News Classification Using Machine Learning
News Classification Using Machine Learning
ISSN No:-2456-2165
Abstract:- There are plenty of social media webpages and Classification Dataset(.json) separated into train data and test
platforms producing the textual data. These different data. The news classification performed major work based on
kind of a data needs to be analysed and processed to text mining. News articles are in the form of large
extract meaningful information from raw data. information but nowadays long length news is not worked so
Classification of text plays a vital role in extraction of our main focus is on specifically limited news headlines.
useful information along with summarization, text News classification depend on statistics and deep learning.
retrieval. In our work we have considered the problem of There are many classifiers to classify the news classification
news classification using machine learning approach. like naïve bayes, support vector machine (SVM) [4], decision
Currently we have a news related dataset which having tree, k-nearest neighbour (KNN) and many more. In this
various types of data like entertainment, education, paper Naive Bayes technique are used with word
sports, politics, etc. On this data we have applying vectorization.
classification algorithm with some word vectorizing
techniques in order to get best result. The results which II. LITERATURE SURVEY
we got that have been compared on different parameters
like Precision, Recall, F1 Score, accuracy for performance [1] Have proposed Mykhailo and Volodymyr, which is
improvement. a naïve bayes technique used for fake news detection. They
are implemented one software and test dataset on Facebook
Keywords:- Machine Learning, News Classification, Naive post. They are targeted on accuracy and to check true and
Bayes Classifier, Natural Language Processing. fake article. They are verified similarities on spam messages
and fake messages using naïve bayes approaches. Explained
I. INTRODUCTION with formula and test dataset on Facebook post and calculate
test accuracy evaluation in precision and recall. Then they
There is a huge amount of information that we have to classify which post is fake and correct.
deal with daily in the form of news articles. These techniques
are essential used in textual data management techniques [2] This paper focus on Indonesian news classification
because the textual data is easily rising with the passage of with categories. They are used Nazief-adriani stemmer
time. Text mining tools are required to perform indexing and method for each word reduced into basic word and for
retrieval of text data. Text mining is used to extracting hidden classification technique is used for naïve bayes. They are uses
information which is finding of some information from large dictionary from Katlego. In that remove Prefix and suffix.
amount of dataset is a very strong tool that is used for
classification purpose and this text in the form of unstructured They used TF-IDF concept also check in one document
text. Unstructured text rising mostly the information is how many terms are there and how much weight are reduced
available in digital form for example world wide web, e-mail, [2]. They explain the stages wise which is used in naïve bayes
publication, etc. A word and a sentence ambiguity may one is stage of training and second is phase classification
include in an actual form or sequence occurred in that text explain with examples. After the result analysis they receive
any type of data this term is called as unstructured text. This accurate 94% values.
text in sequence of extraction on useful information are using
processing and pre-processing techniques are required. The III. OBJECTIVES
processing method cannot be used in a computer. The
computer using the text as order of string and not provide any To implement for classification of news into accurate
kind of information. For better classification in every field category
large number of text material are available. These include News classification to determine the precision, recall, F1
medical, finance, image-processing, and many other fields, score for accuracy
where the major objective of text mining is to extract useful To implement graphical representation of news in the
information from semi-structured or unstructured text by form of Bar chart, Tree map, word cloud
making best use of techniques i.e., supervised or To implement and compare naïve Bayes techniques and
unsupervised classification or Natural Language Processing. with TF-IDF and check the accuracy
All the traditional natural language processing algorithms
have been known to majorly operate on words to decide
predefined classes for particular text or text-documents. The
data set used in this study has been taken by the News
A. Text Pre-processing:
We are collecting news in many of the sources is to be
seen in newspapers, magazines, etc. Dataset are available in
many formats that is it may in.csv, json,.pdf, .doc, or in .html
format. After the collection of news is done [7]. Dataset we
retrieve from various sources so it has to be required for
cleaning that it should be free from noisy and useless
information. We need delete unrelated words from data
means full stop, brackets, semicolon, etc [6]. so, data is short
Figure 1 Flow chart – Classifier Training
listed from these words those are appear in text are called
Figure 1 show the process of classification and given stop words. In this paper used some libraries for example
step to retrieve result by step by step in that pre-processing NumPy, pandas, sklearn.
step, especially in feature extraction. Apart from,
classification technique they use Naïve Bayes and SVM News Tokenization: Divide text into small words or
classifier [10]. segments is called to be text tokenization [9]. News headlines
every word. Every word in the headline and content is
evaluated as a string, which will then be broken down into
V. PROPOSED SYSTEM
tiny pieces. The end outcome would be used as information
for text mining processing. All of the headlines are merged to
form a set of words.
Figure 2 Flow chart – Proposed Model Word Stemming: The process of deleting a portion of a
word or reducing a word to its stem or root is known as
Nowadays in this pandemics people want to read a news stemming. It's possible that we're not reducing a word to its
but what was happed to they search a news in google or any dictionary root.
of the website all news are shown one by one [7]. But they
want a one particular category of news like political, sports B. Feature Engineering:
like that. So, we are categorised news. We are taking a news The Process of transforming data to improve the
dataset from google and to check that news is in which predictive performance of the trained models is called as
categories. feature engineering [5].
REFERENCES