News Classification Using Machine Learning

There are plenty of social media webpages and platforms producing the textual data. These different kind of a data needs to be analysed and processed to extract meaningful information from raw data

Uploaded by

International Journal of Innovative Science and Research Technology

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views

News Classification Using Machine Learning

There are plenty of social media webpages and platforms producing the textual data. These different kind of a data needs to be analysed and processed to extract meaningful information from raw data

Uploaded by

International Journal of Innovative Science and Research Technology

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Volume 6, Issue 5, May – 2021 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

News Classification Using Machine Learning

Shweta D. Mahajan Dr. D.R Ingle
Computer Engineering Computer Engineering
Bharati Vidyapeeth College of Engineering Bharati Vidyapeeth College of Engineering
Navi Mumbai Navi Mumbai

Abstract:- There are plenty of social media webpages and Classification Dataset(.json) separated into train data and test
platforms producing the textual data. These different data. The news classification performed major work based on
kind of a data needs to be analysed and processed to text mining. News articles are in the form of large
extract meaningful information from raw data. information but nowadays long length news is not worked so
Classification of text plays a vital role in extraction of our main focus is on specifically limited news headlines.
useful information along with summarization, text News classification depend on statistics and deep learning.
retrieval. In our work we have considered the problem of There are many classifiers to classify the news classification
news classification using machine learning approach. like naïve bayes, support vector machine (SVM) [4], decision
Currently we have a news related dataset which having tree, k-nearest neighbour (KNN) and many more. In this
various types of data like entertainment, education, paper Naive Bayes technique are used with word
sports, politics, etc. On this data we have applying vectorization.
classification algorithm with some word vectorizing
techniques in order to get best result. The results which II. LITERATURE SURVEY
we got that have been compared on different parameters
like Precision, Recall, F1 Score, accuracy for performance [1] Have proposed Mykhailo and Volodymyr, which is
improvement. a naïve bayes technique used for fake news detection. They
are implemented one software and test dataset on Facebook
Keywords:- Machine Learning, News Classification, Naive post. They are targeted on accuracy and to check true and
Bayes Classifier, Natural Language Processing. fake article. They are verified similarities on spam messages
and fake messages using naïve bayes approaches. Explained
I. INTRODUCTION with formula and test dataset on Facebook post and calculate
test accuracy evaluation in precision and recall. Then they
There is a huge amount of information that we have to classify which post is fake and correct.
deal with daily in the form of news articles. These techniques
are essential used in textual data management techniques [2] This paper focus on Indonesian news classification
because the textual data is easily rising with the passage of with categories. They are used Nazief-adriani stemmer
time. Text mining tools are required to perform indexing and method for each word reduced into basic word and for
retrieval of text data. Text mining is used to extracting hidden classification technique is used for naïve bayes. They are uses
information which is finding of some information from large dictionary from Katlego. In that remove Prefix and suffix.
amount of dataset is a very strong tool that is used for
classification purpose and this text in the form of unstructured They used TF-IDF concept also check in one document
text. Unstructured text rising mostly the information is how many terms are there and how much weight are reduced
available in digital form for example world wide web, e-mail, [2]. They explain the stages wise which is used in naïve bayes
publication, etc. A word and a sentence ambiguity may one is stage of training and second is phase classification
include in an actual form or sequence occurred in that text explain with examples. After the result analysis they receive
any type of data this term is called as unstructured text. This accurate 94% values.
text in sequence of extraction on useful information are using
processing and pre-processing techniques are required. The III. OBJECTIVES
processing method cannot be used in a computer. The
computer using the text as order of string and not provide any  To implement for classification of news into accurate
kind of information. For better classification in every field category
large number of text material are available. These include  News classification to determine the precision, recall, F1
medical, finance, image-processing, and many other fields, score for accuracy
where the major objective of text mining is to extract useful  To implement graphical representation of news in the
information from semi-structured or unstructured text by form of Bar chart, Tree map, word cloud
making best use of techniques i.e., supervised or  To implement and compare naïve Bayes techniques and
unsupervised classification or Natural Language Processing. with TF-IDF and check the accuracy
All the traditional natural language processing algorithms
have been known to majorly operate on words to decide
predefined classes for particular text or text-documents. The
data set used in this study has been taken by the News

IJISRT21MAY852 www.ijisrt.com 873

Volume 6, Issue 5, May – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IV. EXISTING SYSTEM dataset. Text Pre-processing are applied to the train data is
used to classifier for training and for that various methods we
are follows like news tokenisation, diacritics removal, stop
words removal, word stemming. This input to trained
classifier and build model to naïve Bayes classifier [6] and
check the predicts the result of test news headlines. This is
accompanying by the evaluation of the accuracy of news
classifier based on performance measures.

A. Text Pre-processing:
We are collecting news in many of the sources is to be
seen in newspapers, magazines, etc. Dataset are available in
many formats that is it may in.csv, json,.pdf, .doc, or in .html
format. After the collection of news is done [7]. Dataset we
retrieve from various sources so it has to be required for
cleaning that it should be free from noisy and useless
information. We need delete unrelated words from data
means full stop, brackets, semicolon, etc [6]. so, data is short
Figure 1 Flow chart – Classifier Training
listed from these words those are appear in text are called
Figure 1 show the process of classification and given stop words. In this paper used some libraries for example
step to retrieve result by step by step in that pre-processing NumPy, pandas, sklearn.
step, especially in feature extraction. Apart from,
classification technique they use Naïve Bayes and SVM News Tokenization: Divide text into small words or
classifier [10]. segments is called to be text tokenization [9]. News headlines
every word. Every word in the headline and content is
evaluated as a string, which will then be broken down into
V. PROPOSED SYSTEM
tiny pieces. The end outcome would be used as information
for text mining processing. All of the headlines are merged to
form a set of words.

Diacritics Removal: The meaning of diacritics varies

depending on the language [9]. Commas, semi-colons,
quotations, double quotes, full stops, underscores, special
characters, and brackets, among other things, are eliminated
from all words.

Stop words Removal: Stop words are all the words in a

text that join words or connector lines [9]. They are simply
considered once and then eliminated. Those words, which
appear frequently in news headlines, are deemed ineffective
in terms of frequency.

Figure 2 Flow chart – Proposed Model Word Stemming: The process of deleting a portion of a
word or reducing a word to its stem or root is known as
Nowadays in this pandemics people want to read a news stemming. It's possible that we're not reducing a word to its
but what was happed to they search a news in google or any dictionary root.
of the website all news are shown one by one [7]. But they
want a one particular category of news like political, sports B. Feature Engineering:
like that. So, we are categorised news. We are taking a news The Process of transforming data to improve the
dataset from google and to check that news is in which predictive performance of the trained models is called as
categories. feature engineering [5].

VI. METHODOLOGY a. Count vectorization: It is basically count like how many

words are there in the feature set. In machine learning
Our approach is to take output to the news headlines feature set has to be extracted from text document. In the
based on shot description. Fig 2 shows the whole process of feature set has many dimensions as the many unique
the flow. The beginning with news dataset is call data number in full dataset and this approach every unique
retrieval module means the collection of datasets in our word as separate feature and directed shown as each set of
implementation we are collected from web sites and extract features as a document [6]. Count of word assign in a
actual text form. The dataset is divided into two-part train document in to its related to that feature is called as count
dataset and test dataset. Train data is collected 80% of the

IJISRT21MAY852 www.ijisrt.com 874

Volume 6, Issue 5, May – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
vectorization method. It is not able to specific word
combinations.
b. TF-IDF vectorization: It is common words based on
rescale the frequency [2]. How frequently used same
words are scores in all present document like “that”, “is”,
“then”. It takes advantage of the concept of Term
frequency- Inverse document frequency. Team frequency
is number of occurrences in a document. Inverse
Document frequency is downscaling words that appear
across document [2]. This common word is allowed to
reduce the weight. for example, one certain term such as
“of”, “is”, “that” may appear many times but it has low
importance then we need to weight down the frequent
terms.

Figure 4. TreeMap for Distribution of categories

Thus, most of the document some word is frequent and

the denominator numerator get close to each other and IDF
score get zero, thus words which is not discriminative enough
get close to zero weight that means infinity. We have applied
one type of feature extraction methods and trained one
classification technique that is naïve Bayes. Comparative
analysis of naïve Bayes methods and naïve Bayes with TF-
IDF vectorizer [2]. While working with features generated by
TF-IDF vectorization method rather that count vectorization
and classifier check accuracy with the both methods.
Figure 5. Wordcloud for all words
C. Data Visualization:
D. Naïve Bayes:
We're visualising three different approaches to
In machine learning, naïve Bayes classifiers is a simple
categorise data [3][4]. Figure 3. We present categorising
probabilistic classifiers family. It is based on Bayes theorem
distribution in a bar chart with two dimensions: categories
and text features [1]. Features are presumed to be independent
and counts, and we present how many counts are in each
of one another. It individual calculates text probability within
category [8]. Like that Figure 4. we show treemap for news
class label and classes. We trained the naïve Bayes classifier
category where as we show one more visualisation is use that
with count vectorization and with TF-IDF vectorization [2].
is wordcloud Figure 5. For all dataset word [3].
In this text classification techniques naïve Bayes classifier
gives accuracy results shows but for instance, not every
predication is accurate. It shows 80% to 85% predication are
accuracy.

This accuracy is achieved by applying count

vectorization as feature extraction observed and we got
highest accuracy and the least accuracy observed by using
TF-IDF vectorization approach with naïve Bayes. Naïve
Bayes is best classifier with textual as well as numeric data
formats and easy to implement and to compute, relatively
robust, fast, accurate and to compute but it shows poor
performance when features are the short text classification.
Naïve Bayes is classifiers the particular feature assuming that
value and this value is independent of any other feature.

The key concept is to regard each piece of news as a

separate entity. First, we will use some data set using from
internet in that dataset and their entities for example author,
Figure 3. Bar Chart for Distribution of categories category, date, headline, link, short description. The Bayes

IJISRT21MAY852 www.ijisrt.com 875

Volume 6, Issue 5, May – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
theorem is the foundation of Nave Bayes, which indicates VIII. CONCLUSION
that features in a dataset are mutually independent. The
probability of occurrence of a trait is unaffected by its After conducting research and analysis, the findings of
occurrence. this study show that Naive Bayes can successfully categorise
news, and TF-IDF is at the bottom of the performance
E: Performance measures: measures employed in our methodology. Our next goal is to
We use precision, recall and F1 score, support as the enhance accuracy while also experimenting with classifiers
performance measure [5] and compare with naïve Bayes such as SVM and decision trees.
classifier Figure 6. and TF-IDF Figure 7. Precision is the ratio
of the number of correct results to number of total results [3]. ACKNOWLEDGMENT
Recall is ratio of number of correct results to number of
correct results that should have been returned [3]. F1 score is This progress report on the creation of "News
function of precision as well as recall [3]. Support is how Classification" brings me tremendous delight. We are
many times is coming in a dataset [9]. appreciative to Bharati Vidyapeeth College of Engineering's
Institute of Engineering for providing us with this fantastic
opportunity to work on this big project. Dr. Dyanand Ingle,
our project supervisor, provided us with invaluable guidance
and suggestions.We'd also want to thank Prof. Sheetal
Thakare, Project Coordinator, for providing us with all of the
materials we needed to complete our project. We would like
to express our gratitude to the Department of Computer
Engineering's teaching and non-teaching personnel for their
Figure 6. Performance measure using Naïve Bayes important assistance and support throughout the project
hours. We shall not forget to thank everyone who has helped
us achieve this goal.

REFERENCES

[1]. Mykhailo Granik, Volodymyr Mesyura “Fake News

Detection Using Naive Bayes Classifier” 2017 IEEE
First Ukraine Conference on Electrical and Computer
Engineering (UKRCON)
Figure 7. Performance measure using TF-IDF [2]. Garin Septian, Ajib Susanto, Guruh Fajar Shidik
Faculty of Computer Science “Indonesian News
VII. IMPLEMENTATION AND RESULTS Classification based on NaBaNA” 2017 International
Seminar on Application for Technology of Information
The four existing techniques are considered for and Communication (iSemantic)
implementation purposes. The results of the four models [3]. Vignesh Rao,Jayant Sachdev “A Machine Learning
presented are consistent with the suggested model, and the Approach to classify News Articles based on Location”
category of news is correctly identified [8]. The Proceedings of the International Conference on
demonstration is done with certain machine learning Intelligent Sustainable Systems (ICISS 2017) IEEE
algorithms and python programming on Jupiter Notebook. Xplore Compliant - Part Number:CFP17M19-ART,
Following example of the news classification method using ISBN:978-1-5386-1959-9
Naïve Bayes [2]. [4]. David Martens, Bart Baesens, and Tony Van Gestel,
“Decompositional Rule Extraction from Support Vector
No News Category Machines by Active Learning” IEEE
TRANSACTIONS ON KNOWLEDGE AND DATA
India’s largest ever ‘eye sky’ take TRAVEL
1 ENGINEERING, VOL. 21, NO. 2, FEBRUARY 2019
neighbours
[5]. Z. Liu, X. Lv, K. Liu, and S. Shi, "Study on SVM
US Vice President Mike Pence Did POLITICS compared with the other text classification methods,"
2
Not Fake Getting COVID-19 Vaccine 2nd Int. Work. Educ.Technol. Comput. Sci. ETCS
Rapper Skinnyfromthe9 Handcuffed W CRIME 2010, vol. 1, pp. 219-222, 2010.
3 [6]. M. IKONOMAKIS, S. KOTSIANTIS, V.
eed Bust
TAMPAKAS, “Text Classification Using Machine
How reach Kedarnath Temple: quick
4 TRAVEL Learning Techniques” WSEAS TRANSACTIONS on
guide
COMPUTERS, Issue 8, Volume 4, August 2005, pp.
Table1: Result of Naïve Bayes classification 966-974.

IJISRT21MAY852 www.ijisrt.com 876

Volume 6, Issue 5, May – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
[7]. Kiran Shriniwas Doddi , Dr. Mrs. Y. V. Haribhakta ,
Dr. Parag Kulkarni, “Sentiment Classification of News
Articles”, Kiran Shriniwas Doddi et al, / (IJCSIT)
International Journal of Computer Science and
Information Technologies, Vol. 5 (3) , 2014, 4621-4623
[8]. J SreeDevi, M Rama Bai, M Chandrashekar Reddy,
“Newspaper Article Classification using Machine
Learning Techniques” International Journal of
Innovative Technology and Exploring Engineering
(IJITEE)ISSN: 2278-3075, Volume-9 Issue-5, March
2020
[9]. Mazhar Iqbal Rana, Shehzad Khalid, Muhammad
Usman Akbar “News Classification Based On Their
Headlines: A Review” ISBN: 978-1-4799-5754-
5/14/$26.00 ©2014 IEEE
[10]. Anjali Jain, Avinash Shakya, Harsh Khatter, Amit
Kumar Gupta “A smart system for fake new detection
using machine learning” 2019 International Conference
on Issues and Challenges in Intelligent Computing
Techniques (ICICT) ISSN: 2321-9939