Entity Based Sentiment Classifier For Social Media Analysis
Entity Based Sentiment Classifier For Social Media Analysis
Master Thesis
in the
August 2019
Declaration of Authorship
I, Cristobal Leiva, declare that this thesis titled, ’Entity based Sentiment Classifier
for Social Media Analysis’ and the work presented in it are my own. I confirm that:
• This work was done wholly or mainly while in candidature for a research degree
at this University.
• Where any part of this thesis has previously been submitted for a degree or
any other qualification at this University or any other institution, this has
been clearly stated.
• Where I have consulted the published work of others, this is always clearly
attributed.
• Where I have quoted from the work of others, the source is always given. With
the exception of such quotations, this thesis is entirely my own work.
Cristobal Leiva
Signed:
Date:
i
"Achievement of your happiness is the only moral purpose of your life, and that
happiness, not pain or mindless self-indulgence, is the proof of your moral integrity,
since it is the proof and the result of your loyalty to the achievement of your values."
1 Introduction 3
1.1 Problem and motivation . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Overview of the document . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background 8
2.1 Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Analysis Levels . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Sentiment Classification . . . . . . . . . . . . . . . . . . . . . 11
2.2.2.1 Unsupervised Techniques . . . . . . . . . . . . . . . 12
2.2.2.2 Supervised Techniques . . . . . . . . . . . . . . . . . 13
2.2.3 SentiTrack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Related Work 21
3.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Entity Based Sentiment Analysis . . . . . . . . . . . . . . . . . . . . 23
4 Approach 25
4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Entity Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 POS Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6 Feature Vector Generation . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6.1 Document-based features generation . . . . . . . . . . . . . . 34
4.6.1.1 Binary bag-of-words . . . . . . . . . . . . . . . . . . 34
4.6.1.2 POS Tags . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6.1.3 Linguistic features . . . . . . . . . . . . . . . . . . . 35
4.6.2 Entity-based features generation . . . . . . . . . . . . . . . . 36
4.6.2.1 Lexicon Features . . . . . . . . . . . . . . . . . . . . 37
4.6.2.2 Emoticon Features . . . . . . . . . . . . . . . . . . . 39
4.7 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.8 SentiTrack Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 40
iii
CONTENTS iv
A Appendix 52
A.1 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.2 Technical Specifications . . . . . . . . . . . . . . . . . . . . . . . . . 53
A.3 SentiTrack GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Bibliography 55
List of Figures
v
List of Tables
vi
Abstract
Having a notion of people’s opinions has always been an essential asset for the
decision-making processes in many business models. Therefore, the popularity of
sentiment analysis of social media content has being growing rapidly in the last
years. However, state-of-the-art sentiment classification approaches lack the capabil-
ity of performing entity-based classification to identify opinion expressions targeting
a specific term, word or entity. In this thesis, this particular problem of entity-based
sentiment classification is addressed.
Combining machine learning and natural language processing methods, the solution
proposed in this project uses several sentiment analysis techniques such as Named
Entity Recognition (NER), POS tagging, feature vector generation, bag-of-words
model and others. These are required to train a Support Vector Machine that clas-
sifies tweets (from Twitter) as positive, negative or neutral based on sentiments
expressed toward a given target entity (company, person or product). Taking into
account real-time sentiment analysis systems, presented solution is developed as
a high-performance classifier capable of coping with real-time processing environ-
ments. The completeness and effectiveness of this approach is affirmed by quality
evaluations and performance tests.
Acknowledgements
It delights me immensely to finally present this thesis to the Department of Com-
puter Science at the University of Bonn, for the partial fulfilment of the Master
of Science in Informatik. I would like to thank Professor Dr. Sören Auer for his
constant support and open arms to accept my thesis proposal. This was a fantastic
opportunity and I am fully satisfied with the treatment received by the department
and university. I also want to thank Dr. Simon Scerri for supervise this thesis
project and always be supportive even in the most difficult circumstances.
Finally, my sincere gratitude is towards my family and friends who were always
there for me. My beloved future wife who did everything she could to keep me on
track and focus on our goals. My mother, who I own basically my entire life and
achievements. You are the best mother anyone could ever have. Thank you mom!
Thank you all!
2
Chapter 1
Introduction
This chapter will start by giving insight into the problem and motivation behind
this thesis; it continues with an outlining of the contributions this research project
provides. Lastly, the structure of the document is described.
Information has become the most valuable resource in modern society. From chil-
dren to seniors, from New Zealand to Canada, a very large portion of the current
human population have shared in some extend part of their life by technological
means. This is a fact, no doubts. That fact opens unquantifiable possibilities for
science. Especially Computer Science with topics such as Information Retrieval
(IR), Machine Learning (ML), Natural Language Processing (NLP), Semantic Web,
and the list goes on. The era of data analysis is just starting.
Having a notion of people’s thoughts has always been an essential asset for the
decision-making processes in many business models. Even before the existence of the
World Wide Web, people used to ask questions about products recommendations or
opinions about events such as local elections. But with the beginning of the Internet
Era, societies started to use the common wealth of knowledge found on forums,
blogs, and social media network services as an important source of opinions [1].
These opinions may come from customers simply sharing their experience with a
product or from well-known professionals writing elaborated reviews. The Internet
became a valuable pool of experiences available for everyone to use.
3
Chapter 1 Introduction 4
Figure 1.1: Social network user growth projection 2010 - 2018 [2]
In Figure 1.1 the number of social network users worldwide from 2010 to 2014 with
projections until 2018 is represented. By 2018 the projection shows an astonishing
amount of 2.44 billion users on social media networks, this is more than 30% of
the world’s population. However, data generated from systems must be mined and
analyzed to represent valuable information for interested parties. This master thesis
aims to explore an approach called "Entity Based Sentiment Classification" to tackle
this challenge.
As shown on Figure 1.1 in the past few years, there has been an increase in the
usage of social networking services such as Twitter. Twitter, as a micro-blogging
system, allows users to publish tweets of up to 140 characters in length to tell others
what they are doing, what they are thinking, or what is happening around them.
Therefore, companies and media organizations are constantly looking for ways of
extracting public opinions and feelings about their products and services [3]. The
nature of the tweets (short and usually meaningful in the context of marketing) al-
lows researchers to exploit different data mining and sentiment analysis approaches.
Projects like Twitratr1 , streamcrab 2 , and Talend3 are examples of services intending
to obtain sentiment information from tweets.
1
http:www.twitrratr.com
2
http:www.streamcrab.com
3
http:www.talend.com
Chapter 1 Introduction 5
Table 1.1 shows two example tweets together with their corresponding sentiment.
The first tweet expresses a positive sentiment, containing one positive noun happy
and one positive emoji :D. The sencond tweet indicates negative sentiment based on
the hashtag #sad and emoji :(.
In other services such as Sentiment1404 and IBM AlchemyAPI 5 users may insert
a target entity as a search query, then the system proceeds to fetch tweets in real-
time containing positive or negative sentiments towards given target entity [4]. This
task is formally named Targeted Sentiment Analysis and can be described as the
extraction of positive, negative or neutral sentiment towards an input target entity
on a given text.
Most approaches that deal with Targeted Sentiment Analysis are based on the ex-
traction of target-independent features. In 2010, Barbosa and Feng [5] use a ma-
chine learning based classifiers for the sentiment classification of texts. However,
their classifiers actually work in a target-independent way: all the features used in
the classifiers are independent of the target, so the sentiment is decided no mat-
ter what the target is. Pang and Lee in 2002 [6] performed a similar sentiment
classification experiment on movie reviews, on this experiment they only consider
target-independent features on the classifier. Movie reviews usually concentrate
opinions expressed towards a single target entity, in this case, a specific movie. Nev-
ertheless, this approach does not apply for entity based sentiment classification in
tweets. Tweets have a very particular structure where multiple target entities may
exist in the same context, given this scenario target-independent sentiment classifiers
will not yield satisfactory results.
Table 1.2 illustrates two example tweets where TiSC6 approaches can not correctly
identify the sentiment towards given target entities. The first tweet does not express
any positive sentiment to given target iPhone but instead to a second entity Amazon;
the problem is that the user is only interested on the input entity. This would be a
false positive case for TiSCs. In the second tweet, a similar case is presented where
4
http:www.sentiment140.com
5
http:www.alchemyapi.com
6
Target-independent Sentiment Classification according to Jian and Liu 2011
Chapter 1 Introduction 6
A correct sentiment classification for given target entities is crucial for systems
such as SentiTrack7 . SentiTrack intends to find a correlations between the opinions
extracted from real-time tweets and intra-day stock price variations of a set of specific
companies. Therefore, an entity based sentiment classification is required for given
task. Solving this issue is the main objective of this master’s thesis. In order to
achieve this goal, proposed solution combines document-based (target independent)
and entity-based features with a variety of sentiment classification techniques to
train a Support Vector Machine (SVM) capable of producing highly accurate results
for entity-based sentiment classification tasks.
1.2 Contribution
The main contributions that this thesis aims to achieve are the following:
• This work leads to the comparison and evaluation of different techniques for
sentiment classification; since it provides exact information from the different
techniques by evaluating the performance, accuracy, precision and recall.
7
Linked Data-based Social Media Analysis for Stock Market Tracking
Chapter 1 Introduction 7
Background
This chapter explains the theoretical background used for the development of an en-
tity based sentiment classifier. The chapter is divided into three sections: Section 2.1
explains the basic concepts related to the social media and the social networking ser-
vice Twitter. In Section 2.2 Sentiment Analysis is defined and different state-of-the-
art sentiment classification techniques are described. Finally, Section 2.3 explores
existing Named Entity Recognition approaches.
Social Media refers to a set of computer-based tools that allows people, organiza-
tions and companies to share and exchange information with community networks.
Moreover, Social Media tends to change with time allowing people to create content
in a dynamic way and without restrictions. Therefore, there is an unquantifiable
number of communities performing all kinds of activities such as blogging, posting
on forums, podcasting, generating trends on social networks and more.
8
Chapter 2 Background 9
2.1.1 Twitter
Being one of the most popular websites in the world, Twitter has become an ever-
growing corpus of information. As a social network, Twitter allows constant sharing
of short messages between users; this is also called micro-blogging. Between users,
they can follow the activity of each other giving updates of their messages. Currently,
this network has around 320 million of active users [8].
• Tweet: Tweets are the essence of Twitter, they are short message units (pieces
of information) formed by a maximum of 140 characters. Users use tweets to
share ideas, status, news or any other kind of information with their followers.
In Figure 2.2 a tweet example is showed where the actor: User is "Outlook"
and the message contains #hashtags, @mentions and URLs.
Chapter 2 Background 10
• Users: The users are the source of information on Twitter. They have follow-
ers who are other users interested in the content being sharing.
• Trends: Trends are popular topics close to the geographic location of the
users. They are determined by an algorithm that combines the information
about followers, accounts, and places related to them. The trends intend to
help users to discover relevant topics based on their location.
Sentiment analysis (SA) also known as Opinion Mining can be defined as the use of
computer-based methods to extract an opinion from a given text source. Liu [9, p.
7] defines SA as:
Opinions are essential for all human activities because they affect our decisions.
Before making decisions, people usually evaluate others opinions. In the real world,
businesses and organizations are always trying to know the public opinion about their
products and services and be able adapt marketing strategies [9, p. 10]. Nowadays,
companies may no longer require surveys and opinion polls to gather the opinion of
their customers. However, analyzing opinion sites and social networks by SA means
is not an easy task. Several fields of Computer Science such as Natural Language
Processing (NLP), Machine Learning (ML) and Information Retrieval (IR) are some
of the research areas involved in SA.
Chapter 2 Background 11
There are different levels of SA that could be applied to a given text source, these
are the following [9, p. 11]:
Sentiment classification is one of the most studied topics in the area of sentiment
analysis and natural language processing. The main goal is to classify a document
as positive or negative based on the opinion expressed in it, if the document does
not contain any opinion expression the classification result must be neutral. The
following sections present unsupervised and supervised techniques for sentiment clas-
sification according to Bing Liu [9].
Chapter 2 Background 12
Sentiment dictionaries or opinion lexicons are the core component of any unsu-
pervised sentiment classification method. These dictionaries are made of words or
phrases with specific sentiment scores. For example, words like fantastic and ex-
cellent have a positive orientation in most sentiment lexicons, while other words
such as terrible and awful are categorized as negative. Sentiment lexicons define the
polarity orientation of words by numeric values, some of them assign intensity score
to each word and others consider negating contexts for sentiment scores. Section 4.4
explains more about negation contexts.
There are several way of creating a sentiment lexicon; Bing Liu explains the most
effective ones [9].:
1. Manual approach: As its name refers to, this approach requires the effort
of evaluators to manually assign sentiment orientation to a set of words. This
task consumes a lot of time and its better to be used in combination with
other automatic methods. However, it is useful for evaluation of results of
non-manual approaches [9, p. 79].
Table 2.1 illustrates the sentiment classification of two example tweets using a lexicon
based unsupervised technique where positive and negative words are evaluated to
+1 and -1 respectively. The overall score of each tweet is calculated by adding up
the sentiment values of individual words.
• Three-class classifier: Similar to the two-class classifier, this one also in-
cludes subjectivity classification which means that it classifies documents as
neutral or polar (positive or negative).
Chapter 2 Background 14
One of the main differences between unsupervised and supervised sentiment clas-
sification methods is the training phase. Sentiment classifiers based on supervised
approaches require a set of annotated documents, this annotation is usually done
manually but in some cases through distant-supervision methods [11]. Distant-
supervision approach generates a training corpus by automatic means. It annotates
documents based on the presence of positive or negative emojis (emoticons). The
annotation accuracy of this method depends on the size of the documents, mostly
used on short unit texts such as Twitter sentiment classification tasks.
After being trained, a classifier is capable of predicting the sentiment of new input
documents. The classification process requires the extraction of feature vectors
from the documents; each vector contains n numerical features. The feature sets
of a classifier are essential for obtaining high accuracy. Some of the most effective
features are [9, p. 25]:
• Sentiment shifters: These are expressions that are used to change the sen-
timent orientations, e.g., from positive to negative or vice versa [9, p. 26].
Chapter 2 Background 16
?? illustrates two example tweets with their respective vector features, these features
are composed by Bag of Words model, part-of-speech tags and sentiment features.
Let us explain one by one:
• bag-of-words: To reduce the sparsity of the vectors, one of the most used
prepossessing steps for extracting features is the removal of stop words, which
in these cases are: you, the, my, me, are. Twitter mentions and URLs are also
removed or replaced with placeholders. The resulting vectors are:
– (1) (@mention,1)(best,1)(<3,1)(#happy,1)
– (2) (bf,1)(hates,1)(:(,1)(depressed,1)
Each tweet is represented on the vector space which is the union of both sets.
• sentiment: The first tweet contains three positive tokens while the second one
has three negative. As a result, the sentiment features are: (1)[3,0] & (2)[0,3].
Chapter 2 Background 17
2.2.3 SentiTrack
2. Stream Processing: The set of entities obtained in previews stage are used
to fed the Twitter Streaming API in order to fetch related public tweets in
real-time. Then, tweets are processed for recognition and annotation of enti-
ties using DBPedia Spotlight. Moreover, those tweets with annotated entities
are analyzed using a lexicon-based entity-centric sentiment classifier. The clas-
sification step intends to provide a positive, negative or neutral result based
on the opinions expressed towards each of the identified entities in the tweet.
Entity-centric sentiment classification approach used by SentiTrack is vary sim-
ple and not accurate enough. Therefore, an improved entity-based sentiment
classifier is required for SentiTrack project.
Named Entity Recognition (NER) also known as named entity extraction or entity
identification, is an information extraction task used mostly for natural language
processing. The task consists on the extraction of structured information from
unstructured text, for instance: social media networks, micro-blogging sites and
e-commerce systems are constantly generating unstructured data that could be pro-
cessed into valuable information for interested parties. Therefore, NER intendeds to
identify elements of given input text and classify it according to a set of categories,
e.g. names of persons, locations, organizations, products, values, etc [13, p. 1].
The most effective approaches for NER use machine learning (ML) methods to
extract features of Named Entity (NE) examples classified as positive or negative,
finally, these examples will form a large collection of annotated documents (training
corpus).
There are three main ML techniques to perform entity identification [13, p. 4]:
Which means that NER is highly dependent on the quality of the training
corpus, this will determine how accurate the classifier is.
Related Work
This chapter presents a variety of projects and scientific work related to sentiment
analysis and the usage of entity based approaches for sentiment classification. The
constant publication of opinionated data in social media networks provides an ever-
growing source of valuable information for interested companies. Therefore, Senti-
ment Analysis has become a very popular research field in the scientific community.
Pang and Lee [1] in 2008 presented a survey that covers techniques and approaches
for polarity sentiment classification and subjectivity identification. Moreover, in
2012 Bing Liu [9] published one of the most cited books related to sentiment analy-
sis, where he provides a very complete survey of most relevant research topics related
to opinion mining.
Based on aforementioned literature and additional related works, the next section
discusses a variety of sentiment analysis projects and the different approaches used
in them. Finally, this chapter presents a few research works associated with entity-
based sentiment classification and compares these techniques with the methods uti-
lized and developed in this master’s thesis.
The applications of Sentiment analysis are many, some of them include the classi-
fication of forum posts, blogs, news, product reviews and social network content.
Therefore, because this thesis project is based on social media analysis, specifi-
cally Twitter data. The related work presented in this section is mainly focused on
document-level sentiment analysis of tweets.
21
Chapter 3 Related Work 22
The term of sentiment analysis was first introduced by Nasukawa and Yi in 2003
[14], in their work they extracted sentiments associated with polarities of positive
or negative for specific subjects using semantic analysis with a syntactic parsing
method. However, research about opinions and sentiments expressed in text ap-
peared in 2001 where Das et al. [15] and Tong [16] published their work about the
analysis of market sentiment.
It was not until 2009 where Bhayani et al. [11] presented the first relevant research
related to the usage of Twitter data for sentiment analysis. Bhayani et al. used
a novel approach for automatic polar-classification of tweets where messages are
classified as either positive or negative. Based on distant supervision and machine
learning algorithms, Bhayani et al. generated a training corpus evaluating the pres-
ence of positive or negative emoticons such as ":D" or ":(" in each tweet. With this
method, they managed to achieve an accuracy above 80% for a polarity classification
task.
In 2010 Pak and Paroubek [17] built a sentiment classifier that is able to determine
positive, negative and neutral sentiments of English tweets. Using a distant super-
vision approach similar to Bhayani et al. [11] for the generation of a training corpus,
they implemented a multinomial Naive Bayes classifier extracting POS-tags and un-
igrams as binary features. In their results they showed higher precision by using a
term presence rather than its frequency. Also, an increased accuracy was obtained
by the usage of unigrams instead of bi-grams or three-grams. On the other hand,
Barbosa and Feng in same year [5] presented a 2-step sentiment analysis classifi-
cation method which first classifies tweets as subjective and objective (neutral and
polar), followed by a polarity classification of subjective tweets as positive or nega-
tive. Based on support vector machines, this 2-step approach proved an increased
accuracy in comparison with single step classifiers.
In 2011 Kouloumpis et al. [3] explored the utility of linguistic features for the identi-
fication of sentiment in Twitter messages. This paper presents an extended distant
supervision approach for generation of training data, the methods used consist on
the inclusion of Twitter hashtags such as "#bestfeeling, #epicfail, #news, etc." to
enhance the training data quality and include a third class to the classifier (neutral).
Additionally, Kouloumpis et al. implemented a 3-step prepossessing method com-
posed by the following steps: (1)tokenization, (2)normalization, (3)part-of-speech
tagging. According to Kouloumpis et al.’s results this prepossessing stage improves
Chapter 3 Related Work 23
the quality of the extracted features. In contrast to Kouloumpis et al., the meth-
ods used by Paltoglou and Thelwall [18] in 2012 explored an unsupervised lexicon-
based approach that predicts the level of emotional intensity contained in tweets.
According to Paltoglou and Thelwall, this approach may be used for subjectivity
identification and sentiment classification tasks, obtaining results comparable to
state-of-the-art machine learning based methods.
In 2013 Saif et al. [19] developed a state-of-the-art support vector machine (SVM)
classifier which obtained the best results in SemEval 2013 Twitter analysis task.
SemEval (Semantic Evaluation) is an international competition for evaluations of
computational semantic analysis systems. Many scientist and students from all
around the world participated in this event, but Saif et al.’s sentiment classification
approach excelled in its category. The classifier uses a large set of features to train a
SVM, features such as n-grams, POS-Tags, hashtags, lexicons, emoticons, elongated
words are just a part of the full set. However, the addition of negation context
handling was one of the determinant factors to outperform the other competitors.
Finally, the methods presented in this master’s thesis for document-level features
extraction are highly influenced by Saif et al.’s work.
Currently, there are not many research works related to entity based sentiment
analysis in Twitter. Therefore, this section intends to discuss the most relevant
publications that explore the idea of an entity-centric sentiment classifier. Starting
with Ding et al. [20], in 2008 they presented a holistic lexicon-based approach to
perform topic-based opinion mining. In their work, they determined the sentiment
orientations (positive, negative or neutral) of opinions expressed in product features
in reviews. With the usage of unsupervised lexicon-based techniques, Ding et al.
defined a set of linguistic rules to extract opinion phrases expressed towards specific
products in a given review text. Although Ding et al.’s approach achieved high
accuracy for topic-based sentiment analysis on product reviews, this method may
not perform just as well with noisy text such as tweets. Khoo et al. [21] in 2010
implemented an aspect-based sentiment classifier which was capable of extracting
both sentiment orientation and sentiment strength of movie reviews. He considered
the sentiment expressed towards different aspects of these movies. These aspects
can be seen as sentiment targets, using sentence oriented linguistic clauses, Khoo et
al. managed to obtain highly precise sentiment scores for movie aspects. However,
Chapter 3 Related Work 24
one of the drawbacks of Khoo et al.’s solution is the absence of a neutral class on
their approach.
The most relevant scientific work related to this master’s thesis was developed by
Jiang et al. [4] in 2011. They focused on target-dependent Twitter sentiment clas-
sification which means that given an input query, they classify the sentiment ori-
entation of the tweets as positive, negative or neutral based on the presence of
positive, negative or neutral sentiments towards that query. In Jiang et al.’s ap-
proach they implemented a two-step SVM classifier incorporating target-dependent
and target-independent features. Additionally, they included related tweets (men-
tions and replies of each tweet) in the analysis. According to their experimental
results, the two-steps methodology greatly improves the accuracy of entity-based
sentiment classifiers. Nevertheless, because of the complexity of the methods and
the necessity of analyzing related tweets in each document, the performance time is
compromised and not suitable for real-time systems.
To conclude this chapter is important to clarify that this master’s thesis is highly
influenced by aforementioned research works. However, the specific combination of
methods and techniques used in this project are not documented in any other related
publication.
Chapter 4
Approach
The following section presents the architecture and process pipeline of the entity-
based sentiment classifier. Moreover, following sections provide an in depth descrip-
tion of the architecture’s components.
4.1 Architecture
1. Entity identification
2. Tokenization
3. Normalization
4. POS-Tagging
- Document-based features
- Entity-based features
Figure 4.1 illustrates aforementioned components and processes. The pipeline starts
with two input elements provided by the user or system which the classifier is inte-
grated to. The required input data are the following:
Figure 4.2 shows a simplified workflow of the entity-based sentiment classifier devel-
oped in this thesis. Notice that the input of the classifier is a 2-tuple (two elements
list) composed by a tweet and a target entity. The output produced by the classifier
is a three-class sentiment classification (positive, negative or neutral) which repre-
sents the opinion expressed in the tweet towards given target entity. Furthermore,
continuing with the explanation of Figure 4.1, the first processing step in proposed
system is called Entity Identification which will determine the presence of other en-
tities (besides the target one) in the tweets. Then, the Tokenization step proceeds to
remove unnecessary tokens (terms or words) or replace them with predefined place-
holders. The workflow continues with a normalization process which is responsible
for most of the linguistic processing in input tweets. After obtaining a normalized
data, a POS-Tagger assigns part-of-speech labels to each token and sends them to
the Feature Vector Generator. The Feature Vector Generator proceeds to extract
document and entity level feature vectors to finally feed the support vector machine.
The following sections will describe how each of these processes work.
Chapter 4 Approach 28
4.3 Tokenization
The Tokenization process is responsible for splitting input tweet text (string) into
tokens (word or terms) and organize those tokens into their respective sentences.
The Tokenizer (module in charge of this process) uses natural language processing
methods and logistic rules such as regular expressions to trim the text.
In Figure 4.4 the Tokenization process workflow is represented. Starting with the
token segmentation, words contained in tweets are separated to each other based in
the presence of white spaces. Then, a regular expressions algorithm checks each of
those tokens for sentence stop punctuations such as exclamation points, question
marks and full stops. This sentence segmentation step is necessary for the entity-
based feature vector generation process since the sentiment relevance of sentences is
associated with the presence of entity tokens. Finally, HTML symbols (e.g. &,
", etc) are replaced by their substituted values (some emoticons are made by
this symbols) followed by the replacement of identified entities with respective place-
holder. The replacement process of entities reduces the sparsity of future generation
of vector space model. Table 4.1 shows an example of a tokenized tweet where tokens
are arranged into sentences and entities are replaced by their respective placeholders
(TargetEntity and OtherEntity).
Chapter 4 Approach 30
Target
(1) Google
Entity
Other
(1) Nexus
Entities
Tweet Thanks google!! Just got my new Nexus <3
(1) {Thanks, TargetEntity!!}
Result Tokens
(2) {Just, got, my, new, OtherEntity, <3}
4.4 Normalization
Executed by a preprocessor module, the normalization step does most of the lin-
guistic processing required for generation of feature vectors. Normalization of data
involves the correction, removal and replacement of tokens yield by the tokeniza-
tion process. While some Twitter features like @mentions and URLs are weightless
in a sentiment context, #hashtags actually might contain sentiment value which is
necessary for a final classification. The Normalization process is composed of five
processing steps, these are represented in Figure 4.5.
• Fix Elongation: tokens that contain more than two repeated letters are
fixed, leaving only two of these letter. For example, the word: "loooove!"
would be replaced by "loove!". This process contributes with the effectiveness
of the classifier because despite the fact that resulting fixed words might not
necessary be the correct ones (originals), most lexicons consider two-letter
elongated versions of terms.
• Negation Tagging: negation words such as "not" or "never" can modify the
sentiment orientation of a sentence. e.g. "I love you" is an obvious positive
sentence, but "I do not love you" is considered negative. Therefore, to deal
with this situation, a negation tagging "_NEG" must be appended to tokens
located between a negation word and the end of sentence. The sentiment of
negated tokens will be shifted in the features generation stage of this classifier.
Table 4.2 presents an example of how the normalization process works, in this case
two sentences are normalized. The first one shows the negation tagging of tokens
"their" and "best!" because they are positioned after the negation word "not". On
the second sentence, an example of token replacements is shown.
Chapter 4 Approach 32
POS tagging consists on labeling normalized tweet tokens with their respective part-
of-speech (POS) values. There are many POS tagging technologies but only a few of
them are design to perform Twitter-specialized POS analysis. The solution proposed
in this master’s thesis uses Twitter ARK POS Tagger which is a java-based part-of-
speech tagger for English data and it is tailored made for Twitter posts. ARK POS
Tagger was developed by a group of researchers from Carnegie Mellon University2 ,
they manually annotated 1,827 tweets with POS tags and developed a specialized
POS tagset for tweets. ARK POS Tagger reports nearing 90% accuracy [22] making
it one of the most effective solutions available.
Figure 4.6 shows the ARK POS tagset with examples for each POS tag. The
most relevant POS tags for sentiment classification are adjectives (tag:A) and nouns
(tag:N) which usually express some degree of sentiment. For this reason, sentiment
lexicons like SentiWordNet and AFINN are mostly composed by nouns and adjec-
tives. However, hashtags (tag:#) in tweets may also contribute significantly with
the extraction of sentiment expressions. For example, the hashtag #BeautifulDay
clearly has a positive connotation.
Table 4.3 shows how the POS tagging process works. In this example, the nor-
malized tokens not, their_NEG, best_NEG and #Live are replaced by R/not,
O/their_NEG, A/best_NEG and #/#Live respectively. The symbol "/" is ap-
pended at the beginning of each token, even if those tokens are already tagged as
negated (_NEG). Additionally, tokens already identified on previews steps are ig-
nored by the POS tagger. e.g. "@mentions", "URLs" and "punctuation symbols".
This is the final processing step before starting the feature vector generation. Fol-
lowing sections will explain how these tokens are transformed into vectors that will
represent key component of proposed entity-based sentiment classifier.
2
http://www.cmu.edu/
Chapter 4 Approach 33
The feature vector generation process is arguably the most important component
in any sentiment classifier. A Support Vector Machine (SVM) based classifier like
the one presented and implemented for this master’s thesis, depends highly on the
quality of features extracted from the raw data (tweets in this case). Therefore, in
order to achieve a highly accurate classification, the production of feature vectors
most be done with precision. The feature generation module is responsible for the
extraction of numerical values from the already normalized and tagged tokens, the
way these values are generated depends on the type of features required. Hence, this
section explores two types of features: document-based and entity-based features.
The sentiment classifier developed in this master’s thesis uses both types of fea-
ture extraction, classifying tweets not only on a document-level like most sentiment
classifiers but also on an entity-level which is the final goal of this project.
Document-based features are those extracted from document-level data. This means
that every single normalized-tagged token obtained from the input tweets is relevant
and considered for the generation of vectors. As a result, each tweet is represented
as a feature vector made up of the following set of features: binary bag-of-words
(unigrams), POS tags, linguistic features.
is present and 0 thath it is absent. Table 4.4 shows an example of how a generated
binary bag-of-words based on unigrams looks like.
The feature vector generated from part-of-speech (POS) Tags, consists on the num-
ber of verbs, adverbs, adjectives and nouns contained in tweets. Inspired by Saif et
al. [19], these four POS tags are proved to be the most relevant for sentiment clas-
sification. The addition of more POS tags to the feature generation process might
have a negative impact on the classification accuracy. In Table 4.5 an example of
POS tag features is shown, the order of POS tags for the creation of vectors is the
following: (1)noun -> (2)adjective -> (3)adverb -> (4)verb
Linguistic features are a set of elements extracted from tweets and represented as
count numbers on feature vectors. The Linguistic features consists of the following
eight elements:
Figure 4.7 illustrates the processing steps required for the generation of entity-based
features, it starts with the identification of the target entity context. The following
techniques were used in order to extract target contexts:
1. Sentence separation: A tweet may contain many entities but only the target
entity and its context should be considered for entity-level feature generation.
Therefore, each sentence in a given tweet is evaluated for entity presence and
only those that fulfil the following rules are considered: (1) sentence with
target entity (2) sentence with no entities that is in the neighborhood of a
target entity sentence. For a better illustration of this concept, Figure 4.8
shows how the identification of relevant context is done in a tweet with two
sentences, the first is relevant because it contains the target entity "Nexus 5X"
while the second sentence is ignored due to other entity presence ("Apple").
1. No. Tokens: the number of sentiment tokens in the tweet. These tokens are
words with sentiment scores above or below zero in a lexicon.
When an entity context has negated tokens, sentiment scores for those tokes are
inverted. The solutions presented in this project uses a combination of different
lexicons, the quality of these lexicons is critical for the classifier. Therefore, the
following state-of-the-art lexical resources were selected for this task:
Chapter 4 Approach 38
• Bing Liu [24] [25]: Developed by Bing Liu, all terms of this lexicon are
manually labeled. It contains 6.790 positive and negative words which are not
scored.
• AFINN [26]: Manually labeled lexicon composed by 2.477 words, each word
has a sentiment score in the range of -5 to 5.
This component represents the final stage of proposed sentiment classification solu-
tion. There are many supervised learning models such as Maximum Entropy, Naive
Bayes and Neural Networks. However, Support Vector Machines (SVMs) have the
potential to handle large feature spaces in a very efficient way [30]. Hence, a SVM
is used to deal with the large set of features generated from tweets and entities,
the generation of this features is explained in Section 4.6.2. Like any other super-
vised learning model, SVMs require training data to function. This training data
is represented as numerical feature vectors labeled with their respective class, the
labeling process is usually done manually (by evaluators) but in some cases distant-
supervision methods are used. The job of any SVM is to find a clear separation
between training vectors and their classes, based on this learning process, the SVM
is able to classify new document entries (tweets for this case).
Proposed entity-based classifier uses a support vector machine (SVM) NodeJS mod-
ule called Node-SVM which is a port from the C++ SVM library LIBSVM [31].
This library is one of the most popular SVM solutions available for machine learning
based classification and regression. To have a better understanding of implemented
classifier, the parameters used to setup Node-SVM are the following:
Chapter 4 Approach 40
Figure 4.10 illustrates the linear form of a SVM which is a hyperplane that divides a
set of positive data points (positive labeled tweets) from a set of negative data points.
The separation distance between these two set is called maximum margin. In linear
SVMs, the maximum margin represents the maximum distance of the hyperplane
to the positive and negative data points.
For a full description of the SentiTrack system refer to Section 2.2.3. The integration
process of the sentiment classifier developed in this master’s thesis and the Senti-
Track system was fairly simple since both platforms are fully developed with NodeJS
(JavaScript) technologies. Therefore, in order to perform the integration, the pro-
posed classifier was implemented as a NodeJS module using the JS package manager
Chapter 4 Approach 41
(npm) approach. Figure 4.11 shows the package organization of implemented mod-
ule (source code files), the name of the module is entity_sentiment which refers to
the capabilities of implemented entity-based sentiment classifier. The following line
of code is required to make use of the entity_sentiment module:
This line will allow NodeJS classes to perform classification using proposed solution.
In following chapter 5 the quality and performance of developed classifier is tested
and evaluated, then an analysis of result is presented.
This chapter presents the evaluation and results of the entity-based sentiment clas-
sifier developed in this master’s thesis. The chapter is divided in three sections.
First one, describes the process of collecting the datasets used to train the classifier
and evaluate the proposed solution. The second section is about quality evalua-
tion, here the overall success of the implemented sentiment classifier is tested and
analyzed. Finally, the performance of the classifier is measured and compared with
other similar solutions.
Sentiment classifiers that make use of machine learning methods require a corpus
of labeled documents as training data in order to function. Usually the data is
manually labeled by evaluators. However, the level of sentiment classification will
determine which type of training data is necessary. For sentiment analysis of tweets,
these are the most used types of datasets:
42
Chapter 5 Evaluation and Analysis of Results 43
• Sanders Analytics1 : this dataset is for training and testing sentiment anal-
ysis algorithms. It is composed by 5513 manually classified tweets. A 3-class
classification method was used (positive, negative, neutral) and each tweet
expresses sentiment towards a specific entity.
A normalization process was necessary to remove repeated tweets and noisy tokens
from the collection, additionally, a balance between the three different sentiment
classes (positive, negative, and neutral) had to be achieve in order to guarantee
effective performance of the support vector machine. As a result, Table 5.2 shows a
summary of the collected tweets. 70% of the 4900 labeled tweets is used as training
data leaving a 30% for evaluation and testing porpoises.
1
http://www.sananalytics.com/lab/twitter-sentiment/
2
http://tweenator.com/index.php?page_id=13
Chapter 5 Evaluation and Analysis of Results 44
Using the confusion matrix illustrated in Table 5.3, several different metrics were
calculated for the evaluation of the classifier. The scoring metrics considered in this
project are the following:
true positive
precision = (5.1)
true positive + false positive
• Recall: is the fraction of relevant tweets that are successfully predicted. Con-
tinuing with positive class example, recall is defined as:
true positive
recall = (5.2)
true positive + false negative + false neutral
Chapter 5 Evaluation and Analysis of Results 45
precisionpositive ∗ recallpositive
F score = 2 ∗ (5.4)
precisionpositive + recallpositive
Based on 4900 collected tweets (datasets) and previews described metrics, a 4-fold
cross validation test was performed in proposed entity-based sentiment classifier.
The results are presented in Table 5.4. According to obtained results, proposed
classifier achieved an accuracy of 0.635 (64%).
Results in Table 5.4 show that the class positive achieved the lowest performance
while neutral class obtained the highest. Neutral class tends to get better results
since its classification depends on the absence of sentiment expressions. Therefore,
polarity classification represents a bigger challenge because of possible presence of
negation context and sarcasm comments.
These results show how important the lexicon features are to the overall performance
of the classifier, the contribution of the seven state-of-the-art lexical resources is
significantly higher than the contributions obtained by other features. Content
and POS Tags features achieved very interesting results surpassing unigrams and
emoticons features by a considerable margin. Although unigrams and bag-of-words
are some of the most popular features, the evaluation results yield insignificant
contributions from this vector. Therefore, its possible to assume that in an entity-
based classification approach, n-grams might not have the same contribution impact
as in document-level sentiment classifiers.
Chapter 5 Evaluation and Analysis of Results 47
3
https://www.npmjs.com/package/sentiment
4
https://github.com/Ulflander/compendium-js
Chapter 5 Evaluation and Analysis of Results 48
Figure 5.2 illustrates the results obtained by a 4-fold cross validation test of 4900
target-labeled tweets (Section 5.1) performed over three sentiment classifiers includ-
ing the one developed and proposed in this master’s thesis. The results show a
significant difference in F-Score and Accuracy between the entity-based classifier
and the other two tools. Former SentiTrack classifier’s poor performance is con-
sequence of its simplicity, it is based on one single lexical resources AFINN with
no consideration of negation context and emoticons. The difference between results
is related to the classification techniques, most state-of-the-art sentiment classifiers
for Twitter use supervised methods capable of analyze many tweet-specific features.
Hence, presented solution make use of machine learning methods.
A performance evaluation measures how long will take proposed classifier to process
a given amount of tweets. This is a very important evaluation given that one of
the objectives of this research project is to develop a sentiment classifier capable of
function under real-time processing systems such as SentiTrack. Additionally, the
performance test is also done to former SentiTrack classifier and CompendiumJS in
order to compare resulting processing times. These are the evaluation environment
specifications:
The results are shown in Table 5.6, they reflect a large difference in performance
time between former SentiTrack classifier and proposed solution. Can be inferred,
that the complex pipeline of processes required by the entity-based classifiers and the
usage of supervised techniques have an impact in classification time. However, the
performance achieved by proposed solution is good enough to cope with real-time
processing enviroments like SentiTrack.
1000 Tweets
Entity-based (ms) 3447.054
Former SentiTrack (ms) 323.310
CompendiumJS (ms) 2357.886
Chapter 5 Evaluation and Analysis of Results 49
The overall results were very positive in comparison to previews SentiTrack experi-
ments where the former classifier was used, evidence of a moderate correlation was
found on 3 out of 6 companies with a maximum correlation of 0.84 (84%). These re-
sults prove a successful integration between proposed/built sentiment classifier and
SentiTrack, which will provide a more accurate analysis for future experiments.
Chapter 6
This thesis presented the research, evaluation and solution for an entity-based sen-
timent classifier for social media analysis. Implemented system is able to perform
sentiment classification of tweets based on the presence of entities and the opinion
expressions targeting them.
This chapter aims conclude this masters thesis project with a summary of the
achievements made, limitations encountered and possible future extensions and en-
hancements.
6.1 Achievements
• The main goal of this thesis is the study of an entity-based sentiment classi-
fication approach for the analysis of social media data. The research presents
the facts, reasons and evaluation results that led the project to the usage of
most suitable methods for required solution.
• A successful approach was produced for the required solution. The presented
approach is able to extract opinion expressions aiming relevant entities in
tweets.
50
Chapter 6 Conclusions and Future Work 51
• The clean organization and structure of developed system, allows future inter-
ested researches to improve and modify provided tools.
6.2 Limitations
• Memory issues limited the development of the classifier to the use of unigrams
with bag-of-word method. The usage of different levels of n-grams as features
might have contributed significantly with better final results.
• Despite the fact that entity-targeted opinion expressions were extracted from
tweets, there are many cases where no clear separation of sentence is made,
leading to incorrect classifications.
• Extend the solution to work with other social networks such as Facebook,
LinkedIn and Google plus.
Appendix A
Appendix
A.1 Glossary
52
Appendix A Appendix 53
The source code of the project can be downloaded from the following link which
points to the Git repository of the EIS group, University of Bonn.
https://github.com/EIS-Bonn/Theses/tree/master/2015/Cristobal_Leiva
To run the application, the system has to fulfill the following requirements and the
user needs to follow the instructions given below.
Requirements:
• Node.js + NPM
• Latest MongoDB
How to Install:
2. Configure Twitter API keys: pen config.sample.js and fill in the required keys
under the twitter app config and save it as config.js
3. To start NodeJS server: start mongo then run the command "gulp"
[1] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations
and trends in information retrieval, 2(1-2):1–135, 2008.
[3] Efthymios Kouloumpis, Theresa Wilson, and Johanna D Moore. Twitter senti-
ment analysis: The good the bad and the omg! Icwsm, 11:538–541, 2011.
[4] Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and Tiejun Zhao. Target-
dependent twitter sentiment classification. In Proceedings of the 49th Annual
Meeting of the Association for Computational Linguistics: Human Language
Technologies-Volume 1, pages 151–160. Association for Computational Linguis-
tics, 2011.
[5] Luciano Barbosa and Junlan Feng. Robust sentiment detection on twitter from
biased and noisy data. In Proceedings of the 23rd International Conference
on Computational Linguistics: Posters, pages 36–44. Association for Computa-
tional Linguistics, 2010.
[6] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up?: sentiment
classification using machine learning techniques. In Proceedings of the ACL-
02 conference on Empirical methods in natural language processing-Volume 10,
pages 79–86. Association for Computational Linguistics, 2002.
[9] Bing Liu. Sentiment analysis and opinion mining. Synthesis lectures on human
language technologies, 5(1):1–167, 2012.
[11] Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using
distant supervision. CS224N Project Report, Stanford, 1:12, 2009.
[12] Priyanka Dank, Simon Scerri, and Ali Khalili. Linked data-based social media
analysis for stock market tracking.
[13] David Nadeau and Satoshi Sekine. A survey of named entity recognition and
classification. Lingvisticae Investigationes, 30(1):3–26, 2007.
[14] Tetsuya Nasukawa and Jeonghee Yi. Sentiment analysis: Capturing favorabil-
ity using natural language processing. In Proceedings of the 2nd international
conference on Knowledge capture, pages 70–77. ACM, 2003.
[15] Sanjiv Das and Mike Chen. Yahoo! for amazon: Extracting market sentiment
from stock message boards. In Proceedings of the Asia Pacific finance asso-
ciation annual conference (APFA), volume 35, page 43. Bangkok, Thailand,
2001.
[16] Richard M Tong. An operational system for detecting and tracking opinions
in on-line discussion. In Working Notes of the ACM SIGIR 2001 Workshop on
Operational Text Classification, volume 1, page 6, 2001.
[17] Alexander Pak and Patrick Paroubek. Twitter as a corpus for sentiment analysis
and opinion mining. In LREc, volume 10, pages 1320–1326, 2010.
[18] Georgios Paltoglou and Mike Thelwall. Twitter, myspace, digg: Unsupervised
sentiment analysis in social media. ACM Transactions on Intelligent Systems
and Technology (TIST), 3(4):66, 2012.
[20] Xiaowen Ding, Bing Liu, and Philip S Yu. A holistic lexicon-based approach to
opinion mining. In Proceedings of the 2008 International Conference on Web
Search and Data Mining, pages 231–240. ACM, 2008.
Bibliography 57
[21] Tun Thura Thet, Jin-Cheon Na, and Christopher SG Khoo. Aspect-based sen-
timent analysis of movie reviews on discussion boards. Journal of Information
Science, page 0165551510388123, 2010.
[22] Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel
Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan,
and Noah A Smith. Part-of-speech tagging for twitter: Annotation, features,
and experiments. In Proceedings of the 49th Annual Meeting of the Association
for Computational Linguistics: Human Language Technologies: short papers-
Volume 2, pages 42–47. Association for Computational Linguistics, 2011.
[23] Svetlana Kiritchenko, Xiaodan Zhu, and Saif M Mohammad. Sentiment analysis
of short informal texts. Journal of Artificial Intelligence Research, pages 723–
762, 2014.
[24] Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In
Proceedings of the tenth ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 168–177. ACM, 2004.
[25] Bing Liu, Minqing Hu, and Junsheng Cheng. Opinion observer: analyzing
and comparing opinions on the web. In Proceedings of the 14th international
conference on World Wide Web, pages 342–351. ACM, 2005.
[26] Finn Årup Nielsen. A new anew: Evaluation of a word list for sentiment analysis
in microblogs. arXiv preprint arXiv:1103.2903, 2011.
[27] Andrea Esuli and Fabrizio Sebastiani. Sentiwordnet: A publicly available lexical
resource for opinion mining. In Proceedings of LREC, volume 6, pages 417–422.
Citeseer, 2006.
[28] Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. Recognizing contextual
polarity in phrase-level sentiment analysis. In Proceedings of the conference on
human language technology and empirical methods in natural language process-
ing, pages 347–354. Association for Computational Linguistics, 2005.
[29] Xiaodan Zhu Svetlana Kiritchenko and Saif M. Mohammad. Sentiment analysis
of short informal texts. 50:723–762.
[30] Thorsten Joachims. Text categorization with support vector machines: Learning
with many relevant features. Springer, 1998.
[31] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector
machines. ACM Transactions on Intelligent Systems and Technology (TIST),
2(3):27, 2011.
Bibliography 58
[32] John Platt et al. Sequential minimal optimization: A fast algorithm for training
support vector machines. 1998.
[33] Hassan Saif, Miriam Fernandez, Yulan He, and Harith Alani. Evaluation
datasets for twitter sentiment analysis: a survey and a new dataset, the sts-gold.
2013.
[34] Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko, Saif M Mohammad, Alan
Ritter, and Veselin Stoyanov. Semeval-2015 task 10: Sentiment analysis in
twitter. Proceedings of SemEval-2015, 2015.