0% found this document useful (0 votes)
148 views66 pages

Entity Based Sentiment Classifier For Social Media Analysis

This document is Cristobal Leiva's master's thesis submitted in August 2019 for the degree of Master of Science in Enterprise Information Systems at Rheinische Friedrich-Wilhelms-Universität Bonn. The thesis proposes an entity-based sentiment classifier for analyzing social media. It introduces the background of social media and sentiment analysis, describes the proposed approach of identifying entities in texts and classifying sentiment at the entity level. The approach is evaluated on Twitter data and integrated with the SentiTrack system.

Uploaded by

Aldo Loayza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views66 pages

Entity Based Sentiment Classifier For Social Media Analysis

This document is Cristobal Leiva's master's thesis submitted in August 2019 for the degree of Master of Science in Enterprise Information Systems at Rheinische Friedrich-Wilhelms-Universität Bonn. The thesis proposes an entity-based sentiment classifier for analyzing social media. It introduces the background of social media and sentiment analysis, describes the proposed approach of identifying entities in texts and classifying sentiment at the entity level. The approach is evaluated on Twitter data and integrated with the SentiTrack system.

Uploaded by

Aldo Loayza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

RHEINISCHE FRIEDRICH-WILHELMS-UNIVERSITÄT BONN

INSTITUT FÜR INFORMATIK III

Master Thesis

Entity Based Sentiment Classifier for


Social Media Analysis

Author: Cristobal Leiva


E-mail: [email protected]
Matriculation number: 2616679

First Evaluator: Prof. Dr. Sören Auer


Second Evaluator: Prof. Dr. Jens Lehmann
Supervisor: Dr. Simon Scerri

A thesis submitted in fulfilment of the requirements


for the degree of Master of Science

in the

Enterprise Information Systems


Computer Science Department

August 2019
Declaration of Authorship
I, Cristobal Leiva, declare that this thesis titled, ’Entity based Sentiment Classifier
for Social Media Analysis’ and the work presented in it are my own. I confirm that:

• This work was done wholly or mainly while in candidature for a research degree
at this University.

• Where any part of this thesis has previously been submitted for a degree or
any other qualification at this University or any other institution, this has
been clearly stated.

• Where I have consulted the published work of others, this is always clearly
attributed.

• Where I have quoted from the work of others, the source is always given. With
the exception of such quotations, this thesis is entirely my own work.

• I have acknowledged all main sources of help.

Cristobal Leiva

Signed:

Date:

i
"Achievement of your happiness is the only moral purpose of your life, and that
happiness, not pain or mindless self-indulgence, is the proof of your moral integrity,
since it is the proof and the result of your loyalty to the achievement of your values."

Ayn Rand (1905-1982), American philosopher.


Contents

1 Introduction 3
1.1 Problem and motivation . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Overview of the document . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 8
2.1 Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Analysis Levels . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Sentiment Classification . . . . . . . . . . . . . . . . . . . . . 11
2.2.2.1 Unsupervised Techniques . . . . . . . . . . . . . . . 12
2.2.2.2 Supervised Techniques . . . . . . . . . . . . . . . . . 13
2.2.3 SentiTrack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Related Work 21
3.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Entity Based Sentiment Analysis . . . . . . . . . . . . . . . . . . . . 23

4 Approach 25
4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Entity Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 POS Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6 Feature Vector Generation . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6.1 Document-based features generation . . . . . . . . . . . . . . 34
4.6.1.1 Binary bag-of-words . . . . . . . . . . . . . . . . . . 34
4.6.1.2 POS Tags . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6.1.3 Linguistic features . . . . . . . . . . . . . . . . . . . 35
4.6.2 Entity-based features generation . . . . . . . . . . . . . . . . 36
4.6.2.1 Lexicon Features . . . . . . . . . . . . . . . . . . . . 37
4.6.2.2 Emoticon Features . . . . . . . . . . . . . . . . . . . 39
4.7 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.8 SentiTrack Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 40

iii
CONTENTS iv

5 Evaluation and Analysis of Results 42


5.1 Data Collection and Processing . . . . . . . . . . . . . . . . . . . . . 42
5.2 Quality Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.1 Features Contribution . . . . . . . . . . . . . . . . . . . . . . 45
5.2.2 Quality Comparison . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 SentiTrack Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6 Conclusions and Future Work 50


6.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

A Appendix 52
A.1 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.2 Technical Specifications . . . . . . . . . . . . . . . . . . . . . . . . . 53
A.3 SentiTrack GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Bibliography 55
List of Figures

1.1 Social network user growth projections . . . . . . . . . . . . . . . . . 4

2.1 Twitter graph representation . . . . . . . . . . . . . . . . . . . . . . 9


2.2 A tweet example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Penn Treebank Part-Of-Speech (POS) tags . . . . . . . . . . . . . . 14
2.4 SentiTrack Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1 Processing pipeline and architecture . . . . . . . . . . . . . . . . . . 26


4.2 Simplified entity-based sentiment classifier workflow . . . . . . . . . 27
4.3 DBPedia Spotlight annotation example . . . . . . . . . . . . . . . . . 28
4.4 Tokenization workflow . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Normalization workflow . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6 ARK POS Tagset table with examples . . . . . . . . . . . . . . . . . 33
4.7 Entity-based features generation workflow . . . . . . . . . . . . . . . 36
4.8 Sentence separation context identification of tweet . . . . . . . . . . 36
4.9 But clause context extraction of tweet . . . . . . . . . . . . . . . . . 37
4.10 Illustration of a linear Support Vector Machine . . . . . . . . . . . . 40
4.11 Package files organization . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1 Features Contribution Graph . . . . . . . . . . . . . . . . . . . . . . 46


5.2 Classifiers Comparison Graph . . . . . . . . . . . . . . . . . . . . . . 47

A.1 SentiTrack front-end, bubble cloud . . . . . . . . . . . . . . . . . . . 54


A.2 SentiTrack front-end, live sentiment data . . . . . . . . . . . . . . . . 54

v
List of Tables

1.1 Example tweets with sentiment . . . . . . . . . . . . . . . . . . . . . 5


1.2 Example of unsuccessful cases by target-independent sentiment clas-
sification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Unsupervised classification of tweets example. . . . . . . . . . . . . . 14

4.1 Tokenized tweet example . . . . . . . . . . . . . . . . . . . . . . . . . 30


4.2 Normalization example . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 ARK POS Tagging example over normalized tokens . . . . . . . . . . 32
4.4 Binary bag-of-words representation of a tweet . . . . . . . . . . . . . 35
4.5 POS Tag feature vector example . . . . . . . . . . . . . . . . . . . . 35
4.6 Lexical resources summary. . . . . . . . . . . . . . . . . . . . . . . . 38
4.7 Entity-based feature vectors example . . . . . . . . . . . . . . . . . . 39

5.1 Entity-based Twitter corpus example. . . . . . . . . . . . . . . . . . . 43


5.2 Datasets Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 3-Class Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 Precision, Recall and F-Score results . . . . . . . . . . . . . . . . . . 45
5.5 Features contribution details . . . . . . . . . . . . . . . . . . . . . . . 46
5.6 Performance test results . . . . . . . . . . . . . . . . . . . . . . . . . 48

vi
Abstract
Having a notion of people’s opinions has always been an essential asset for the
decision-making processes in many business models. Therefore, the popularity of
sentiment analysis of social media content has being growing rapidly in the last
years. However, state-of-the-art sentiment classification approaches lack the capabil-
ity of performing entity-based classification to identify opinion expressions targeting
a specific term, word or entity. In this thesis, this particular problem of entity-based
sentiment classification is addressed.

Combining machine learning and natural language processing methods, the solution
proposed in this project uses several sentiment analysis techniques such as Named
Entity Recognition (NER), POS tagging, feature vector generation, bag-of-words
model and others. These are required to train a Support Vector Machine that clas-
sifies tweets (from Twitter) as positive, negative or neutral based on sentiments
expressed toward a given target entity (company, person or product). Taking into
account real-time sentiment analysis systems, presented solution is developed as
a high-performance classifier capable of coping with real-time processing environ-
ments. The completeness and effectiveness of this approach is affirmed by quality
evaluations and performance tests.
Acknowledgements
It delights me immensely to finally present this thesis to the Department of Com-
puter Science at the University of Bonn, for the partial fulfilment of the Master
of Science in Informatik. I would like to thank Professor Dr. Sören Auer for his
constant support and open arms to accept my thesis proposal. This was a fantastic
opportunity and I am fully satisfied with the treatment received by the department
and university. I also want to thank Dr. Simon Scerri for supervise this thesis
project and always be supportive even in the most difficult circumstances.

Finally, my sincere gratitude is towards my family and friends who were always
there for me. My beloved future wife who did everything she could to keep me on
track and focus on our goals. My mother, who I own basically my entire life and
achievements. You are the best mother anyone could ever have. Thank you mom!
Thank you all!

2
Chapter 1

Introduction

This chapter will start by giving insight into the problem and motivation behind
this thesis; it continues with an outlining of the contributions this research project
provides. Lastly, the structure of the document is described.

1.1 Problem and motivation

Information has become the most valuable resource in modern society. From chil-
dren to seniors, from New Zealand to Canada, a very large portion of the current
human population have shared in some extend part of their life by technological
means. This is a fact, no doubts. That fact opens unquantifiable possibilities for
science. Especially Computer Science with topics such as Information Retrieval
(IR), Machine Learning (ML), Natural Language Processing (NLP), Semantic Web,
and the list goes on. The era of data analysis is just starting.

Having a notion of people’s thoughts has always been an essential asset for the
decision-making processes in many business models. Even before the existence of the
World Wide Web, people used to ask questions about products recommendations or
opinions about events such as local elections. But with the beginning of the Internet
Era, societies started to use the common wealth of knowledge found on forums,
blogs, and social media network services as an important source of opinions [1].
These opinions may come from customers simply sharing their experience with a
product or from well-known professionals writing elaborated reviews. The Internet
became a valuable pool of experiences available for everyone to use.

3
Chapter 1 Introduction 4

Figure 1.1: Social network user growth projection 2010 - 2018 [2]

In Figure 1.1 the number of social network users worldwide from 2010 to 2014 with
projections until 2018 is represented. By 2018 the projection shows an astonishing
amount of 2.44 billion users on social media networks, this is more than 30% of
the world’s population. However, data generated from systems must be mined and
analyzed to represent valuable information for interested parties. This master thesis
aims to explore an approach called "Entity Based Sentiment Classification" to tackle
this challenge.

As shown on Figure 1.1 in the past few years, there has been an increase in the
usage of social networking services such as Twitter. Twitter, as a micro-blogging
system, allows users to publish tweets of up to 140 characters in length to tell others
what they are doing, what they are thinking, or what is happening around them.
Therefore, companies and media organizations are constantly looking for ways of
extracting public opinions and feelings about their products and services [3]. The
nature of the tweets (short and usually meaningful in the context of marketing) al-
lows researchers to exploit different data mining and sentiment analysis approaches.
Projects like Twitratr1 , streamcrab 2 , and Talend3 are examples of services intending
to obtain sentiment information from tweets.

1
http:www.twitrratr.com
2
http:www.streamcrab.com
3
http:www.talend.com
Chapter 1 Introduction 5

Table 1.1 shows two example tweets together with their corresponding sentiment.
The first tweet expresses a positive sentiment, containing one positive noun happy
and one positive emoji :D. The sencond tweet indicates negative sentiment based on
the hashtag #sad and emoji :(.

Table 1.1: Example tweets with sentiment

Tweet Text Sentiment


I am so happy with my new iPhone :D Positive
I wont make it to the party :( #sad Negative

In other services such as Sentiment1404 and IBM AlchemyAPI 5 users may insert
a target entity as a search query, then the system proceeds to fetch tweets in real-
time containing positive or negative sentiments towards given target entity [4]. This
task is formally named Targeted Sentiment Analysis and can be described as the
extraction of positive, negative or neutral sentiment towards an input target entity
on a given text.

Most approaches that deal with Targeted Sentiment Analysis are based on the ex-
traction of target-independent features. In 2010, Barbosa and Feng [5] use a ma-
chine learning based classifiers for the sentiment classification of texts. However,
their classifiers actually work in a target-independent way: all the features used in
the classifiers are independent of the target, so the sentiment is decided no mat-
ter what the target is. Pang and Lee in 2002 [6] performed a similar sentiment
classification experiment on movie reviews, on this experiment they only consider
target-independent features on the classifier. Movie reviews usually concentrate
opinions expressed towards a single target entity, in this case, a specific movie. Nev-
ertheless, this approach does not apply for entity based sentiment classification in
tweets. Tweets have a very particular structure where multiple target entities may
exist in the same context, given this scenario target-independent sentiment classifiers
will not yield satisfactory results.

Table 1.2 illustrates two example tweets where TiSC6 approaches can not correctly
identify the sentiment towards given target entities. The first tweet does not express
any positive sentiment to given target iPhone but instead to a second entity Amazon;
the problem is that the user is only interested on the input entity. This would be a
false positive case for TiSCs. In the second tweet, a similar case is presented where
4
http:www.sentiment140.com
5
http:www.alchemyapi.com
6
Target-independent Sentiment Classification according to Jian and Liu 2011
Chapter 1 Introduction 6

target iPhone is misclassified as positive when it should be negative, this happens


because of an undetected comparison token: better than.

Table 1.2: Example of unsuccessful cases by target-independent sentiment clas-


sification

Tweet Text Target Sentiment


My new iPhone arrives today. I <3 Amazon! iPhone Positive
The Nexus 5 is better than the new iPhone iPhone Positive

A correct sentiment classification for given target entities is crucial for systems
such as SentiTrack7 . SentiTrack intends to find a correlations between the opinions
extracted from real-time tweets and intra-day stock price variations of a set of specific
companies. Therefore, an entity based sentiment classification is required for given
task. Solving this issue is the main objective of this master’s thesis. In order to
achieve this goal, proposed solution combines document-based (target independent)
and entity-based features with a variety of sentiment classification techniques to
train a Support Vector Machine (SVM) capable of producing highly accurate results
for entity-based sentiment classification tasks.

1.2 Contribution

The main contributions that this thesis aims to achieve are the following:

• Develop an entity based sentiment classifier with state-of-the-art techniques


for Twitter data analysis that yield highly accurate results on Target-based
twitter corpora.

• The project creates a high-performance sentiment classifier that integrates


seamlessly with real-time analysis environments such as SentiTrack.

• This work aims to describe an approach to perform sentiment classification


based on entities position and presence of grammatical condition clauses on a
given tweet text.

• This work leads to the comparison and evaluation of different techniques for
sentiment classification; since it provides exact information from the different
techniques by evaluating the performance, accuracy, precision and recall.
7
Linked Data-based Social Media Analysis for Stock Market Tracking
Chapter 1 Introduction 7

1.3 Overview of the document

The introduction in Chapter 1 provides a short overview of the current state of


social media ans sentiment analysis, the problems present on current sentiment
classification approaches and the motivation for this thesis. Additionally, describes
the contributions provided by this project. Chapter 2 explains what is sentiment
analysis and the current state-of-the-art sentiment classification techniques, different
Named-entity-recognition approaches and a theoretical background related to the
social media. Chapter 3 explores the related work in the field of sentiment analysis
and entity-based sentiment classification. Chapter 4 presents the proposed/built
sentiment classification approach and explains its architecture and features. The
evaluation of this project is developed in Chapter 5, which consists of cross-validation
and performance tests. Finally, Chapter 6 presents the conclusion and possible
future works, which summarizes the master’s thesis.
Chapter 2

Background

This chapter explains the theoretical background used for the development of an en-
tity based sentiment classifier. The chapter is divided into three sections: Section 2.1
explains the basic concepts related to the social media and the social networking ser-
vice Twitter. In Section 2.2 Sentiment Analysis is defined and different state-of-the-
art sentiment classification techniques are described. Finally, Section 2.3 explores
existing Named Entity Recognition approaches.

2.1 Social Media

Social Media refers to a set of computer-based tools that allows people, organiza-
tions and companies to share and exchange information with community networks.
Moreover, Social Media tends to change with time allowing people to create content
in a dynamic way and without restrictions. Therefore, there is an unquantifiable
number of communities performing all kinds of activities such as blogging, posting
on forums, podcasting, generating trends on social networks and more.

The Social Media relies on various technologies to achieve a compelling communi-


cation between users. Web structures like social networks are fundamental for the
social media. These structures are mainly made of actors and messages which can
be visualized as graph nodes with specific properties and unique attributes. This
concept is depicted on Figure 2.1 where the interaction between the different ele-
ments of the social network Twitter is represented. In Section 2.1.1, Twitter and its
elements are explained in detail.

8
Chapter 2 Background 9

Figure 2.1: Twitter: graph representation [7]

2.1.1 Twitter

Being one of the most popular websites in the world, Twitter has become an ever-
growing corpus of information. As a social network, Twitter allows constant sharing
of short messages between users; this is also called micro-blogging. Between users,
they can follow the activity of each other giving updates of their messages. Currently,
this network has around 320 million of active users [8].

As described in Figure 2.1, Twitter is composed of the following elements:

• Tweet: Tweets are the essence of Twitter, they are short message units (pieces
of information) formed by a maximum of 140 characters. Users use tweets to
share ideas, status, news or any other kind of information with their followers.
In Figure 2.2 a tweet example is showed where the actor: User is "Outlook"
and the message contains #hashtags, @mentions and URLs.
Chapter 2 Background 10

Figure 2.2: A tweet example

• Users: The users are the source of information on Twitter. They have follow-
ers who are other users interested in the content being sharing.

• Trends: Trends are popular topics close to the geographic location of the
users. They are determined by an algorithm that combines the information
about followers, accounts, and places related to them. The trends intend to
help users to discover relevant topics based on their location.

2.2 Sentiment Analysis

Sentiment analysis (SA) also known as Opinion Mining can be defined as the use of
computer-based methods to extract an opinion from a given text source. Liu [9, p.
7] defines SA as:

"The field of study that analyzes people’s opinions, sentiments, evalua-


tions, appraisals, attitudes, and emotions towards entities such as prod-
ucts, services, organizations, individuals, issues, events, topics, and their
attributes."

Opinions are essential for all human activities because they affect our decisions.
Before making decisions, people usually evaluate others opinions. In the real world,
businesses and organizations are always trying to know the public opinion about their
products and services and be able adapt marketing strategies [9, p. 10]. Nowadays,
companies may no longer require surveys and opinion polls to gather the opinion of
their customers. However, analyzing opinion sites and social networks by SA means
is not an easy task. Several fields of Computer Science such as Natural Language
Processing (NLP), Machine Learning (ML) and Information Retrieval (IR) are some
of the research areas involved in SA.
Chapter 2 Background 11

2.2.1 Analysis Levels

There are different levels of SA that could be applied to a given text source, these
are the following [9, p. 11]:

• Document level: The objective of this level if SA is to classify the sentiment


(positive, negative or neutral) of a whole document. Documents in this context
refer to any piece of information that requires analysis. Tweets, .pdf files,
forums posts, e-commerce reviews, all are considered documents for sentiment
analysis porpoises. Sentiment classification of product reviews on e-commerce
system are good examples of these level of analysis. Document level SA is
effective on text sources that express opinion towards a single entity. When
many entities are present in the document this type of analysis may not be
very accurate.

• Sentence level: As its name describes, sentence level analysis segments a


given document into sentences, then each of these sentences is processed for
classification. Similar to document-level SA, opinions on sentences must ex-
press its sentiment towards a single entity in order to achieve high accuracy.

• Entity level: document-level and sentence-level analysis are not capable of


finding the target of peoples’ opinion. Entity-level is a finer-grained analysis
that assumes the presence of a target entity in the opinion expression. For
example, the sentence "although the iPhone is not a good phone, I still love
Apple as a company" may appear to have a positive tone but is not accurate
to classify it as positive. In this case, there are two different sentiments: A
positive sentiment towards the entity Apple and a negative sentiment towards
entity iPhone. Therefore, the ultimate goal of this level of analysis is to find
out the sentiment expressed towards target entities. This thesis project base
its’ sentiment classification approach on this level of analysis.

2.2.2 Sentiment Classification

Sentiment classification is one of the most studied topics in the area of sentiment
analysis and natural language processing. The main goal is to classify a document
as positive or negative based on the opinion expressed in it, if the document does
not contain any opinion expression the classification result must be neutral. The
following sections present unsupervised and supervised techniques for sentiment clas-
sification according to Bing Liu [9].
Chapter 2 Background 12

2.2.2.1 Unsupervised Techniques

Unsupervised techniques for sentiment classification are strongly based on opinion


words, also known as lexical resources. Consequently, these techniques classify doc-
uments using lexicon-based methods. Every word contained in a document or input
text is evaluated for polarity orientation, this orientation is defined by the presence
of opinion words, which are contained in sentiment dictionaries (lexicons). Hence, if
a document contains more positive than negative words, the sentiment classification
result of the document would be positive. The absence of polar-oriented words in a
document results in a neutral classification [9, p. 29].

Sentiment dictionaries or opinion lexicons are the core component of any unsu-
pervised sentiment classification method. These dictionaries are made of words or
phrases with specific sentiment scores. For example, words like fantastic and ex-
cellent have a positive orientation in most sentiment lexicons, while other words
such as terrible and awful are categorized as negative. Sentiment lexicons define the
polarity orientation of words by numeric values, some of them assign intensity score
to each word and others consider negating contexts for sentiment scores. Section 4.4
explains more about negation contexts.

There are several way of creating a sentiment lexicon; Bing Liu explains the most
effective ones [9].:

1. Manual approach: As its name refers to, this approach requires the effort
of evaluators to manually assign sentiment orientation to a set of words. This
task consumes a lot of time and its better to be used in combination with
other automatic methods. However, it is useful for evaluation of results of
non-manual approaches [9, p. 79].

2. Dictionary-based approach: This is arguably the most effective approach


for lexicons creation, it automatically generates a sentiment dictionary based
on synonyms and antonyms and the grammatical relation between words.
WordNet is defined as "a large English lexical database where nouns, verbs,
adjectives and adverbs are grouped into sets of synonyms called synsets, each
expressing a distinct concept" [10]. Although, there are many dictionary-based
approaches, one of the most useful ones follows two steps. First, evaluators
manually annotate a set of seed words (e.g. bad, good ) with "obvious" polarity
orientation. Then, each seed word is expanded by collecting their synonyms
and antonyms from a dictionary (e.g. WordNet). After that, collected words
with their respective sentiment scores are appended to the original set of seed
Chapter 2 Background 13

tokens. This process is repeated progressively resulting in an expanding sen-


timent lexicon [9, p. 80].

3. Corpus-based Approach: A corpus based approach is mostly used for the


generation of domain-dependent sentiment lexicons. In this context, senti-
ment words are extracted from specific domain corpus or adapted from an
open-domain one [9, p. 82]. This extraction task is not a simple given that
some words may express different or even opposite sentiment depending on
their context. As an example, lets take the word "unpredictable". An unpre-
dictable movie plot might be considered positive but an unpredictable work
schedules would be negative for most people. The corpus-based approach uses
grammatical rules to expand a set of sentiment seed words. With the use of
conjunction words (e.g and, yet) it is possible to infer the sentiment orienta-
tion of unclassified words. In the following sentence: "the computer is powerful
and fast" if the word powerful is a positive oriented seed word, we can infer
that fast is also positive. Therefore, a domain specific lexicon expansion relies
the presence of Conjunctions.

Table 2.1 illustrates the sentiment classification of two example tweets using a lexicon
based unsupervised technique where positive and negative words are evaluated to
+1 and -1 respectively. The overall score of each tweet is calculated by adding up
the sentiment values of individual words.

2.2.2.2 Supervised Techniques

Supervised sentiment classification can be categorized as a natural language process-


ing task (text classification). There are many methods to perform text classification,
some of the most used classifiers are Maximum Entropy, Naive Bayes and Support
Vector Machines (SVM). Before starting a sentiment classification task, the number
of classes must be defined. The following are the most common approaches [9]:

• Two-class classifier: Also known as polarity classification, has as an objec-


tive to classify a document as positive or negative which represent the two
classes respectively.

• Three-class classifier: Similar to the two-class classifier, this one also in-
cludes subjectivity classification which means that it classifies documents as
neutral or polar (positive or negative).
Chapter 2 Background 14

Figure 2.3: Penn Treebank Part-Of-Speech (POS) tags [9, p. 33]

Table 2.1: Unsupervised classification of tweets example.

Tweet Content Score


@TylorSwift concerts are the best(+1) :D(+1) #hypped(+1) 3
@Apple please stop selling terrible(-1) music on iTunes #mad(-1) -2

• Multi-class classifier: A multi-class sentiment classifier is usually based


on emotional classification. Therefore, documents are classified according to
emotions expressed on them (e.g. angry, sad, happy, etc).
Chapter 2 Background 15

One of the main differences between unsupervised and supervised sentiment clas-
sification methods is the training phase. Sentiment classifiers based on supervised
approaches require a set of annotated documents, this annotation is usually done
manually but in some cases through distant-supervision methods [11]. Distant-
supervision approach generates a training corpus by automatic means. It annotates
documents based on the presence of positive or negative emojis (emoticons). The
annotation accuracy of this method depends on the size of the documents, mostly
used on short unit texts such as Twitter sentiment classification tasks.

After being trained, a classifier is capable of predicting the sentiment of new input
documents. The classification process requires the extraction of feature vectors
from the documents; each vector contains n numerical features. The feature sets
of a classifier are essential for obtaining high accuracy. Some of the most effective
features are [9, p. 25]:

• Terms and their frequency (bag-of-words model): These features are


individual words (unigrams) or n-grams with associated frequency counts. N-
grams are a contiguous sequence of n-tokens (words) from a given sequence of
text. Besides frequency counts of the tokens, their positions might be also con-
sidered. This approach is called TF-IDF weighting scheme and is commonly
used in information retrieval tasks.

• Part of speech: The part-of-speech (POS) of words in a document could be


useful. For example, some research shows that adjectives are more likely to
indicate sentiment than other part-of-speech words. Therefore, the number of
different POS tags in a given document represents an effective feature. POS
tags may also be included in other types of features such as Terms and their
frequency where unigrams are included with their respective POS tag.

• Sentiment words and phrases: As discussed in previews section, the pres-


ence of positive and negative words is very important for sentiment classifiers.
Extracted from sentiment lexicons, the number of sentiment terms and phrases
in a document represents very powerful features.

• Syntactic dependency: The semantic relation between words in a document


may be useful as a feature, with the usage of dependency trees is possible to
find the target of a sentiment expression. However, the creation of these trees
usually has a negative impact in performance times.

• Sentiment shifters: These are expressions that are used to change the sen-
timent orientations, e.g., from positive to negative or vice versa [9, p. 26].
Chapter 2 Background 16

The most important type of sentiment shifters is negation contexts. When a


negation word (e.g. not, no, never ) is present in a sentence, the sentiment
score of subsequent words is inverted. For example, the sentence "the iPhone
is not a good phone" has the word good in a negated context which translates
to a negative sentiment classification of the sentence.

?? illustrates two example tweets with their respective vector features, these features
are composed by Bag of Words model, part-of-speech tags and sentiment features.
Let us explain one by one:

• bag-of-words: To reduce the sparsity of the vectors, one of the most used
prepossessing steps for extracting features is the removal of stop words, which
in these cases are: you, the, my, me, are. Twitter mentions and URLs are also
removed or replaced with placeholders. The resulting vectors are:

– (1) (@mention,1)(best,1)(<3,1)(#happy,1)
– (2) (bf,1)(hates,1)(:(,1)(depressed,1)

Each tweet is represented on the vector space which is the union of both sets.

• part-of-speech: This feature is composed by the count of: verbs, adjectives,


nouns, adverbs. Other POS tags could be added but these are the most rele-
vant for sentiment classification tasks.

• sentiment: The first tweet contains three positive tokens while the second one
has three negative. As a result, the sentiment features are: (1)[3,0] & (2)[0,3].
Chapter 2 Background 17

2.2.3 SentiTrack

SentiTrack is a system that performs sentiment analysis of tweets in real-time using


Semantic Web technologies to track stock market behaviors of a certain set of com-
panies. In order to determine the public opinion towards a company, SentiTrack uses
the Twitter API to pull tweets related to that company and process them through
a sentiment analysis pipeline [12]. An illustration of SentiTrack’s process workflow
is shown in Figure 2.4.

SentiTrack processing workflow is divided into the following three stages:

1. Context Expansion: In this stage, SentiTrack starts by selecting the set of


entity companies to analyze. Then, using DBPedia and SPARQL queries a
retrieval of secondary entities related to those chosen companies is performed.
The secondary entities in this context are persons and products, e.g. Company:
Microsoft, Person: Bill Gates, Product: Windows.

2. Stream Processing: The set of entities obtained in previews stage are used
to fed the Twitter Streaming API in order to fetch related public tweets in
real-time. Then, tweets are processed for recognition and annotation of enti-
ties using DBPedia Spotlight. Moreover, those tweets with annotated entities
are analyzed using a lexicon-based entity-centric sentiment classifier. The clas-
sification step intends to provide a positive, negative or neutral result based
on the opinions expressed towards each of the identified entities in the tweet.
Entity-centric sentiment classification approach used by SentiTrack is vary sim-
ple and not accurate enough. Therefore, an improved entity-based sentiment
classifier is required for SentiTrack project.

3. Periodic Sentiment Approximation: This stage is about analysis of the


sentiment data obtained in previews steps. It creates a periodic sentiment
projection in real-time about the opinions expressed in Twitter towards a tar-
get company. Additionally, a secondary system fetches real-time stock market
values about the target company, this is necessary to perform a correlation
evaluation with sentiment data obtained and stock values variations.

One of the contributions of this master’s thesis is to integrate a more accurate


sentiment classifier in SentiTrack.
Chapter 2 Background 18

Figure 2.4: SentiTrack Workflow [12]


Chapter 2 Background 19

2.3 Named Entity Recognition

Named Entity Recognition (NER) also known as named entity extraction or entity
identification, is an information extraction task used mostly for natural language
processing. The task consists on the extraction of structured information from
unstructured text, for instance: social media networks, micro-blogging sites and
e-commerce systems are constantly generating unstructured data that could be pro-
cessed into valuable information for interested parties. Therefore, NER intendeds to
identify elements of given input text and classify it according to a set of categories,
e.g. names of persons, locations, organizations, products, values, etc [13, p. 1].

The most effective approaches for NER use machine learning (ML) methods to
extract features of Named Entity (NE) examples classified as positive or negative,
finally, these examples will form a large collection of annotated documents (training
corpus).

There are three main ML techniques to perform entity identification [13, p. 4]:

1. Supervised learning: Some of these techniques are Maximum Entropy, Decision


Trees, Support Vector Machines, Conditional Random Fields and according
to Nadeau [13, p. 4]:

"A baseline Supervised learning method that is often proposed con-


sists of tagging words of a test corpus when they are annotated as
entities in the training corpus. The performance of the baseline sys-
tem depends on the vocabulary transfer, which is the proportion of
words, without repetitions, appearing in both training and testing
corpus."

Which means that NER is highly dependent on the quality of the training
corpus, this will determine how accurate the classifier is.

2. Semi-supervised learning: Semi-supervised learning (SSL) consist of a limited


degree of supervision. Therefore, this technique only requires a set of seed
words to start a learning process. For example, the system starts with a set of
seed words related to a specific topic such as "technology". Then, the system
identifies contextual clues of given words to classify new unknown terms. The
accuracy this approach can achieve is not as high as fully supervised learning
techniques but is effective for identification of entities related to unpopular
topics where good training corpus are difficult to find.
Chapter 2 Background 20

3. Unsupervised learning: This is a NER technique that usually depends on lexi-


cal resources (e.g., WordNet) to identify entities in a document. The types of
entities are extracted from given Lexicons, therefore, the quality and number
of lexical resources will define the success of this technique.
Chapter 3

Related Work

This chapter presents a variety of projects and scientific work related to sentiment
analysis and the usage of entity based approaches for sentiment classification. The
constant publication of opinionated data in social media networks provides an ever-
growing source of valuable information for interested companies. Therefore, Senti-
ment Analysis has become a very popular research field in the scientific community.
Pang and Lee [1] in 2008 presented a survey that covers techniques and approaches
for polarity sentiment classification and subjectivity identification. Moreover, in
2012 Bing Liu [9] published one of the most cited books related to sentiment analy-
sis, where he provides a very complete survey of most relevant research topics related
to opinion mining.

Based on aforementioned literature and additional related works, the next section
discusses a variety of sentiment analysis projects and the different approaches used
in them. Finally, this chapter presents a few research works associated with entity-
based sentiment classification and compares these techniques with the methods uti-
lized and developed in this master’s thesis.

3.1 Sentiment Analysis

The applications of Sentiment analysis are many, some of them include the classi-
fication of forum posts, blogs, news, product reviews and social network content.
Therefore, because this thesis project is based on social media analysis, specifi-
cally Twitter data. The related work presented in this section is mainly focused on
document-level sentiment analysis of tweets.

21
Chapter 3 Related Work 22

The term of sentiment analysis was first introduced by Nasukawa and Yi in 2003
[14], in their work they extracted sentiments associated with polarities of positive
or negative for specific subjects using semantic analysis with a syntactic parsing
method. However, research about opinions and sentiments expressed in text ap-
peared in 2001 where Das et al. [15] and Tong [16] published their work about the
analysis of market sentiment.

It was not until 2009 where Bhayani et al. [11] presented the first relevant research
related to the usage of Twitter data for sentiment analysis. Bhayani et al. used
a novel approach for automatic polar-classification of tweets where messages are
classified as either positive or negative. Based on distant supervision and machine
learning algorithms, Bhayani et al. generated a training corpus evaluating the pres-
ence of positive or negative emoticons such as ":D" or ":(" in each tweet. With this
method, they managed to achieve an accuracy above 80% for a polarity classification
task.

In 2010 Pak and Paroubek [17] built a sentiment classifier that is able to determine
positive, negative and neutral sentiments of English tweets. Using a distant super-
vision approach similar to Bhayani et al. [11] for the generation of a training corpus,
they implemented a multinomial Naive Bayes classifier extracting POS-tags and un-
igrams as binary features. In their results they showed higher precision by using a
term presence rather than its frequency. Also, an increased accuracy was obtained
by the usage of unigrams instead of bi-grams or three-grams. On the other hand,
Barbosa and Feng in same year [5] presented a 2-step sentiment analysis classifi-
cation method which first classifies tweets as subjective and objective (neutral and
polar), followed by a polarity classification of subjective tweets as positive or nega-
tive. Based on support vector machines, this 2-step approach proved an increased
accuracy in comparison with single step classifiers.

In 2011 Kouloumpis et al. [3] explored the utility of linguistic features for the identi-
fication of sentiment in Twitter messages. This paper presents an extended distant
supervision approach for generation of training data, the methods used consist on
the inclusion of Twitter hashtags such as "#bestfeeling, #epicfail, #news, etc." to
enhance the training data quality and include a third class to the classifier (neutral).
Additionally, Kouloumpis et al. implemented a 3-step prepossessing method com-
posed by the following steps: (1)tokenization, (2)normalization, (3)part-of-speech
tagging. According to Kouloumpis et al.’s results this prepossessing stage improves
Chapter 3 Related Work 23

the quality of the extracted features. In contrast to Kouloumpis et al., the meth-
ods used by Paltoglou and Thelwall [18] in 2012 explored an unsupervised lexicon-
based approach that predicts the level of emotional intensity contained in tweets.
According to Paltoglou and Thelwall, this approach may be used for subjectivity
identification and sentiment classification tasks, obtaining results comparable to
state-of-the-art machine learning based methods.

In 2013 Saif et al. [19] developed a state-of-the-art support vector machine (SVM)
classifier which obtained the best results in SemEval 2013 Twitter analysis task.
SemEval (Semantic Evaluation) is an international competition for evaluations of
computational semantic analysis systems. Many scientist and students from all
around the world participated in this event, but Saif et al.’s sentiment classification
approach excelled in its category. The classifier uses a large set of features to train a
SVM, features such as n-grams, POS-Tags, hashtags, lexicons, emoticons, elongated
words are just a part of the full set. However, the addition of negation context
handling was one of the determinant factors to outperform the other competitors.
Finally, the methods presented in this master’s thesis for document-level features
extraction are highly influenced by Saif et al.’s work.

3.2 Entity Based Sentiment Analysis

Currently, there are not many research works related to entity based sentiment
analysis in Twitter. Therefore, this section intends to discuss the most relevant
publications that explore the idea of an entity-centric sentiment classifier. Starting
with Ding et al. [20], in 2008 they presented a holistic lexicon-based approach to
perform topic-based opinion mining. In their work, they determined the sentiment
orientations (positive, negative or neutral) of opinions expressed in product features
in reviews. With the usage of unsupervised lexicon-based techniques, Ding et al.
defined a set of linguistic rules to extract opinion phrases expressed towards specific
products in a given review text. Although Ding et al.’s approach achieved high
accuracy for topic-based sentiment analysis on product reviews, this method may
not perform just as well with noisy text such as tweets. Khoo et al. [21] in 2010
implemented an aspect-based sentiment classifier which was capable of extracting
both sentiment orientation and sentiment strength of movie reviews. He considered
the sentiment expressed towards different aspects of these movies. These aspects
can be seen as sentiment targets, using sentence oriented linguistic clauses, Khoo et
al. managed to obtain highly precise sentiment scores for movie aspects. However,
Chapter 3 Related Work 24

one of the drawbacks of Khoo et al.’s solution is the absence of a neutral class on
their approach.

The most relevant scientific work related to this master’s thesis was developed by
Jiang et al. [4] in 2011. They focused on target-dependent Twitter sentiment clas-
sification which means that given an input query, they classify the sentiment ori-
entation of the tweets as positive, negative or neutral based on the presence of
positive, negative or neutral sentiments towards that query. In Jiang et al.’s ap-
proach they implemented a two-step SVM classifier incorporating target-dependent
and target-independent features. Additionally, they included related tweets (men-
tions and replies of each tweet) in the analysis. According to their experimental
results, the two-steps methodology greatly improves the accuracy of entity-based
sentiment classifiers. Nevertheless, because of the complexity of the methods and
the necessity of analyzing related tweets in each document, the performance time is
compromised and not suitable for real-time systems.

To conclude this chapter is important to clarify that this master’s thesis is highly
influenced by aforementioned research works. However, the specific combination of
methods and techniques used in this project are not documented in any other related
publication.
Chapter 4

Approach

This chapter gives a comprehensive description of the development of an Entity-


based Sentiment Classifier for social media analysis, which is the ultimate result of
this thesis. This approach relies on natural language processing (NLP) methods and
machine learning (ML) tools to achieve a highly accurate sentiment classification.

The following section presents the architecture and process pipeline of the entity-
based sentiment classifier. Moreover, following sections provide an in depth descrip-
tion of the architecture’s components.

4.1 Architecture

The architecture composition of the entity-based classifier is represented as a pipeline


of processes. Each process is essential for the correct operation of the classifier, which
is the following:

1. Entity identification

2. Tokenization

3. Normalization

4. POS-Tagging

5. Feature Vector Generation

- Document-based features

- Entity-based features

6. Support Vector Machine


25
Chapter 4 Approach 26

Figure 4.1: Processing pipeline and architecture


Chapter 4 Approach 27

Figure 4.1 illustrates aforementioned components and processes. The pipeline starts
with two input elements provided by the user or system which the classifier is inte-
grated to. The required input data are the following:

• Tweet: are microblogging-posts shared on the social media network Twitter.


Tweets have a restriction of 140 characters and may contain URLs, mentions,
hashtags or multimedia content. The classifier proposed in this thesis should
be able to process every word or phrase contained in tweets in order to yield an
accurate sentiment classification. For this task, specialized natural language
processing methods are necessary.
• Target Entity: The target entity refers to a query term (usually a company,
product or person) which the classifier takes as an input in order to find the
opinion expressed towards it. Therefore, given target entity must be present
in the tweet.

Figure 4.2: Simplified entity-based sentiment classifier workflow

Figure 4.2 shows a simplified workflow of the entity-based sentiment classifier devel-
oped in this thesis. Notice that the input of the classifier is a 2-tuple (two elements
list) composed by a tweet and a target entity. The output produced by the classifier
is a three-class sentiment classification (positive, negative or neutral) which repre-
sents the opinion expressed in the tweet towards given target entity. Furthermore,
continuing with the explanation of Figure 4.1, the first processing step in proposed
system is called Entity Identification which will determine the presence of other en-
tities (besides the target one) in the tweets. Then, the Tokenization step proceeds to
remove unnecessary tokens (terms or words) or replace them with predefined place-
holders. The workflow continues with a normalization process which is responsible
for most of the linguistic processing in input tweets. After obtaining a normalized
data, a POS-Tagger assigns part-of-speech labels to each token and sends them to
the Feature Vector Generator. The Feature Vector Generator proceeds to extract
document and entity level feature vectors to finally feed the support vector machine.
The following sections will describe how each of these processes work.
Chapter 4 Approach 28

4.2 Entity Identification

Entity identification, also known as Named-entity recognition (NER), is an informa-


tion retrieval task that intends to label tokens of a given text into pre-defined cat-
egories such as companies, persons, locations, etc. The proposed approach requires
the identification of entities in tweets to isolate contextual sentiments expressed
towards the target entity. Therefore, opinions expressed towards non-target enti-
ties are not considered for sentiment classification. In following sections the entity
contextual separation process is explained.

Figure 4.3: DBPedia Spotlight annotation example

Proposed sentiment classification approach uses DBpedia Spotlight service in order


to carry the entity identification task. This service annotates DBpedia1 resources
contained in the tweets, generating a list of contained entities. Figure 4.3 presents
an example text annotated by Dbpedia Spotlight service which in this case provides
meta-data about two identified DBpedia resources: Samsung and LG. Although the
service returns information about entities such as type and DBPedia URI, proposed
classifier only considers the identification of existing entities in given text ignoring
the rest of the information. Further usage of DBpedia Spotlight service might be
explored for future works and enhancements. Finally, a list of extracted entities is
forwarded to the Tokenizer module which is explained in next section.
1
http:wiki.dbpedia.org
Chapter 4 Approach 29

4.3 Tokenization

The Tokenization process is responsible for splitting input tweet text (string) into
tokens (word or terms) and organize those tokens into their respective sentences.
The Tokenizer (module in charge of this process) uses natural language processing
methods and logistic rules such as regular expressions to trim the text.

Figure 4.4: Tokenization workflow

In Figure 4.4 the Tokenization process workflow is represented. Starting with the
token segmentation, words contained in tweets are separated to each other based in
the presence of white spaces. Then, a regular expressions algorithm checks each of
those tokens for sentence stop punctuations such as exclamation points, question
marks and full stops. This sentence segmentation step is necessary for the entity-
based feature vector generation process since the sentiment relevance of sentences is
associated with the presence of entity tokens. Finally, HTML symbols (e.g. &amp,
&quot, etc) are replaced by their substituted values (some emoticons are made by
this symbols) followed by the replacement of identified entities with respective place-
holder. The replacement process of entities reduces the sparsity of future generation
of vector space model. Table 4.1 shows an example of a tokenized tweet where tokens
are arranged into sentences and entities are replaced by their respective placeholders
(TargetEntity and OtherEntity).
Chapter 4 Approach 30

Table 4.1: Tokenized tweet example

Target
(1) Google
Entity
Other
(1) Nexus
Entities
Tweet Thanks google!! Just got my new Nexus &lt;3
(1) {Thanks, TargetEntity!!}
Result Tokens
(2) {Just, got, my, new, OtherEntity, <3}

4.4 Normalization

Executed by a preprocessor module, the normalization step does most of the lin-
guistic processing required for generation of feature vectors. Normalization of data
involves the correction, removal and replacement of tokens yield by the tokeniza-
tion process. While some Twitter features like @mentions and URLs are weightless
in a sentiment context, #hashtags actually might contain sentiment value which is
necessary for a final classification. The Normalization process is composed of five
processing steps, these are represented in Figure 4.5.

Figure 4.5: Normalization workflow


Chapter 4 Approach 31

• Fix Elongation: tokens that contain more than two repeated letters are
fixed, leaving only two of these letter. For example, the word: "loooove!"
would be replaced by "loove!". This process contributes with the effectiveness
of the classifier because despite the fact that resulting fixed words might not
necessary be the correct ones (originals), most lexicons consider two-letter
elongated versions of terms.

• Slang / Abbr. Correction: presence of slang words and abbreviations are


very common in microblogging sites such a Twitter. Therefore, to enhance
the effectiveness of the lexicon features, these tokens are substituted by their
correct forms. For example, the slang word "w8" is replaced by "wait" and
"u" replaced by "you".

• URLs / Mentions Replacement: in some cases URLs or user names


(@mentions) are relevant for opinion extraction. The values by them self are
ignored, but the fact that there are references to URLs and users might be use-
ful. Hence, URLs and mentions are replaced by the placeholders "someURL"
and "someUser" respectively.

• Negation Tagging: negation words such as "not" or "never" can modify the
sentiment orientation of a sentence. e.g. "I love you" is an obvious positive
sentence, but "I do not love you" is considered negative. Therefore, to deal
with this situation, a negation tagging "_NEG" must be appended to tokens
located between a negation word and the end of sentence. The sentiment of
negated tokens will be shifted in the features generation stage of this classifier.

• Clean Tokens: As the final step in the normalization process, a removal of


unnecessary tokens must be done. Stopwords like "you", "my" or "the" do not
represent any sentiment value, consequently, those are removed. Also, tokens
with no letters (excluding emoticons) are ignored for further processing.

Table 4.2: Normalization example

Tokens Normalization Result


(1){not, their, best, !} (1){not, their_NEG, best_NEG, !}
(2){ http://t.co, @Muse, #LiveMuse} (2){ someURL, someUser, #LiveMuse}

Table 4.2 presents an example of how the normalization process works, in this case
two sentences are normalized. The first one shows the negation tagging of tokens
"their" and "best!" because they are positioned after the negation word "not". On
the second sentence, an example of token replacements is shown.
Chapter 4 Approach 32

4.5 POS Tagging

POS tagging consists on labeling normalized tweet tokens with their respective part-
of-speech (POS) values. There are many POS tagging technologies but only a few of
them are design to perform Twitter-specialized POS analysis. The solution proposed
in this master’s thesis uses Twitter ARK POS Tagger which is a java-based part-of-
speech tagger for English data and it is tailored made for Twitter posts. ARK POS
Tagger was developed by a group of researchers from Carnegie Mellon University2 ,
they manually annotated 1,827 tweets with POS tags and developed a specialized
POS tagset for tweets. ARK POS Tagger reports nearing 90% accuracy [22] making
it one of the most effective solutions available.

Figure 4.6 shows the ARK POS tagset with examples for each POS tag. The
most relevant POS tags for sentiment classification are adjectives (tag:A) and nouns
(tag:N) which usually express some degree of sentiment. For this reason, sentiment
lexicons like SentiWordNet and AFINN are mostly composed by nouns and adjec-
tives. However, hashtags (tag:#) in tweets may also contribute significantly with
the extraction of sentiment expressions. For example, the hashtag #BeautifulDay
clearly has a positive connotation.

Table 4.3: ARK POS Tagging example over normalized tokens

Normalized Tokens POS Tagging


(1){not, their_NEG, best_NEG} {R/not, O/their_NEG, A/best_NEG}
(2){ someURL, someUser, #Live} { someURL, someUser, #/#Live}

Table 4.3 shows how the POS tagging process works. In this example, the nor-
malized tokens not, their_NEG, best_NEG and #Live are replaced by R/not,
O/their_NEG, A/best_NEG and #/#Live respectively. The symbol "/" is ap-
pended at the beginning of each token, even if those tokens are already tagged as
negated (_NEG). Additionally, tokens already identified on previews steps are ig-
nored by the POS tagger. e.g. "@mentions", "URLs" and "punctuation symbols".
This is the final processing step before starting the feature vector generation. Fol-
lowing sections will explain how these tokens are transformed into vectors that will
represent key component of proposed entity-based sentiment classifier.
2
http://www.cmu.edu/
Chapter 4 Approach 33

Figure 4.6: ARK POS Tagset table with examples [22]


Chapter 4 Approach 34

4.6 Feature Vector Generation

The feature vector generation process is arguably the most important component
in any sentiment classifier. A Support Vector Machine (SVM) based classifier like
the one presented and implemented for this master’s thesis, depends highly on the
quality of features extracted from the raw data (tweets in this case). Therefore, in
order to achieve a highly accurate classification, the production of feature vectors
most be done with precision. The feature generation module is responsible for the
extraction of numerical values from the already normalized and tagged tokens, the
way these values are generated depends on the type of features required. Hence, this
section explores two types of features: document-based and entity-based features.
The sentiment classifier developed in this master’s thesis uses both types of fea-
ture extraction, classifying tweets not only on a document-level like most sentiment
classifiers but also on an entity-level which is the final goal of this project.

4.6.1 Document-based features generation

Document-based features are those extracted from document-level data. This means
that every single normalized-tagged token obtained from the input tweets is relevant
and considered for the generation of vectors. As a result, each tweet is represented
as a feature vector made up of the following set of features: binary bag-of-words
(unigrams), POS tags, linguistic features.

4.6.1.1 Binary bag-of-words

Bag-of-words approach is used in natural language processing to represent training


data (documents) as a set of word. This set of words is called vector space and
will be made of every token existing in the training data. Then, new documents
to be classified must be evaluated with aforementioned vector space to generate a
vector representation of this new entry. The n-gram approach is a common way of
categorizing words from documents and validate their presence in the vector space.
In this master’s thesis approach, unigrams were used to generate a vector space of
the training data (tweets). There are many ways of representing unigrams in feature
vectors but the binary approach was the selected method to be used in this project.
Binary approach provides its simplicity and performance speed without scarifying
quality. Binary bag-of-words also known as boolean term frequency, consists on
representing terms contained in documents as 1s or 0s, where 1 means that the term
Chapter 4 Approach 35

is present and 0 thath it is absent. Table 4.4 shows an example of how a generated
binary bag-of-words based on unigrams looks like.

Table 4.4: Binary bag-of-words representation of a tweet

Tweets Binary Bag-of-words


happy birthday friend! :) {1,1,1,1,0,0,0}
always be happy ;) {1,0,0,0,1,1,1}

4.6.1.2 POS Tags

The feature vector generated from part-of-speech (POS) Tags, consists on the num-
ber of verbs, adverbs, adjectives and nouns contained in tweets. Inspired by Saif et
al. [19], these four POS tags are proved to be the most relevant for sentiment clas-
sification. The addition of more POS tags to the feature generation process might
have a negative impact on the classification accuracy. In Table 4.5 an example of
POS tag features is shown, the order of POS tags for the creation of vectors is the
following: (1)noun -> (2)adjective -> (3)adverb -> (4)verb

Table 4.5: POS Tag feature vector example

Tweets POS Tags


happy birthday friend! :) {2,1,0,0}
always be happy ;) {0,1,1,1}

4.6.1.3 Linguistic features

Linguistic features are a set of elements extracted from tweets and represented as
count numbers on feature vectors. The Linguistic features consists of the following
eight elements:

1. all-caps: the number of words with all characters in upper case.

2. hashtags: the number of hashtags present in the tweet.

3. elongated words: the number of elongated words. e.g. loooove!

4. negation context: the number of negation contexts.

5. punctuation: count of contiguous sequences of question marks, exclamation


marks, and both exclamation and question marks. e.g. !!!, ???, !!??
Chapter 4 Approach 36

4.6.2 Entity-based features generation

Entity-based features unlike document-based, are extracted from entity-level data.


Entity-level data can be described as the context information of a target entity. The
idea is to separate the contextual sentiment of every existing entity in a document
(tweet in this case), then generate sentiment features only for the target entity ignor-
ing non-target ones. Therefore, this subsection explains which entity-base features
were used for the development of proposed entity-based classifier.

Figure 4.7: Entity-based features generation workflow

Figure 4.7 illustrates the processing steps required for the generation of entity-based
features, it starts with the identification of the target entity context. The following
techniques were used in order to extract target contexts:

1. Sentence separation: A tweet may contain many entities but only the target
entity and its context should be considered for entity-level feature generation.
Therefore, each sentence in a given tweet is evaluated for entity presence and
only those that fulfil the following rules are considered: (1) sentence with
target entity (2) sentence with no entities that is in the neighborhood of a
target entity sentence. For a better illustration of this concept, Figure 4.8
shows how the identification of relevant context is done in a tweet with two
sentences, the first is relevant because it contains the target entity "Nexus 5X"
while the second sentence is ignored due to other entity presence ("Apple").

Figure 4.8: Sentence separation context identification of tweet


Chapter 4 Approach 37

2. "But" clause: "But" clause context extraction method is similar to a sen-


tence separation process. Sentences with "but" like clauses (“with the excep-
tion of”, “except that” and “except for”) are splinted using as separation point
the position of these clauses. Figure 4.9 illustrates this idea, for this example
only the content to the left side of the token "but" will be extracted an consid-
ered for sentiment evaluation. The remaining part of the sentence is irrelevant
due to the presence of another entity (no-target).

Figure 4.9: But clause context extraction of tweet

4.6.2.1 Lexicon Features

After a successful extraction of sentiment contexts, lexical features can be generated


from those contexts. Lexicon features are the heart of a sentimentt classifier and usu-
ally represent the most effective set of features. Many different opinion lexicons can
be used to represent the numerical sentiment values of tweets. Hence, the classifier
presented in this master’s thesis combines seven different state-of-the-art sentiment
lexicons. For each of these lexicons, the following set of features is calculated to
generate a compelling sentiment feature vector [19]:

1. No. Tokens: the number of sentiment tokens in the tweet. These tokens are
words with sentiment scores above or below zero in a lexicon.

2. Total Score: total sentiment score calculated in tweet.

3. Max Score: highest sentiment score obtained in the tweet.

4. Last Token: sentiment score of last token.

When an entity context has negated tokens, sentiment scores for those tokes are
inverted. The solutions presented in this project uses a combination of different
lexicons, the quality of these lexicons is critical for the classifier. Therefore, the
following state-of-the-art lexical resources were selected for this task:
Chapter 4 Approach 38

• MaxDiff [23]: It is a manually labeled lexicon developed by crowdsourcing


and using the MaxDiff method. The lexicon contains 1,500 positive and neg-
ative words which are scored from -1 to 0 and 0 to 1.

• Bing Liu [24] [25]: Developed by Bing Liu, all terms of this lexicon are
manually labeled. It contains 6.790 positive and negative words which are not
scored.

• AFINN [26]: Manually labeled lexicon composed by 2.477 words, each word
has a sentiment score in the range of -5 to 5.

• SentiWordNet [27]: built over WordNet lexical database which contains


150.000 words. This lexical resource assigns sentiment scores in a range of
-1 to 1 with decimal values. Unlike BingLiu and AFINN lexicons, SentiWord-
Net scores are semi-automatically generated from manually labeled seed-words
using semi-supervised techniques.

• MPQA [28]: just like SentiWordNet, MPQA is a subjectivity lexicon created


by using semi-supervised techniques. This lexical resource has 6.880 labeled
words with no score intensities, only positive and negative.

• NRC Hashtag / Sentiment140 [29]: Developed by Mohammad Saif, both


lexicons were generated fully automatically with distant supervision tech-
niques. NRC Hashtag and Sentiment 140 lexicons contain 54,129 and 62,468
words respectively. Both represent the sentiment value of words with scores
between -∞ (most negative) to ∞ (most positive).

Table 4.6: Lexical resources summary.

Lexicon Score Range No. Words


MaxDiff Twitter Real-values 1,500
AFINN -5 to 5 2,477
BingLiu Pos / Neg 6,785
SentiWordNet -1 to 1 147,292
MPQA Pos / Neg 6,886
NRC Hashtag Real-values 54,129
Sentiment140 Real-values 62,468
Chapter 4 Approach 39

4.6.2.2 Emoticon Features

The informal nature of tweets is characterized by the common usage of emoticons


to express sentiment. Therefore, emoticons are also considered for the generation
of feature vectors. Only those emoticons contained on the target-entity context are
used to generate this features. The vector composition is fairly simple: No. of
positive emojis (e.g. ":)", ":D", ";)" ) / No. of negative emojis (e.g. ":(", "D:").

In Table 4.7 an example of entity-based feature vectors is presented. Notice that


there is only one emoticon in the example and it has positive value, hence the
result vector is 1,0 (+1 positive, 0 negative). Similar rules apply to BingLiu lexicon
features, two positive words in the tweet with no negative sentiments.

Table 4.7: Entity-based feature vectors example

Target-entity context tokens Feature Vectors


{my, TargetEntity, is, awesome, (BingLiu){2, 2, 1, 1}
best, day, ever, :D } (Emoji){1,0}

4.7 Support Vector Machine

This component represents the final stage of proposed sentiment classification solu-
tion. There are many supervised learning models such as Maximum Entropy, Naive
Bayes and Neural Networks. However, Support Vector Machines (SVMs) have the
potential to handle large feature spaces in a very efficient way [30]. Hence, a SVM
is used to deal with the large set of features generated from tweets and entities,
the generation of this features is explained in Section 4.6.2. Like any other super-
vised learning model, SVMs require training data to function. This training data
is represented as numerical feature vectors labeled with their respective class, the
labeling process is usually done manually (by evaluators) but in some cases distant-
supervision methods are used. The job of any SVM is to find a clear separation
between training vectors and their classes, based on this learning process, the SVM
is able to classify new document entries (tweets for this case).

Proposed entity-based classifier uses a support vector machine (SVM) NodeJS mod-
ule called Node-SVM which is a port from the C++ SVM library LIBSVM [31].
This library is one of the most popular SVM solutions available for machine learning
based classification and regression. To have a better understanding of implemented
classifier, the parameters used to setup Node-SVM are the following:
Chapter 4 Approach 40

• SVM Type: C-Support Vector Classification (C_SVC). n-class classification


where n > 2, allows multi-class classification (positive, negative and neutral).

• Kernel: Default Liner kernel is used.

• Normalization: During SVM data pre-processing, mean normalization is


required.

• Shrinking: Usage of shrinking heuristics.

• Probability: Disable the usage of probability estimates.

Figure 4.10 illustrates the linear form of a SVM which is a hyperplane that divides a
set of positive data points (positive labeled tweets) from a set of negative data points.
The separation distance between these two set is called maximum margin. In linear
SVMs, the maximum margin represents the maximum distance of the hyperplane
to the positive and negative data points.

Figure 4.10: Illustration of a linear Support Vector Machine [32]

4.8 SentiTrack Integration

For a full description of the SentiTrack system refer to Section 2.2.3. The integration
process of the sentiment classifier developed in this master’s thesis and the Senti-
Track system was fairly simple since both platforms are fully developed with NodeJS
(JavaScript) technologies. Therefore, in order to perform the integration, the pro-
posed classifier was implemented as a NodeJS module using the JS package manager
Chapter 4 Approach 41

(npm) approach. Figure 4.11 shows the package organization of implemented mod-
ule (source code files), the name of the module is entity_sentiment which refers to
the capabilities of implemented entity-based sentiment classifier. The following line
of code is required to make use of the entity_sentiment module:

var entitysentiment = require("entity_sentiment");

This line will allow NodeJS classes to perform classification using proposed solution.
In following chapter 5 the quality and performance of developed classifier is tested
and evaluated, then an analysis of result is presented.

Figure 4.11: Package files organization


Chapter 5

Evaluation and Analysis of Results

This chapter presents the evaluation and results of the entity-based sentiment clas-
sifier developed in this master’s thesis. The chapter is divided in three sections.
First one, describes the process of collecting the datasets used to train the classifier
and evaluate the proposed solution. The second section is about quality evalua-
tion, here the overall success of the implemented sentiment classifier is tested and
analyzed. Finally, the performance of the classifier is measured and compared with
other similar solutions.

5.1 Data Collection and Processing

Sentiment classifiers that make use of machine learning methods require a corpus
of labeled documents as training data in order to function. Usually the data is
manually labeled by evaluators. However, the level of sentiment classification will
determine which type of training data is necessary. For sentiment analysis of tweets,
these are the most used types of datasets:

• Document-level: Documents are labeled based on the sentiment orientation


expressed in the whole tweet span without considering the presence of enti-
ties or differentiation between expressions. This is the most common type of
Twitter sentiment corpus, but it is not useful for an entity based sentiment
classifier.

• Entity-level: Unlike document-level training data, entity-based labeled cor-


pus must reflect the sentiment expressed towards a given entity or query. Ta-
ble 5.1 shows how the dataset is organized.

42
Chapter 5 Evaluation and Analysis of Results 43

Table 5.1: Entity-based Twitter corpus example.

Sentiment Query/Target Tweet


I’m lovin the iPhone update especially the slide
Positive apple
down bar at top of screen =) good job @Apple.
#Twitter are you freaking kidding me
Negative twitter
#wth... http://t.co/zKn2bu5R
Developers: Let Microsoft Market Your App
Neutral microsoft
for Windows Phone http://t.co/QZUqhCxx

For the development and evaluation of presented entity-based sentiment classifier,


a collection of several entity-based labeled corpus was created. This collection is
composed by the following datasets:

• Sanders Analytics1 : this dataset is for training and testing sentiment anal-
ysis algorithms. It is composed by 5513 manually classified tweets. A 3-class
classification method was used (positive, negative, neutral) and each tweet
expresses sentiment towards a specific entity.

• STS-Gold2 [33]: Developed by Mohammad Saif, is a dataset where tweets


and targets (entities) are annotated individually. Therefore, only the opinions
expressed towards those entities is relevant for result labels.

• SemEval 2015 / 2016 [34]: SemEval (Semantic Evaluation) is an event held


every year where new semantic analysis systems are evaluated. Teams from
universities and institutions around the globe submit their solutions to a set
of problem tasks defined by the competition committee. One of these task is
about Twitter sentiment analysis, therefore, SemEval provides labeled datasets
that allow participants to train and evaluate their sentiment classifiers.

A normalization process was necessary to remove repeated tweets and noisy tokens
from the collection, additionally, a balance between the three different sentiment
classes (positive, negative, and neutral) had to be achieve in order to guarantee
effective performance of the support vector machine. As a result, Table 5.2 shows a
summary of the collected tweets. 70% of the 4900 labeled tweets is used as training
data leaving a 30% for evaluation and testing porpoises.
1
http://www.sananalytics.com/lab/twitter-sentiment/
2
http://tweenator.com/index.php?page_id=13
Chapter 5 Evaluation and Analysis of Results 44

Table 5.2: Datasets Summary

Dataset No. of Tweets #Negative #Neutral #Positive


Sanders Analytics 5513 654 2503 570
STS-Gold 498 177 139 192
SemEval 2016 2825 862 965 998
SemEval 2015 1105 213 422 470
Normalization 4900 1634 1633 1633

5.2 Quality Evaluation

In order to measure the quality of the entity-based sentiment classifier developed in


this master’s thesis, standard evaluation metrics were considered. Therefore, both
correct and incorrect predictions must be measured to calculate aforementioned
metrics. Table 5.3 presents a three-class confusion matrix which is used to visualize
the level of accuracy of predicted data against test datasets.

Table 5.3: 3-Class Confusion Matrix

Data Class Classified as Pos Classified as Neg Classified as Neu


Positive true positive false negative false neutral
Negative false positive true negative false neutral
Neutral false positive false negative true neutral

Using the confusion matrix illustrated in Table 5.3, several different metrics were
calculated for the evaluation of the classifier. The scoring metrics considered in this
project are the following:

• Precision: Precision is the rate of correct predictions over the universe of


predictions (true + false). For the positive class, precision is defined as follows:

true positive
precision = (5.1)
true positive + false positive

• Recall: is the fraction of relevant tweets that are successfully predicted. Con-
tinuing with positive class example, recall is defined as:

true positive
recall = (5.2)
true positive + false negative + false neutral
Chapter 5 Evaluation and Analysis of Results 45

• Accuracy: is an overall representation of correct predictions. For a three-class


classifier, it is defined as follows:

total correct prediction


accuracy = (5.3)
total correct prediction + total incorrect prediction

• F-Score: (F-Measure) is defined as the combination of precision and recall.


This is the result formula for class positive:

precisionpositive ∗ recallpositive
F score = 2 ∗ (5.4)
precisionpositive + recallpositive

Based on 4900 collected tweets (datasets) and previews described metrics, a 4-fold
cross validation test was performed in proposed entity-based sentiment classifier.
The results are presented in Table 5.4. According to obtained results, proposed
classifier achieved an accuracy of 0.635 (64%).

Table 5.4: Precision, Recall and F-Score results

Data Class Precision Recall Fscore


Positive 0.594 0.668 0.629
Negative 0.650 0.615 0.632
Neutral 0.669 0.622 0.645
Total: 0.637 0.635 0.635

Results in Table 5.4 show that the class positive achieved the lowest performance
while neutral class obtained the highest. Neutral class tends to get better results
since its classification depends on the absence of sentiment expressions. Therefore,
polarity classification represents a bigger challenge because of possible presence of
negation context and sarcasm comments.

5.2.1 Features Contribution

As explained in Section 4.6.2, the generation of feature vectors is an essential process


for sentiment classifiers. Hence, the contributions made by each of the features
generated for proposed classifier are illustrated in Table 5.5 and Figure 5.1.
Chapter 5 Evaluation and Analysis of Results 46

Table 5.5: Features contribution details

F-Score Accuracy Diff-F Diff-A


All Features 0.629 0.635
/ - Unigrams 0.619 0.626 - 0,01 - 0,009
/ - Emoticons 0.618 0.626 - 0,01 - 0,009
/ - Content 0.536 0.571 - 0,093 - 0,064
/ - POS Tags 0.541 0.576 - 0,088 - 0,059
/ - Lexicon 0.362 0.495 - 0,267 - 0,14

Figure 5.1: Feature Contributions Graph

These results show how important the lexicon features are to the overall performance
of the classifier, the contribution of the seven state-of-the-art lexical resources is
significantly higher than the contributions obtained by other features. Content
and POS Tags features achieved very interesting results surpassing unigrams and
emoticons features by a considerable margin. Although unigrams and bag-of-words
are some of the most popular features, the evaluation results yield insignificant
contributions from this vector. Therefore, its possible to assume that in an entity-
based classification approach, n-grams might not have the same contribution impact
as in document-level sentiment classifiers.
Chapter 5 Evaluation and Analysis of Results 47

5.2.2 Quality Comparison

In order to extend the evaluations made to proposed sentiment classifier, a perfor-


mance comparison between this master’s thesis solution and two additional senti-
ment classification tools was made. The following tools were considered:

• Former SentiTrack Classifier: SentiTrack as described in Section 2.2.3


requires a sentiment classifier to project public’s opinion about specific com-
panies in Twitter. Proposed sentiment classifier aims to replace the already
existing classifier which is referred as former SentiTrack classifier. Former
classifier uses unsupervised techniques such as lexicons and linguistic rules to
classify tweets as positive, negative or neutral. Also, it is based on a very
popular NodeJS module called Sentiment 3 .

• CompendiumJS4 : is a suit of natural language processing tools for NodeJS


platform. This module provides a lexicon-based sentiment classifier capable of
perform three-class classification.

Figure 5.2: Classifiers Comparison Graph

3
https://www.npmjs.com/package/sentiment
4
https://github.com/Ulflander/compendium-js
Chapter 5 Evaluation and Analysis of Results 48

Figure 5.2 illustrates the results obtained by a 4-fold cross validation test of 4900
target-labeled tweets (Section 5.1) performed over three sentiment classifiers includ-
ing the one developed and proposed in this master’s thesis. The results show a
significant difference in F-Score and Accuracy between the entity-based classifier
and the other two tools. Former SentiTrack classifier’s poor performance is con-
sequence of its simplicity, it is based on one single lexical resources AFINN with
no consideration of negation context and emoticons. The difference between results
is related to the classification techniques, most state-of-the-art sentiment classifiers
for Twitter use supervised methods capable of analyze many tweet-specific features.
Hence, presented solution make use of machine learning methods.

5.3 Performance Evaluation

A performance evaluation measures how long will take proposed classifier to process
a given amount of tweets. This is a very important evaluation given that one of
the objectives of this research project is to develop a sentiment classifier capable of
function under real-time processing systems such as SentiTrack. Additionally, the
performance test is also done to former SentiTrack classifier and CompendiumJS in
order to compare resulting processing times. These are the evaluation environment
specifications:

• Processor: Intel Core i5-2320 CPU @ 3.00GHz


• Memory RAM: 8 GB
• Operative System: 64 bits Windows 7

The results are shown in Table 5.6, they reflect a large difference in performance
time between former SentiTrack classifier and proposed solution. Can be inferred,
that the complex pipeline of processes required by the entity-based classifiers and the
usage of supervised techniques have an impact in classification time. However, the
performance achieved by proposed solution is good enough to cope with real-time
processing enviroments like SentiTrack.

Table 5.6: Performance test results

1000 Tweets
Entity-based (ms) 3447.054
Former SentiTrack (ms) 323.310
CompendiumJS (ms) 2357.886
Chapter 5 Evaluation and Analysis of Results 49

5.4 SentiTrack Experiment

After proposed/built entity-based sentiment classifier was integrated to SentiTrack,


an experiment was performed. This experiment intends to find a correlation be-
tween the two variables (social media sentiment, stock market prices) filtering and
classifying live tweets; as well as stock market movements, over a week for a set of
6 companies. Moreover, a correlation test was done over collected sentiments and
stock market data for each evaluated company.

The overall results were very positive in comparison to previews SentiTrack experi-
ments where the former classifier was used, evidence of a moderate correlation was
found on 3 out of 6 companies with a maximum correlation of 0.84 (84%). These re-
sults prove a successful integration between proposed/built sentiment classifier and
SentiTrack, which will provide a more accurate analysis for future experiments.
Chapter 6

Conclusions and Future Work

This thesis presented the research, evaluation and solution for an entity-based sen-
timent classifier for social media analysis. Implemented system is able to perform
sentiment classification of tweets based on the presence of entities and the opinion
expressions targeting them.

This chapter aims conclude this masters thesis project with a summary of the
achievements made, limitations encountered and possible future extensions and en-
hancements.

6.1 Achievements

• The main goal of this thesis is the study of an entity-based sentiment classi-
fication approach for the analysis of social media data. The research presents
the facts, reasons and evaluation results that led the project to the usage of
most suitable methods for required solution.

• A successful approach was produced for the required solution. The presented
approach is able to extract opinion expressions aiming relevant entities in
tweets.

• The solution achieved satisfactory result in terms of Accuracy and F-Score,


surpassing by a significant margin evaluated alternatives for sentiment classi-
fication.

• In terms of performance time, developed classifier is capable of work under


real-time processing systems.

50
Chapter 6 Conclusions and Future Work 51

• Presented solutions was successfully integrated to SentiTrack, providing an


improved sentiment analysis experience.

• The clean organization and structure of developed system, allows future inter-
ested researches to improve and modify provided tools.

6.2 Limitations

• Memory issues limited the development of the classifier to the use of unigrams
with bag-of-word method. The usage of different levels of n-grams as features
might have contributed significantly with better final results.

• Some NodeJS modules used by developed classifier may produce incompatibil-


ity errors in specific versions of Windows OS, this issue is related to the usage
of c++ libraries and other native resources.

• Despite the fact that entity-targeted opinion expressions were extracted from
tweets, there are many cases where no clear separation of sentence is made,
leading to incorrect classifications.

• Presented solution is unable of identify the presence of advertisements and


sarcasms in tweets. Therefore, sentiment analysis of specific products may
yield inconsistent results.

6.3 Future Work

• An expansion of the sentiment classifier with a dependency parser capable of


perform under real-time systems, would improve considerably the accuracy of
the solution.

• Improvements in accuracy can be made by using a more accurate named entity


recognition module and higher levels of n-grams as feature vectors.

• Performance of the system could be enhanced by exploring the usage of dif-


ferent POS tagging solutions and automatic tokenization libraries.

• Extend the solution to work with other social networks such as Facebook,
LinkedIn and Google plus.
Appendix A

Appendix

A.1 Glossary

Blog Abbreviation to WebLog


Emoticon Short for emotion icon. e.g. ":D", ":)", ":-)"
Hashtag Topic labels used in Twitter posts.
Sites where users express their ideas in the form of
Microblog
small units of text. e.g. Twitter.
Tweet Short message or post.
SO Sentiment Orientation
NLP natural language processing
Lexicon Dicctionary or lexical resource

52
Appendix A Appendix 53

A.2 Technical Specifications

The source code of the project can be downloaded from the following link which
points to the Git repository of the EIS group, University of Bonn.

Link to thesis repository:

https://github.com/EIS-Bonn/Theses/tree/master/2015/Cristobal_Leiva

To run the application, the system has to fulfill the following requirements and the
user needs to follow the instructions given below.

Requirements:

• Node.js + NPM

• Latest MongoDB

• Bower (get by running "npm install -g bower")

• Gulp (get by running "npm install -g gulp")

How to Install:

1. Install the required NodeJS modules: npm install

2. Configure Twitter API keys: pen config.sample.js and fill in the required keys
under the twitter app config and save it as config.js

3. To start NodeJS server: start mongo then run the command "gulp"

4. Run the web browser: http://localhost:8082/resa


Appendix A Appendix 54

A.3 SentiTrack GUI

Figure A.1: SentiTrack front-end, bubble cloud

Figure A.2: SentiTrack front-end, live sentiment data


Bibliography

[1] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations
and trends in information retrieval, 2(1-2):1–135, 2008.

[2] Statista. number-of-worldwide-social-network-users. http://www.statista.


com/statistics/278414/number-of-worldwide-social-network-users/,
2016. [Online; accessed 21-March-2016].

[3] Efthymios Kouloumpis, Theresa Wilson, and Johanna D Moore. Twitter senti-
ment analysis: The good the bad and the omg! Icwsm, 11:538–541, 2011.

[4] Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and Tiejun Zhao. Target-
dependent twitter sentiment classification. In Proceedings of the 49th Annual
Meeting of the Association for Computational Linguistics: Human Language
Technologies-Volume 1, pages 151–160. Association for Computational Linguis-
tics, 2011.

[5] Luciano Barbosa and Junlan Feng. Robust sentiment detection on twitter from
biased and noisy data. In Proceedings of the 23rd International Conference
on Computational Linguistics: Posters, pages 36–44. Association for Computa-
tional Linguistics, 2010.

[6] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up?: sentiment
classification using machine learning techniques. In Proceedings of the ACL-
02 conference on Empirical methods in natural language processing-Volume 10,
pages 79–86. Association for Computational Linguistics, 2002.

[7] Kenny Bastani. twitter-graph. http://neo4j.com/blog/


oscon-twitter-graph/, 2014. [Online; accessed 21-March-2016].

[8] Statista. Leading social networks worldwide as of January 2016, ranked


by number of active users. http://www.statista.com/statistics/272014/
global-social-networks-ranked-by-number-of-users/, 2016. [Online; ac-
cessed 21-March-2016].
55
Bibliography 56

[9] Bing Liu. Sentiment analysis and opinion mining. Synthesis lectures on human
language technologies, 5(1):1–167, 2012.

[10] Princeton University. WordNet. https://wordnet.princeton.edu/, 2016.


[Lexical database of English].

[11] Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using
distant supervision. CS224N Project Report, Stanford, 1:12, 2009.

[12] Priyanka Dank, Simon Scerri, and Ali Khalili. Linked data-based social media
analysis for stock market tracking.

[13] David Nadeau and Satoshi Sekine. A survey of named entity recognition and
classification. Lingvisticae Investigationes, 30(1):3–26, 2007.

[14] Tetsuya Nasukawa and Jeonghee Yi. Sentiment analysis: Capturing favorabil-
ity using natural language processing. In Proceedings of the 2nd international
conference on Knowledge capture, pages 70–77. ACM, 2003.

[15] Sanjiv Das and Mike Chen. Yahoo! for amazon: Extracting market sentiment
from stock message boards. In Proceedings of the Asia Pacific finance asso-
ciation annual conference (APFA), volume 35, page 43. Bangkok, Thailand,
2001.

[16] Richard M Tong. An operational system for detecting and tracking opinions
in on-line discussion. In Working Notes of the ACM SIGIR 2001 Workshop on
Operational Text Classification, volume 1, page 6, 2001.

[17] Alexander Pak and Patrick Paroubek. Twitter as a corpus for sentiment analysis
and opinion mining. In LREc, volume 10, pages 1320–1326, 2010.

[18] Georgios Paltoglou and Mike Thelwall. Twitter, myspace, digg: Unsupervised
sentiment analysis in social media. ACM Transactions on Intelligent Systems
and Technology (TIST), 3(4):66, 2012.

[19] Saif M. Mohammad, Svetlana Kiritchenko, and Xiaodan Zhu. Nrc-canada:


Building the state-of-the-art in sentiment analysis of tweets. In Proceed-
ings of the seventh international workshop on Semantic Evaluation Exercises
(SemEval-2013), Atlanta, Georgia, USA, June 2013.

[20] Xiaowen Ding, Bing Liu, and Philip S Yu. A holistic lexicon-based approach to
opinion mining. In Proceedings of the 2008 International Conference on Web
Search and Data Mining, pages 231–240. ACM, 2008.
Bibliography 57

[21] Tun Thura Thet, Jin-Cheon Na, and Christopher SG Khoo. Aspect-based sen-
timent analysis of movie reviews on discussion boards. Journal of Information
Science, page 0165551510388123, 2010.

[22] Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel
Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan,
and Noah A Smith. Part-of-speech tagging for twitter: Annotation, features,
and experiments. In Proceedings of the 49th Annual Meeting of the Association
for Computational Linguistics: Human Language Technologies: short papers-
Volume 2, pages 42–47. Association for Computational Linguistics, 2011.

[23] Svetlana Kiritchenko, Xiaodan Zhu, and Saif M Mohammad. Sentiment analysis
of short informal texts. Journal of Artificial Intelligence Research, pages 723–
762, 2014.

[24] Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In
Proceedings of the tenth ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 168–177. ACM, 2004.

[25] Bing Liu, Minqing Hu, and Junsheng Cheng. Opinion observer: analyzing
and comparing opinions on the web. In Proceedings of the 14th international
conference on World Wide Web, pages 342–351. ACM, 2005.

[26] Finn Årup Nielsen. A new anew: Evaluation of a word list for sentiment analysis
in microblogs. arXiv preprint arXiv:1103.2903, 2011.

[27] Andrea Esuli and Fabrizio Sebastiani. Sentiwordnet: A publicly available lexical
resource for opinion mining. In Proceedings of LREC, volume 6, pages 417–422.
Citeseer, 2006.

[28] Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. Recognizing contextual
polarity in phrase-level sentiment analysis. In Proceedings of the conference on
human language technology and empirical methods in natural language process-
ing, pages 347–354. Association for Computational Linguistics, 2005.

[29] Xiaodan Zhu Svetlana Kiritchenko and Saif M. Mohammad. Sentiment analysis
of short informal texts. 50:723–762.

[30] Thorsten Joachims. Text categorization with support vector machines: Learning
with many relevant features. Springer, 1998.

[31] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector
machines. ACM Transactions on Intelligent Systems and Technology (TIST),
2(3):27, 2011.
Bibliography 58

[32] John Platt et al. Sequential minimal optimization: A fast algorithm for training
support vector machines. 1998.

[33] Hassan Saif, Miriam Fernandez, Yulan He, and Harith Alani. Evaluation
datasets for twitter sentiment analysis: a survey and a new dataset, the sts-gold.
2013.

[34] Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko, Saif M Mohammad, Alan
Ritter, and Veselin Stoyanov. Semeval-2015 task 10: Sentiment analysis in
twitter. Proceedings of SemEval-2015, 2015.

You might also like