Text Mining Project Report
Text Mining Project Report
Contents
1 Introduction 3
2 Tweet preprocessing 4
2.1 Removing Links/URLs . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Removing usernames . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Removing Hashtags . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Repeated characters . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.5 Removing stop-words . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Dataset 8
3.1 sentiment140 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 SemEval17 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
7 Future work 25
8 Conclusion 25
1
CONTENTS CONTENTS
Abstract
Most of the tasks in natural language processing involve looking at individual words
and try to extract information from different factors, namely their distribution in
the documents,syntactic contexts,their frequency weighting but always ignore the
sentiment of the continuous form of these words.
In this paper we represent a method that learns word embedding for Twitter senti-
ment classification.
Word embedding for social media sentiment analysis and classification has gained
increasing importance as it gives better results than the other related existing meth-
ods.
keywords : sentiment analysis ; tweet sentiment ; sentiment-specific word embed-
ding ; text classification
2
1 INTRODUCTION
1 Introduction
Recently, the social web network has experienced a huge growth in the number of its
users where they are daily sharing their opinions and expressing their feelings. These
data are extremely valuable and from them we can extract so many information.
Twitter sentiment classification is one of the popular and attractive tasks that is
widely used in order to analyze feedbacks, reviews and opinions towards products,
persons, events, etc. This kind of analysis is very important in decision making
specially with nowadays connected world.
However, the existing approaches for twitter analysis are generally based on training
a machine learning model and using it to predict the new tweet polarity. The features
used to train the machine learning classifier are typically extracted by the traditional
NLP methods such as Bag of words and TF-IDF. These traditional features did not
yield to a robust model and fail to predict unseen data. So that researchers tried to
find another feature representation to improve the model learning performance. In
this context, the Word Embedding comes as a promising alternative and becomes
the most used technique to construct a vector representation of word. Word2Vec
and C&w models are two words embedding models that map word to vector repre-
sentation which integrates the context and the syntactic relationship between words
but ignore the sentiment polarity of words.
Specific-Sentiment word Embedding (SSWE) is an embedding algorithm that en-
codes the sentiment information in the vector representation. In this project, we
will explore this technique to learn continuous representations of tweet words to
train a twitter sentiment classifier. Our approach is based on training a SSWE
model and then used to extract features from tweets and use them to train twitter
sentiment classifier.
3
2 TWEET PREPROCESSING
2 Tweet preprocessing
When treating the case of analyzing tweets posted on the Worldwide Micro-blog
Twitter, one can remark that the posts present often some repetitive patterns,
namely, Hash-tags before words, Retweet symbols Handles, a remarkable use
of Punctuations, Emoticons, as well as the use of grammatically correct words
but also misspelled written words, some other patterns as Repeated characters
are also very common. We can also notice that Abbreviations are widely used in
tweets. This will naturally make the information extraction process more complex
and for that reason we followed the following preprocessing steps present in the sub-
sections of this part of the report.
4
2.1 Removing Links/URLs 2 TWEET PREPROCESSING
The URLs and Links are maybe the most widely used patterns which appears to
be the least useful for our Task. A Word embedding which is considering to embed
Links and URLS in vectors is obviously very wrong, as these web directions don’t
have any impact on the sentiment of the tweet. Hence, We have removed these
patterns using the following regular expression and the re module providing regular
expression operations:
1 def removeUrls(tweet):
2 tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+) |
,→ (http?://[^\s]+))','URL',tweet)
User mentions are used to refer to a person name. names begin with an at sign
”@” symbol followed by the user-name. User-names are used as a unique reference
to a person’s account and add no information to the collected entropy about the
sentiment of the tweet so that’s why they’re removed.
1 def removeUsername(tweet):
2 tweet = tweet.lower()
3 tweet = re.sub('@[^\s]+','AT_USER',tweet)
Users use hashtag symbol (#) before a relevant keyword or phrase in their Tweets
to categorize or prioritize it over the other. Keeping the hashtags may mislead
any embedding learning which can consider these ”hashtagged” words as new words
rather from their original form without Hashtagging. Therefore, we decided to
remove all hashtags and keep the words that come after. We have used here also re
module:
1 def removeHashtags(tweet):
2 tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
People always use repeated characters when they want to express their emotional
state, It is very common to find tweets like ”I’m happyyyyyy”,”We won, omggggg!”.
The repeated characters present in these kind of tweets may mislead the embedder
which will consider these forms of words, if not preprocessed and the original form
5
2.5 Removing stop-words 2 TWEET PREPROCESSING
Stop Words are words which do not contain important significance to be used in
Search Queries or in text analysis. Usually these words are filtered out from the
corpus because they return vast amount of unnecessary information. Every language
has a set of stop words , and for instance, Words like ”This”, ”here”, ”I” belong
to the list of stop words of the English language. Learning how to eliminate these
words from the tweet is tricky due to the following facts [6]:
• Learning phase becomes much faster since we eliminated many features from
the tweet.
• Prediction can be also more accurate since we are eliminating noise and dis-
tracting features.
It is also important to mention that the list of stop words is not unique for a specified
language, and one can find different stop words lists for the english language, what
matters is to know which stop word list to use for our application. For instance, we
found for example stop words lists containing words like ”good” or ”bad”, we have
estimated that this kind of lists won’t be suitable as we may need the embedding
of these words for the sentiment classification part. This has conduct us to choose
carefully the list of the stop-words in order to improve the power of the prediction
of our classifier.
2.6 Conclusion
In this chapter, we have described the preprocessing that we have done to our
dataset. To summarize, We have performed three subtasks. The first one consists of
removing URLs/Links, usernames and Hashtags. The second one removes repeated
6
2.6 Conclusion 2 TWEET PREPROCESSING
letters of a word and the last one deals with removing stop-words. We have also
done other filtering steps of some patterns. For instance, datetimes, days, months,
years and other similar words which are filtered. More detailed information about
the pre-processing, in the module we have implemented for this step and namely
preprocess.
7
3 DATASET
3 Dataset
This section describes the datasets that we used during this project. As our project
is composed of 2 main parts: the first part consists on learning a specific sentiment
word embedding (SSWE) model and the second part consists on training a
classifier for predicting the sentiment expressed by tweets. Therefore, tow different
datasets were used in order to achieve these objectives.
column values
0 the polarity of the tweet: 0 for negative, 2 for neutral and 4 for positive)
1 the id of the tweet (e.g : 2087)
2 the date of the tweet (e.g: Sat May 16 23:58:44 UTC 2009)
3 the query (e.g: lyx). If there is no query, then this value is NO QUERY.
4 the user that tweeted (e.g: robotickilldozr)
5 the text of the tweet (e.g: Lyx is cool)
and uses the presence of the tweets polarity to construct the word representation,
this dataset provides us with the necessary data and informationto construct these
vector representations.
deepnl needs as well a large training set annotated with polarity. In our case
the training set is a large number of tweets. Deepnl defines the format of the train-
ing input. In fact, the tweets training file should be in the format of the SemEval
2013 Sentiment Analysis in Twitter, i.e. one tweet per line, in the following format:
< SID >< tab >< U ID >< tab > polarity < tab >TWITTER MESSAGE The
polarity values can be positive, negative, neutral or objective.
1
http://help.sentiment140.com/for-students/
2
http://help.sentiment140.com/for-students
3
https://github.com/attardi/deepnl
8
3.2 SemEval17 dataset 3 DATASET
Before using sentiment140 dataset, we cleaned because there are data rows with a
distorted format and we also removed the too short tweets (less than 3 words). So
after cleaning, we obtained a new dataset with 1102286 tweets. Then we removed
the unnecessary fields from sentiment140 dataset, and we kept only the tweets text
and their polarities. As a result we get the new format of the training dataset:
TWITTER MESSAGE < tab >polarity
The table 2 describes the sentiment140 dataset after cleaning:
Property values
Number of tweets 1102286
Number of positive tweets 534590
Number of negative tweets 568102
Start date Mon Apr 06 22:19:45 PDT 2009
End date Tue Jun 16 08:40:49 PDT 2009
we set the polarity ”objective or neutral” to ”neutral”. The SemEval train data
contains 18051 tweets with the polarity distribution described in the table4 The
SemEval test data contains 11987 tweets with the polarity distribution described in
the table5
9
4 SENTIMENT-SPECIFIC WORD EMBEDDING
Vector space models have been used in distributional semantics since the 1990s.
Many models have been described for estimating the continuous representations of
word embedding, Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis
(LSA) being two such examples. The term word embeddings was originally de-
scribed by Bengio et al. in 2003 [1] who trained them in a neural network language
model together with the model’s parameters.
However, Collobert and Weston were the first to proove the efficiency of pre-trained
word embeddings in their paper [2], in which they established word embdedding e
as a highly effective tool to represent words in vector spaces for later use in classi-
fication or other tasks, they have also rather than that described a neural network
architecture that most of today’s embedding approaches are built upon. However,
It was Mikolov et al. (2013), who really brought a big constribution that has been
made to this area through the invention of word2vec, a powerful two-layer neural
network model, that is trained to reconstruct linguistic contexts of words with a
very precise and efficient way.
Lately, the SSWE model which stands learning sentiment-specific word embedding,
and which encodes sentiment information in the continuous representation of words
has made a breakthrough in Vector space models. In this report we give an overview
about the classical model architectures C&W and Word2vec, Later on, a more de-
tailed overview will be given for the SSWE model, citing its variants SSW E h and
SSW E r .
Collobert and Weston, a.k.a C&W, prove that word embeddings trained on a suffi-
ciently large dataset carry syntactic and semantic meaning.
Their solution consist of training a neural network that gives a higher score fθ for
the correct next word sequence given the previous word.
They use a ranking criterion that looks like this :
10
4.1 Classical technique of Word
4 Embedding
SENTIMENT-SPECIFIC WORD EMBEDDING
x is a sample window that contains n words and X is the set of all possible windows
of the corpus.
For each window x, they then produce a corrupted, incorrect version x(w) by replac-
ing x’s centre word with another word w from V.
The objectif of ranking criterion is to maximize the distance between the correct
and the incorrect window with a margin of 1.[2]
While this model eliminates the computation’s complexity of the softmax layer, it
preserves an intermediate fully-connected hidden layer which itself constitutes an-
other level of complexity.
C&W used seven weeks to train their model with a vocabulary |V | = 130000.
The most popular word embedding model proposed by Mikolov et al. in 2013.[5]
Technically, word2vec is not considered to be part of deep learning because it doesn’t
have multiple non-linear layers and can’t learn feature hierarchies.
However, Word2vec has other training strategies that takes additional contexts into
account.
Those training strategies enhance both the speed and accuracy of the prediction and
get rid of the complexity computation of the previous word embedding models.
In the following, we will look further at those strategies.
1
PT
Jθ = T t=1 log p(wt |wt−n , ..., wt−1 , wt+1 , ..., wt+n )
After that, the objective function is used to compute the gradients with respect to
the unknown parameters and at each iteration update them via Stochastic Gradient
Descent.
11
4.1 Classical technique of Word
4 Embedding
SENTIMENT-SPECIFIC WORD EMBEDDING
Skip-gram : Instead of using the surrounding words to predict the target words,
Skip-gram uses the center word to predict the surrounding words as show in figure.
The objective function is as following :
1
PT P
Jθ = T t=1 −n≥j≤n,6=0 log p(wt+j |wt )
knowing that the next equation is equivalent to the softmax function defined in :
exp(hT v0wt+j )
p(wt+j |wt ) = P T
w ∈V exp(h v0wi )
i
Knowing that the skip-gram model doesn’t have a hidden layer that produces an
intermediate vector , here h is simply the word embedding vwt of the input word wt .
12
4.2 SSWE model 4 SENTIMENT-SPECIFIC WORD EMBEDDING
Figure 3: Skip-gram
4.2.1 SSWEh
The first approach SSWEh has the same neural network architecture as C&W model
with an additional sof tmax layer as it is shown in Figure4(b). The idea behind this
model is to predict the sentiment distribution of the tweet based on input ngram
using the continuous vector of the top layer of the neural network. As a result, it
yields a feature vector for the specific input. In fact, it uses a window to go through
the input sentence and then predict the sentiment polarity based on each ngram
with a shared neural network.
13
4.2 SSWE model 4 SENTIMENT-SPECIFIC WORD EMBEDDING
The sof tmax layer is used to map the neural network outputs into conditional
probabilities. SSWEh assumes that the positive distribution is of the form [1,0]
and the negative distribution is of the form [0,1]. So it is trained by predicting
the positive ngram as [1,0] and the negative ngram as [0,1] which are too strict
constraints. The letter h in the SSW Eh refers to the hard constraints of this model.
The cross-entropy error of the sof tmax layer is:
X g
lossh (t) = − fk (t).log(fkh (t)) (1)
k={0,1}
Where f g (t) is the gold sentiment distribution, and f h (t) is the predicted sentiment
distribution.
4.2.2 SSWEr
14
4.2 SSWE model 4 SENTIMENT-SPECIFIC WORD EMBEDDING
Where f0r is the predicted positive score, f1r is the predicted negative score and
δs (t) is an indicator function reflecting the sentiment polarity of a sentence which is
described below : (
1 if f g (t) = [1, 0]
δs (t) =
−1 if f g (t) = [0, 1]
4.2.3 SSWEu
Both SSWEh and SSWEr do not generate the corrupted ngram. So they learn
sentiment-specific word embedding by integrating the sentiment polarity of sen-
tences but They ignore the syntactic contexts of words. SSWEu is a unified model
that capture both the sentiment information of sentences and the syntactic contexts
of words. It is described in the Figure4(c).
SSWEu uses the original (or corrupted) ngram and the sentiment polarity of a sen-
tence as the input to predict a two-dimensional vector for each input ngram. It
defines two scalars (f0u ,f1u ) to compute respectively the language model score and
the sentiment score of the input ngram. The training of SSWEu model aims to
achieve the following goals:
• The original ngram should obtain a higher language model score f0u (t) than
the corrupted ngram f0u (tr )
• The sentiment score of original ngram f1u (t) should be more consistent with
the gold polarity annotation of sentence than corrupted ngram f0u (tr ).
The new function is a linear combination of two losses as it is illustrated below:
Where losscw (t, tr ) is the syntactic loss given by c&W model, and lossus (t, tr ) is the
sentiment loss as described below:
15
5 WORD EMBEDDING VISUALIZATION
In this section, we describe the approach that we have chosen to prove the seman-
tic relationship between the word representations that we have already obtained as
described in the previous part of the report.
Many features and metrics were used in the literature to determine the semantic
efficiency of different word embedding algorithms. The choice of such metrics ap-
pears to be as crucial as the choice of the embedding algorithm itself, as these metrics
will be the evaluators of how good a semantic structure is constructed between words
and how the learning of this structure is done.
Our approach to study the semantic efficiency of the sentiment-specific word em-
bedding is divided in two parts, namely a Global Visualization in which we try
to understand the overall distribution of our vector representations as well as to ex-
plain the created structure of the whole set of vectors, and a second step consists of
a Specific Visualization in which we further our analyzing of the word embedding
by investigating the relationship between individual word representations in order
to track the similarities between semantically close words as for example synonym
words.
In order to visualize the set of the word embeddings as a whole, we have first to
patch it in a way that it becomes visually clear and more intuitive so that we can
later infer the general structure and distribution of the data. One problem is to
how visualize the High-dimensional word representations as the chosen size of
the Embedding’s vectors extends by far the 2-Dimensional or 3-Dimensional spaces.
This High-dimensional dataset can be very difficult to visualize and even impossible
considering the complexity of the problem and the available resources. To aid visu-
alization of the structure of the word embedding, the dimension must be reduced in
some way.
Hence, we have opted to map our representation from the original chosen size of the
Embedding’s vectors to the 2D-space, and because a simple random projection of
the data or even using PCA is likely to cause a loss in the more interesting structure
within the data. We have chosen to map our data using the t-distributed Stochas-
tic Neighbor Embedding (t-SNE) which is a very effective way to preserve the
neighborhood structure of the data i.e the clustering in the high dimensional space
is preserved in the low dimensional space (2D-space in our case).
16
5.2 Specific visualization: 5 WORD EMBEDDING VISUALIZATION
ities. The affinities in the original space are represented by Gaussian joint probabili-
ties and the affinities in the low-dimensional space are represented by t-distributions,
which is the sampling distribution of the t statistic given by this formula:
x̄ − µ
t= √ (5)
s/ n
where x̄ is the sample mean, µ is the population mean, s is the standard devia-
tion of the sample, and n is the sample size. This allows t-SNE to be particularly
sensitive to local structure but also to reduce the tendency to crowd points together
at the center and more importantly to reveal data that lie in multiple, different,
manifolds or clusters.
To add more interactivity with the user, we added some features to the visual-
ization such as the display of the corresponding word when the user hover on the
data point, this is done using an association of the vectors to their correspond-
ing words, to do so , our visualizer has inherited the class Keyedvectors from the
module gensim.models.keyedvectors and we stored the words and vectors in a
Keyedvectors instance.
Figure 5: Zoom-in and click over two word representation to show the associated
vocab words
This section describes the approach we have followed to analyze the relationship
between individual word representations. The main goal of any word embedding
scheme is to enclose the semantic similarity into the word representations. To verify
this property on our embedding, different similarity measures where studied and the
17
5.2 Specific visualization: 5 WORD EMBEDDING VISUALIZATION
best fitted to our problem were chosen to analyze the behavior of the embedding
algorithm when comparing between similar ( i.e in the sense of words synonymy)
words.
Very much similar to the case of the Word2Vec embedding, and because of the
simple fact that the SSWE embedding uses also the context of words when learning
the embedding along with the polarity of the sentences in which they appeared,
similar words tend to be close to each other in the sense of the geometric
distance between them. For that reason, it seemed necessary to measure the
distance similarity between different words of our embedded vocabulary.
The cosine similarity is a widely used similarity measure of two non-zero vectors of
an inner product space that measures the cosine of the angle between two vectors.
A normalization of the input vectors can be used to get a range of cosine similar-
ity output results between 0 and 1. Hence, the values close to 0 (i.e geometrically
perpendicular when 0 is reached) refer to two completely uncorrelated vectors , this
means in the case of the word embedding problem that the two words have no com-
mon semantic relationship. On the other hand, the more the two input vectors are
positively correlated , the bigger the cosine similarity value is.
When testing on our word representation, we could be able to prove that the words’
semantics learning was successfully done. In the following figures, we demonstrate
some of the normalized cosine similarity measure results of some chosen emotion
words:
18
6 EXPERIMENTS AND RESULTS
In this section we will illustrate the implementation details of the word embedding
algorithm and the classification model, Afterwards, we will demonstrate and explain
the results that we have obtained.
we train the Specific sentiment embedding model using the sentiment140 presented
in the dataset section. In fact, the SSWE needs a huge train dataset in order to
obtain large scale training corpora.
We used the deepnl library which implement the SSWE neural networks architec-
ture. We used the script dl-sentiwords.py which allows creating sentiment specific
word embedding. It uses essentially the dl-conv.py script which a convolutional
neural networks with the same layers specified by SSWE model. Then we integrate
them into our project as an Embedding module. Before using the sentiment140
dataset, We cleaned and pre-processed as it is described in the preprocessing sec-
tion. We also remove remove the too short tweet which have less then 5 words.
The training parameters are described in the table6
The outputs of the first model are two files with the same size:
• Vocabulary file: text file contains the vocabulary extracted from the dataset.
• Vector file: text file contains the embedding vectors of the words existing in
the vocabulary file
19
6.3 Visualization results: 6 EXPERIMENTS AND RESULTS
Figure 6: word representation distribution after being mapped from a vector space
of size 50 to a 2D-space using t-SNE
Figure 7: word representation distribution after being mapped from a vector space
of size 200 to a 2D-space using t-SNE
20
6.3 Visualization results: 6 EXPERIMENTS AND RESULTS
We note that the general word representations distribution has a particular oval
form for an embedding size of 200, it is more coarse in the representation of words
in a vector space of dimension equal to 50. this latter representation presents more
vectors density in the x-axis part of the manifold but also presents some special parts
in its extremities. Given the fact that the initialization of the feature vectors before
learning was done using pseudo-random values following the Normal distribution,
we can prove that a learning of a model is done and the vector distribution doesn’t
follows anymore a normal distribution (i.e for a size of embedding of 50) like its
initialization values. On the hand, it is insufficient to study our embedding only
through studying the shape of the distribution and thus we needed to study the
relationships between different words in the embedding space, for that reason, we
have ,as described in the specific visualization of the last part, calculated for some
keywords the cosine similarity of some chosen cosine similarities and here is a list of
the most similar words to some of the keywords described in the next page.
Figure 8: The most similar words using the cosine similarity of the word: good in
a vector space of dimension 50
Figure 9: The most similar words using the cosine similarity of the word: happy in
a vector space of dimension 50
21
6.3 Visualization results: 6 EXPERIMENTS AND RESULTS
Figure 10: The most similar words using the cosine similarity of the word: joy in a
vector space of dimension 50
Figure 11: The most similar words using the cosine similarity of the word: joy in a
vector space of dimension 200
Figure 12: The most similar words using the cosine similarity of the word: joy in a
vector space of dimension 200
22
6.3 Visualization results: 6 EXPERIMENTS AND RESULTS
Figure 13: The most similar words using the cosine similarity of the word: joy in a
vector space of dimension 200
We can note very clearly that the semantic learning of the embedding gets
much better when the size of the embedding is properly chosen. After
trying different embedding sizes ranging from 30 to 250 (passing by 100,
150, 180, 200, 230). We can obviously deduce that a size of embedding
of 50 is optimal to get the best embedding results in the sense of the
semantics understanding of the embedded words.
In the annex pages, we have also visualized the most similar words to other keywords
using the cosine similarity measure but also using the MCO similarity ( Multiplica-
tive combination objective similarity).
23
6.4 Twitter sentiment classification 6 EXPERIMENTS AND RESULTS
For both models (the SVM and the Neural network classifiers), a feature vector of
the tweet is constructed by concatenating the word embedding of its tokens,
each token is multiplied by its computed term frequency–inverse document
frequency(tf-idf ) weight described as the product of these two functions tfi,j and
idfi :
ni,j
tfi,j = P (7)
k nk,j
|D|
idfi = log (8)
|d : ti ∈ d|
6.5 Results
For the training of the SSWEu embedding algorithm, we have tried different parametriza-
tion metrics, namely:
• The number of epochs of the convolutional neural network of the SSWE model.
(N.b: we are not talking about the number of epochs of the classification neural
network that we have used to later on to classify the tweets).
• The size of the embedding vector, hence the dimensionality of the embedding
space.(As described in deep details in the part 8.1)
When we used the neural network, we tried to train the model using different epochs,
so that, we can deduce what is the best epoch to use, in order to get the best accuracy
results. The results of using a different epochs and the corresponding accuracies of
the models constructed using these epochs are listed in the following table:
To evaluate our machine learning models (SVM and Neural network), using the
optimal models’ parametrization metrics of the models(e.g number of epochs for the
neural network), we used these following metrics:
• The F1-score
24
8 CONCLUSION
Table 7: Accuracy results for different epochs’ size of the classification neural net-
work
• The Accuracy
In the following table [8], we report the obtained results:
7 Future work
First, we would like to further explore and exploit other methods to represent the
feature vectors of the tweet (besides the concatenation and the summation) and see
the impact on the classification results.
Also, how to learn the SSWE effectively with document level remains an interesting
future work.
8 Conclusion
25
REFERENCES REFERENCES
References
[1] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A
neural probabilistic language model. Journal of machine learning research,
3(Feb):1137–1155, 2003.
[2] Ronan Collobert and Jason Weston. A unified architecture for natural language
processing: Deep neural networks with multitask learning. In Proceedings of the
25th international conference on Machine learning, pages 160–167. ACM, 2008.
[3] Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using
distant supervision. CS224N Project Report, Stanford, 1(2009):12, 2009.
[4] Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity
with lessons learned from word embeddings. Transactions of the Association for
Computational Linguistics, 3:211–225, 2015.
[5] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation
of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[6] Hassan Saif, Miriam Fernández, Yulan He, and Harith Alani. On stopwords,
filtering and data sparsity for sentiment analysis of twitter. 2014.
[7] Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. Learning
sentiment-specific word embedding for twitter sentiment classification. In ACL
(1), pages 1555–1565, 2014.
26