Many sentiment analysis applications rely on opinion lexicons, which are linguistic resources that associate words to sentiment values. Twitter-specific sentiment applications must deal with a language that includes many informal expressions that are not observed in traditional media, e.g., acronyms, misspelled words, and abbreviations. The diversity and sparseness of this language make the manual creation of a Twitter-oriented opinion lexicon a time-consuming task. In this project, we present different supervised models for automatically discovering Twitter opinion words from a corpus of tweets.
The first version of this model was published at IJCAI'15. An extended version of that model was later published in the Knowledge-Based Systems Journal in 2016.
The proposed method combines information from automatically annotated tweets and existing hand-made opinion lexicons to expand an opinion lexicon in a supervised fashion. The expanded lexicon contains part-of-speech (POS) disambiguated entries with a probability distribution for positive, negative, and neutral polarity classes, similarly to SentiWordNet.
To obtain this distribution using machine learning, we propose word-level attributes based on (a) the morphological information conveyed by POS tags and (b) associations between words and the sentiment expressed in the tweets that contain them. We consider tweets with both hard and soft sentiment labels. The sentiment associations are modelled in two different ways: using point-wise-mutual-information semantic orientation (PMI-SO) [Turney, 2002] , and using stochastic gradient descent semantic orientation (SGD-SO), which learns a linear relationship between words and sentiment. The training dataset is labelled by a seed lexicon formed by combining multiple hand-annotated lexicons.
To avoid the high costs of manually annotating tweets into polarity classes for calculating the word-level sentiment associations, we rely on two heuristics for automatically obtaining polarity-annotated tweets: emoticon-based annotation and model transfer . In the first approach, only tweets with positive or negative emoticons are considered and labelled according to the polarity indicated by the emoticon. In the second approach we pursue a model transfer approach by training a probabilistic message-level classifier from a corpus of emoticon-annotated tweets and using it to label a target corpus of unlabelled tweets with a probability distribution of positive and negative sentiment.
This model was published as a short paper in SIGIR'15. It is a word-level classification model for automatically generating a Twitter-specific opinion lexicon from a corpus of unlabelled tweets. The tweets from the corpus are represented by two vectors: a bag-of-words vector and a semantic vector based on word-clusters trained with the Brown clustering algorithm. We propose a distributional representation for words by treating them as the centroid of the tweet vectors in which they appear. The lexicon generation is conducted by training a word-level classifier using these centroids to form the instance space and a seed lexicon to label the training instances.
The seed lexicon used in both models to label the words to positive, negative, and neutral sentiment classes, was created by taking the union of the following manually created lexical resources.
We discarded all the words where a polarity clash between two or more lexicons was observed. As sample of these words is shown below:
This model was published as a short paper at the Web Intelligence (WI) 2016 conference. We expand the NRC word-emotion association lexicon for the language used in Twitter using multi-label classification of words. We compare different word-level features extracted from unlabelled tweets such as unigrams, Brown clusters, POS tags, and word2vec embeddings. The results show that the expanded lexicon achieves major improvements over the original lexicon when classifying tweets into emotional categories. In contrast to previous work, our methodology does not depend on tweets annotated with emotional hashtags, thus enabling the identification of emotional words from any domain-specific collection using unlabelled tweets.
Expanded lexicons as tab separated files.
Training and Testing words in ARFF format for MEKA.
Word2Vec word vectors as a tab separated file.
Contact : fjb11 at students.waikato.ac.nz