Twitter Sentiment Analysis

Introduction

Millions of people worldwide use social networks daily to express themselves on a variety of issues. With the advent of Web 2.0, platforms like blogs, facebook and twitter to express opinion have become available to anyone with an internet connection. With such a huge number of people expressing themselves on the web, naturally the question arose if it is possible to predict socio-economic events using information gleaned from the Web. Can public sentiment gleaned from the web possibly provide a cheaper solution to conducting election polls and the like which involve massive effort?

Brendon et al[1] ,in their recent paper linked public sentiment extracted from twitter with polling data on consumer confidence and elections. They came up with a simple technique to score positive and negative sentiment on the issue of interest(jobs, economic health, presidential candidate favoured etc) from tweets. Although they found the day to day sentiment variation to be pretty volatile, but smoothening the scores over a window of the past k days correlated pretty well with well known consumer confidence polls(Gallup Daily, Michigan ICS). A high correlation of ~80% was found for a smoothening period of 30 days with the Gallup polls. Even though there was a lot of noise in the results, the authors mention that the results have been very encouraging for such a preliminary research.

Twitter for some reasons looks like the ideal tool to gather public sentiment from the web. With its 140 characters limit, and a huge community, it is ideal for extracting text, compared to blogs. With 7 million tweets or so being posted each day, it is potentially a huge data mine. "...each tweet may be regarded as microscopic instantiations of mood. It follows that the collection of all tweets published over a given period can unveil changes in the state of public mood at a large scale."[4]

Bollen et al[2] came up with an expanded lexicon built upon the well know standard Profile of Mood States(see POMS) combined with frequency of n-gram data from google, built by analysing a terrabyte of data from the web(see T. Brants, A. Franz, Google Web 1T 5-gram Version 1), called the Google Profile of Mood States(GPOMS). Using their lexicon, unlike other tools which usually map text sentiment to two dimensions corresponding to positive and negative, they were able to map data along six dimensions: Calm, Alert, Sure, Vital, Kind and Happy. This is a definite improvement in capturing the rich array of human emotions.

They went on to test the hypothesis that public sentiment extracted from twitter can be used to predict stock market movements. When combined with stock movement during the previous k days and using Self Organised Fuzzy Neural Networks to predict whether the Dow Jones would go up or down, a maximum of 86.7% accuracy using timeline scores of the emotion 'calm' was achieved. The study has been in spotlight since, offering a hitherto unexplored insight into the power of twitter sentiment analysis.

Initial Idea

Our aim is to come up with our own version of GPOMS. Twitter API provides a convenient way to scrape millions of tweets from the world over. To filter noise, and retain only those tweets which might be an indicator of sentiment, we could follow a strategy inspired by students at Stanford [3] and Bollen et al [2], going something like:

* Tweets containing the expressions "i feel", "i am feeling", "i'm feeling", "i dont feel", "I'm", "Im", I am", "makes me"
* Emoticons
* Words with lenghtened substrings, eg: hahahahaha, yaaaaaaay
* Emoticons
* Capitalised words
* Tweets containing urls etc would be removed.

Coming up with GPOMS however is the main challenge in the course of extracting public mood from twitter. 4-grams and 5-grams containing co-occurences of words from the 65-word POMS lexicon can be added as a starting step in building an expanded lexicon which can applied to tweet terms.

Once the GPOMS lexicons is built, the task is simplified. Any tweet term matching a GPOMS n-gram can be mapped to its original POMS terms using co-occurrence weights and then to its respective dimension via the POMS scoring table. "The score of each POMS mood dimension is thus determined as the weighted sum of the co-occurence weights of each tweet term that matched the GPOMS lexicon".[2]

Image source: J. Bollen et al [2]

If time allows, we shall proceed to test whether sentiment analysis agrees with some major events. A simple example would be if happiness levels increase corresponding to festive events, Diwali, Thanksgiving or Christmas, depending on the time period when tweets were gathered. Also happy scores must align with positive emotion scores as can be obtained by tools such as Opinion Finder(see: mpqa).

References

B. O' Connor, R. Balasubramanyan, B.R. Routledge, N.A. Smith, From tweets to polls: linking text sentiment to public opinion time series, in: Pro- ceedings of the International AAAI Conference on Weblogs and Social Media, Washington, DC, May, 2010, http://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/view/1536
Bollen, J.; Mao, H.; and Zeng, X.-J. 2010. Twitter mood predicts the stock market. Journal of Computational Science 2(1):1–8
Daniel Debbini, Philippe Estin, Maxime Goutagny, Modeling the Stock Market using Twitter Sentiment Analysis, http://cs229.stanford.edu/proj2011/DebbiniEstinGoutagny-ModelingTheStockMarketUsingTwitterSentimentAnalysis.pdf
Bollen, J., Pepe, A., & Mao, H. (2009). Modeling public mood and emotion: Twitter sentiment and socioeconomic phenomena. arXiv.org, arXiv:0911.1583v0911 [cs.CY] 0919 Nov 2009