Introduction

This is the first in a series of planned posts describing the building of my web app project SentiMap.us. The basic premise behind SentiMap is to plot localized trends in mood via realtime sentiment analysis of a geotagged Twitter stream. This post will focus mainly on the backend sentiment analysis portion of the project and take you through streaming tweets, feature extraction, and classification. However, if you get impatient waiting for future posts, the full code is available on GitHub for you to peruse at your leisure.

Without getting too far into the details just yet, the main functionality behind the sentiment classification, in this case, comes from vectorizing tweets via Word2Vec to construct tweet feature vectors and then training a random forest classifier on a pre-classified tweet corpus. We will cover each of these topics in greater detail later, but for now let's focus on the setup.

The entirety of the coding for the first post will be in python, and will use the following libraries:

Tweepy (Python library for dealing with the Twitter API)
NLTK (Natural Language Toolkit used for stopwords)
gensim (Python library containing Word2Vec algorithm)
Pandas (for managing the Tweet training corpus)
Scikit-Learn (providing machine learning functionality)

pip install tweepy nltk gensim pandas sklearn

Note that in order to use NLTK's stop word corpus you must first download it via

>>>import nltk
>>>nltk.download()

With that out of the way, the basic flow follows two main lines:

Training the Classifier

Load the GoogleNews pre-trained Word2Vec vectors.
Import and format/clean the Sentiment140 Twitter Corpus.
Vectorize the training data tweets.
Train a classifier on the cleaned corpus data (I chose a Random Forest).

Predicting Tweet Sentiment From Stream

Grab JSON formatted output from the Stream endpoint of the Twitter API
Strip/clean tweet text.
Vectorize cleaned tweet.
Predict sentiment and store data in database.

Representing Words as Vectors

We'll start out by training a machine learning classifier to predict tweet sentiment. The general methodology behind any classification scheme is to take a dataset and transform each data point into a vector that, we hope, represents its most salient features with respect to the task at hand. This process of feature extraction can be nearly trivial for some applications, but in general it represents a substantive obstacle with no unique solution.

In the case of natural language processing (NLP), the process of feature extraction for use in sentiment analysis is particularly ill-defined. When we, as people, parse a sentence or phrase we can draw upon years of past experience and contextual clues from countless interactions with language to determine whether said phrase is generally positive, negative, or neutral. For instance, most people would generally decide that the sentence "He is not the brightest crayon in the box." is reasonably negative. However, without the idiomatic context, the phrase itself might be viewed as merely neutral.

Additionally, taking this lack of context to the extreme we might consider the same phrase where even the order of words (that is the contextual clues in the the text itself) were beyond our knowledge. Or perhaps in it's ultimate manifestation, the ordering of letters and characters. It's easy to see the difficulty in deciding anything, other than perhaps letter frequency, about a phrase that has been deconstructed so completely. Yet this is somewhat akin to the world that machines inhabit, having no previous knowledge of language context beyond syntactic rules for translating code documents into binary.

Thus, while there are certainly more naive approaches to constructing vector representations of documents, the big challenge is then to construct systems which simulate the process of learning contextual knowledge. To that end, we will focus on a system of algorithms, called Word2Vec, designed to extract context clues for specific words via the analysis of massive datasets of text.

As much as I'd love to speak a bit about how Word2Vec works (perhaps a future blog post) in gritty technical detail, for now I will point you to a blog post written by a friend of mine which is quite accessible to the lay-person, but contains a large number of technical links at the bottom for those who wish to delve further. Suffice it to say here that Word2Vec "learns" N-dimensional vector representations of words by analyzing and associating words that are less than a certain distance from them in the corpus of training text. At the end of the day we want two words to have vectors that are "close" to each other when it is probabilistically favorable for them to be near each other in a system of documents. For example, a decent metric for determining the "closeness" of vectors is the angle between them which can be easily obtained through use of the inner (or dot) product. Consequently, one might expect that "lion" and "giraffe" are often mentioned within close proximity to one another and thus the angle between their vector representations should be close to zero.

Given the amount of text needed to train these models, it is often advantageous to use large corpuses of words with their pre-trained vectors representations. As the original implementation of Word2Vec was invented by a Google engineer, the codebase, as well as pre-trained vector sets can be found on their site here. For this example we will use their vectors that have been trained on GoogleNews stories.

Since our vectors are pre-trained we will just need to load the binary file into gensim's Word2Vec implementation. Vectors for given words are then easily accessible by using the word strings as keys on our model instance. Then we will construct a function to take a string with only letters and whitespace and remove common words of little contextual meaning, called stop words, then output a vector based on the average of the set of vectors that correspond to words in the string.

import gensim as gs
import numpy as np
from nltk.corpus import stopwords

stop_set = set(stopwords.words("english"))

model = gs.models.Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)

def phrase2vec(phrase):
    phrase = phrase.lower().split()
    phrase_fil = [w for w in phrase if not w in stop_set]
    size = 0
    vec = np.zeros(300)
    for word in phrase_fil:
        try:
            vec= np.add(vec,model[word])
            size+=1
        except:
            pass
    if size==0:
        size=1
    return np.divide(vec,size)

We now have a nice function for vectorizing tweets, once they have been properly formatted.

Cleaning the Data

Next, we'll need to get some training data to run our feature extraction method on. Though there are several Twitter sentiment corpora on the web, many of them sacrifice sample size for accuracy of classification of the training set. Since we are dealing with the potentially very powerful, but somewhat nebulous, concept of word vectors we should choose instead to err on the side of a much larger training set, with perhaps a slight hit to classification accuracy. This can be accomplished by creating heuristics for classification rather than relying on manual techniques. The Sentiment 140 Corpus constructs its sentiment classification heuristic on tweets that contain emoticons. Of the tweets that contain emoticons, those that contain mostly positive emoticons are classified as such, and the same goes for the negative set.

With this heuristic one can immediately automate the data classification task for the training set, which allows for a far greater sample size (in this case 1.6M tweets), at the expense of a possible drop in accuracy.

To analyze and clean our data we're going to work with Pandas, but first let's take a brief look at the format of the training data.

$head -10 training.1600000.processed.noemoticon.csv

"0","1467810369","Mon Apr 06 22:19:45 PDT 2009","NO_QUERY","_TheSpecialOne_","@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"
"0","1467810672","Mon Apr 06 22:19:49 PDT 2009","NO_QUERY","scotthamilton","is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!"
"0","1467810917","Mon Apr 06 22:19:53 PDT 2009","NO_QUERY","mattycus","@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds"
"0","1467811184","Mon Apr 06 22:19:57 PDT 2009","NO_QUERY","ElleCTF","my whole body feels itchy and like its on fire "
$

From the first few lines you can see the general format of the data and should notice several things. For instance the training data file doesn't have a header, so we'll have to name the columns ourselves. Also, the text of the tweets has been formatted to remove emoticons, but still contains plenty of errant strings like URLs and @name tags we'll need to remove before proceeding to vectorize them.

First let's import the csv file and store it as a dataframe.

import pandas as pd
name_list = ['sentiment','id','time','query','user','text']
df = pd.read_csv("training.1600000.processed.noemoticon.csv",\
                 header=None, names= name_list)

Now we'll create a function that takes a string and strips it down to only the text content we care about, removing URLs, names, emoji character sets, and "#" part of the hashtags. We can also optimize our data to be more uniform by taking repeated instances of letters, as in the string "loooooooooove", and transforming them down to 2 at most. This is best accomplished using python's regular expressions module. I'll construct the function here, keeping things a bit more long form for the sake of readability.

import re

#compile regular expressions that match repeated characters and emoji unicode
emoji = re.compile(u'[^\x00-\x7F\x80-\xFF\u0100-\u017F\u0180-\u024F\u1E00-\u1EFF]',re.UNICODE)
multiple = re.compile(r"(.)\1{1,}", re.DOTALL)

def format(tweet):

    #strip emoji
    stripped = emoji.sub('',tweet)

    #strip URLs
    stripped = re.sub(r'http[s]?[^\s]+','', stripped)

    #strip "@name" components
    stripped = re.sub(r'(@[A-Za-z0-9\_]+)' , "" ,stripped)

    #strip html '&amp;', '&lt;', etc.  
    stripped = re.sub(r'[\&].*;','',stripped)

    #strip punctuation
    stripped = re.sub(r'[#|\!|\-|\+|:|//]', " ", stripped)

    #strip the common "RT"
    stripped = re.sub( 'RT.','', stripped)

    #strip whitespace down to one.
    stripped = re.sub('[\s]+' ,' ', stripped).strip()

    #strip multiple occurrences of letters
    stripped = multiple.sub(r"\1\1", stripped)

    #strip all non-latin characters
    #if we wish to deal with foreign language tweets, we would need to first 
    #translate them before taking this step.

    stripped = re.sub('[^a-zA-Z0-9|\']', " ", stripped).strip()

    return stripped

Take a look at Python's RegEx HOW-TO if you're confused on the above format. You can also test them out yourself by visiting Regexr.

Training the Classifier

Now we're ready to train our classifier. Let's start off by making an array of feature vectors for our training data, as well as an array of target values for the sentiment.

#initialize a numpy array to the proper shape and a counter. 
training_data = np.zeros((df.shape[0],300))
counter = 0
#add vectorized tweet text to numpy array.
for tweet in df['text'].values:

    filtered = format(tweet)
    vectorized = phrase2vec(filtered)
    training_data[counter] = vectorized
    counter+=1

training_target = df['sentiment'].values

Now we simply pick which ML algorithm we wish to train. In this case I will be using a random forest classifier. Once we train our model we can then easily export the results using scikit-learn's joblib function, which operates on Python's Pickle package (say that 5 times fast) to dump our Python classifier model to a file.

from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib

cl = RandomForestClassifier()

cl.fit(training_data, training_target)

joblib.dump(cl,'Pickle/rfc.pkl')

That's basically it for training. So far we have trained and exported a classifier to use in the future for predicting tweet sentiment. At this point it would normally be necessary to validate the model via the method of your choice. Scikit-learn has some nice cross-validation functionality for just this purpose, or you can grab the test data from the same Sentiment 140 folder and begin your testing metrics. For simple classification tasks like this I find F-scores to be rather useful. Also you should note that we have only used the default configured random forest classifier here, and there are many options with which to performance optimize your individual setup.

Streaming and Predicting Tweets

Now that we have a model, we can move to predicting a tweet stream in realtime. In order to start streaming tweets, we'll use a nice Python package called Tweepy, which interfaces with the Twitter API and allows us to focus our attention on the task at hand rather than dealing with Oauth validation. You will need to get some credentials before we can start, so head over to Twitter's application page and register your own "app" to receive access credentials.

Once you've got your credentials and installed the Tweepy we can proceed to setting up a streaming client. In this example I'll set one up that outputs the sentiment-valued stream to file. First we need to construct a Listener class that inherits fromtweepy.streaming.StreamListener. We'll also be handling JSON, so we need to import Python's JSON module as well. We will use the on_data method to perform a task every time we receive a tweet.

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import json
from datetime import datetime
import time




def get_sentiment(text, stop , model, trained_classifier):
    cl = trained_classifier
    vec = phrase2vec(text,stop,model)
    pred = cl.predict_proba(vec)[0][1]
    return pred


class Listener(StreamListener):

    def __init__(self, classifier, stops, model,start):
        self.cl = classifier
        self.stop = stops
        self.model = model
        self.start = start



    def on_data(self, data):
        #parse json from data event
        elem = json.loads(data)

        with open("./Data/{0}".format(sys.argv[1]),'a') as f:
            writer = csv.writer(f,codecs.getwriter('utf-8')(sys.stdout))
            tweet = elem['text']
            if elem['lang'] ==  'en' or elem['lang'] ==  'en-gb':
                filtered = format(tweet)


                if len(filtered.split()) > 2:
                    sentiment = get_sentiment(filtered, self.stop, self.model, self.cl)

                    writer.writerow( [tweet,
                                      filtered,
                                      sentiment,
                                      datetime.utcnow()])





        t1 = time.time()
        t0 = self.start
        if t1-t0 <= float(sys.argv[2])*3600.:
            sys.stdout.write("\r{0}".format(t1-t0))
            sys.stdout.flush()
            return True
        else:
            print "\nTime's Up!"
            return False



    def on_error(self, status):
        print status

This class then takes incoming tweets and writes them to a file specified as the first argument. The second argument is reserved for the time (in hours) one wishes to keep the stream alive. You can easily edit this to continue indefinitely if desired.

From here on in we must merely load in models and setup the stream with our credentials.

if __name__ == '__main__':

    ####################Variables that contain the user credentials for the Twitter API 
    access_token = "ACCESS_TOKEN"
    access_token_secret = "ACCESS_TOKEN_SECRET"
    consumer_key = "CONSUMER_KEY"
    consumer_secret = "CONSUMER_SECRET"
    #########################################################

    #Write file header
    f=open("./Data/{0}".format(sys.argv[1]),'w')
    f.write("original_text,filtered_output,sentiment,time \n")
    f.close()


    print "Loading Classifier...\n\n"

    #load trained sklearn classifier
    cl=joblib.load('Pickle/rfc.pkl')

    print "Classifier Loaded...loading model...\n\n"

    #load w2v vectors from GoogleNews training set. 
    model= gs.models.Word2Vec.load_word2vec_format(
           './static/Data/GoogleNews-vectors-negative300.bin',binary=True)


    print "Model Loaded...\n\n"

    #load set of stop words
    stop_set = set(stopwords.words("english"))

    start = time.time()

    l = Listener(cl,stop_set,model,start)

    #This handles Twitter authentication and the connection to Twitter Streaming API

    #Oauth Handling
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)

    stream = Stream(auth, l)
    print "Listening..."

    #initiate stream
    stream.sample()

When we run the file python path_to_script.py output_file_name time we should then see our file being populated with tweets and their corresponding sentiments.

More to Come

This concludes the first post on SentiMap.us. Hopefully you've gained a basic understanding of the nuts and bolts behind realtime sentiment analysis for twitter streams.

In the next post we will cover using MongoDB to hold fixed size collections of tweets, setting up a basic web server using Flask, and serving out data to servers using the PubSub messaging queue paradigm.

Inside SentiMap.us: Part 1

A Pythonic How-To on Twitter Sentiment

Introduction

Training the Classifier

Predicting Tweet Sentiment From Stream

Representing Words as Vectors

Cleaning the Data

Training the Classifier

Streaming and Predicting Tweets

More to Come