A Web Application to Determine Sentiment from Movie Reviews

8/3/19

This blog post was largely inspired by chapter 8 and 9 of Sebastian Raschka's 'python machine learning' book.

The code for the webapp can be found at: https://github.com/raybellwaves/WebAppNLP

The website for the app is: http://raybellwaves.pythonanywhere.com/

Introduction

Natural Language Processing

Natural Language Processing (NLP) is a sub-field of Artificial Intelligence (AI) which involves processing and analyzing natural language [1]. You are most likely subject to this technology daily without realizing it. Examples include: telling Alexa to play something; writing something on social media or having google translate something for you. NLP can be broken down further using the prior examples into speech recognition, sentiment analysis (facebook knows what to put at the top of your news feed to either enrage you or make you happy) and machine translation.

Sentiment analysis

In this blog post I will demonstrate how machine learning can be applied to text documents to infer sentiment. I will use a dataset of movie reviews from the Internet Movie Database (IMDB) [2] and build a predictor that can distinguish between positive and negative reviews.

I will show how to clean and prepare text data and how to create feature vectors from text documents. While 'feature vectors' can sound complicated it is the process of converting text to numbers. Machine learning algorithms (math) cannot work with text so it has to be converted to numbers. 'feature' is the word used in machine learning to describe a variable which is fed into the machine learning algorithm. If I wanted to predict the price of a house then I would very likely use the number of rooms in the house as a feature. A 'vector' is a combination of numbers. Image you have a document with two sentences. The first being 'the cat jumps' and the second being 'the dog eats'. A simple way to convert each sentence to a vector is using the bag-of-words model [3] . A dictionary is created with all the words in the document {cat, dog, eats, jumps, the} and they are assigned an index {cat->0, dog->1 eats->2, jumps->3, the->4}. A sentence is then converted to a vector the same size as the dictionary. It is initialized with 0's and the unique words are mapped to their index in the dictionary. The sentence 'the cat jumps' will look like:

[1 0 0 1 1]

The words 'cat', 'jumps' and 'the' have been mapped to their index location. These values are also known as raw term frequencies.

Lastly, I will show how to use a very simple machine learning algorithm which is well suited for this problem.

Machine learning and state-of-the-art NLP

As much as 'AI' and 'machine learning' are buzz words they are essentially pattern recognizing algorithms. There is a trade-off between model complexity (and increased accuracy) and understanding. Simple models may yield lower accuracy scores than more complex models but they are faster, easier to work with and offer some level of intuition. For this blog post I will therefore focus on a simple model but I will comment on state-of-the-art NLP in this section as it is an area of active research.

OpenAI recently made tech headlines by stating that their latest NLP model (GPT-2) has huge success in 'language generation, reading comprehension, machine translation, question answering, and summarization' [4]. The part which mostly stood out was they stated 'Due to our concerns about malicious applications of the technology, we are not releasing the trained model'. This has become bragging rights for companies now as they are keen to say "it is so good that it could be disruptive technology". An example of a malicious use in this case could be the creation of 'bots' to create content on social media to spread misinformation. The publicity for OpenAI payed off as they recieved $1bn from Microsoft recently [5]. The algorithm behind their model was 'improved' (modified) for the the algorithm used in BERT created by Google. BERT uses a 'bidirectional Transformer encoder'. Transformers have established themselves in the NLP community as they provide much better parallelism than widely established Recurrent Neural Network architecture in the last couple of years [6].

The Machine Learning Model

Cleaning text data

First the movie review data is split into two columns: review and sentiment. Review contains the movie review text and sentiment contains a label which is 1 for a positive review and 0 for a negative review.

Next the movie review data is 'tokenized' using the following:

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

The python package re (regular expression) is used to clean up the text by:

  • Removing html mark up (the symbols such as <, ^, >...).
  • Emoticons such as :-), |P, and =) are kept as they can provide useful information about sentiment.
  • Text is converted to lower case.
  • Stop words (taken from the python package nltk) such as 'and' and 'the' are removed as they do not provide useful information about sentiment.

The model

Before the text data is feed into the machine learning model it is transformed using a HashingVectorizer. This creates a matrix of token occurrences (similar to the way the bag-of-words model works). It has the advantage of being lightweight as it doesn't need to hold a vocabulary and it is optimally for pickling (more on this later).

vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21, # make large to avoid hash collisions
                         preprocessor=None, 
                         tokenizer=tokenizer)

The classifier used is a stochastic gradient descent classifier with a logistic regression loss. I used a large value for max_iter to avoid early stopping and improve accuracy:

clf = SGDClassifier(loss='log',
                    penalty='l2', # add ridge regression to avoid overfitting training data
                    max_iter=1000000, # make large to avoid early stop 
                    n_jobs=-1) # use all cores

This method is an optimal choice over logistic regression as it has a partial_fit method. This allows the weights (model) to be updated as new data is available.

Using these settings I got ~90% accuracy on the train and test data set.

Pickle

The pickle module converts an object into a byte stream. This is the best may to store the weights of the model and creates what is often know as a 'pre-trained' model.

The Web App

pythonanywhere

I use pythonanywhere which offers a free beginners account to host the model.

flask

The website it written using flask. It is written in pure python and provides an easy way to pass variables to be rendered in html. For example, I can pass the prediction (positive/negative), a color (green/red) and the probability of the prediction to a html file as:

<div>This movie review is <strong style="color: {{ font_color }}">{{ prediction }}</strong> (probability: {{ probability }}%).</div>

sqlite

I want to store the review and sentiment that users write to improve the model. For that I use sqlite. I can use the python API via sqlite3 as:

def sqlite_entry(path, document, y): 
    conn = sqlite3.connect(path)
    c = conn.cursor()
    c.execute("INSERT INTO review_db (review, sentiment, date)" " VALUES (?, ?, DATETIME('now'))",
              (document, y))
    conn.commit()
    conn.close()

Auto update

I have created an update.py which works by updating the model every time the web application starts but I have opted not to switch it on. It could get corrupt with multiple users using the site and ideally I would want to do some cleaning of the user reviews to ensure the reviews are meaningful. If I want to maintain it (which I don't really) it is probably best for me to download the reviews, clean then update, update the model and upload the updated pickled classifier.