The dataset can be accessed here. For preprocessing, the text of each speech is stripped of newlines, joined and then split again on line enders (!?.). Its converted to lowercase and all punctuations are replaced by space. Finally the sentences are joined to a single text with ‘.’ as separator. Stopwords are removed and other words are lemmatized.
AFINN is a list of words with sentiment scores. A quick and dirty way to analyse sentiments is to simply replace words by their scores from a list such as AFINN, and find the average sentiment score. Significant dips in sentiment scores indicate important events in the USA history timeline.
The first half of the 20th century had very low scores compared to the second half, presumably due to the two World Wars and the Great Depression. After the end of the World War 2, sentiment scores climbed steeply over the next few years. At the start of the new millennium there are dips in the score due to 9/11, the Iraq War and recession.
Average sentiment score of SOTU speeches from 1901 to 2016. Red bars indicate Republican presidents and blue bars indicate Democrats.
We now focus on speeches by 4 presidents: George H. W. Bush (4), William Clinton (8), George W. Bush (8) and Barrack Obama (8). The numbers in the braces denote the number of years they were president, hence the number of SOTU speeches delivered. Therefore, we are looking at 28 speeches in total.
After the usual preprocessing of stopword removal and lemmatization, we find the unigram counts of each of the 28 speeches. Combining all speeches, we get a total vocabulary. Finally each speech is Laplace smoothed to get a unigram probability distribution. Then we find the average KL divergence (AKLD) of each pair of speeches. Since KL Divergence, as defined by the equation below, is not symmetric, we use a more symmetric measure, average KL divergence defined as AKLD(p,q) = 0.5*(KLD(p,q) + KLD(q,p)). The results are described below.
The equation for KL Divergence for two discrete distributions P and Q
The 28x28 heatmap generated by plotting the pairwise average KL divergences of each pair of speeches.
We observe clear blocks forming along the diagonal, which indicate low KL divergence among speeches made by the same person. This means that speeches made by the same president have similar content. This is clearly an expected result. Even more interesting is the fact that we observe that speeches of presidents from the same party have low KL divergences and vice versa. For example, George W. Bush's speeches have a lower KL divergence (darker colour block) with his father, than with Clinton or Obama. Thus unigrams have good discriminative potential in identifying speakers.
Topic Modelling using Latent Dirichlet Allocation (LDA) is an unsupervised method that takes a corpus as input and outputs a fixed number of 'topics'. Each topic is a collection of related words, that tend to occur together in documents. Each document is a mixture of topics.
For this analysis, we use sklearn's LDA. First we use the count vectorizer on the corpus, which is then passed to the LDA class.
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=2000, stop_words='english')
tf = tf_vectorizer.fit_transform(documents_sents)
num_topics = 10
lda = LatentDirichletAllocation(n_topics=num_topics, max_iter=10, learning_method='online', learning_offset=50.,random_state=0)
lda.fit(tf)
for topic_idx, topic in enumerate(lda.components_): #print the top 15 words of each topic
print ("Topic %d" % (topic_idx))
print (" ".join([feature_names[i] for i in topic.argsort()[:-15- 1:-1]]))
We use the speeches of 4 presidents in this section too: George H. W. Bush, William Clinton, George W. Bush and Barrack Obama. We get the following 10 topics, which we have given a name by looking at the top 15 words:
For further analysis, we attempt to create a feature vector out of the topics generated in the last section. The number of topics is 10. First we find the top 15 words of each topic
topwords = []
for topic_idx, topic in enumerate(lda.components_):#topic is a np array of 2000 numbers (vocab size)
tmp = topic.argsort()[:-15 - 1:-1]
topwords.append({tf_feature_names[i]:topic[i] for i in tmp})
#topwords is of length 10. topwords[i] is the top words (and scores) for topic i
Then we generate a 10-D feature for each sentence in the speech. For each word in a sentence, we increment the ith feature of the 10-D vector if the word belongs to the top words of the ith topic. The code to build the sentence topic feature is shown below:
sent_topic_ft = [0]*10 #10 is the number of topics
for word in words: #words is a list of words in a sentence of a speech
for tp_idx, tp in enumerate(topwords): #topwords is generated by the earlier code segment
if word in tp:
topic_ft[tp_idx] += 1
Finally a document topic feature (of 10-D) is obtained by averaging all the sentence topic features across all sentences. Thus each speech is reduced to a 10-D feature.
To test the goodness of this feature, we first ask the question, which presidents use each topic the most. For that analysis, we find the top 3 speeches using a particular topic (that is for topic i, which speeches have maximum values in the ith dimension of their 10-D topic feature), and then list their speakers:
The results make sense. For example, Bush Jr. and Obama talk about the Iraq war, which started during the former's presidency and ended during the later's. Democrat presidents Obama and Clinton talk a lot about jobs, health care and social programs, while Republican presidents Bush Sr. and Bush Jr. talk about world affairs and the middle east.
Next we look at the top 3 topics in each speech:
The speeches contain the expected topics. For example Obama's speeches are dominated by the 'jobs' topic, perhaps due to the Great Recession. Bush Jr.'s speeches focus a lot on world affairs.
Finally, we use the 10-D topic feature for each speech and construct a heatmap of distances between speeches. To get better discriminative features, we use metric learning. Specifically, we learn a linear transformation that seeks to separate out the features so that features from speeches of the same speaker lie close together in that embedding, while features from speeches from different speakers lie far away. This method is called Large Margin Nearest Neighbour (LMNN) method. We use the PyLMNN package to implement LMNN, with number of neighbours set to 2.
28x28 heatmap using raw 10-D topic features
28x28 heatmap using raw 10-D topic features along with LMNN metric learning
Observe that though blocks of similarity are present along the diagonal, they are much less pronounced for the raw features (left). On applying LMNN, the blocks become more prominent (right). Also note, that the raw feature is totally unsupervised. Thus even with a 10-D feature, we are seeing quite impressive discriminative power, which is enhanced by using metric learning (which is supervised). Compare this with the unigram heatmap, which forms nice interpretable blocks, but uses 1000's of features (depending in the vocabulary size, which is 19,123 in this case). Thus the proposed feature is a good method of dimensionality reduction.
Notice that inside Clinton's and Obama's 8x8 blocks, there are 2 subblocks of size 4x4. These could correspond to their 2 terms, which might focus on different issues. It is intriguing that the feature vector is able to identify presidency terms.
To compare this dimensionality reduction method, we also look at the heatmap generated by the euclidean distances between speech features, which are generated by applying PCA on the unigram probability distribution features. PCA is used to reduce to 10 dimensions too. Also we apply LMNN on top of the PCA features. We get the following two graphs where we see clear cut blocks of similarity using PCA and even sharper ones using LMNN+PCA.
28x28 heatmap using raw 10-D PCA reduced features from unigram distribution
28x28 heatmap using raw 10-D PCA reduced features from unigram distribution along with LMNN metric learning
The blog here analyses the speeches based on unigrams. Let us extend that analysis based on bigrams. NLTK library is used and 4 measures of bigram collocations (pointwise mutual information (PMI), frequency (FREQ), Chi-squared (CHI) and likelihood (LIKE)) are considered. Only post Reagan presidents are considered.
It is observed that the FREQ measure is not very good as it picks up bigrams like 'right now', 'make sure' etc based on their raw frequencies, which are not really very interesting. More sophisticated measures like PMI and CHI perform better.
Below we list the top 25 bigrams based on FREQ for Democratic and Republican presidents in the post-reagan era
Democratic
'tax credit', 'clean energy', 'high school', 'work together', 'working families', 'young people', 'health insurance', 'vice president', 'every american', 'fellow americans', 'first time', 'right now', 'world s', 'new jobs', 'two years', 'middle class', 'years ago', 'social security', 'make sure', 'united states', '21st century', 'america s', 'last year', 'american people', 'health care'
Republican
'small businesses', 'federal government', 'nuclear weapons', 'years ago', 'mass destruction', 'nation s', 'every american', 'iraq s', 'law enforcement', 'work together', 'united nations', 'last year', 'al qaida', 'health insurance', 'make sure', 'al qaeda', 'tax relief', 'fellow citizens', 'saddam hussein', 'middle east', 'america s', 'american people', 'health care', 'social security', 'united states'
Some interesting phrases in the democratic speeches are: tax credit, clean energy, high school, working families, heath insurance, new jobs. These bigrams are expected since clean energy, families, education, health care are part of the core focus of Democrats.
Some interesting phrases in the republican speeches are: nuclear weapons, mass destruction, health insurance, tax relief, middle east. These bigrams are expected due to Republicans emphasis on the war in the Middle East and lower taxation.
Statistical tests can be performed to identify bigrams which occur significantly more in one of the party's speeches compared to the other. Using both Chi-square or G test, similar results are observed.
Hypothesis:
Test Statistic: Chi square on Contingency table.
PDF: chi square distributed, with a degree of freedom of 1
p-value: 0.05
In the table below, we look at top bigrams (by FREQ) from each party and separate them into 'significant' or 'insignificant'
Party
Republican top bigrams
Democrat top bigrams
Significantly different frequencies
nuclear weapons, years ago, mass destruction, iraq s, law enforcement, united nations, last year, al qaeda, tax relief, fellow citizens, saddam hussein, middle east, social security, united states
clean energy, high school, fellow americans, first time, right now, world s, new jobs, two years, middle class, years ago, social security, united states, 21st century, last year
Insignificant
small businesses, federal government, nation s, every american, work together, health insurance, make sure, america s, american people, health care
tax credit, work together, young people, health insurance, make sure, america s, american people, health care
Common phrases like ‘american people’, ‘make sure’ etc are not significant, while distinctive phrases like ‘mass destruction’, ‘middle class’, 'nuclear weapons' are significant. There is a clear focus on the Iraq war in the republican speeches in the bigram analysis, and a focus on social policies like clean energy, education etc. in democratic speeches