SOTU Analysis

The State of the Union (SOTU) address is an annual speech by the President of the USA to the joint session of Congress. A nice analysis of SOTU speeches can be found here. In this post we shall analyse speeches from recent years and see if we find some more interesting trends.

Preprocessing

The dataset can be accessed here. For preprocessing, the text of each speech is stripped of newlines, joined and then split again on line enders (!?.). Its converted to lowercase and all punctuations are replaced by space. Finally the sentences are joined to a single text with ‘.’ as separator. Stopwords are removed and other words are lemmatized.

Sentiment Analysis

AFINN is a list of words with sentiment scores. A quick and dirty way to analyse sentiments is to simply replace words by their scores from a list such as AFINN, and find the average sentiment score. Significant dips in sentiment scores indicate important events in the USA history timeline.

The first half of the 20th century had very low scores compared to the second half, presumably due to the two World Wars and the Great Depression. After the end of the World War 2, sentiment scores climbed steeply over the next few years. At the start of the new millennium there are dips in the score due to 9/11, the Iraq War and recession.

Average sentiment score of SOTU speeches from 1901 to 2016. Red bars indicate Republican presidents and blue bars indicate Democrats.

Similarity of speeches based on KL divergence

We now focus on speeches by 4 presidents: George H. W. Bush (4), William Clinton (8), George W. Bush (8) and Barrack Obama (8). The numbers in the braces denote the number of years they were president, hence the number of SOTU speeches delivered. Therefore, we are looking at 28 speeches in total.

After the usual preprocessing of stopword removal and lemmatization, we find the unigram counts of each of the 28 speeches. Combining all speeches, we get a total vocabulary. Finally each speech is Laplace smoothed to get a unigram probability distribution. Then we find the average KL divergence (AKLD) of each pair of speeches. Since KL Divergence, as defined by the equation below, is not symmetric, we use a more symmetric measure, average KL divergence defined as AKLD(p,q) = 0.5*(KLD(p,q) + KLD(q,p)). The results are described below.

The equation for KL Divergence for two discrete distributions P and Q

The 28x28 heatmap generated by plotting the pairwise average KL divergences of each pair of speeches.

We observe clear blocks forming along the diagonal, which indicate low KL divergence among speeches made by the same person. This means that speeches made by the same president have similar content. This is clearly an expected result. Even more interesting is the fact that we observe that speeches of presidents from the same party have low KL divergences and vice versa. For example, George W. Bush's speeches have a lower KL divergence (darker colour block) with his father, than with Clinton or Obama. Thus unigrams have good discriminative potential in identifying speakers.

Topic Modelling

Topic Modelling using Latent Dirichlet Allocation (LDA) is an unsupervised method that takes a corpus as input and outputs a fixed number of 'topics'. Each topic is a collection of related words, that tend to occur together in documents. Each document is a mixture of topics.

For this analysis, we use sklearn's LDA. First we use the count vectorizer on the corpus, which is then passed to the LDA class.

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=2000, stop_words='english')

tf = tf_vectorizer.fit_transform(documents_sents)

num_topics = 10

lda = LatentDirichletAllocation(n_topics=num_topics, max_iter=10, learning_method='online', learning_offset=50.,random_state=0)

lda.fit(tf)

for topic_idx, topic in enumerate(lda.components_):  #print the top 15 words of each topic

    print ("Topic %d" % (topic_idx))

    print (" ".join([feature_names[i] for i in topic.argsort()[:-15- 1:-1]]))

Topics found

We use the speeches of 4 presidents in this section too: George H. W. Bush, William Clinton, George W. Bush and Barrack Obama. We get the following 10 topics, which we have given a name by looking at the top 15 words:

Education: know school good people american college generation child student america young start community teacher just
Jobs: job work make american need business help million tax people new child way better thing
World affairs: nation world america people economy american great freedom opportunity new like work trade peace hard
Health care: family health care reform year cost pas plan insurance congress fight need drug medicare save
Middle east: energy weapon place free democracy middle nuclear state leadership change build union world going economic
Terrorism: worker terrorist effort leader important believe number stop military border qaeda threat crisis built intelligence
Taxation: year congress security let ask american tonight cut deficit federal government tax did social percent
Social programs: america new right state year united life community program billion country nation help tonight goal
Law and order: country year law future want meet long history thank making open protect challenge ago crime
Iraq war: time war woman century day force end act iraq men stand come troop american today

Topic feature

For further analysis, we attempt to create a feature vector out of the topics generated in the last section. The number of topics is 10. First we find the top 15 words of each topic

topwords = []

for topic_idx, topic in enumerate(lda.components_):#topic is a np array of 2000 numbers (vocab size)

    tmp = topic.argsort()[:-15 - 1:-1]

    topwords.append({tf_feature_names[i]:topic[i] for i in tmp})

#topwords is of length 10. topwords[i] is the top words (and scores) for topic i

Then we generate a 10-D feature for each sentence in the speech. For each word in a sentence, we increment the ith feature of the 10-D vector if the word belongs to the top words of the ith topic. The code to build the sentence topic feature is shown below:

sent_topic_ft = [0]*10 #10 is the number of topics

for word in words:  #words is a list of words in a sentence of a speech

    for tp_idx, tp in enumerate(topwords): #topwords is generated by the earlier code segment

        if word in tp:

            topic_ft[tp_idx] += 1

Finally a document topic feature (of 10-D) is obtained by averaging all the sentence topic features across all sentences. Thus each speech is reduced to a 10-D feature.

Testing the topic feature

To test the goodness of this feature, we first ask the question, which presidents use each topic the most. For that analysis, we find the top 3 speeches using a particular topic (that is for topic i, which speeches have maximum values in the ith dimension of their 10-D topic feature), and then list their speakers:

Education: William J. Clinton (D)
Jobs: William J. Clinton (D), Barack Obama (D)
World affairs: George W. Bush (R), William J. Clinton (D), George H.W. Bush (R)
Health care: William J. Clinton (D), Barack Obama (D)
Middle east: George H.W. Bush (R), George W. Bush (R), Barack Obama (D)
Terrorism: George W. Bush (R)
Taxation: William J. Clinton (D), George W. Bush (R)
Social programs: William J. Clinton (D), Barack Obama (D)
Law and order: William J. Clinton (D), Barack Obama (D)
Iraq war: George W. Bush (R), Barack Obama (D)

The results make sense. For example, Bush Jr. and Obama talk about the Iraq war, which started during the former's presidency and ended during the later's. Democrat presidents Obama and Clinton talk a lot about jobs, health care and social programs, while Republican presidents Bush Sr. and Bush Jr. talk about world affairs and the middle east.

Next we look at the top 3 topics in each speech:

George H.W. Bush used the following topics:
- Speech 1: world affairs, social programs, jobs
- Speech 2: world affairs, social programs, education
- Speech 3: jobs, social programs, world affairs
William J. Clinton used the following topics:
- Speech 1: jobs, taxation, world affairs
- Speech 2: jobs, health care, world affairs
- Speech 3: jobs, world affairs, taxation
- Speech 4: jobs, education, world affairs
- Speech 5: education, world affairs, jobs
- Speech 6: jobs, social programs, world affairs
- Speech 7: taxation, social programs, jobs
- Speech 8: jobs, social programs, world affairs
George W. Bush used the following topics:
- Speech 1: taxation, jobs, social programs
- Speech 2: world affairs, social programs, iraq war
- Speech 3: world affairs, social programs, jobs
- Speech 4: world affairs, social programs, jobs
- Speech 5: jobs, world affairs, social programs
- Speech 6: world affairs, social programs, jobs
- Speech 7: world affairs, social programs, jobs
- Speech 8: jobs, social programs, world affairs
- Speech 9: world affairs, social programs, jobs
Barack Obama used the following topics:
- Speech 1: jobs, world affairs, education
- Speech 2: jobs, social programs, taxation
- Speech 3: jobs, world affairs, education
- Speech 4: jobs, world affairs, social programs
- Speech 5: jobs, social programs, taxation
- Speech 6: jobs, world affairs, social programs
- Speech 7: jobs, world affairs, social programs
- Speech 8: jobs, world affairs, social programs

The speeches contain the expected topics. For example Obama's speeches are dominated by the 'jobs' topic, perhaps due to the Great Recession. Bush Jr.'s speeches focus a lot on world affairs.

Finally, we use the 10-D topic feature for each speech and construct a heatmap of distances between speeches. To get better discriminative features, we use metric learning. Specifically, we learn a linear transformation that seeks to separate out the features so that features from speeches of the same speaker lie close together in that embedding, while features from speeches from different speakers lie far away. This method is called Large Margin Nearest Neighbour (LMNN) method. We use the PyLMNN package to implement LMNN, with number of neighbours set to 2.

28x28 heatmap using raw 10-D topic features

28x28 heatmap using raw 10-D topic features along with LMNN metric learning

Observe that though blocks of similarity are present along the diagonal, they are much less pronounced for the raw features (left). On applying LMNN, the blocks become more prominent (right). Also note, that the raw feature is totally unsupervised. Thus even with a 10-D feature, we are seeing quite impressive discriminative power, which is enhanced by using metric learning (which is supervised). Compare this with the unigram heatmap, which forms nice interpretable blocks, but uses 1000's of features (depending in the vocabulary size, which is 19,123 in this case). Thus the proposed feature is a good method of dimensionality reduction.

Notice that inside Clinton's and Obama's 8x8 blocks, there are 2 subblocks of size 4x4. These could correspond to their 2 terms, which might focus on different issues. It is intriguing that the feature vector is able to identify presidency terms.

To compare this dimensionality reduction method, we also look at the heatmap generated by the euclidean distances between speech features, which are generated by applying PCA on the unigram probability distribution features. PCA is used to reduce to 10 dimensions too. Also we apply LMNN on top of the PCA features. We get the following two graphs where we see clear cut blocks of similarity using PCA and even sharper ones using LMNN+PCA.

28x28 heatmap using raw 10-D PCA reduced features from unigram distribution

28x28 heatmap using raw 10-D PCA reduced features from unigram distribution along with LMNN metric learning

Bigram Analysis

The blog here analyses the speeches based on unigrams. Let us extend that analysis based on bigrams. NLTK library is used and 4 measures of bigram collocations (pointwise mutual information (PMI), frequency (FREQ), Chi-squared (CHI) and likelihood (LIKE)) are considered. Only post Reagan presidents are considered.

It is observed that the FREQ measure is not very good as it picks up bigrams like 'right now', 'make sure' etc based on their raw frequencies, which are not really very interesting. More sophisticated measures like PMI and CHI perform better.

Below we list the top 25 bigrams based on FREQ for Democratic and Republican presidents in the post-reagan era

Democratic

'tax credit', 'clean energy', 'high school', 'work together', 'working families', 'young people', 'health insurance', 'vice president', 'every american', 'fellow americans', 'first time', 'right now', 'world s', 'new jobs', 'two years', 'middle class', 'years ago', 'social security', 'make sure', 'united states', '21st century', 'america s', 'last year', 'american people', 'health care'

Republican

'small businesses', 'federal government', 'nuclear weapons', 'years ago', 'mass destruction', 'nation s', 'every american', 'iraq s', 'law enforcement', 'work together', 'united nations', 'last year', 'al qaida', 'health insurance', 'make sure', 'al qaeda', 'tax relief', 'fellow citizens', 'saddam hussein', 'middle east', 'america s', 'american people', 'health care', 'social security', 'united states'

Some interesting phrases in the democratic speeches are: tax credit, clean energy, high school, working families, heath insurance, new jobs. These bigrams are expected since clean energy, families, education, health care are part of the core focus of Democrats.

Some interesting phrases in the republican speeches are: nuclear weapons, mass destruction, health insurance, tax relief, middle east. These bigrams are expected due to Republicans emphasis on the war in the Middle East and lower taxation.

Significant bigrams: Hypothesis testing

Statistical tests can be performed to identify bigrams which occur significantly more in one of the party's speeches compared to the other. Using both Chi-square or G test, similar results are observed.

Hypothesis:

H0 (null hypothesis): the given bigram occurs as frequently in republican speeches as it does in democratic speeches
H1: the given bigram does not occur as frequently in republican speeches as it does in democratic speeches

Test Statistic: Chi square on Contingency table.

PDF: chi square distributed, with a degree of freedom of 1

p-value: 0.05

In the table below, we look at top bigrams (by FREQ) from each party and separate them into 'significant' or 'insignificant'

Party

Republican top bigrams

Democrat top bigrams

Significantly different frequencies

nuclear weapons, years ago, mass destruction, iraq s, law enforcement, united nations, last year, al qaeda, tax relief, fellow citizens, saddam hussein, middle east, social security, united states

clean energy, high school, fellow americans, first time, right now, world s, new jobs, two years, middle class, years ago, social security, united states, 21st century, last year

Insignificant

small businesses, federal government, nation s, every american, work together, health insurance, make sure, america s, american people, health care

tax credit, work together, young people, health insurance, make sure, america s, american people, health care

Common phrases like ‘american people’, ‘make sure’ etc are not significant, while distinctive phrases like ‘mass destruction’, ‘middle class’, 'nuclear weapons' are significant. There is a clear focus on the Iraq war in the republican speeches in the bigram analysis, and a focus on social policies like clean energy, education etc. in democratic speeches

Google Sites

Report abuse