The Congressional Speech dataset can be found here. It contains speeches made in support of or in opposition of certain bills.
The dataset is a collection of speeches made in Congress in relation to some bills. Each speech is tagged with which speaker delivered it, his/her political affiliation and whether the bill passed. The data is pre-processed by removing all punctuations and converting all words to lowercase.
As we saw in this post, Republican and Democrat presidents had some clear differences in words used. Perhaps that difference in word choices can be leveraged to identify if a speech is made by a Republican or a Democrat.
For each speech the tf-idf score is calculated for all words in the training vocabulary.
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
count_vectorizer = CountVectorizer(); counts = count_vectorizer.fit_transform(data)
tfidf_transformer = TfidfTransformer(); counts = tfidf_transformer.fit_transform(counts)
Then multiple training algorithms are run, most of which have test accuracy in the range of 60-70%. Best performance (68.4%) is acheived by Random Forest classifier with 200 estimators, which also takes the longest to train.
Testing accuracy, training time and testing times of different algorithms
It would seem very likely that bills that passed had more positive sentiments in their speeches, compared to bills that failed. Sentiment scores are calculated from AFINN, by replacing each word with a score between -5 to 5. The mean sentiment score of each speech is used to perform hypothesis testing on the t-statistic.
The t-statistic was found to be 6.85, and the p-value was 9.02e-12. The histogram of the sentiment scores below show that bills that passed had significantly more positive sentiment (mean score = 0.027) compared to sentiment scores of bills that failed (mean score = 0.012).
Histogram of sentiment scores. Bills that passed have speeches with significantly higher sentiment scores