I graduated! Click here to see some of my favorite papers I wrote during my undergrad
I've always thought that the best way to learn is by doing. During my Intro to NLP course in the Fall of 2018 I came up with the idea to create a Twitter bot that tweeted out something that I could control with my new-found Python skills. After we went over ngram analysis and generation in class I knew that I wanted to do something along those lines with my bot. I eventually came up with the idea to analyze titles from Georgetown's student newspaper, The Hoya. The process of making the bot taught me more than I expected to learn from the get-go: I learned how to use BeautifulSoup to scrape websites for relevant information, how to organize a project in Python and import functions and methods from scripts I wrote myself, and most importantly how to troubleshoot and debug my own code, as well has how to effectively search for answers and learn from others online. Even though not many people may follow this bot on Twitter, I'll be proud knowing it was my first real Python project.
In an ngram analysis you need to decide on a useful context. The picture in this section depicts a trigram model, one with a context of three words. My bot uses trigrams as well: the script goes through the data I feed it (titles from The Hoya) and counts the occurrences of that particular trigram, storing them in a dictionary. For example, in the first trigram the quick continues to brown. In the dictionary, then, a new key the quick is created with a value of brown, and brown has a value of 1. Should the script come across another occurrence of the quick with a different continuation like yellow, then a new value yellow is created which itself has a value of 1. Once the script counts all occurrences of all trigrams in the training data it can generate a new title. To do this a random first context is selected from the dictionary. Given that random first context, the continuation with the highest value is selected. The "window" then moves and looks at words two and three, again selecting the continuation with the highest value. This continues until the selected output length is reached, and the title is complete!
When I was enrolled in Introduction to NLP in the Fall semester of 2018 I had plenty of opportunities to work on extra homework challenges and personal projects to hone my coding skills. For example, while working as a research assistant on a phonology study I was tasked with creating a large spreadsheet for my team to enter data into. Had I copy and pasted everything by hand it would have taken me many hours to finish (not to mention the strain it would have put on my hands and wrists). Instead my mind went to how I could write a script in python to do the entire thing for me. With the help of one of my friends I was able to write the entire script in just a few hours, which I was proud of for my level of experience with Python at the time. I was hungry for new projects that would allow me to work more with Python, this time something more related to NLP. When we went over ngram analysis and generation in class my mind immediately went to what I could do with it: a twitter bot. From that point on it was in the back of my head.
I needed a corpus in order to make my ngram analysis run. The easy option would have been to take a book from Project Gutenberg, but who wants to follow a Twitter bot that spits out nonsense from a book like Pride and Prejudice? I had in mind to use a satirical site like the Onion, or even Georgetown's own The Heckler, but I instead chose to use article titles from The Hoya. I chose The Hoya for a number of reasons: the articles are all relatively short and could thus be easily reproduced by the generator without getting stuck in a loop, there are lots of titles all the way back to 1998 freely available on their website, and the URLs were organized in such a way that made looping through all of the years and pages a breeze.
I needed to learn how to scrape websites, though, which seemed like a daunting task at first. I tried a few Python libraries like Newspaper3k for article scraping and analysis but ultimately chose BeautifulSoup because I could specify exactly what I wanted from the HTML code. After browsing StackOverflow and trying to use solutions for what seemed like an eternity I took a step back and realized I needed to learn how to do this from the ground up, not change around somebody else's solution to build my code. I found a really great YouTube video by a fellow named Corey Schafer who explained everything about HTML and BeautifulSoup that I needed in order to get started. A simple 45 minute video taught me more than I learned in about 2 hours of trying to use other people's code. The takeaway? It's more important to understand what exactly is happening in your code than it is to get a quick solution that may not work in the long run. After watching Corey's video I was able to write a script that went through all pages for all years 1998-2018 and scrape just around 22,000 article titles. It may not have been the fastest or most efficient code ever, but I understood exactly how it worked and implemented it myself.
This portion of the project was probably the most difficult. I did have some starter code from my Intro NLP course that allowed me to easily create new titles based on the title data I scraped, but the output was not super pretty. After a while of searching around for a detokenizer in NLTK, I just decided to define one for myself that I could use. I tried to think of all the situations in which I would want to "detokenize" and simply wrote a few regex substitutions to take care of them:
def detokenize(input):
"""
detokenize strings of tokens created by nltk.word_tokenize
:param input: string of tokens separated by whitespace
:return: string of words as normally seen in English
"""
input = re.sub(r'(\s)(’)(\s)(s|S)', r'\2\4', input)
input = re.sub(r'(s|S)(\s)(’)', r'\1\3', input)
input = re.sub(r'(\s)(’)(\s)(re|ll|t|ve)', r'\2\4', input)
input = re.sub(r'(\s)(’)(\s)(Cuse)', r'\2\4', input)
input = re.sub(r'(O|D)(\s)(’)(\s)', r'\1\3\4', input)
input = re.sub(r'(\s)(,)(\s)', r'\2\3', input)
input = re.sub(r'(\s)(\.|\!|\?)', r'\2', input)
input = re.sub(r'(\s)(:|;)', r'\2', input)
input = re.sub(r'(‘|“)(\s)(.*)(\s)(”|’)', r'\1\3\5', input)
return input
This part of the project was one that did not take very long, but went a long way in improving the quality of the generated titles. To that end I also defined a few items that I did not want to start a title, such as the following:
stop_list_begin = ['s', '’', ',', ':', ';', '&']
In the above case I simply had the generator choose another random first context for the title. In the future, I could try to eliminate the possibility of these characters ever being in the first position.
I also defined some words that I thought were okay at the beginning of a sentence, but wanted to be capitalized should they show up there:
cap_list_begin = ['of', 'in', 'on', 'to', 'the', 'for', 'not', 'and']
Finally, I defined a list for words that I did not want to end a title. In this case I had the generator choose one more word to finish it off:
stop_list_end = ['The', 'the', 'in', ':', 'of', 'a', 'to', 'on', '‘', '“', '&', 'Every', 'With', 'and', 'for', 'at']
Creating a project that I could call with a single command in my terminal was a bit more work than I expected. In fact, there is an xkcd about this (like there is for just about everything). I found myself with more tabs open in my browser than I knew what to do with: StackOverflow, YouTube, and various forums all open at once. I tried to take it step-by-step: once I was able to generate new titles I turned my attention to being able to tweet from my terminal.
I found a few articles online about creating twitter bots, some more helpful than others. There are a few discussion boards about twitter bots, and even a big Slack group dedicated to discussing them. A lot of the searching just involved narrowing down my search terms, since I did not quite know what it was that I wanted to create other than "a twitter bot." I soon realized that I was trying to create a bot that tweeted non sequiturs, in other words a bot that did not reply to or retweet others' tweets but instead just tweeted to whoever wanted to listen. Tweepy is a great library for favoriting, retweeting, and replying to tweets and DMs, but I wanted to just send simple tweets. Most tutorials were aimed at the former goal. I finally found a great article by Molly White that went through how to implement one of these non sequitur bots, and even gave some starter code to get one running.
Once I had that code I needed a way to import my functions that I defined in my other script, which I had never done before. This included a whole slew of YouTube videos looking at
if __name__ == __main__:
generate_title()
and specifying encoding at the top of my script and a few other issues that I ran into along the way. Eventually, though, I was able to send my first tweet from my terminal! This was a huge milestone that was really satisfying, and it came at about 3:30 am one morning because I didn't want to give up any sooner.
Once I was able to use Cron to schedule my terminal to tweet out titles I had to figure out a way to host my bot somewhere, that way I did not have to leave my laptop open and running all day long. I spent almost an entire day trying to figure out how to get my bot to run on Heroku, messing around with Git Repos and Heroku's CLI was mind-numbing and felt like it was way above my understanding. All of the youtube videos and articles I could find assumed a basic knowledge that I seemed not to have. I felt pretty hopeless at this point, like I was not going to ever get my bot up and running unless I had my laptop open. That was until (again, late at night) I came up with a great idea: Raspberry Pi!
Not the food, but the small computer. I had an old Raspberry Pi 2 lying around from an emulator project I worked on a few years back, so I took the micro-SD card from it, reformatted it and installed Raspbian, an OS developed based on Debian for Raspberry Pi. All I had to do was take the files from my laptop, put them on the Raspberry Pi, set my Crontab to run the bot.py file every 3 hours and forget about it! I could unplug my mouse, keyboard, and display to just have it sit quietly on my desk as it tweets at regular intervals: no messing around with Procfiles or Dynos or anything like that. AND this solution was free! Doesn't get much better than that.
This project started as a small idea and blossomed into me learning more than I expected to at the get-go. It was very satisfying to start with my concept and–over the course of a few days–come out with a working twitter bot. In the future there are probably a few improvements that I can make to ensure that the titles come out a bit more readable. Additionally, I would like to get more familiar with Git and Github so that I can push new code to my Raspberry Pi and make version control a but smoother. I also think that incorporating some form of machine learning could help the titles a lot as well. Those are all ideas for a future project, however. For now I am pleased with what I have created.
I picked myself up a Raspberry Pi Zero W, a model that comes with built-in WiFi. This allows me to plug the Pi in on my desk without needing to have a LAN cable run to it, which was quite annoying. Additionally, the Pi Zero W is much smaller, and is totally capable of running the code (albeit slower). I think that I will count all the trigram continuations and store them in a file that the script can read, that way the Pi does not need to go through the titles and count each time the script runs.
I am getting some unicode errors, as shown in the Pi's Terminal in the picture. I'll need to fix those in the future, as I believe they are messing up the detokenization and stop lists
After having some issues with the bot tweeting out strings that were not detokenized and started with lower-case words in my stop list (all issues I thought I had fixed), I thought I needed to change how the script was dealing with unicode characters. After some fiddling around and reading online, though, I think all I needed to do was change my default python version on the Raspberry Pi to 3.5. Time will tell if that got the job done, which I think it did.
I wanted to improve the script's performance, as it was taking a long time for the smaller 1GHz single-core processor to count the trigram continuations each time. On top of that, it did not make sense to do it each time since the counts are never going to change as long as the training data stays the same. I looked around for how I could save that nested dictionary to a file that my script could then read, and familiarized myself with the python "pickle" module. This allowed me to serialize the dictionary into a file called counts.pkl, which I then read in the generate.py file. It was easy to implement and shaves a good 20-25 seconds off the time it takes the bot to tweet.