Goals:

(1) Let's do something fun!

I would like to analyze something we would be interested.

(2) Let's do real world implementation.

I would like to apply regular expressions to find out things we would actually get curious as we explore our dataset.

(3) Let's explore text data.

Though STATA might not be the best tool for this purpose, it indeed can do some basic stuff.

(4) Let's try challenging exercises.

We will practice deciding which function(regexm, regexr, regexs) we should be using without any direct hints.

1. Download dataset

Then what?

We are going to analyze tweets posted by Trump.

The above hyperlink will connect you to the website named TrumpTwitterArchive.

As you see in the picture at the left, the website provides Trump's tweets categorized by some interesting keywords to explore.

Today, we will download 234 tweets which contains the word "loser."

Please click "234 tweets" to download dataset.


Now you may select some options before you export data.

I checked "show retweets," "show manual retweets," "show retweet count," "show favorite count" so that I would have extra information.



I selected CSV format when I export dataset.

Click "Export."




It will show you a list of tweets in CSV format, delimited by comma.

Now select them all (ctrl+a) and copy (ctrl+c) them.


Now you can paste (ctrl+v) tweets data in your favorite text editor and save it in CSV format.

Here I used Notepad++.

2. Import dataset into STATA

Importing CSV data is easy.

Click [File] - [Import] - [Text data (delimited, *.csv, ...)]


Then STATA will let you select some options when you import the dataset.

I set [Comma] for [Delimiter] because it is CSV file.

When you import your dataset manually, then STATA will show you below command, so that you may use it when you write your do file.

  • import delimited file_name.csv, delimiter(comma)

3. Exercises

(1) "created_at" variable contain information of date and time when a tweet is posted.

(1-1) Using "created_at" variable, make a separate variable for date and time. For this exercise, please use information that date consist of two two-digits numbers and one four-digits number (00-00-0000) and time consist of four two-digits number (00:00:00).


(1-2) Now let's find a simpler solution. Please do not indicate number of digits information when you write commands to make date and time variable (hint: you may use the same regular expressions to extract both date and time information). Commands must be shorter and more generic than what you wrote for (1-1).

(2) To understand the context of tweets using the word "loser(s)," it might be helpful to find out whether tweets contain "@," as it means a public reply in twitter. For this exercise, please generate "tag" variable to tag whether a tweet contains "@" or not so that we may overview how many tweets mentioning "loser(s)" is a reply/comment to others.

(3) Then exactly whom/what is mentioned when public replies are posted? Now we would like to extract whom/what is called in these tweets.

(3-1) Let's extract any single word typed right after "@." (i.e., @CNN) This can be tricky. Keep your eyes on the patterns, and try to be as comprehensive as possible.


(3-2) Now you successfully extracted names written after "@" and then found that some tweets were mentioning Trump himself (@realDonaldTrump). Let's make another tag variable to indicate whether a tweet is this case.


(3-3) As we are more interested in other people/institutions etc. who are mentioned in tweets, let's replace "realDonaldTrump" with a missing value. Generate another variable based on the variable created for (3-1) and then recode the variable.

(4) "!" is mostly used to emphasize a word or a sentence, so I wonder which words tend to be yelled in these tweets. Let's extract any single word typed right before "!". Please keep in minds that multiple "!" can be used. You'll get bonus points if you also extract multiple words.

(5) Anything else are you interested in the dataset? Please write a command using regular expressions to understand contents of tweets better, and explain why you chose the command and why your command would work for your purpose.

Enjoy!