SOC 561 - Solutions

Solutions

Congratulations on finishing exercises! Now we will go over solutions.

do file

log file

Cheat Sheet 1

Cheat Sheet 2

Cheat Sheet 3

(1) "created_at" variable contain information of date and time when a tweet is posted.

Let's overview the dataset first.

(1) "created_at" variable contain information of date and time when a tweet is posted.

Let's overview the dataset first.

(1-1) Using "created_at" variable, make a separate variable for date and time. For this exercise, please use information that date consist of two two-digits numbers and one four-digits number (00-00-0000) and time consist of four two-digits number (00:00:00).

Let's overview quickly on what regexm and regexs do.

To extract specific parts of string, you may use if(regexm(variable_name,"what_you_want_to_extract") syntax.

As m in regexm means match, regexm will first find any matches you specified like "what_you_want_to_extract" in variable_name variable.

For instance, if you specified if(regexm(variable_name,"[0-9][0-9][-][0-9][0-9]"), then your command will extract something like 15-12, 11-61, etc. In other words, "[0-9][0-9][-][0-9][0-9]" means to find that any [two-digits numbers][two-digits numbers]-[two-digits numbers][two-digits numbers] in the string.

If you type gen new_variable = regexs(0 or 1 or 2 or 3) if (regexm(variable_name,"what_you_want_to_extract"), and then regexs will extract entire or the first/second/third substring of your extraction. You may specify substring using parentheses (e.g., "(substring1)(substring2)(substring3")).

For instance, now you typed :

gen newvar = regexs(0) if(regexm(variable_name,"([0-9][0-9])([-])([0-9][0-9])")

It will extract entire ([0-9][0-9])([-])([0-9][0-9]). e.g. 12-16

gen newvar = regexs(1) if(regexm(variable_name,"([0-9][0-9])([-])([0-9][0-9])")

It will extract the first substring ([0-9][0-9]). e.g. 12

gen newvar = regexs(2) if(regexm(variable_name,"([0-9][0-9])([-])([0-9][0-9])")

It will extract the second substring ([-]). e.g. -

gen newvar = regexs(3) if(regexm(variable_name,"([0-9][0-9])([-])([0-9][0-9])")

It will extract the third substring ([0-9][0-9]). e.g. 16.

Now let's write commands to extract date and time drawing on those examples.

As "created_at" variable have cases like "12-11-2018 11:43:13," I can use digits information to extract date and time.

I would write:

gen date = regexs(0) if(regexm(created_at, "[0-9][0-9][-][0-9][0-9][-][0-9][0-9][0-9][0-9]"))

regexs(0) means STATA should extract the entire string it finds.

[0-9][0-9][-][0-9][0-9][-][0-9][0-9][0-9][0-9] specify that STATA should extract [two two-digits numbers]-[two two-digits numbers]-[four two digits numbers].

gen time = regexs(0) if(regexm(created_at, "[0-9][0-9][:][0-9][0-9][:][0-9][0-9]"))

Again, regexs(0) means STATA should extract the entire string it finds.

[0-9] [0-9][:][0-9][0-9][:][0-9][0-9]* specifies that [one empty space and two two-digits numbers]:[two two-digits numbers]:[two two-digits numbers] should be found, and it will be extracted and used to generate time variable.

and yes! They worked.

We may check transformation using sort and list.

sort created_at date time
list created_at date time

(1-2) Now let's find a simpler solution. Please do not indicate number of digits information when you write commands to make date and time variable. (hint: you may use the same regular expressions to extract both date and time information) Commands must be shorter and more generic than what you wrote for (1-1).

As it is so much verbose to specify digits information (too many copy and paste!), now we will try more generic solution.

What I did :

gen date2 = regexs(1) if(regexm(created_at, "(.*)[ ](.*)"))

This time I specified substrings to get most out of the single command. Let's take a look at what I wrote in " ".

As I go over created_at variable, I could see that date and time are divided by a single space.

so I specified the single space with [ ]. Any characters come before [ ] will mean date and any characters come after [ ] will mean time.

.* specify that any character(.) zero ore more(*) can come. So the first substring (.*) means date, the second substring (.*) means time.

As regext(1) means STATA should extract the first substring it finds, we will have date information for date2 variable!

gen time2 = regexs(2) if(regexm(created_at, "(.*)[ ](.*)"))

Same logic applies here! I used the same specification, and then typed regexs(2) to extract time information for time2 variable.

This generic solution works neatly.

We may also check transformation using sort and list.

sort created_at date2 time2
list created_at date2 time2

(2) To understand the context of tweets using the word "loser(s)," it might be helpful to find out whether tweets contain "@," as it means a public reply in twitter. For this exercise, please generate "tag" variable to tag whether a tweet contains "@" or not so that we may overview how many tweets mentioning "loser(s)" is a reply/comment to others.

This question is the easiest! We can literally write "@" to specify "@" and generate a tag variable.

gen tag = regexm(text, "@")

regexm will find "@" in "text" variable and then generate "tag" variable coding it as 1 when it found "@" in "text" variable and coding it as 0 when it couldn't find "@" in "text" variable.

gen tag = regexm(text, "@")

It turns out that 65.57% of "loser(s)" tweets are replies/comments to someone.

To check transformation, I used sort and by list. Although STATA does not display full values of text variable (contents of tweets), we can still get a sense that the commands has worked. @NYDailyNews, @politico, and @GOP are mentioned in those tweets.

sort tag text
by tag: list tag text

(3) Then exactly whom/what is mentioned when public replies are posted? Now we would like to extract whom/what is called in these tweets.

(3-1) Let's extract any single word typed right after "@." (i.e., @CNN) This can be tricky. Keep your eyes on the patterns, and try to be as comprehensive as possible.

We know that we can specify "@" as it is, but how can we specify words come right after "@" while we don't know what they will look like?

At least we know that it will consist of alphabetical letters and/or numbers. Thankfully, "[a-zA-Z0-9]" will specify any numbers and alphabetical letters either capitalized or not. I would also add "+" and write "[a-zA-Z0-9]+," as "+" means that one or more preceding expressions (in this case, [a-zA-Z0-9]) should be there. As we would like to extract a single word comes after "@," at least one alphabetical letter should be found, therefore I use "+" instead of "*". I added spacing "[ ]+" right after the substring specification as spacing is the most common delimiter between different words, and one or more spacing could be used in tweets.

gen called = regexs(1) if(regexm(text, "[@]([a-zA-Z0-9]+)[ ]+"))

As I go over the first result, I added "[_]*[-]*[:]*[.]*[!]*[?]*[)]**" for more flexibility as some public replies end with "_" "-" ":" "." "!" "?" ")" before the spacing.

gen called2 = regexs(1) if(regexm(text, "[@]([a-zA-Z0-9]+)[_]*[-]*[:]*[.]*[!]*[?]*[)]*[ ]+"))

The result shows that the second command is more comprehensive.

Now I tried sort and by list to see whether called or called2 extracted every word comes before "@."

As we already made "tag" variable to indicate whether the tweet contains "@," we can use this variable to sort called and called2.

Ideally, called or called2 should be able to extract any single word comes right after "@" when tag =1.

sort tag called called2
by tag: list tag called called2

Well, unfortunately, 22 values are missing. As I took a look at raw data, I thought "[ ]+" in the regular expressions could be an issue. For instance, if "@username" is at the very last of a tweet, "@username" will not have any space after it, which would jeopardize our command.

Thus, I made a modified "text" variable by adding a single space to every end of last sentence in tweets.

And then based on the newly generated variable, I used the command I wrote for called2 again and named the variable called3.

gen text_sp = text + " "
gen called3 = regexs(1) if(regexm(text_sp, "[@]([a-zA-Z0-9]+)[_]*[-]*[:]*[.]*[!]*[?]*[)]*[ ]+"))
sort tag called called2 called3
by tag: list tag called called2 called3

Still not perfect, but definitely improved. Other "@username" has special symbols in it, which are far harder to capture all correctly (".+" won't work as it includes spacing which is a delimiter between different words). So let's move on to the next question and keep those 15 missing for challenges in the future.

(3-2) Now you successfully extracted names written after "@" and then found that some tweets were mentioning Trump himself (@realDonaldTrump). Let's make another tag variable to indicate whether a tweet is this case.

gen tag2 = regexm(called3, "realDonaldTrump")

regexm will find "realDonaldTrump" in "called3" variable and then generate "tag2" variable coding it as 1 when it found "realDonaldTrump" and coding it as 0 when it couldn't.

I used bysort and list to check transformation.

bysort tag2 called3: list tag2 called3

Great! tag2 tagged called3 variable when it contains "realDonaldTrump."

(3-3) As we are more interested in other people/institutions etc. who are mentioned in tweets, let's replace "realDonaldTrump" with a missing value. Generate another variable based on the variable created for (3-1) and then recode the variable.

gen nothim = called3
replace nothim = regexr(nothim, "realDonaldTrump", "")

I will first generate "nothim" variable, and then replace "realDonaldTrump" with "", a missing value in string variable.

Basic syntax: regexr(variable, "would like to replace", "with this")

r in regexr means "replace," thus it allows us to replaces something in a string with another.

I also used sort and by list to check transformation.

sort tag2 called3 nothim
by tag2: list called3 nothim

It worked! Newly generated nothim variable coded "realDonaldTrump" as missing values.

(4) "!" is mostly used to emphasize a word or a sentence, so I wonder which words tend to be yelled in these tweets. Let's extract any single word typed right before "!". Please keep in minds that multiple "!" can be used. You'll get bonus points if you also extract multiple words.

As I would like to allow some flexibility, I first added .* to the front and the end of the specification.

An then I typed "[ ]" so that STATA will extract a single word, which mostly comes right after a single space.

And then I would like to extract something. Something I don't know but should exist. So I used "." to indicate this could be any character, and use "+" to indicate that should be one or more. As we are interested in a word comes before one or more "!," I added "!+" so that STATA will know there should be one or more "!" after the word we will be extracting.

gen yell = regexs(1) if(regexm(text, ".*[ ](.+)!+.*"))
sort yell text
list yell text if yell == ""
list yell text if yell != ""

Although STATA does not display full values of text variable (contents of tweets), again we can get a sense that the command has worked. "win," "winner," "wonderful," and "year" are yelled in these tweets.

You may also extract a second, a third, and a fourth word comes before "!" if you like by using [ ] as a delimiter between different words.

I tried extracting the word come right before the word come before "!".

gen yell_pre = regexs(1) if(regexm(text, ".*[ ](.+)[ ](.+)!+.*"))
sort yell_pre yell text
list yell_pre yell text if yell == ""
list yell_pre yell text if yell != ""

Hope it will give us some contexts.

It did! Now we can see that "you--two losers," "your money," "your panel," "zero credibility," and "a stupida Armstrong" are yelled in these tweets, which aid us more information to understand what these tweets are about.

(5) Anything else are you interested in the dataset? Please write a command using regular expressions to understand contents of tweets better, and explain why you chose the command and why your command would work for your purpose.

This question does not have a solution. Have fun!

Exercises

Home

Unit 3: Regular Expressions

Click Here If You Dare!

Google Sites

Report abuse