Let's overview the dataset first.
Let's overview the dataset first.
Let's overview quickly on what regexm and regexs do.
To extract specific parts of string, you may use if(regexm(variable_name,"what_you_want_to_extract") syntax.
As m in regexm means match, regexm will first find any matches you specified like "what_you_want_to_extract" in variable_name variable.
For instance, if you specified if(regexm(variable_name,"[0-9][0-9][-][0-9][0-9]"), then your command will extract something like 15-12, 11-61, etc. In other words, "[0-9][0-9][-][0-9][0-9]" means to find that any [two-digits numbers][two-digits numbers]-[two-digits numbers][two-digits numbers] in the string.
If you type gen new_variable = regexs(0 or 1 or 2 or 3) if (regexm(variable_name,"what_you_want_to_extract"), and then regexs will extract entire or the first/second/third substring of your extraction. You may specify substring using parentheses (e.g., "(substring1)(substring2)(substring3")).
For instance, now you typed :
It will extract entire ([0-9][0-9])([-])([0-9][0-9]). e.g. 12-16
It will extract the first substring ([0-9][0-9]). e.g. 12
It will extract the second substring ([-]). e.g. -
It will extract the third substring ([0-9][0-9]). e.g. 16.
Now let's write commands to extract date and time drawing on those examples.
As "created_at" variable have cases like "12-11-2018 11:43:13," I can use digits information to extract date and time.
I would write:
regexs(0) means STATA should extract the entire string it finds.
[0-9][0-9][-][0-9][0-9][-][0-9][0-9][0-9][0-9] specify that STATA should extract [two two-digits numbers]-[two two-digits numbers]-[four two digits numbers].
Again, regexs(0) means STATA should extract the entire string it finds.
[0-9] [0-9][:][0-9][0-9][:][0-9][0-9]* specifies that [one empty space and two two-digits numbers]:[two two-digits numbers]:[two two-digits numbers] should be found, and it will be extracted and used to generate time variable.
and yes! They worked.
We may check transformation using sort and list.
As it is so much verbose to specify digits information (too many copy and paste!), now we will try more generic solution.
What I did :
This time I specified substrings to get most out of the single command. Let's take a look at what I wrote in " ".
As I go over created_at variable, I could see that date and time are divided by a single space.
so I specified the single space with [ ]. Any characters come before [ ] will mean date and any characters come after [ ] will mean time.
.* specify that any character(.) zero ore more(*) can come. So the first substring (.*) means date, the second substring (.*) means time.
As regext(1) means STATA should extract the first substring it finds, we will have date information for date2 variable!
Same logic applies here! I used the same specification, and then typed regexs(2) to extract time information for time2 variable.
This generic solution works neatly.
We may also check transformation using sort and list.
This question is the easiest! We can literally write "@" to specify "@" and generate a tag variable.
regexm will find "@" in "text" variable and then generate "tag" variable coding it as 1 when it found "@" in "text" variable and coding it as 0 when it couldn't find "@" in "text" variable.
It turns out that 65.57% of "loser(s)" tweets are replies/comments to someone.
To check transformation, I used sort and by list. Although STATA does not display full values of text variable (contents of tweets), we can still get a sense that the commands has worked. @NYDailyNews, @politico, and @GOP are mentioned in those tweets.
We know that we can specify "@" as it is, but how can we specify words come right after "@" while we don't know what they will look like?
At least we know that it will consist of alphabetical letters and/or numbers. Thankfully, "[a-zA-Z0-9]" will specify any numbers and alphabetical letters either capitalized or not. I would also add "+" and write "[a-zA-Z0-9]+," as "+" means that one or more preceding expressions (in this case, [a-zA-Z0-9]) should be there. As we would like to extract a single word comes after "@," at least one alphabetical letter should be found, therefore I use "+" instead of "*". I added spacing "[ ]+" right after the substring specification as spacing is the most common delimiter between different words, and one or more spacing could be used in tweets.
As I go over the first result, I added "[_]*[-]*[:]*[.]*[!]*[?]*[)]**" for more flexibility as some public replies end with "_" "-" ":" "." "!" "?" ")" before the spacing.
The result shows that the second command is more comprehensive.
Now I tried sort and by list to see whether called or called2 extracted every word comes before "@."
As we already made "tag" variable to indicate whether the tweet contains "@," we can use this variable to sort called and called2.
Ideally, called or called2 should be able to extract any single word comes right after "@" when tag =1.
Well, unfortunately, 22 values are missing. As I took a look at raw data, I thought "[ ]+" in the regular expressions could be an issue. For instance, if "@username" is at the very last of a tweet, "@username" will not have any space after it, which would jeopardize our command.
Thus, I made a modified "text" variable by adding a single space to every end of last sentence in tweets.
And then based on the newly generated variable, I used the command I wrote for called2 again and named the variable called3.
Still not perfect, but definitely improved. Other "@username" has special symbols in it, which are far harder to capture all correctly (".+" won't work as it includes spacing which is a delimiter between different words). So let's move on to the next question and keep those 15 missing for challenges in the future.
regexm will find "realDonaldTrump" in "called3" variable and then generate "tag2" variable coding it as 1 when it found "realDonaldTrump" and coding it as 0 when it couldn't.
I used bysort and list to check transformation.
Great! tag2 tagged called3 variable when it contains "realDonaldTrump."
I will first generate "nothim" variable, and then replace "realDonaldTrump" with "", a missing value in string variable.
Basic syntax: regexr(variable, "would like to replace", "with this")
r in regexr means "replace," thus it allows us to replaces something in a string with another.
I also used sort and by list to check transformation.
It worked! Newly generated nothim variable coded "realDonaldTrump" as missing values.
As I would like to allow some flexibility, I first added .* to the front and the end of the specification.
An then I typed "[ ]" so that STATA will extract a single word, which mostly comes right after a single space.
And then I would like to extract something. Something I don't know but should exist. So I used "." to indicate this could be any character, and use "+" to indicate that should be one or more. As we are interested in a word comes before one or more "!," I added "!+" so that STATA will know there should be one or more "!" after the word we will be extracting.
Although STATA does not display full values of text variable (contents of tweets), again we can get a sense that the command has worked. "win," "winner," "wonderful," and "year" are yelled in these tweets.
You may also extract a second, a third, and a fourth word comes before "!" if you like by using [ ] as a delimiter between different words.
I tried extracting the word come right before the word come before "!".
Hope it will give us some contexts.
It did! Now we can see that "you--two losers," "your money," "your panel," "zero credibility," and "a stupida Armstrong" are yelled in these tweets, which aid us more information to understand what these tweets are about.