1. Rename variables and keep variables of interest only.
When I import the dataset into STATA, I checked option that STATA should read a list of the first row as variable names but it did not work for some reasons. So I renamed all variables, and then kept variables of interest.
rename v1 tconst
rename v2 titleType
rename v3 primaryTitle
rename v4 originalTitle
rename v5 isAdult
rename v6 startYear
rename v7 endYear
rename v8 runtimeMin
rename v9 genres
describe
keep tconst runtimeMin genres
describe
2. Generate unique identifier for units of analysis in numeric variable.
I sorted using "tconst," alphanumerical unique identifier.
sort tconst
As the first row have wrong values (variable names instead of actual values), I dropped the first row where tconst=="tconst."
drop if tconst=="tconst"
I generated numeric unique identifier for each title, and then dropped all observations if id > 1000 because this dataset is too large for my laptop (STATA works too slow).
gen id = _n
drop if id > 1000
describe
3. Split "genres" variable so that our data will be structured in wide format.
Now I am going to restructure this dataset into standard wide format. As you see in the below screenshot, all 1-3 genres are typed in a single variable "genres," so I will split it to three variables sharing the stubname "genres."
split genres, parse(,)
STATA has generated genres1, genres2, and genres3.
Before I apply reshape command, I rename the original variable name "genre" into "old" so that STATA won't be confused when I type stubname "genres" shared by three variables, "genres1," "genres2," and "genres3" in reshape command.