Data Cleaning for Reshaping
1. Rename variables and keep variables of interest only.
1. Rename variables and keep variables of interest only.
When I import the dataset into STATA, I checked option that STATA should read a list of the first row as variable names but it did not work for some reasons. So I renamed all variables, and then kept variables of interest.
When I import the dataset into STATA, I checked option that STATA should read a list of the first row as variable names but it did not work for some reasons. So I renamed all variables, and then kept variables of interest.
- rename v1 tconst
- rename v2 titleType
- rename v3 primaryTitle
- rename v4 originalTitle
- rename v5 isAdult
- rename v6 startYear
- rename v7 endYear
- rename v8 runtimeMin
- rename v9 genres
- describe
- keep tconst runtimeMin genres
- describe
2. Generate unique identifier for units of analysis in numeric variable.
2. Generate unique identifier for units of analysis in numeric variable.
I sorted using "tconst," alphanumerical unique identifier.
I sorted using "tconst," alphanumerical unique identifier.
- sort tconst
As the first row have wrong values (variable names instead of actual values), I dropped the first row where tconst=="tconst."
As the first row have wrong values (variable names instead of actual values), I dropped the first row where tconst=="tconst."
- drop if tconst=="tconst"
I generated numeric unique identifier for each title, and then dropped all observations if id > 1000 because this dataset is too large for my laptop (STATA works too slow).
I generated numeric unique identifier for each title, and then dropped all observations if id > 1000 because this dataset is too large for my laptop (STATA works too slow).
- gen id = _n
- drop if id > 1000
- describe
3. Split "genres" variable so that our data will be structured in wide format.
3. Split "genres" variable so that our data will be structured in wide format.
Now I am going to restructure this dataset into standard wide format. As you see in the below screenshot, all 1-3 genres are typed in a single variable "genres," so I will split it to three variables sharing the stubname "genres."
Now I am going to restructure this dataset into standard wide format. As you see in the below screenshot, all 1-3 genres are typed in a single variable "genres," so I will split it to three variables sharing the stubname "genres."
- split genres, parse(,)
STATA has generated genres1, genres2, and genres3.
STATA has generated genres1, genres2, and genres3.
Before I apply reshape command, I rename the original variable name "genre" into "old" so that STATA won't be confused when I type stubname "genres" shared by three variables, "genres1," "genres2," and "genres3" in reshape command.
Before I apply reshape command, I rename the original variable name "genre" into "old" so that STATA won't be confused when I type stubname "genres" shared by three variables, "genres1," "genres2," and "genres3" in reshape command.
- rename genres old
Let's checked if it worked!
Let's checked if it worked!
- sort id genres1 genres2 genres3 old
- list id genres1 genres2 genres3 old