Data Cleaning for Reshaping

1. Rename variables and keep variables of interest only.

When I import the dataset into STATA, I checked option that STATA should read a list of the first row as variable names but it did not work for some reasons. So I renamed all variables, and then kept variables of interest.

  • rename v1 tconst
  • rename v2 titleType
  • rename v3 primaryTitle
  • rename v4 originalTitle
  • rename v5 isAdult
  • rename v6 startYear
  • rename v7 endYear
  • rename v8 runtimeMin
  • rename v9 genres


  • describe
  • keep tconst runtimeMin genres


  • describe

2. Generate unique identifier for units of analysis in numeric variable.

I sorted using "tconst," alphanumerical unique identifier.

  • sort tconst

As the first row have wrong values (variable names instead of actual values), I dropped the first row where tconst=="tconst."

  • drop if tconst=="tconst"

I generated numeric unique identifier for each title, and then dropped all observations if id > 1000 because this dataset is too large for my laptop (STATA works too slow).

  • gen id = _n
  • drop if id > 1000


  • describe

3. Split "genres" variable so that our data will be structured in wide format.

Now I am going to restructure this dataset into standard wide format. As you see in the below screenshot, all 1-3 genres are typed in a single variable "genres," so I will split it to three variables sharing the stubname "genres."

  • split genres, parse(,)

STATA has generated genres1, genres2, and genres3.


Before I apply reshape command, I rename the original variable name "genre" into "old" so that STATA won't be confused when I type stubname "genres" shared by three variables, "genres1," "genres2," and "genres3" in reshape command.

  • rename genres old


Let's checked if it worked!

  • sort id genres1 genres2 genres3 old
  • list id genres1 genres2 genres3 old