SOC 561 - Data Cleaning

Data Cleaning for Reshaping

1. Rename variables and keep variables of interest only.

When I import the dataset into STATA, I checked option that STATA should read a list of the first row as variable names but it did not work for some reasons. So I renamed all variables, and then kept variables of interest.

rename v1 tconst
rename v2 titleType
rename v3 primaryTitle
rename v4 originalTitle
rename v5 isAdult
rename v6 startYear
rename v7 endYear
rename v8 runtimeMin
rename v9 genres

describe

keep tconst runtimeMin genres

describe

2. Generate unique identifier for units of analysis in numeric variable.

I sorted using "tconst," alphanumerical unique identifier.

sort tconst

As the first row have wrong values (variable names instead of actual values), I dropped the first row where tconst=="tconst."

drop if tconst=="tconst"

I generated numeric unique identifier for each title, and then dropped all observations if id > 1000 because this dataset is too large for my laptop (STATA works too slow).

gen id = _n
drop if id > 1000

describe

3. Split "genres" variable so that our data will be structured in wide format.

Now I am going to restructure this dataset into standard wide format. As you see in the below screenshot, all 1-3 genres are typed in a single variable "genres," so I will split it to three variables sharing the stubname "genres."

split genres, parse(,)

STATA has generated genres1, genres2, and genres3.

Before I apply reshape command, I rename the original variable name "genre" into "old" so that STATA won't be confused when I type stubname "genres" shared by three variables, "genres1," "genres2," and "genres3" in reshape command.

rename genres old

Let's checked if it worked!

sort id genres1 genres2 genres3 old
list id genres1 genres2 genres3 old

Exercises

Solutions

Unit 3: Reshaping Datasets

Google Sites

Report abuse