SOC 561 - Exercises

Exercises

Goal: Real world implementation with real dataset (not toy data containing 10ish observations).

I would like to walk you through a process of selecting eligible dataset, importing it into STATA, and managing it using reshape command as a part of research practice. Let's try to make your own decisions/rationales in each step of the process.

1. Download dataset

Subset of IMDb data are publicly available!

Click here or the left image to check details of data and download.

2. Decide what to do with unfamiliar format of dataset (and then import it into STATA)

It is great to obtain interesting data for free! Then now what?

Unfortunately, datasets seem to be in unfamiliar format. gz and tsv.

Do we need StatTransfer or can use default option in STATA?

The IMDb data webpage explains:

"IMDb Dataset Details Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name. "

I first googled "gz" to see how to unzip those dataset.

You may use Winzip if you already have one:

https://www.winzip.com/gz-file.html

Or Gzip, a free software:

https://www.gnu.org/software/gzip/

Or Total Commander, which I love (also a free software):

https://www.ghisler.com/

You can just drag and drop zipped file (in this case, "data.tsv") to the other side to unzip it.

It works for almost all format of zipping.

("ZIP, 7ZIP, ARJ, LZH, RAR, UC2, TAR, GZ, CAB, ACE archive handling + plugins")

Then I googled"tsv" and found STATA can import tsv.

https://www.stata.com/features/overview/importing-and-exporting-text-delimited-data/

https://www.stata.com/manuals13/dimportdelimited.pdf

https://www.imf.org/external/help/tsv.htm

Cool, you can also copy and paste the below code to import those datasets into STATA:

import delimited data_title.basics.tsv, delimiter(tab)

3. Exercises

(1) First of all, we should decide which dataset is eligible for reshaping.

Click IMDb datasets to see available options.

(1-1) Among seven options, I chose to use "title.basics.tsv " for this exercise. Can you explain why reshaping would be applicable for the data? Please refer to IMDb Dataset Details when you explain.

(1-2) Among remaining six options, are there any other datasets which might be eligible for reshaping? Pick one and explain why.

(2) I would like to reshape "title.basics.tsv" from wide to long, but the current format is not quite ready for that. Why is it so, and what would you do to make it ready for reshaping? Please provide your codes and the result of data cleaning.

Hint: generate, split, rename, and keep/drop commands would be helpful to clean the dataset.

*You may take a look at data cleaning page if you are not familiar with relevant commands (though I strongly encourage you to google first to figure them out).

(3) Now the dataset is ready. Let's use reshape command to restructure the dataset from wide to long and then preserve the transformed dataset.

(4) Restructure the dataset again from long to wide.

(5) Lastly, let's try restore and see what happens.

Cheat sheet 1: How to reshape and preserve data

Cheat sheet 2: When reshaping would work

Cheat sheet 3: Video example on reshape

Data Cleaning

Solutions

Google Sites

Report abuse