14. Data Cleaning

"If you think your data is clean, you haven't looked at it hard enough." - Eben Hewitt

Lesson Prerequisites

This lesson assumes that you have basic knowledge of Stata syntax and writing code in .do files (Intro to Stata, Stata Best Practices), an understanding of the sources of errors in data collection (Collecting High Quality Data, Questionnaire Design), and familiarity with the pros & cons of digital data collection (e.g. the first section on mobile data collection in the SurveyCTO lesson; knowledge of SurveyCTO itself is not necessary for this lesson).

0. Intro to the lesson

The objective of data cleaning is to facilitate data analysis while avoiding the introduction of coding errors.

1. The Importance of Data Cleaning

Data cleaning helps you to avoid errors during analysis, enable others to use and interpret the data, and protect the confidentiality of respondents.

2. Folder Structure

Proper folder structure can help multiple users navigate the data flow, find files during analysis, and reduce time spent on version control.

3. Data Preparation (1): Duplicates, Renaming, Labeling

Duplicates are common in raw data, and you should intentionally select which duplicates to drop rather than choosing randomly. Variable names, labels, and notes are crucial meta-data.

4. Data Preparation (2): New Data, Missing Data, Logic Checks

You often need to incorporate data from the field during data cleaning, especially to deal with missing values. Logic checks are also important to eliminate errors from data collection.

5. Data Preparation (3): Creating Variables, De-identifying Data

Aggregating or transforming variables is often necessary ahead of analysis. You will also need to de-identify the data before sharing it with others.

6. Data Transformations (1): Appending, Merging

Appending and merging are procedures for combining two datasets.

7. Data Transformations (2): Reshaping, Collapsing

Reshaping the data may be necessary to analyze outcomes at a different level. Collapsing the data is an efficient way to calculate summary statistics for subgroups.

Additional Resources

  • IDinsight Data Cleaning checklist (link)

  • IDinsight .do File Checklist (link)

  • multract .ado file for splitting multiple response variables into binary variables (link)

  • Andrade et al (2021) "iefieldkit to document primary data collection and cleaning in Stata", World Bank (link)

  • Gentzkow & Shapiro (2014) "Code and Data for the Social Sciences: A Practical Guide" (link)

  • Kopper et al "Data cleaning and management", J-PAL Research Resources (link)

Banner photo: Edmond Halley's map of trade winds, 1686. Accessed from https://commons.wikimedia.org/wiki/File:Edmond_Halley%27s_map_of_the_trade_winds,_1686.jpg