"If you think your data is clean, you haven't looked at it hard enough." - Eben Hewitt
This lesson assumes that you have basic knowledge of Stata syntax and writing code in .do files (Intro to Stata, Stata Best Practices), an understanding of the sources of errors in data collection (Collecting High Quality Data, Questionnaire Design), and familiarity with the pros & cons of digital data collection (e.g. the first section on mobile data collection in the SurveyCTO lesson; knowledge of SurveyCTO itself is not necessary for this lesson).
The objective of data cleaning is to facilitate data analysis while avoiding the introduction of coding errors.
Data cleaning helps you to avoid errors during analysis, enable others to use and interpret the data, and protect the confidentiality of respondents.
Proper folder structure can help multiple users navigate the data flow, find files during analysis, and reduce time spent on version control.
Duplicates are common in raw data, and you should intentionally select which duplicates to drop rather than choosing randomly. Variable names, labels, and notes are crucial meta-data.
You often need to incorporate data from the field during data cleaning, especially to deal with missing values. Logic checks are also important to eliminate errors from data collection.
Aggregating or transforming variables is often necessary ahead of analysis. You will also need to de-identify the data before sharing it with others.
Appending and merging are procedures for combining two datasets.
Reshaping the data may be necessary to analyze outcomes at a different level. Collapsing the data is an efficient way to calculate summary statistics for subgroups.
IDinsight Data Cleaning checklist (link)
IDinsight .do File Checklist (link)
multract .ado file for splitting multiple response variables into binary variables (link)
Andrade et al (2021) "iefieldkit to document primary data collection and cleaning in Stata", World Bank (link)
Gentzkow & Shapiro (2014) "Code and Data for the Social Sciences: A Practical Guide" (link)
Kopper et al "Data cleaning and management", J-PAL Research Resources (link)
Banner photo: Edmond Halley's map of trade winds, 1686. Accessed from https://commons.wikimedia.org/wiki/File:Edmond_Halley%27s_map_of_the_trade_winds,_1686.jpg