Incorrect or inconsistent data can lead to false analysis and misleading data stories derived from the data. For instance, a government agency may want to analyze population census data to decide which regions require further spending and investment on services. Incorrect data could mean the government spends money in the wrong places. As another example, many companies use customer information data like contact information, addresses, and preferences. If the addresses are inconsistent, the company will suffer the cost of resending mail or even losing customers.
Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, data file, or database. The term refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting this dirty data. Data cleansing may be performed with tools like Spreadsheets.
The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of rules (such as the age of people must be between 0 and 150). The validation may be strict (such as rejecting any address that does not have a valid postal code) or loose (such as correcting records that partially match existing, known records).
High-quality data needs to pass a set of quality criteria, that includes:
Data-Type Constraints – e.g., values in a particular column must be of a particular datatype, e.g., numeric (integer or real), date, etc.
❖ Example: If an age value should be an integer, then a value of "18 years" is an invalid type because "years" is a sequence of characters, not an integer.
Range Constraints: Typically, numbers or dates should fall within a certain range. That is, they have minimum and/or maximum permissible values.
❖ Example: A human age should be between 0 and 150.
Mandatory Constraints: Certain columns cannot be empty.
❖ Example: A unique ID number for each data entry.
Format Constraints: Data may have to be in a certain form such as a number having exactly two places to the right of a decimal point, or a string being a maximum length.
❖ Example: A US zip code must not be more than five characters.
Unique Constraints: A field, or a combination of fields, must be unique across a dataset.
❖ Example: No two persons can have the same social security number.
Set-Membership Constraints: The values for a column come from a predetermined set of values or codes.
❖ Example: Test scores may be A, B, C, D, F, or Unknown (not recorded).
Pattern Constraints: Sometime text fields have to be validated according to a certain pattern of characters.
❖ Example: Phone numbers may be required to have the pattern (###) ###-#### where # is any digit.
Cross-field validation: Certain conditions that utilize multiple fields must be met.
❖ Example: Data showing how many points scored by each player on a team must equal the point total for the team.
All spreadsheet programs support importing data sets in CSV (Comma Separated Values) format. CSV files are text file that use commas to separate values. This makes a spreadsheet program a good tool to clean smaller, simpler, data sets. Larger and/or more complex data sets usually need to be cleaned with more advanced computer programs that are written specifically to wrangle data.
The Cleaning Data Tutorial covers using a spreadsheet to:
Filter data
Filter data over more than one column.
Conditional filtering
Sort data (to help find data out of order or out of range, and to help find duplicates)
Standardize free form data
Some other data cleaning operations that spreadsheets are good for, are:
Single action cleaning of data. For instance, if you get data from an organization, like earthquake data from the National Geological Survey website, the organization might use abbreviations that make the data hard to read. Changing the field names, which are the first row in most spreadsheets, is a one-time easy change with a spreadsheet tool (like Google Sheets).
Use the spreadsheet Find tool (Ctrl+F for Windows; Command+F for Mac) to search for specific data items.
Replacing all data values with another. Most spreadsheets have a "Find and Replace" menu item that will allow replacing all instances of a value with another value. This can allow you to change abbreviations in data values to easier to read values in the entire data set.
Formatting numeric data. Most spreadsheets allow formatting numeric data by restricting the number of decimal places, requiring a leading sign such as a dollar sign. They also allow various date formats.