This tutorial will take you through cleaning your data. It will use an example where students took a survey of their classmates over several weeks of what they did in the evening. Review the Cleaning Data Overview for more information on why cleaning data is important and some examples of data constraints relevant to cleaning data.
Do the following:
➤ Create a copy of the "What Did You Do Last Night" survey data spreadsheet on your Google Drive.
Sometimes a data file is too large to look at all at once. One way to address this problem is to filter the data so that you are only looking at some of the rows.
Do the following:
➤ Follow the example in the video provided.
➤ Using your own copy of the "What Did You Do Last Night" data spreadsheet, filter one of the columns.
Filters may include more than one value - simply select all the values you'd like to include. You can even use a conditional statement as a filter. For example, you can make a filter that will only show values if the value is less than some number you specify. It's also possible to filter on multiple columns at the same time. More complex filters help refine how you look at your data.
Do the following:
➤ Follow the example in the video provided.
➤ Using your own copy of the "What Did You Do Last Night" data, add filters to two different columns.
In one column choose at least two values.
In the other column use a Conditional filter, as shown below.
Sorting will reorder your rows of data by one of the columns. This makes it easy to see the smallest or largest value in each column. You may also notice patterns in your data once it is in sorted order.
Do the following:
➤ Follow the example in the video provided.
➤ Using your own copy of the data, sort your data by at least one of the columns, both A -> Z and Z -> A
Even when it is collected carefully (and especially if it isn't) data usually has issues that make it difficult to use for analysis. Values may have been entered incorrectly, data may not be in a usable format, or your data might just be poorly organized. It's important to fix these issues prior to exploring the data for trends and patterns. The process of preparing data for analysis is called "cleaning" the data
As you looked through the "What Did You Do Last Night" data, you probably already noticed some errors. Sometimes it's easy to know how to fix an error in the data; other times, it's not so obvious. In general, if you are certain what the data is supposed to be, you can fix it manually by typing the correction. Otherwise, it is better to just delete it.
Do the following:
➤ Follow the example in the video provided.
➤ Use sorting and searching to find errors in your data. Consider the following:
Are some values too large or too small to make sense?
Did some users input text when they should have used a number?
➤ Correct the errors that you can and delete the values that don't make sense.
In order to use computers to analyze data we usually need it to be standardized in some way. Data collected as "free form text" will be particularly susceptible to this problem. If you ask people "What did you do last night" you will likely get a different response from every single person.
Making charts or tables with this information would be meaningless and confusing. Free form text data like this may be useful for a human to read but cannot easily be used by a computer. In order to fix this you might need to create new columns of data manually which would allow you to categorize the free form text into data that is more useful for computation.
In the video provided, you'll see one possible way to categorize responses to the question, "What did you do to relax last night?". In this case the data is being categorized by what the person was doing. The resulting column of data will be much easier to use when analyzing this data with a computer.
Note: This process is being done by hand using only one of many possible methods for creating categories. There are many different ways to categorize data and there will rarely be one clearly "right" way to do it. You'll have to make judgment calls about what makes the most sense with the data provided.
Do the following:
➤ Follow the example in the video provided.
➤ Create at least one new column in your "What Did You Do Last Night" dataset that standardizes a column of data collected as "free form text".
To do this, you will need to invent a set of categories that applies to your data as demonstrated in the video.