Welcome to the "How to Clean Data Faster" Asynchronous Module.
You will work through this module by scrolling through this learning space. To expand documents and slide decks that are included, you can click on the gray arrow at the top right corner of each item.
Feel free to focus on the pieces of this module that are most relevant to your topics of interest.
Please complete the Exit Ticket at the end of the module. We will use your submission to track completion.
Please contact datafellows@eddirection.org if help is needed.
In lieu of a Note-Catcher, you will be provided with a sample data set you can use to practice the formulas we will be covering in today's module.
Session Outcome: Data Fellows will learn 11 functions in Excel/Google sheets that can accelerate their data cleaning workflows.
Success Criteria: More specifically, Data Fellows will be able to:
Use pre-existing information to create new columns within a spreadsheet
Extract and concatenate values
Alter and replace values
Anticipated time to complete this module: 90 minutes
If you are not yet comfortable using formulas in Excel or Google Sheets, we recommend that you begin with completing the asynchronous module, "Descriptive Analysis in Google Sheets/Excel". There you will learn the basics of how formulas work in electronic spreadsheets.
Data cleaning is a foundational step in the data visualization process. It is also one of the most cumbersome. As Data Fellows have wrestled with organizing and cleaning their RSSP data sets, we have heard sentiments like, “This is taking me such a long time to do. I'm doing this all by hand, and I feel bad because I know that someone, somewhere out there knows how to do this faster than I do.”
First, we want to affirm that if it is taking you longer to clean your data than you would like, that is okay! It is more important that you do your job slowly and correctly than quickly and wrong. That being said, your stress levels would probably go down if you had access to tools and techniques that could accelerate your ability to clean and organize your data. The goal of this module is to teach you how to be that “someone out there” who can do this work incredibly fast.
As we introduce new tools and techniques to help accelerate your data cleaning timelines, we recognize that the applications we use in this module will be fairly basic given the simplistic nature of our sample data set. However, as you go through the exercises below, we encourage you to think about how you can apply these functions to solve the problems you face in your work. As a data analyst, this is where you get to be creative. Your ability to successfully create compelling data sets depends on your capacity to creatively use the tools to which you have access. If you reflect on your collaboration with your Data Fellow peers over the last 9 months, chances are you have seen them solve problems within their data sets or dashboards using a tool you knew about in a way you hadn't considered. Those examples of creativity are what enable them to be so successful -- they thoughtfully consider all the ways each tool can be applied.
As many of you rapidly progress in your analytical prowess, we encourage you to continue adding new tools to your repertoire and learning how to sequentially connect those tools to solve complex problems. The further you get in data analysis, the more difficult and unique problems will become. Lean into your creative abilities to discover solutions.
With that in mind, let’s get started:
IF Function
Problem: You want to create a new column of binary values based on values in a pre-existing column.
Example: You have a column of raw student test scores and you want to create a new column that labels students as proficient or not proficient.
Function Notation: IF(logical_expression, value_if_true, value_if_false)
Notation Explanation: Don’t get intimidated by the function notation! It is actually surprisingly simple and intuitive. The “logical expression” effectively means “the logical statement you want the computer to examine”. An example of this would be selecting cell W2 in the sample spreadsheet and writing < 75. This would mean that you want the computer to see if the selected cell has a value that is less than 75. The rest of the notation indicates what you want the computer to do if the value is less than 75 or if it is not less than 75. In the context of our example, we want the computer to insert the value "Not Proficient" in the new cell if the value is less than 75 or "Proficient" if the value is greater than 75.
Application: Using the sample spreadsheet, insert a new column to the right of column W. Title that column "Student Proficiency", try writing an IF function to label students with scores greater than or equal to 75 as proficient and those less than 75 as not proficient.
Answer: =IF(W2>=75, "Proficient", "Not Proficient")
Explanation: We start by writing the logical statement we want the computer to examine (W2<=74). We then use a comma to indicate what value we want to be inserted in the new cell if the logical statement is true and then we use a second comma to indicate what value we want to be inserted if the logical statement is false.
We put the words Not Proficient and Proficient in quotation marks because we want the computer to recognize that they are strings (non-numerical values).
Pro-Tip: When you write the formula for the first cell correctly, you can simply click and hold the bottom right corner of the cell, drag the cell down to the bottom of the column, and formula will be applied to all subsequent cells, helping you avoid having to write the formula in for each cell.
Nested IF Function
Problem: You want to create a new column of more than two values using an existing column.
Example: You have student test scores and want to categorize students as "Did Not Meet", "Approaches", "Meets", or "Masters".
Function Notation: =IF(logical_expression, value_if_true, IF(logical_expression, value_if_true, IF(logical_expression, value_if_true, IF(logical_expression, value_if_true))))
Notation Explanation: Just like the notation of the IF function, the logical expression refers to the statement we want the computer to examine. An example would be if cell Y2 in the sample spreadsheet is less than or equal to 1693 (written as Y2<=1693). The next piece of the notation describes what the value of the new cell value should be if the logical expression is true. In our case, if the value in the cell we selected is less than or equal to 1693, then the value “Did Not Meet” will appear in the cell.
The next component of the notation is another IF statement. When we write the second IF statement, we are telling Google Sheets to do the following “If the logical expression is true, write the value 'Did Not Meet'. If the logical expression is false, then use the second IF statement to determine what to put in the cell." The computer will then test the second logical expression. If it is true, it will input the corresponding value. If it is false, it will refer to the third IF statement to determine what to put into the cell. The pattern continues until we have created a series of IF statements that cover all values in our reference column.
Application: Title the empty column Z "STAAR Designations" and write a nested IF function that labels scores that are less than or equal 1570 as “Did Not Meet”, scores that are less than or equal to 1693 as “Approaches”, scores that are less or equal to 1757 as “Meets”, and scores that are less than or equal to 2163 as “Masters”.
Answer: =IF(Y2<=1570,"Did Not Meet",IF(Y2<=1693,"Approaches",IF(Y2<=1757,"Meets",IF(Y2<=2163,"Masters"))))
Pro-Tip: It's important to keep track of the number of parenthesis you use when writing nested IF statements. Because we use four IF statements in this formula, it's important we use four sets of parenthesis.
IFS Function
Problem: If you want to use an existing column to create a new column with reclassified values that you have manually assigned.
Example: You have a column of teacher names and want to use that information to create a column that indicates the schools to which the teachers belong.
Function Notation: =IFS(condition_1, value_1, condition_2, value_2, condition_3, value_3)
Notation Explanation: The notation is simple. Condition_1 is the first condition you will use to assign values to the new column. Value_1 is the first value the computer will put into the new column. Condition_2 is the second condition you will use to assign values to the new column. Value_2 is the second value the computer will put into the new column.
Application: Using the knowledge that Murray and Saltz teach at Springview Jr. High and Kaufman and Baum teach at Canyon Jr. High, insert a column to the write of column H, title it "Campus", and write an IFS statement that creates a new column that correctly identifies which teachers belong to which schools.
Answer: =IFS(H2 = "Kaufman", "Canyon Jr. High", H2 = "Saltz","Springview Jr. High", H2 = "Baum", "Canyon Jr. High", H2 = "Murray", "Springview Jr. High")
CONCATENATE Function
Problem: You want to combine string values from multiple cells into one cell. Remember, string values refer to non-numerical data and are most commonly expressed in letters.
Example: You have student first names in one column, student last names in a separate column, and you want to combine both into one column.
Function Notation: =Concatenate(string1, string2, string3)
Notation Explanation: The notation is straight forward. Simply select the cell with the first string value you want to appear in your new cell. Then select the second string value you want to appear in your new cell. You can select more than two cells to concatenate, but you do not have to.
Application: Insert a new column to the right of column C in the sample spread sheet, title it "Student Full Name", and write a CONCATENATE function that takes student first and last names and combines them into one cell.
Answer: =CONCATENATE(B2,C2)
But wait! There is no space in between the names. How do we fix that?
The solution is simple: =CONCATENATE(B2," ",C2)
The blank space surrounded by quotation marks tells the computer that we want to add a space between the values of the two cells we have selected.
LEFT, RIGHT, and MID Functions
Problem: You want to use only a piece of a string value from existing cells to create a new column.
Example: You have a numeric code preceding a student characteristic and you want to extract just the code.
Function Notation: =LEFT(string, number_of_characters)
Notation Explanation: Here, using the LEFT function, we select the cell from which we want to extract a value, and then we list the number of characters (starting from the left and moving to the right) we want to extract and insert into our new cell.
Application: Insert a column to the right of column E, title it "Ethnicity Code", and write a LEFT function that inserts only the number from column E into the "Ethnicity Code" column.
Answer: =LEFT(E2,1)
UPPER, LOWER, and PROPER Functions
Problem: You want the string values in a column to be in all lower case, all upper case, or proper case.
Example: During the data entry process, not all teacher names were written using proper case.
Function Notation: =PROPER(text)
Notation Explanation: Write the PROPER command and then select the text you want to convert to proper case.
Application: Insert a new column to the right of column J, title it Teacher, and write a PROPPER function that converts all teacher names to proper case.
Answer: =PROPER(J2)
After writing the function and dragging it down to the bottom of the data set, you will need to take a couple additional steps to avoid having two columns of teacher names. You will first want to copy all of the teacher names in your newly created column, right click on the same column, use the "paste special" feature on the pop up menu, and select the "values only" option. This will paste the values, not the underlying formulas, into the column. You can then delete the old "Teacher" column with the inconsistent capitalization.
VALUE Function
Problem: The computer thinks that numerical values in your spreadsheet are actually strings.
Example: If numerical data you extracted from a report arrives in your spreadsheet coded as strings.
Demonstration: To show how this can happen, try to sum the values of the "Ethnicity Code" column you created earlier. We recommend using the formula =SUM(F2:F80).
If done correctly, the output of the function is zero. This is unexpected -- the numbers 1, 2, and 3 summed across 79 rows should not equal zero. The reason why these values are summing up to zero is because the computer thinks they are strings. Because we extracted these values from the Ethnicity Column and used the LEFT function, which is designed to work with string values, the computer has no indication that the new values are numeric.
We can convert string values into numeric values by using the VALUE function.
Function Notation: =VALUE(text)
Notation Explanation: The notation is identical to what we saw above with “PROPER”. We simply need to write the VALUE command and select the text we want to convert.
Application: Insert a new column to the right of column F, title it Ethnicity Code, and write a VALUE function that converts the string values to numeric values.
Answer: =VALUE(F2)
To prove that it worked, try summing up the values in the new column and see what happens.
Similar to the previous exercise, you will need to take a couple additional steps to avoid having two columns of ethnicity codes in your data set. You will first want to copy all of the ethnicity codes in your newly created column, right click on the same column, use the "paste special" feature on the pop up menu, and select the "values only" option. This will paste the values, not the underlying formulas, into the column. You can then delete the old "Ethnicity Codes" column.
UNIQUE Function
Problem: You want to get a quick sense of all the different types of values that exist in a column.
Example: You want to create a list of all the teachers who have taken an RSSP survey and compare it to the list of all teachers who were supposed to take the survey.
Function Notation: =UNIQUE(range)
Notation Explanation: Don’t you love how simple these functions are? You know the drill: just write the UNIQUE command and select the range of cells from which you want a list of unique values.
Note: When you select the values in a column, know that if you include the column header in the range, the column header will appear in your list of unique values
Application: Select a blank cell at the bottom of the spreadsheet (we chose cell H86) and write a UNIQUE function that produces a list of the teacher names in column J.
Answer: =UNIQUE(J2:J80)
SORT Function
Problem: You want sort the rows of your spreadsheet in a specific order.
Example One: Continuing the example from above, if you have created a list of teachers who have completed a survey and want to compare it to a list of all the teachers in a school, it could be helpful to have the list of unique values sorted alphabetically.
Example two: If you want to sort your data by ascending/descending order of test scores.
Function Notation: =SORT(range, sort_column, is_ascending)
Notation Explanation: Like many functions in spreadsheets, there are multiple components that you can, but do not have to, include. We will demonstrate this below.
The "range" component of the function is what we would expect; it is the list of cells we want the computer to sort. The sort_column section refers to the column by which we want to sort our values. For example, if we selected all of the columns in our sample spreadsheet and wanted to sort them by scaled STAAR scores, we would write the column number 28 in the second component of the function because scaled the STAAR scores column is the 28th column in the range we have selected.
The final component determines if we want to sort the range in ascending or descending order. If we want to sort it in ascending order, we would write a 1. If we wanted to sort it in a descending order, we would write a 2.
However, the application we are using for our example is much simpler. We simply need to write the SORT command followed by the range of values we want to include. In this instance, the function's default setting for string values is to sort them alphabetically.
Application: Select a blank cell at the bottom of the spreadsheet and write a SORT function that produces a list of the teacher names sorted alphabetically.
Answer: =SORT(H86:H89) . Your answer could be different depending on the cells you used to insert the values from the UNIQUE function you wrote as a part of the previous application.
FILTER Function
Problem: You want to see all of the rows for a specific value, but you don’t want to have to look through your entire spreadsheet to find them.
Example: You want to only see the rows for students who are Hispanic.
Function Notation: =FILTER(range, include)
Notation Explanation: The first component of this function instructs us to select the range of cells we want filtered. The second component directs us to include the logical test we want to use to filter our data.
For example, if I wanted to only return rows of students who had Mr. Saltz as their teacher, I would select all of the cells in my spreadsheet (excluding column headers), and I would write H2:H80=H3 for the second component. The logical test effectively says “I want to see the range of cells I have selected only if the cells in H2 through H80 are equal to H3”.
Application: Select a new cell at the bottom of the sample data sheet and write a FILTER function that produces a list of all the rows for students who are Hispanic.
Answer: =FILTER(A2:AD80,E2:E80=E2)
Reflect: Filtering your data like this in your spreadsheet can help you get a good sense of what your subgroups look like, enabling you to double check that the visualizations in your dashboard are correct. Are there any of the visualizations in your dashboard that could benefit from verification that they are accurately displaying what you think they are displaying?
REPLACE Function
Problem: You want to replace a portion of text in a cell.
Example: A typo in the walkthrough form misspelled one of the school names and you want to correct it so the right name shows up in your dashboard.
Function Notation: =REPLACE(text, position, length, new_text)
Notation Explanation: The first component instructs us to select the cell whose text we want to alter, we then indicate the position in the text at which we want to begin our alteration by listing the letter number. For example, if I wanted to start my alteration at the 5th letter of the word in the cell, I would write the number 5 in for the component. We then indicate how many subsequent letters we want to change. If we wanted to change the next four letters of the word in the cell, we would write the number 4 in for the component. The final component is simply listing the letters we want to use to replace the old text. If we wanted to replace the previous letters “nenr” with the new letters “mont”, we would write “mont” in the final component of the function.
Application: Scroll over to column AJ. There you will find a typo for Canyon Jr. High. Using a blank column, write a REPLACE function that fixes the typo.
Answer: =REPLACE(AJ2, 5, 2, "on")
Note: If you write one formula and drag it down the entire column, it will also change letters for Springview Jr. High, which is correctly spelled. What other function can you use to organize the rows by school, enabling you to drag the formula down only to the Canyon schools?
Answer: Use SORT to organize the spreadsheet by campus name.
Additional Answer: After completing the session, Fatma Tamer identified an alternative (and more effective) solution to the problem. She embedded the REPLACE function within an IF function. Embedding the REPLACE function within an IF function enables you to fix the typo without having to sort the column. Here is the formula she wrote: =IF(AJ2="Canyim Jr. High", replace(AJ2,5,2,"on"), AJ2). Shout out to Fatma for her creative approach to solving this problem!
Applications Answer Key
You can check your work on the applications with the attached sample data set answer key.
Congratulations on completing the module. Please complete the Exit Ticket form by clicking on the link above. We will use the information you submit to track your completion.