Data Cleaning & Basic Analysis
Ed Direction Data Fellows: Asynchronous Module
May 19, 2022
Ed Direction Data Fellows: Asynchronous Module
May 19, 2022
Welcome to the Data Cleaning and Basic Analysis Optional Asynchronous Module.
Helpful Instructions and Reminders:
You will work through this module by scrolling through this learning space. To expand documents and slide decks that are included, you can click on the gray arrow at the top right corner of each item.
Feel free to focus on the pieces of this module that are most relevant to your topics of interest.
Please complete the Exit Ticket and Module Completion Form at the end of the module. We will use your submission to track completion.
Please contact datafellows@eddirection.org if help is needed.
In this module you will use a dummy dataset on this Google Sheet in lieu of a note catcher. Save a copy to use throughout the module. Refer to the dataset each time you see the icon on the left.
Session Outcome: Data fellows will know basic concepts, tools, and resources to clean their datasets. They will have experience with simple calculations and data visualizations in google sheets.
Success Criteria: Upon completion of this module, participants will be able to:
Clean a dataset, paying close attention to column names, consistency of data types, accuracy, and usable numbers
Perform calculations to find a count, average, and median in a dataset
Create pie charts and bar graphs to visualize data
What is Data Cleaning?
Data cleaning is the process of preparing data to be in used in your dashboard. A clean data set can be easily read by the data platform, translating your data into useful displays. By cleaning your data and eliminating any issues in your dataset at the beginning, you avoid encountering problems and errors down the line that are trickier to identify.
In the slide below are the key steps to data cleaning. Each step will be discussed in more detail throughout the module.
Long column names like “Language Arts: Informational Text: Language, Craft, and Structure” may be helpful when reading a spreadsheet, but they can be very problematic when you are trying to use that spreadsheet in a programming language, dashboard software, or even for visualizations.
HOT TIPS!
Use abbreviations and shorten column names as much as possible (you can keep the original name in a comment in the cell). The column names should be the shortest way of saying what is in the column.
Replace spaces and symbols with underscores, as they can cause issues in programming language (ex. Language Arts: Informational Text: Language, Craft, and Structure could become Language_Arts_Informational_Text).
Step 1: Review the column names in your practice Google Sheet. Which of the column names need to change? Change them to a version which follows the rules discussed above.
Data values are also sometimes called data types or data formats. These all refer whether the data is read as a number, percentage, date, etc.
Even if the data looks the same, it is important to ensure that this is the case by selecting your desired format for values in the toolbar at the top of your sheet. This process can "corrupt" values that did not originally match the selected format, so make sure to check your dataset for errors. Note: Several programming languages have similar concepts, such as choosing whether a value is a number or a factor.
Step 2: Look at your practice Google Sheet and find the columns for ‘Term Tested’, ‘$Term ##Rostered&&’, and ‘Grade’.
Are all the values in those columns consistent?
Are the numbers exactly the same? Are they all stored as the correct type of value?
Select the column and click ‘123’ in the top bar to see how they are saved. Then, make the necessary changes so that these rows are consistent.
You may receive data that is imperfect for a number of reasons. There may be duplicates, system errors, and issues with the formulas. There may be missing or misplaced values. In smaller data sets, you can manually look over the sheet, but it may be a challenge in larger data sets. Here are some ways to check your dataset for accuracy and other issues.
Read through the first three ways to check for accuracy (slides 2-4), and then complete the activity (also noted on slide 5).
Now, read through about checking for consistent written values (slide 6), and complete the activity (also noted on slide 7).
Step 4: Open your dataset and click on the filter (three lines in the column header) for ‘School’ and ‘Subject Matter’, and ‘Course Matter’. Under the search bar, all the different values in the column will appear. All of the values in these columns should be exactly the same, so there should only be one. If there are more than one, find the cells with different values, and replace them so that all the values are the same.
Read the last two ways to check for accuracy (slides 8-9) and complete the activity (slide 10).
You may receive numerical data with letters, symbols, or words attached that are unusable for calculations. (The exceptions are dates, currencies, and percentages, if they are formatted correctly.)
The values on the right represent a number, and may even contain numbers, but they can not be shown as such in either Excel/Google Sheets calculations or the dashboard. To address this, keep the column with that ‘true’ value, and add an empty column next to it, preferably with the same name, with ‘_num’ added to show that this is the numerical version. For instance, ‘Language Arts’ would have a column next to it named ‘Language_Arts_Num’. You can then use an Excel formula to change the value into a usable format. If you have experience in a coding language, that may be a simpler option for this type of data cleaning. When you are done, test your work by applying the appropriate value formatting discussed in the previous step.
Step 6: Insert and name a column to the right of ‘Lexile Score’ (right click on the top of the column, and then select ‘Insert Column Right’). Watch the first 4:30 of this video and use a Split function to just get the lexile score number into the column, based on the column to the left.
Bonus Activity: Use the Split function to create columns for the high and low numbers of the Lexile range.
In this section, we will be doing some simple calculations using the same sheet from Part 1. If you had challenges with cleaning the data, use the tab labeled "Correct Version" for a fully cleaned dataset.
Count functions help you understand the frequency of non-numeric variables, like school district, homeroom teacher, and school year.
HOT TIP!
Count functions are different than averages. For example, although grade is represented by a number, it actually represents levels. It is more useful to know that there are 24 students in Grade 1 and 20 students in Grade 2 than to know that the average student is in Grade 1.45.
Watch this short video about how to use the 'countif' function in Google Sheets.
Step 1: In your sheet, type the function under ‘# of LP Middle School students’ into the box above it and check the number. Then use the same ‘countif’ formula to find the number of students studying reading, under the other yellow box.
Averages are a simple calculation which take a number of statistics of the same type, and compares them, giving a summary of a unit or group across multiple occurrences. To take an average, you add up all of the data points, and then divide them by the number of data points. The number will then be comparable to the original data points.
HOT TIP!
In datasets where one or a few units have especially high or low numbers (called ‘skewed’ data), you may still want to compare, but an average might not be a good way to do it. A median works by ranking all the values in a dataset from largest to smallest, and then selecting the single datapoint that sits in the middle. This helps eliminate skewed data and may give a better picture of the ‘true’ value.
To calculate an average, or median of a column, enter ‘=average()’ or ‘=median()’ with the range of cells to calculate in the parentheses.
Step 2: Under the orange Boxes in the RIT score column, find the average and median for the test scores in this class. Are they similar values? Is the data skewed?
Charts and graphs are an important way to visualize your data for your audience, and they are also for you as you initially look over a dataset. Making a chart can be even easier than performing a calculation, and it might tell you even more!
Pie charts should be used for situations where the units included add up to 100% of something, such as a classroom, district, or dataset. They feature pie ‘slices’ which show how much of the whole each sub-group makes up.
Bar charts compare values (typically counts or averages) across different units. They give a simple view of data that shows the differences between these things through the height of a bar.
Practice: Which of the following pie charts make sense as a way to show their data? Why? (After considering your answer, click the arrow to the right to see our reasoning.)
Graph 1 makes sense as a pie chart, graph 2 does not.
The first graph shows something which is a whole (the number of students within a district) and separates them into groups (by school), and shows the percent of the whole that each of these groups make up. This is a logical place to use a pie chart.
The second graph shows a single student’s performance on 4 tests. Test scores are not usually added together, so this chart does not make much sense to the reader.
Select your data: Start by selecting the columns you want to include in your graph. You can only select one or two columns for these types of graphs. One of the columns needs to be the names, or unit, which you want to show your numbers by. Without these names, the numbers will show up unlabeled.
Practice: Let’s say you are trying to use the dataset below to make a bar chart which compares how many days each student missed. Which column/columns would you select? (After considering your answer, click the arrow to the right to see our reasoning.)
Either First/Last name, or Days Missed.
To select these columns, drag your mouse from the top to the bottom of the variables. If the columns are not next to each other, hold down the control (CTRL) key as you select the second column to add it to the ‘selection.’ Make sure that all of the rows in your columns are selected and that the blue box goes from top to bottom.
Choose a chart: With the data still selected, click the ‘insert’ button on the top bar and then select chart (Google calls these charts, but you might call them graphs). This will automatically create a graph, based on the information provided. If you want to change the type of graph, click on the three dots in the top right corner of the graph, and select a different type from the menu.
Step 3: Use your dataset to make a graph of the Lexile scores for this group. Which chart is a more appropriate way to display this data? Pie or bar?
Optional: Watch this short tutorial to learn how to create pie charts and bar charts in Google Sheets (start at 1:13).
Congratulations on completing the Progress Monitoring module. Please complete the Exit Ticket and Module Completion forms by clicking on the links below. We will use the information you submit to track your completion.