Data Cleaning
What is Data Cleaning?
Data cleaning, also referred to as data cleansing, is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within one or more datasets. When multiple datasets are combined into one data table, there is a high probability that you will have duplicate pieces of data, data in which users have used multiple representations for numbers or terms, data that is incomplete, or data that is irrelevant. By cleaning your data first, you save yourself LOTS of headaches later. As the saying goes, quality data beats fancy algorithms. If your data is not clean, your outcome can be inaccurate or at the least, questionable.
Insider Tip: Data is a plural word. Datum is the singular of data. That being said, when referring to data, you can use data are or data is. Either is acceptable.
What are the characteristics of quality data?
Validity: The degree to which your data conforms to defined rules or constraints; how accurately a method measures what is intended to measure.
Example: A teacher gives a benchmark test that is supposed to measure how much students have learned over a period of time and then reviews the students for that test using the content of the test, your data will not be valid.
Accuracy: Ensure your data is error-free and can be used as a reliable source of information. Reliable data helps educators make informed decisions concerning their students
Example: Accurate data is needed to determine whether a school qualifies for Title 1 funding.
Completeness (Data integrity:) The degree to which all required data is known; How comprehensive the data is.
Example: If you are missing student numbers needed to specifically identify a student, your data would not be complete until they are added.
Consistency. Ensure your data is consistent within the same dataset and/or across multiple data sets. Inconsistency occurs when two values in the data set contradict each other.
Example; One set of data use GA for state name and another set uses Georgia for state name
Uniformity. The degree to which the data is specified using the same unit of measure.
Example: Dates entered in different formats: MM/DD/YYYY and DD/MM/YYYY. Would 09/10/2021 be September 10th or October 9th?
What Types of Data do you find most often in Education?
Achievement Data: provides information on student learning and achievement. These data include standardized test scores, classroom-based assessments, rubric-scored writing portfolios, and class grades.
What evidence shows the knowledge, skills, and understandings our students have achieved?
What evidence shows the academic proficiency our students have achieved?
Demographic Data: provides information about the characteristics of students: enrollment, student mobility, attendance, disabilities, ethnic background, limited English proficiency, economic status, gender, dropout rates, and behavioral problems. The purpose of this data is to ensure that the needs of all types of students are being met adequately and to indicate important demographic trends (e.g., an increase in English language learners).
Who are our students? What trends do we see in our student population?
What factors outside of school help us understand our students?
Program Data: provides information on all school programs. These data represent the conditions directly under the control of the school, such as its curriculum scope and sequence, academic programs, teacher training and experience, professional development, afterschool programming, and Title I programs.
What are our academic programs? Which students are involved in each?
How successful are our programs in attaining the academic standards?
Perception Data: provides information on the attitudes and beliefs of various stakeholders in the school, such as teachers, students, and parents. Some categories for which perceptions are helpful include academic standards, school leadership, quality of instruction, and school climate.
How do parents, teachers, students, and members of the school community feel about the job our school is doing?
How satisfied are our “customers” with our programs?
What is the workflow to clean data?
Import your data from external sources.
Make a copy of your data
Ensure it is in tabular form with no blank rows or columns,
Do all tasks that don't involve data manipulation
Remove duplicate or irrelevant observations
Fix structural errors
Handle missing data
Validate
Reporting
Formulas needed to clean data:
Concatenate: =CONCATENATE(A5," ",B5) Video
V Look-up: =VLOOKUP(search_key, range, index, [is_sorted]) Video
Change text case: =LOWER(range) (or UPPER or PROPER) Video
Resources:
https://www.tableau.com/learn/articles/what-is-data-cleaning
https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4
https://multimedia.journalism.berkeley.edu/tutorials/cleaning-data/
https://geckoengage.com/articles/5-ways-to-keep-your-student-data-accurate-clean-and-updated/
Steps to Clean Student Data
Practice Activity:
Open the spreadsheet : Sample Spreadsheet
Try using the Google Tools>Cleanup Suggestions and Google Tools>Column Stats to start the cleanup process
Follow the steps in the documentation above to clean the data and prepare it for reporting.