Making sure data is clean and ready to use is a task that can't be overlooked. The following is an example of a very typical sales data set that had some format issues and some empty cells (null values). While in some cases, we could process data as we receive it, any conclusions we might try to draw from it, would most likely be problematic and misleading. In other words, useless.An initial exploration of the data allow us to see what do we need to fix. In this case, I first inspected the data using common commands such as data.head (10) to mainly see the column headers and whether there were integers, floats or strings. When dealing with big tables it is recommended to see the first rows and the last ones instead of loading all the rows as this might slow down the tools being used. At this point we can also see if the are other problems like empty spaces or some duplicates. We can also see how many columns and rows are there.
Depending on how data is inputted and what systems were used to collect it, we may have very consistent or very inconsistent information. In this case, I was asked to change the format of the date—probably to make it consistent with other datasets.
After sorting, filtering, renaming, and fixing the quirks in the dataset, the data was finally ready to reveal its secrets. I performed some brief calculations to see the total sales per region, the average price per product, and how many orders each customer had placed. Surely, this was enough to start seeing exactly how well the business is doing!
GitHub Repository Original files are available here
Google Colab here
In the data world, there is a famous saying: "Garbage In, Garbage Out". If your data is "dirty," your analysis, no matter how advanced, will lead to unreliable conclusions and poor business decisions. Clean data increases productivity, builds trust in your dashboards, and is essential for training accurate AI models. Also called, data scrubbing or cleansing, it includes:
Deduplication: Identifying and removing duplicate records that might have been created during data collection or when merging different sources.
Handling Missing Values: Deciding whether to delete rows with gaps or "impute" them (filling them in using averages or other logical guesses).
Standardising Formats: Ensuring that dates (like DD-MM-YY vs MM-DD-YY), currencies, and naming conventions (like "USA" vs "United States") are uniform across the entire set.
Fixing Structural Errors: Correcting typos, inconsistent capitalisation, and mislabelled categories.
Managing Outliers: Identifying extreme values that don't fit the rest of the data. You must decide if they are genuine anomalies to keep or errors to remove.
Validation: A final "sanity check" to ensure the data now follows business rules and makes sense for the intended analysis