Validating Data

When it comes to data, it is hard to tell sometimes what a large database contains. You can't be sure that your analysis will actually work as desired and provide valid results. Validating your data before using it ensure that the data is at least close to what you expect it to be.

Pandas DataFrame

Instead of opening a file, let's try to make our own data. We will use pandas library to make a DataFrame.

Remember that, to do this, we start from a dictionary. The keys of this dictionary are the label for the data, while the values of this dictionary are the data itself.

Documentation: DataFrame

Removing Duplicates

Duplicates in data can alter the result of your analysis. In the previous example, we see that there's a duplicate. By starting with a DataFrame, we can easily remove the duplicate data by drop_duplicates().

Documentation: drop_duplicates

Creating a data map and data plan

A data map is an overview of the dataset. In our example, let's see the data map, "Gender", with "Name" and "Age" as datasets. Since we group it by "Gender", the statistics show how many F and M appears in the data, their mean, standard deviation, minimum, maximum, and 25%, 50%, 75% quartile.

The groupby() function places the datasets, "Name" and "Age", into groups. To determine whether the data map is viable, you obtain statistics using describe(). The output only show for "Age" since there's no statistics for "Name".

Documentation: groupby, describe

Of course, you may not want all the data that describe() provides. Say we just want to see the count and mean.

In this example, instead of gender, we tried to group it by age and note the different statistics given for age and gender: now the header has Gender and Name.

Although, the unstacked data is relatively easy to read and compare, you can add .stack() for a more compact presentation.
Documentation: stack

< Prev. Lesson

Next Lesson >

Exercise 1.3

The csv files contains sex, September weight (kgs), April weight (kgs), September BMI, April BMI, for 67 college freshmen.

Read the csv file as in the previous lesson. Then, remove the duplicates in the data.
Check the length using len() before and after you remove the duplicates. Write in the following format:
Number of data (original) : {number of data}
Number of data (no duplicates) : {num. of data with no duplicate}
Group the data by sex, then find its statistics using describe(). Print it out in stack mode.
Print the maximum and minimum value only in that statistics.

Page updated

Google Sites

Report abuse