Sometimes the data you receive is missing information in specific fields. Having a strategy for dealing with missing data is important.
Finding missing data in your dataset is essential to avoid getting incorrect result from your analysis. The file bmi.csv here is missing few datas.
By looking at the csv directly, it is hard to tell where the missing data is. Using .isnull() can clearly shows where the missing data are.
Try uncommenting line 5 and see notice the difference. NaN stands for Not a Number.After you figure out that your dataset is missing information, you need to consider what to do about it. You can always ignore it or you can also do something. We will look at two possibilities. (The csv data is reduced to 20 entries for now so you can see the whole data easily.)
Filling in Missing Data
The following example shows one technique for filling in missing data. Here we will fill the missing data in "Weight(Sep)" with its mean.
Remove (drop) the Missing Entries
This time, instead of filling, we remove (drop) the entries that has missing data, using .dropna(). Note some removed index in the new table.
We have homes.csv with some missing data. It lists fifty home sales, with selling price, listing price, living space, rooms, bedrooms, bathrooms, age, acreage, taxes
The missing data in the csv is in the form of '-'. Find a way to replace '-' with NaN.
Print the whole homes.csv using .isnull() to see where the empty data are.
Number of bathroom is not a big deal.
Fill the missing number of bathroom with its median. (Use .median())
Then, buying house without listing price is hard so drop all the entries without listing price.
Some data has more than 3 missing value.
Find a way to drop all the data with 3 or more missing values.
After that, save the new data in a variable.
Find a way to reset the index in that new variable.