Data Cleaning: In our data cleaning process, we aimed to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. We also addressed other missing values in the features by imputing or removing them.
Ensuring that all the columns in the dataset had the correct data types was a crucial step. To handle missing values, we used the following method:
"Use the most probable value to fill in the missing value": We employed the 'mode()' function to calculate the most frequent values, as our dataset contained missing values in categorical and binary columns.
Data distribution: As we know different data types require different types of analysis, hence we need to apply different methods for Numerical and Categorical features.
For Numerical Features:
We used histograms or kernel density estimation (KDE) plots to visualize the distribution.
We then checked for skewness to understand the shape of the distribution before removing outliers and after removing outliers.
For Categorical Features:
We used bar plots or pie charts to visualize the distribution of categories.
Data Pre-processing: After statistical analysis of each attributes, by plotting box plots for them, we found that there are numerous outliers. Now to handle these outliers we dropping them, but even after dropping outliers, some remained in certain attributes (e.g., BMI). For such attributes, we used the Capping method to handle them.
We used the Ordinal Encoding method to convert categorical variables into numerical form for smoking_history attribute because it has some order i.e., never = 0, no_info etc.