The data is cleaned and preprocessed through a series of steps to prepare it for further analysis and modeling.Â
Firstly, irrelevant columns, including "Name," "SSN," "ID," and "Customer_ID," are removed. Next, a function named `adjustCreditHistoryAge` is defined to convert credit history age into months, which is then applied to the `Credit_History_Age` column. Subsequently, the `replaceSpecialCharacters` function is defined and applied to replace special characters, converting strings in specific columns (such as Age, Annual_Income, etc.) into numeric values.
To handle outliers, the `removeAgeOutliers` and `removeOtherOutliers` functions are defined, which remove outliers in the age and other columns, respectively. Then specific columns with missing values are dropped. Following this, the `Type_of_Loan` column is processed to convert it into separate loan type count columns.
Additionally, rows with specific abnormal values, such as occupation being "_______" or minimum payment being "NM," are removed. To ensure the correct data types, several columns are converted to numeric types.
Next, some categorical variables are converted into one-hot encodings using the `pd.get_dummies` function if they did not nave an ordering that mattered while the rest were converted to numbers based on their rank or slot in their corresponding sequence. The `Payment_Behaviour` column is further processed by splitting it into "Spent_Amount_Payment_Behaviour" and "Value_Amount_Payment_Behaviour" columns and mapping the corresponding categorical values. Finally, the values in the "Credit_Score" column are mapped to numeric values, and data statistics, including the number of missing values and total rows, are printed.
These steps effectively clean and organize the data, preparing it for subsequent analysis and modeling[3].