Data Preprocessing:
Dropping irrelevant columns
patient_id: It is a unique identifier and does not affect the prediction target.
patient_gender: It is a constant column.
breast_cancer_diagnosis_desc: It is unique for each breast cancer code.
male: It is a redundant column as it is equal to 100 – female.
Handling noisy data:
•Incorrect patient_state and Division was fixed using patient_zip3.
•Population data was fixed to be consistent with patient_zip3.
•Noise was detected in breast_cancer_diagnosis_code:
•Male codes were changed to female.
•Mistyped codes were fixed.
•ICD-9 codes were recoded to ICD-10.
Imputing missing data:
•Temperature data was sorted by location and time. Then, missing values were imputed using forward fill followed by backward fill within each location group.
•Missing values in population data were imputed using the mean value for each corresponding state.
Feature engineering:
•bmi column was binned into groups.
•Due to high % of NaN in columns like metastatic_first_novel_treatment, metastatic_first_novel_treatment_type, and patient_race, missing entries were filled with placeholder values.
•Boolean indicators were created for missing values, improving data quality and facilitating subsequent analysis.