The WiDS Datathon 2024 focuses on predicting the metastatic diagnosis period for breast cancer patients using a dataset comprising of patient characteristics, medical history, demographic information, and climate data. Our goal is to develop a predictive model that can accurately estimate the metastatic diagnosis period, leveraging the provided data and addressing real-world data challenges.
PROJECT PHASES:
1] DATA EXPLORATION
2] DATA PRE-PROCESSING
3] MODEL BUILDING
4] MODEL EVALUATION AND VALIDATION
5] PREDICTION AND SUBMISSION
1]DATA EXPLORATION
We thoroughly examined the dataset to understand its structure and the relationships between various features. This initial phase involved summarizing key statistics, visualizing distributions, and identifying any patterns or anomalies. We explored demographic, medical, and climate data to gain insights into their potential impact on the metastatic diagnosis period. Dataset contains 13,173 rows and 152 columns.
2] DATA PRE-PROCESSING:
In this phase, we focused on cleaning and transforming the data to make it suitable for analysis. We handled missing values using imputation techniques and encoded categorical variables to numerical formats. Numerical features were scaled to standardize their ranges. Additionally, we engineered new features from existing data, such as aggregating climate variables and creating interaction terms. These steps were crucial for improving model performance and ensuring the dataset was robust and ready for the machine learning pipeline. We had no duplicate values in our dataset and categorical features were filled using Mode method and numerical features were filled using Mean .
3] MODEL BUILDING :
We experimented with multiple machine learning algorithms to identify the best predictive model. This included Linear Regression, Random Forest, Gradient Boosting, and XGBoost. Each model was trained using cross-validation techniques to evaluate performance and avoid overfitting. We performed hyperparameter tuning through grid search and randomized search to optimize model parameters. Our goal was to develop a model that accurately predicts the metastatic diagnosis period while generalizing well to unseen data.
4] MODEL EVALUATION AND VALIDATION :
The models were assessed using performance metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared. We validated the models on separate data subsets to ensure they were not overfitting and could generalize to new data. This phase involved iterative testing and refinement, analyzing error distributions, and ensuring that the models met the desired accuracy and reliability standards. This rigorous evaluation process helped in selecting the most robust model for final predictions.
5]PREDICTION AND SUBMISSION :
After selecting the best-performing model, we trained it on the entire training dataset to maximize its predictive power. We then used this final model to generate predictions for the test dataset, focusing on accuracy and completeness. The predicted metastatic diagnosis periods were compiled into a submission file as per the competition requirements. We ensured that all rows in the test set were accounted for, and the submission file was formatted correctly to avoid any issues during evaluation. This final step was critical for our standing in the competition.