The WiDS Datathon 2024 focuses on a prediction task using a dataset of approximately 19,000 records, split into training and test sets. The dataset includes detailed patient characteristics such as age, race, BMI, and zip code, along with diagnosis and treatment information (e.g., breast cancer diagnosis code, metastatic cancer diagnosis code, treatments). Additionally, it incorporates geo-demographic data at the zip-code level (e.g., income, education, rent, race, poverty) and climate data linking health outcomes to external conditions.
Each row in the data corresponds to a single patient and includes their Diagnosis Period. The task is to predict the Metastatic Diagnosis Period for patients in the Test Dataset using the provided characteristics and patient information. The dataset may include messy data typical of real-world scenarios, and participants are expected to address these issues appropriately.
You are provided with:
Training Dataset (train.csv): Contains observed values of the outcome [Metastatic Diagnosis Period] for each row.
Test Dataset (test.csv): Withholds observed values of the outcome [Metastatic Diagnosis Period] for each row, used for prediction.
Example Solution File: Provided as a reference for preparing submissions.
For this datathon challenge, we are utilizing a real-world evidence dataset from Health Verity (HV), one of the largest healthcare data ecosystems in the US, as the primary data source. Specifically, the HV dataset used in this challenge includes health-related information of patients diagnosed with metastatic triple-negative breast cancers in the US.
Additionally, we have enriched the dataset with the US Zip Codes Database, meticulously compiled from authoritative sources such as the U.S. Postal Service™, U.S. Census Bureau, National Weather Service, American Community Survey, and the IRS. This enrichment provides additional socio-economic information based on the geographical locations of the patients.
Furthermore, the dataset has been augmented with zip code level climate data to explore relationships between health outcomes and climate patterns.
The dataset comprises 13,173 records and includes 152 attributes.