Affiliation: This project was done by Ravi Chacko, Charles D. Holmes, DoHyun Kim, Josh Siegel, and I for our CSE 517, Machine Learning, class final project.
Project Description: An enormous amount of effort has gone in to registering incidents of cancer in the US over the past 40 years. The Surveillance, Epidemiology, and End Results (SEER) database claims to contain 95% of all cancer diagnoses. This data provides a tool for epidemiologists in understanding trends in cancer cases. It also provides a tool for oncologists providing prognostic information following cancer diagnosis and understanding how various clinical indicators and clinical interventions affect prognosis. The software provided with the SEER data (SEERStat) offers only linear and logistic regression to explore trends. For this reason, the majority of analysis conducted on these data have remained limited to simple univariate linear relationships. Some machine learning tools have been applied to medical data, but the scope is limited. The goal of our analysis was to clearly identify factors that contribute to survival time following cancer diagnosis with the aim of improving the information available to both epidemiologists and clinicians, i.e., what lifestyle changes would improve longevity. We also aim to produce a high accuracy model for predicting patient longevity after cancer diagnosis.
Write-up attached below