Undergraduates participating in BDSI 2019 were assigned to one of three research groups – Machine Learning, Genomics, and Data Mining. Together with 14 other students, I was part of the Data Mining Group. We were supervised by Faculty Mentors Prof. Johann Gagnon-Bartsch and Prof. Jonathan Terhorst and Graduate Student Instructors Zoe Rehnberg and Anwesha Bhattacharyya at the Department of Statistics at the University of Michigan at Ann Arbor. The goal of our research project was building classification methods that predict the effect of a chemotherapy applied to a cell-line based on its genomic information.
Below you can find the steps of our research split weekly and a broader project description. You can also find this information in the file Research Description.
- Learned about different classification methods as a group.
- Selected 7 drugs.
- Started testing classification methods, initially only for the expression dataset.
- Began writing classifier scripts that can be used for all datasets.
- Worked on imputing the copynumber data.
Note: I have not included the file for copynumber imputation, as we later identified a mistake in it. We used the R package ‘missForest’. We also tried ‘MICE’, ‘Amelia’, and ‘Hmisc’.
- Continued working on single set classifiers.
- Started working on classifiers using multiple datasets.
- Cleaned the methylation dataset (splitting per drug and removing the columns with missing data).
- Worked on an implicit method for imputing the copynumber dataset.
- Redefined drug efficacy and selected drugs.
- Cleaned the screening, expression and methylation datasets.
Note: I have not included the code for cleaning the methylation and expression datasets and splitting the methylation dataset, as these are identical respectively to: Rehnberg, Expression Cleaning; Rehnberg, Methylation Cleaning; Research Group, Splitting Methylation.
- Ran classifiers and recorded performance.
- Compared performance to old definition of efficacy.
- Started preparing for Symposium.
- Worked on the Abstract, Poster, and Presentation.
- Presented at Symposium.
Note: There is a mistake in the plot or drug 1054 in the presentation and poster. The plot currently shows a perfect classifier performance