Research

Undergraduates participating in BDSI 2019 were assigned to one of three research groups – Machine Learning, Genomics, and Data Mining. Together with 14 other students, I was part of the Data Mining Group. We were supervised by Faculty Mentors Prof. Johann Gagnon-Bartsch and Prof. Jonathan Terhorst and Graduate Student Instructors Zoe Rehnberg and Anwesha Bhattacharyya at the Department of Statistics at the University of Michigan at Ann Arbor. The goal of our research project was building classification methods that predict the effect of a chemotherapy applied to a cell-line based on its genomic information.


Below you can find the steps of our research split weekly and a broader project description. You can also find this information in the file Research Description.

Week 3

  • Selected 7 drugs.
  • Started testing classification methods, initially only for the expression dataset.
  • Began writing classifier scripts that can be used for all datasets.
  • Worked on imputing the copynumber data.

Note: I have not included the file for copynumber imputation, as we later identified a mistake in it. We used the R package ‘missForest’. We also tried ‘MICE’, ‘Amelia’, and ‘Hmisc’.

1. Research Group, KNN

2. Research Group, LDA

3. Research Group, Logistic Regression LASSO

4. Research Group, Naïve Bayes

5. Research Group, PCR Linear

6. Research Group, PCR Logistic

7. Research Group, Random Forest

8. Research Group, SVM with Linear Kernel

9. Research Group, SVM with Polynomial Kernel

10. Research Group, SVM with Radial Kernel

11. Research Group, SVM with Sigmoid Kernel

Week 4

  • Continued working on single set classifiers.
  • Started working on classifiers using multiple datasets.
  • Cleaned the methylation dataset (splitting per drug and removing the columns with missing data).
  • Worked on an implicit method for imputing the copynumber dataset.


1. Research Group, Combined KNN voters

2. Research Group, Combined PCR

3. Research Group, Splitting Methylation

4. Research Group, Implicit Imputation via PCA


Week 5

    • Redefined drug efficacy and selected drugs.
    • Cleaned the screening, expression and methylation datasets.

Note: I have not included the code for cleaning the methylation and expression datasets and splitting the methylation dataset, as these are identical respectively to: Rehnberg, Expression Cleaning; Rehnberg, Methylation Cleaning; Research Group, Splitting Methylation.

    • Ran classifiers and recorded performance.
    • Compared performance to old definition of efficacy.
    • Started preparing for Symposium.

1. Research Group, Screening Bimodal Cleaning

2. Research Group, Plotting IC50 distributions

Week 6

  • Worked on the Abstract, Poster, and Presentation.
  • Presented at Symposium.

Note: There is a mistake in the plot or drug 1054 in the presentation and poster. The plot currently shows a perfect classifier performance


1. Research Group, Plotting Research Results

2. Data Mining Group, Presentation

3. Research Group, Abstract

4. Research Group, Poster

ProjectDescription.pdf