BDSI Materials

On this page I have listed the materials that I interacted with during the Big Data Summer Institute 2019 at the University of Michigan at Ann Arbor. You can also find them in the file BDSI Materails Reference Sheet. With the exception of the additional materials and research code, the files are to be made publicly available by BDSI organizers at the bdsi wiki-page.

The way that the lectures are grouped by topic (Being a Data Scientist, Statistics, etc.) is completely up to my choice. It has not been approved by BDSI officials. While I did my best at organizing the files, you may find the structure chaotic. One reason is that many of the lectures were interdisciplinary and fit well into more than one category. Thus, if you are looking for something in particular, I recommend Ctrl+F.

The keywords are also listed according to my own understanding. They have not been approved by BDSI officials. If you find them inappropriate, it is my fault. I would appreciate if you contact me in that case so that I can make the necessary changes.

Before uploading the files to the folders, I changed the filenames and converted them to pdf/html format. I also made changes in many of the scripts located in Research Code so that the structure, variable names, comments, etc. are consistent.

At this point I have decided not to include the journey lectures and lectures related to career development (writing CVs/Resumes and etc.) as many of these contain personal information. You will still be able to find them on the bdsi wiki-page.

This page is designed to reflect my personal experience during BDSI. Thus, in the research sections I have only included the Data Mining Group research lectures and the code and presentations I worked on within my Research Group (James Carzon, Karen Gao, Kailey Mulligan and me). If you are interested in the research of other groups or materials from past years, please visit the bdsi wiki-page.

Again, please contact me if anything on this page or throughout the website is confusing.

1. LeFaive, Reproducible Research

Keywords: Replicating vs Reproducing, Organizing projects, Documentation, Version Control, Package Managers, Workflow Automation, Coding Practices

2. BBDSI Warmup

Keywords: Stochastic vs Algorithmic Statistical Models, Leo Breiman

Associated Files: Breiman, 2001, “Statistical Modeling: The Two Cultures”; Olshen, 2001, “A Conversation with Leo Breiman”; Donoho, 2015, “50 Years of Data Science”

3. On Being a Scientist

Keywords: Treatment of Data, Plagiarism, Sharing of Results, Conflict of Interest, Researchers and Society

Associated Files: National Academy of Sciences, 2009, “On Being Scientist”; Baggerly and Coombes, 2009, “Deriving Chemosensitivity from Cell Lines: Forensic Bioinformatics and Reproducible Research in High-Throughput Biology”

4. Ethical Statistical Practice

Keywords: Study Design, Data Collection (Human Subject Protection), Data Management, Data Analysis

Associated Files: Committee on Professional Ethics, 1999, “Ethical Guidelines for Statistical Practice”; Al-Marzouki et al., 2005, “The effect of scientific misconduct on the results of clinical trials: A Delphi survey”

5. Statistical Significance

Readings: Benjamin et al., 2017, “Redefine statistical significance”; Wasserstein, Schirm, and Lazar, 2019, Moving to a World Beyond “p<0.05”

Note: These readings are not associated to any of the lectures in this section. We had them as a pre-read for BDSI.

1. Wu, Causal Inference in Medicine and Public Health

Keywords: Intervention, Rubin’s Causal Model, Stable Unit Treatment Value Assumption (SUTVA), Randomized Experiments and Observational Studies

1. Hartman, Probability

Keywords: Probability and Statistics, Set Theory, Counting, Conditioning and Independence, Bayes Theorem, Random Variables, Probability Functions, Cumulative Distribution Functions, Expectation, Poisson Distribution, Normal Distribution, Z-scores and P-values, R example

2. Hartman, Linear Algebra

Keywords: Scalars and Vectors, Matrix Algebra, Least Squares Regression, Linear (In)dependence and Rank, R example

3. Adar, (Social) Network Analysis

Keywords: Graphs in the real world, Describing Networks (Degree Distributions, Metrics, Models for Generating Random Graphs, Importance of Nodes, Subgraphs (Motifs + Motif Detection Software), Community Detection (Betweenness Clustering, Modularity, Walktrap), Fixing Networks (Sampling, Link Prediction in Relational Data), Processes on Networks, Planarity, Graph Visualizations, Software for Network Analysis, Applications (Classification, Text Summarization, Recommender Systems, Epidemiology)

4. Kang, Optimization

Keywords: Types of optimization, Single-dimensional Optimization (Bracketing, Golden Section Search, Parabola Method, Brent’s Method), Multi-dimensional (Nelder – Mead, Coordinate Descent, Gradient descent, Stochastic/Batch/Mini-batch Gradient Descent ), Newton’s Method, Quasi-Newton Method; Broygen – Fletcher -Goldfarb – Shanno (BFGS) Update, L-BFGS-B (BFGS with memory limit and box constraints), Specialized Methods, “Iteratively Reweighted Least Squares” (IRWS) for Logistic Regression, “Least-Angle Regression” (LARS) for LASSO, Expectation Maximization, Annealing (Markov Chain – Monte Carlo), Linear Programming, Quadratic Programming, Semidefinite Programming, Alternating Direction Method of Multipliers (ADMM), Reference to Other Methods

1. Beesley, Model Selection I and II

Keywords: Selecting features (P-value model selection, Backward Elimination, Forward Selection, Stepwise Regression), Quantile - Quantile plots, Residuals, adjusted R^2, Likelihood Ratio Testing, Akaike Information Criterion, Bayes Information Criteria, Prediction Measures (Prediction sum of squares, Mallow’s Cp, Mean Squared Prediction Error, Mean Square absolute error), Sensitivity, Specificity, ROC, Cross Validation, Tuning parameters ● Penalized linear Models (LASSO, Ridge, Elastic Net) ● Classification and Regression Trees (Growing Trees, Pruning Trees, Bagging and Boosting Forests, Out of bag prediction)

2. Boonstra, Model Assessment

Keywords: Training, Validating, Testing, Concordance Index, Deviance, Mean Square Error, Concordance, ROC, Specificity, Sensitivity, R example

1. Wiens, Machine Learning I

Keywords: Threshold Linear Mappings: Zero-One Loss, Hinge Loss, Gradient Descent, Stochastic Gradient Descent

2. Wiens, Machine Learning II

Keywords: Feed-Forward Neural Networks

3. Koutra, Unsupervised Learning I

Keywords: Class representatives, K-means Algorithm

4. Koutra, Unsupervised Learning II

Keywords: Hierarchical Clustering, Metrics for Hierarchical Clustering, Comparing Hierarchical Clustering and K-means, Spectral Clustering

5. Gryak, Data Mining I

Keywords: K-means ++, K-means ISODATA, Linear Manifold Clustering, Kohonen self-organizing maps, Classification, K Nearest Neighbors, Applications to Acute Respiratory Distress Syndrome

6. Gryak, Data Mining II

Keywords: Deep-learning (Activation Functions, Cybenko Theorem, Backpropagation), Support Vector Machines (Kernels, Privileged Information SVM+, Uncertainty of Labels LULUPI); Trees and Random Forests, Feature Extraction from Superpixels, Applications to Subdural Hematoma

1. Kay, Information Visualization I

Keywords: Grammar of Graphics (Data, Visual Channels, and Marks), Choices for Visualizing Information (Quantitative, Ordinal and Categorical), Color Choices, Labels and Figure Descriptions, Viewing Order, Small Multiples

2. Kay, Information Visualization II

Keywords: Communicating Uncertainty, Probabilistic Uncertainty vs Uncertainty in Study

3. Griffiths, Reading Like a [Scientific] Writer

Keywords: Structure of Scientific Texts (Intro, Methods, results, Analysis, Discussion), Genre, Genre Specific Details

Associated Files: Gopen and Swan, “The Science of writing”

4. Griffiths, Writing from Point A to Point D

Keywords: Simple Strategies for Conveying Complex Ideas, Sentence and Paragraph Organization

1. Little I, Sampling

Keywords: Hypothesis Testing, P-value, Confidence Intervals, Precision and Accuracy, Sampling, Treatment assignment

2. Little II, Treatment Control

Keywords: Confounding, Bias in observational data, Causal effect, Internal and External validity, Randomized assignments, Blinding and Masking, Clinical trials

Associated Files: Creagan et al.,1979, “Failure of high-dose vitamin C (ascorbic acid) therapy to benefit patients with advanced cancer. A controlled trial”; Cameron and Pauling,1976, “Supplemental ascorbate in the supportive treatment of cancer: Prolongation of survival times in terminal human cancer”; Ryden et al., 1980, “Prophylaxis of ventricular tachyarrhythmias with intravenous and oral tocainide in patients with and recovering from acute myocardial infarction”; Sprague, 1981, “Arthroscopic Debridement for Degenerative Knee Joint Disease”; Baumgaertner et al., 1990, “Arthroscopic Debridement of the Arthritic Knee”; Moseley et al., 2002, “A Controlled Trial of Orthroscopic Surgery for Osteoarthritis of the Knee”; Cast,1989, “Preliminary Report: Effect of Encainide and Flecainide on Mortality in a Randomized Trial of Arrhythmia Suppression After Myocardial Infarction”

3. Little III, Maximum Likelihood

Keywords: Parameter estimation, Likelihood Method (Maximum Likelihood Inference and Bayesian Inference), Statistical Models (Normal, Binomial, Generalized Linear), Maximum Likelihood Properties, Interval Estimation, Significance Tests

Associated Files: Fisher, 1922, “On the Mathematical Foundations of Theoretical Statistics”

4. Hector, Linear Regression

Keywords: Least Squares Regression, Analysis of Variance (ANOVA), Assumptions of Linear Regression

5. Hector, Logistic Regression

Keywords: Logit function, Likelihood Equations, Odds, Log Odds, Odds Ratio, R example

Associated Files: Hartman, Logistic Example CHD

6. Hartman, Generalized Linear Models

Keywords: Exponential Family Distributions, Linear Predictor, Link Function, R Example

7. Hartman, Correlated Data Models

Keywords: Clustered Data, Longitudinal Data, Mixed Effects Model, Random Effects, R example

8. Wen, Bayesian Statistics I

Keywords: Bayes theorem (prior, likelihood, posterior), Arguments for Bayesian Statistics, Exchangeability and De Finetti’s Theorem

9. Wen, Bayesian Statistics II

Keywords: Setting priors, Bayesian model for variable selection

10. Chen, Bayesian Data Analysis

Keywords: Bayes Theorem, Genetic Example, Markov Chain – Monte Carlo (MCMC), Approximate Bayesian Computation and Variational Inference

Associated Files: Chen, discrete MH

1. Flickinger, R dplyr

Associated Files: Flickinger, dplyr; Rstudio Dplyr Cheat Sheet; Flickinger, dlpyr nycflights; Flickinger, dplyr ocsls;

2. Flickinger, R ggplot2

Associated Files: Flickinger, ggplot2 mpg; Flickinger, ggplot2 nycflights, Rstudio Ggplot2 Cheat Sheet

3. Kamran, Python for Data Science

Keywords: Basics, NumPy, Pandas, Matplotlib

4. Barker, Cluster Computing

Keywords: Submitting Jobs, Basic Linux Commands

Boonstra, R markdown

1. Denton and Li, Prostate Cancer Surveillance Using Data-driven Markov Decision Processes

Keywords: Hidden Markov Models, The Baum-Welch Algorithm, Partially Observable Markov Decision Process (POMDP), Bayesian Networks, Multiple Types of Observation, Detecting Prostate Cancer

Associated Files: Barnett et al., 2017, “Two-Stage Biomarker Protocols for Improving the Precision of Early Detection of Prostate Cancer”; Barnet et al., 2017, “Optimizing Active Surveillance Strategies to Balance the Competing Goals of Early Detection of Grade Progression and Minimizing Harm From Biopsies”; Zhang et al., 2012, “Optimization of Prostate Biopsy Referral Decisions”

2. Provost, Human-Centered Computing

Keywords: Detecting Speech Rhythm, Detecting Emotion (Language, Anomaly Detection), Applications to Identifying Bipolar Disorder

3. Zelner, Spatial Epidemiology

Keywords: Neighbor Effects, Ecology, Tools in Spatial Epidemiology

Associated Files: Zelner, R Example

4. Lisabeth, Stroke Disparities Research

Keywords: Stroke Risk Factors, Challenges in Collecting Stroke Research Data, Stroke Disparities

5. Surakka, From Genomics to Prevention of Cardiovascular diseases

Keywords: Genetics (Inheritance and Mutations), Genome-Wide Association Analyses, Polygenic Risk Score, Coronary Heart Disease

6. Kheterpal, Precision Health and Big Data

Keywords: Machine Learning in Apps vs Machine Learning in Health, Misspellings in Medical Text, Visualizing Patient’s condition During Surgery, Anesthetics, Michigan Predictive Activity and Clinical Trajectories

7. Baladandayuthapani, Bayesian Data Integration and Precision Medicine

Keywords: Data Sources for Cancer Research, Software for Visualizing Cancer Data, Challenges in Cancer Research, Sparse Hierarchical Prior Distributions, Network-Based Approaches, Radiomics and Radiogenomics

8. Rao, Radiomics

Keywords: Linking Tumor-Derived Phenotype information with Genetics for Personalized Medicine, Phenotype-Guided Drug Discovery, Assessing Immune Status

9. Singh, Natural Language Processing

Keywords: Applications to Public Health, String Processing

1. Bhattacharyya and Rehnberg, Introduction

Keywords: BDSI Data Mining 2019 Overview, Genomics of Drug Sensitivity in Cancer (GDSC), Data Types

2. Rehnberg, Screening Cleaning

Keywords: Genomics of Drug Sensitivity in Cancer (GDSC), Cell Lines, Drug Screening, Half-Maximal Inhibitory Concentration (IC50), Downloading and Cleaning Screening Data

3. Rehnberg, Expression Cleaning

Keywords: Gene Expressions, Collecting Gene Expression Data, Microarrays, Downloading Expression Data, Preprocessing Screening and Expression Data

4. Rehnberg, Methylation Cleaning

Keywords: DNA Methylation, Methylation and Cancer, Collecting Methylation Data, Cleaning Screening, Expression, and Methylation Data

5. Rehnberg, Copynumber Cleaning

Keywords: Copy Number Variation, Copy Number Variation Data Collection, Cleaning Screening, Expression, Methylation, and Copy Number Variation Data

6. Bhattacharyya, Classification Methods

Keywords: Optimal Bayes Classifier, Parametric and Non-parametric classifiers, Naïve Bayes, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Logistic Regression, K- Nearest Neighbors, Random Forests and Decision Trees (CART), Visualizing Decision Boundaries

7. Bhattacharyya, Support Vector Machines

Keywords: Margins, Linear Support Vector Machines, Non-Linear Support Vector Machines

8. Bhattacharyya, Penalization and Feature Selection

Keywords: Overfitting, Regularized Logistic Regression, Regularized/Diagonal Linear Discriminant Analysis (DLDA), T-test Feature Selection

9. Bhattacharyya, PCA and PCR

Keywords: Principal Components, Principal Component Regression

10. Bhattacharyya, Cross Validation

Keywords: Leave-One-Out-Cross-Validation (LOOCV), K-fold Cross Validation, Parameter Tuning

11. Bhattacharyya, Research Guidelines

Keywords: High-Dimensional Data, Class Imbalance, Transforming Predictors, Missing Data

12. Rehnberg, Combining Datasets

1. Data Mining Group, Presentation

Keywords: Examining the Effects of Cancer Cell Line Genetics on Drug Sensitivity, Genomics of Drug Sensitivity in Cancer Data, Pre-processing and Missing Data, (Group 1) Chemocracy: A Voting-Based Method, (Group 2) Accounting for Adverse Effects of Drugs, (Group 3) Data Combination and Result Uncertainty, (Group 4) Integration of Tumor Site into Predictive Models

Note: There is a mistake on slide 30. The plot currently shows a perfect classifier performance for drug 1054.

2. Research Group, Abstract

3. Research Group, Poster

Note: There is a mistake in the plot or drug 1054. The plot currently shows a perfect classifier performance

1. Research Group, f1.score

Keywords: F1 score

2. Research Group, kappa.cohen

Keywords: Cohen’s Kappa Metric

3. Research Group, getp

Keywords: T-test P-value, Showing how well a feature distinguishes between effective and ineffective drug-cell line combination, Used for feature selection

4. Research Group, Manual Set-up

Keywords: Subset Data for a Specific Drug, Obtain P-values for Every Column

5. Research Group, Screening Bimodal Cleaning

Keywords: Bimodal Distributions of IC50, Redefining Drug Efficacy, Cleaning Screening Data

6. Research Group, Splitting Methylation

Keywords: Updating Row Names of Methylation Data, Splitting Methylation Data per Drug

7. Research Group, Implicit Imputation via PCA

Keywords: Imputing Copy Number Variation Data, Scaling Entries in Principal Component Analysis

8. Research Group, KNN

Keywords: Single Data Set, Works for all data sets, Parameter Tuning

9. Research Group, LDA

Keywords: Single Data Set, Works for all data sets, Parameter Tuning

10. Research Group, Logistic Regression LASSO

Keywords: Single Data Set, Works for all data sets, Parameter Tuning

11. Research Group, Naïve Bayes

Keywords: Single Data Set, Works for all data sets, Parameter Tuning

12. Research Group, PCR Linear

Keywords: Single Data Set, Works for all data sets, Parameter Tuning

13. Research Group, PCR Logistic

Keywords: Single Data Set, Works for all data sets, Parameter Tuning

14. Research Group, Random Forest

Keywords: Single Data Set, Works for all data sets, Parameter Tuning

15. Research Group, SVM with Linear Kernel

Keywords: Single Data Set, Works for all data sets, Parameter Tuning

16. Research Group, SVM with Polynomial Kernel

Keywords: Single Data Set, Works for all data sets, Parameter Tuning

17. Research Group, SVM with Radial Kernel

Keywords: Single Data Set, Works for all data sets, Parameter Tuning

18. Research Group, SVM with Sigmoid Kernel

Keywords: Single Data Set, Works for all data sets, Parameter Tuning

19. Research Group, Combined KNN voters

Keywords: Methylation and Expression Data Sets, Voting Model

20. Research Group, Combined PCR

Keywords: All Data Sets, Common Approximate Principal Components

21. Research Group, Plotting IC50 distributions

Keywords: IC50 distributions per drug (Raw, Fitting Bimodal Distributions, Parameters of Bimodal Distributions, Effective and Ineffective Distributions), Used for Presenting Research

22. Research Group, Plotting Research Results

Keywords: Comparing Performance with respect to Efficacy Definition, Best Classifiers, Best Datasets

1. Chen, Predicting Solar Storms

Keywords: Predicting Solar Storms with Machine Learning, Solar Flares, Classifying Strong Flares, Predict the Advent of a First Flare, Time-to-event Modeling for Flare Arrival Time, Predicting with Long Short-Term Memory (LSTM)

2. Davis-Kean, Testing Developmental Theories

Keywords: Predictor Variables for 10-and 16-year-olds’ total math scores from British and U.S., Confirming Empirical Science with Data, Pre-Registering Existing Data Analysis

3. Feinberg, Online Dating

Keywords: Data (Profile Data, Search Data, Browsing Data, Messaging Data), “Screening Potential Mates”, “Utility” Functions

4. McShane and Gal, Statistical Significance

Keywords: Statistical Significance and the Dichotomization of Evidence, P-value<0.05 Dichotomization, Common Errors Associated to P-value Use and Dichotomization

5. Mukherji, Music

Keywords: Learning structure vs Learning by Repetition, Similarities and differences between musical “languages”, Parametric Variation in Language, Formalization of Bhatkhande-ian Theory

6. Wu, Mobile Health

Keywords: Areas of Mobile Health, Delivering intervention for the right person at the right time, Identify Real-time, Objective Predictors of Depression under Stress, Intern Health Study-Data Collection, Randomization Scheme, Imputing Missing Data

1. Data Mining Group, Presentation

Keywords: Examining the Effects of Cancer Cell Line Genetics on Drug Sensitivity, Genomics of Drug Sensitivity in Cancer Data, Pre-processing and Missing Data, (Group 1) Chemocracy: A Voting-Based Method, (Group 2) Accounting for Adverse Effects of Drugs, (Group 3) Data Combination and Result Uncertainty, (Group 4) Integration of Tumor Site into Predictive Models

2. Genomics Group, Presentation

Keywords: Quantitative Analysis of Polygenic Risk Score Prediction in the Genes for Good Cohort, Polygenic Risk for Hypertension through the Lens of Population Genetics, Neuroticism and Alcohol Use: A Bidirectional Mendelian Randomization Study, Survival Analysis of Glioblastoma Patients Through Tumor Tissue Deconvolution

3. Machine Learning Group, Presentation

Keywords: Predicting Survival of ICU Patients, Classification of Irregularly Sampled

Clinical Time Series Data with Convolutional Neural Networks, Exploring the Robustness of Deep Learning Architectures, Exploring Feature Importance for Predictions of In Hospital Mortality

Additional Materials

Here are some additional materials that I found helpful during BDSI.

The two books were recommended by BDSI faculty. I used them to gain a deeper understanding of the mathematical details behind some of the topics covered during the lectures and research sessions. Particularly intriguing to me was the chapter about Dimensionality Reduction in “Hands-on Machine Learning with Scikit-Learn and Tensorflow”, including Locally Linear Embedding (LLE), Multidimensional Scaling (MDS), Isomap, and t-Distributed Stochastic Neighbor Embedding (t-SNE).

I found the paper myself as a follow-up to Chen, Bayesian Data Analysis.


1. A. Géron, “ Hands-On Machine Learning with Scikit-Learn and TensorFlow; Concepts, Tools, and Techniques to Build Intelligent Systems”, O'Reilly Media, 2017.

2. G. James et al., “An Introduction to Statistical Learning: with Applications in R”, Springer, 2013


1. R. Neal, “MCMC using Hamiltonian dynamics”, 2012, arXiv: