BDSI Materials

On this page I have listed the materials that I interacted with during the Big Data Summer Institute 2019 at the University of Michigan at Ann Arbor. You can also find them in the file BDSI Materails Reference Sheet. With the exception of the additional materials and research code, the files are to be made publicly available by BDSI organizers at the bdsi wiki-page.

The way that the lectures are grouped by topic (Being a Data Scientist, Statistics, etc.) is completely up to my choice. It has not been approved by BDSI officials. While I did my best at organizing the files, you may find the structure chaotic. One reason is that many of the lectures were interdisciplinary and fit well into more than one category. Thus, if you are looking for something in particular, I recommend Ctrl+F.

The keywords are also listed according to my own understanding. They have not been approved by BDSI officials. If you find them inappropriate, it is my fault. I would appreciate if you contact me in that case so that I can make the necessary changes.

Before uploading the files to the folders, I changed the filenames and converted them to pdf/html format. I also made changes in many of the scripts located in Research Code so that the structure, variable names, comments, etc. are consistent.

At this point I have decided not to include the journey lectures and lectures related to career development (writing CVs/Resumes and etc.) as many of these contain personal information. You will still be able to find them on the bdsi wiki-page.

This page is designed to reflect my personal experience during BDSI. Thus, in the research sections I have only included the Data Mining Group research lectures and the code and presentations I worked on within my Research Group (James Carzon, Karen Gao, Kailey Mulligan and me). If you are interested in the research of other groups or materials from past years, please visit the bdsi wiki-page.

Again, please contact me if anything on this page or throughout the website is confusing.

Lectures

Being a Data Scientist

1. LeFaive, Reproducible Research

Keywords: Replicating vs Reproducing, Organizing projects, Documentation, Version Control, Package Managers, Workflow Automation, Coding Practices

2. BBDSI Warmup

Keywords: Stochastic vs Algorithmic Statistical Models, Leo Breiman

Associated Files: Breiman, 2001, “Statistical Modeling: The Two Cultures”; Olshen, 2001, “A Conversation with Leo Breiman”; Donoho, 2015, “50 Years of Data Science”

3. On Being a Scientist

Keywords: Treatment of Data, Plagiarism, Sharing of Results, Conflict of Interest, Researchers and Society

Associated Files: National Academy of Sciences, 2009, “On Being Scientist”; Baggerly and Coombes, 2009, “Deriving Chemosensitivity from Cell Lines: Forensic Bioinformatics and Reproducible Research in High-Throughput Biology”

4. Ethical Statistical Practice

Keywords: Study Design, Data Collection (Human Subject Protection), Data Management, Data Analysis

Associated Files: Committee on Professional Ethics, 1999, “Ethical Guidelines for Statistical Practice”; Al-Marzouki et al., 2005, “The effect of scientific misconduct on the results of clinical trials: A Delphi survey”

5. Statistical Significance

Readings: Benjamin et al., 2017, “Redefine statistical significance”; Wasserstein, Schirm, and Lazar, 2019, Moving to a World Beyond “p<0.05”

Note: These readings are not associated to any of the lectures in this section. We had them as a pre-read for BDSI.

Causal Inference

1. Wu, Causal Inference in Medicine and Public Health

Keywords: Intervention, Rubin’s Causal Model, Stable Unit Treatment Value Assumption (SUTVA), Randomized Experiments and Observational Studies

Mathematics

1. Hartman, Probability

Keywords: Probability and Statistics, Set Theory, Counting, Conditioning and Independence, Bayes Theorem, Random Variables, Probability Functions, Cumulative Distribution Functions, Expectation, Poisson Distribution, Normal Distribution, Z-scores and P-values, R example

2. Hartman, Linear Algebra

Keywords: Scalars and Vectors, Matrix Algebra, Least Squares Regression, Linear (In)dependence and Rank, R example

3. Adar, (Social) Network Analysis

Keywords: Graphs in the real world, Describing Networks (Degree Distributions, Metrics, Models for Generating Random Graphs, Importance of Nodes, Subgraphs (Motifs + Motif Detection Software), Community Detection (Betweenness Clustering, Modularity, Walktrap), Fixing Networks (Sampling, Link Prediction in Relational Data), Processes on Networks, Planarity, Graph Visualizations, Software for Network Analysis, Applications (Classification, Text Summarization, Recommender Systems, Epidemiology)

4. Kang, Optimization

Keywords: Types of optimization, Single-dimensional Optimization (Bracketing, Golden Section Search, Parabola Method, Brent’s Method), Multi-dimensional (Nelder – Mead, Coordinate Descent, Gradient descent, Stochastic/Batch/Mini-batch Gradient Descent ), Newton’s Method, Quasi-Newton Method; Broygen – Fletcher -Goldfarb – Shanno (BFGS) Update, L-BFGS-B (BFGS with memory limit and box constraints), Specialized Methods, “Iteratively Reweighted Least Squares” (IRWS) for Logistic Regression, “Least-Angle Regression” (LARS) for LASSO, Expectation Maximization, Annealing (Markov Chain – Monte Carlo), Linear Programming, Quadratic Programming, Semidefinite Programming, Alternating Direction Method of Multipliers (ADMM), Reference to Other Methods

Choosing and Assessing Models

1. Beesley, Model Selection I and II

Keywords: Selecting features (P-value model selection, Backward Elimination, Forward Selection, Stepwise Regression), Quantile - Quantile plots, Residuals, adjusted R^2, Likelihood Ratio Testing, Akaike Information Criterion, Bayes Information Criteria, Prediction Measures (Prediction sum of squares, Mallow’s Cp, Mean Squared Prediction Error, Mean Square absolute error), Sensitivity, Specificity, ROC, Cross Validation, Tuning parameters ● Penalized linear Models (LASSO, Ridge, Elastic Net) ● Classification and Regression Trees (Growing Trees, Pruning Trees, Bagging and Boosting Forests, Out of bag prediction)

2. Boonstra, Model Assessment

Keywords: Training, Validating, Testing, Concordance Index, Deviance, Mean Square Error, Concordance, ROC, Specificity, Sensitivity, R example

Machine Learning

1. Wiens, Machine Learning I

Keywords: Threshold Linear Mappings: Zero-One Loss, Hinge Loss, Gradient Descent, Stochastic Gradient Descent

2. Wiens, Machine Learning II

Keywords: Feed-Forward Neural Networks

3. Koutra, Unsupervised Learning I

Keywords: Class representatives, K-means Algorithm

4. Koutra, Unsupervised Learning II

Keywords: Hierarchical Clustering, Metrics for Hierarchical Clustering, Comparing Hierarchical Clustering and K-means, Spectral Clustering

5. Gryak, Data Mining I

Keywords: K-means ++, K-means ISODATA, Linear Manifold Clustering, Kohonen self-organizing maps, Classification, K Nearest Neighbors, Applications to Acute Respiratory Distress Syndrome

6. Gryak, Data Mining II

Keywords: Deep-learning (Activation Functions, Cybenko Theorem, Backpropagation), Support Vector Machines (Kernels, Privileged Information SVM+, Uncertainty of Labels LULUPI); Trees and Random Forests, Feature Extraction from Superpixels, Applications to Subdural Hematoma

Communicating Results

1. Kay, Information Visualization I

Keywords: Grammar of Graphics (Data, Visual Channels, and Marks), Choices for Visualizing Information (Quantitative, Ordinal and Categorical), Color Choices, Labels and Figure Descriptions, Viewing Order, Small Multiples

2. Kay, Information Visualization II

Keywords: Communicating Uncertainty, Probabilistic Uncertainty vs Uncertainty in Study

3. Griffiths, Reading Like a [Scientific] Writer

Keywords: Structure of Scientific Texts (Intro, Methods, results, Analysis, Discussion), Genre, Genre Specific Details

Associated Files: Gopen and Swan, “The Science of writing”

4. Griffiths, Writing from Point A to Point D

Keywords: Simple Strategies for Conveying Complex Ideas, Sentence and Paragraph Organization

Statistics

1. Little I, Sampling

Keywords: Hypothesis Testing, P-value, Confidence Intervals, Precision and Accuracy, Sampling, Treatment assignment

2. Little II, Treatment Control

Keywords: Confounding, Bias in observational data, Causal effect, Internal and External validity, Randomized assignments, Blinding and Masking, Clinical trials

Associated Files: Creagan et al.,1979, “Failure of high-dose vitamin C (ascorbic acid) therapy to benefit patients with advanced cancer. A controlled trial”; Cameron and Pauling,1976, “Supplemental ascorbate in the supportive treatment of cancer: Prolongation of survival times in terminal human cancer”; Ryden et al., 1980, “Prophylaxis of ventricular tachyarrhythmias with intravenous and oral tocainide in patients with and recovering from acute myocardial infarction”; Sprague, 1981, “Arthroscopic Debridement for Degenerative Knee Joint Disease”; Baumgaertner et al., 1990, “Arthroscopic Debridement of the Arthritic Knee”; Moseley et al., 2002, “A Controlled Trial of Orthroscopic Surgery for Osteoarthritis of the Knee”; Cast,1989, “Preliminary Report: Effect of Encainide and Flecainide on Mortality in a Randomized Trial of Arrhythmia Suppression After Myocardial Infarction”

3. Little III, Maximum Likelihood

Keywords: Parameter estimation, Likelihood Method (Maximum Likelihood Inference and Bayesian Inference), Statistical Models (Normal, Binomial, Generalized Linear), Maximum Likelihood Properties, Interval Estimation, Significance Tests

Associated Files: Fisher, 1922, “On the Mathematical Foundations of Theoretical Statistics”

4. Hector, Linear Regression

Keywords: Least Squares Regression, Analysis of Variance (ANOVA), Assumptions of Linear Regression

5. Hector, Logistic Regression

Keywords: Logit function, Likelihood Equations, Odds, Log Odds, Odds Ratio, R example

Associated Files: Hartman, Logistic Example CHD

6. Hartman, Generalized Linear Models

Keywords: Exponential Family Distributions, Linear Predictor, Link Function, R Example

7. Hartman, Correlated Data Models

Keywords: Clustered Data, Longitudinal Data, Mixed Effects Model, Random Effects, R example

8. Wen, Bayesian Statistics I

Keywords: Bayes theorem (prior, likelihood, posterior), Arguments for Bayesian Statistics, Exchangeability and De Finetti’s Theorem

9. Wen, Bayesian Statistics II

Keywords: Setting priors, Bayesian model for variable selection

10. Chen, Bayesian Data Analysis

Keywords: Bayes Theorem, Genetic Example, Markov Chain – Monte Carlo (MCMC), Approximate Bayesian Computation and Variational Inference

Associated Files: Chen, discrete MH

Computing

1. Flickinger, R dplyr

Associated Files: Flickinger, dplyr; Rstudio Dplyr Cheat Sheet; Flickinger, dlpyr nycflights; Flickinger, dplyr ocsls;

2. Flickinger, R ggplot2

Associated Files: Flickinger, ggplot2 mpg; Flickinger, ggplot2 nycflights, Rstudio Ggplot2 Cheat Sheet

3. Kamran, Python for Data Science

Keywords: Basics, NumPy, Pandas, Matplotlib

4. Barker, Cluster Computing

Keywords: Submitting Jobs, Basic Linux Commands

Boonstra, R markdown

Public Health

1. Denton and Li, Prostate Cancer Surveillance Using Data-driven Markov Decision Processes

Keywords: Hidden Markov Models, The Baum-Welch Algorithm, Partially Observable Markov Decision Process (POMDP), Bayesian Networks, Multiple Types of Observation, Detecting Prostate Cancer

Associated Files: Barnett et al., 2017, “Two-Stage Biomarker Protocols for Improving the Precision of Early Detection of Prostate Cancer”; Barnet et al., 2017, “Optimizing Active Surveillance Strategies to Balance the Competing Goals of Early Detection of Grade Progression and Minimizing Harm From Biopsies”; Zhang et al., 2012, “Optimization of Prostate Biopsy Referral Decisions”

2. Provost, Human-Centered Computing

Keywords: Detecting Speech Rhythm, Detecting Emotion (Language, Anomaly Detection), Applications to Identifying Bipolar Disorder

3. Zelner, Spatial Epidemiology

Keywords: Neighbor Effects, Ecology, Tools in Spatial Epidemiology

Associated Files: Zelner, R Example

4. Lisabeth, Stroke Disparities Research

Keywords: Stroke Risk Factors, Challenges in Collecting Stroke Research Data, Stroke Disparities

5. Surakka, From Genomics to Prevention of Cardiovascular diseases

Keywords: Genetics (Inheritance and Mutations), Genome-Wide Association Analyses, Polygenic Risk Score, Coronary Heart Disease

6. Kheterpal, Precision Health and Big Data

Keywords: Machine Learning in Apps vs Machine Learning in Health, Misspellings in Medical Text, Visualizing Patient’s condition During Surgery, Anesthetics, Michigan Predictive Activity and Clinical Trajectories

7. Baladandayuthapani, Bayesian Data Integration and Precision Medicine

Keywords: Data Sources for Cancer Research, Software for Visualizing Cancer Data, Challenges in Cancer Research, Sparse Hierarchical Prior Distributions, Network-Based Approaches, Radiomics and Radiogenomics

8. Rao, Radiomics

Keywords: Linking Tumor-Derived Phenotype information with Genetics for Personalized Medicine, Phenotype-Guided Drug Discovery, Assessing Immune Status

9. Singh, Natural Language Processing

Keywords: Applications to Public Health, String Processing

Research

Research Lectures

1. Bhattacharyya and Rehnberg, Introduction

Keywords: BDSI Data Mining 2019 Overview, Genomics of Drug Sensitivity in Cancer (GDSC), Data Types

2. Rehnberg, Screening Cleaning

Keywords: Genomics of Drug Sensitivity in Cancer (GDSC), Cell Lines, Drug Screening, Half-Maximal Inhibitory Concentration (IC50), Downloading and Cleaning Screening Data

3. Rehnberg, Expression Cleaning

Keywords: Gene Expressions, Collecting Gene Expression Data, Microarrays, Downloading Expression Data, Preprocessing Screening and Expression Data

4. Rehnberg, Methylation Cleaning

Keywords: DNA Methylation, Methylation and Cancer, Collecting Methylation Data, Cleaning Screening, Expression, and Methylation Data

5. Rehnberg, Copynumber Cleaning

Keywords: Copy Number Variation, Copy Number Variation Data Collection, Cleaning Screening, Expression, Methylation, and Copy Number Variation Data

6. Bhattacharyya, Classification Methods

Keywords: Optimal Bayes Classifier, Parametric and Non-parametric classifiers, Naïve Bayes, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Logistic Regression, K- Nearest Neighbors, Random Forests and Decision Trees (CART), Visualizing Decision Boundaries

7. Bhattacharyya, Support Vector Machines

Keywords: Margins, Linear Support Vector Machines, Non-Linear Support Vector Machines

8. Bhattacharyya, Penalization and Feature Selection

Keywords: Overfitting, Regularized Logistic Regression, Regularized/Diagonal Linear Discriminant Analysis (DLDA), T-test Feature Selection

9. Bhattacharyya, PCA and PCR

Keywords: Principal Components, Principal Component Regression

10. Bhattacharyya, Cross Validation

Keywords: Leave-One-Out-Cross-Validation (LOOCV), K-fold Cross Validation, Parameter Tuning

11. Bhattacharyya, Research Guidelines

Keywords: High-Dimensional Data, Class Imbalance, Transforming Predictors, Missing Data

12. Rehnberg, Combining Datasets

Presenting Research

1. Data Mining Group, Presentation

Keywords: Examining the Effects of Cancer Cell Line Genetics on Drug Sensitivity, Genomics of Drug Sensitivity in Cancer Data, Pre-processing and Missing Data, (Group 1) Chemocracy: A Voting-Based Method, (Group 2) Accounting for Adverse Effects of Drugs, (Group 3) Data Combination and Result Uncertainty, (Group 4) Integration of Tumor Site into Predictive Models

Note: There is a mistake on slide 30. The plot currently shows a perfect classifier performance for drug 1054.

2. Research Group, Abstract

3. Research Group, Poster

Note: There is a mistake in the plot or drug 1054. The plot currently shows a perfect classifier performance

Research Code

Helper Functions

1. Research Group, f1.score

Keywords: F1 score

2. Research Group, kappa.cohen

Keywords: Cohen’s Kappa Metric

3. Research Group, getp

Keywords: T-test P-value, Showing how well a feature distinguishes between effective and ineffective drug-cell line combination, Used for feature selection

4. Research Group, Manual Set-up

Keywords: Subset Data for a Specific Drug, Obtain P-values for Every Column

Cleaning

5. Research Group, Screening Bimodal Cleaning

Keywords: Bimodal Distributions of IC50, Redefining Drug Efficacy, Cleaning Screening Data

6. Research Group, Splitting Methylation

Keywords: Updating Row Names of Methylation Data, Splitting Methylation Data per Drug

7. Research Group, Implicit Imputation via PCA

Keywords: Imputing Copy Number Variation Data, Scaling Entries in Principal Component Analysis