BDSI Materials
On this page I have listed the materials that I interacted with during the Big Data Summer Institute 2019 at the University of Michigan at Ann Arbor. You can also find them in the file BDSI Materails Reference Sheet. With the exception of the additional materials and research code, the files are to be made publicly available by BDSI organizers at the bdsi wiki-page.
The way that the lectures are grouped by topic (Being a Data Scientist, Statistics, etc.) is completely up to my choice. It has not been approved by BDSI officials. While I did my best at organizing the files, you may find the structure chaotic. One reason is that many of the lectures were interdisciplinary and fit well into more than one category. Thus, if you are looking for something in particular, I recommend Ctrl+F.
The keywords are also listed according to my own understanding. They have not been approved by BDSI officials. If you find them inappropriate, it is my fault. I would appreciate if you contact me in that case so that I can make the necessary changes.
Before uploading the files to the folders, I changed the filenames and converted them to pdf/html format. I also made changes in many of the scripts located in Research Code so that the structure, variable names, comments, etc. are consistent.
At this point I have decided not to include the journey lectures and lectures related to career development (writing CVs/Resumes and etc.) as many of these contain personal information. You will still be able to find them on the bdsi wiki-page.
This page is designed to reflect my personal experience during BDSI. Thus, in the research sections I have only included the Data Mining Group research lectures and the code and presentations I worked on within my Research Group (James Carzon, Karen Gao, Kailey Mulligan and me). If you are interested in the research of other groups or materials from past years, please visit the bdsi wiki-page.
Again, please contact me if anything on this page or throughout the website is confusing.
1. LeFaive, Reproducible Research
Keywords: Replicating vs Reproducing, Organizing projects, Documentation, Version Control, Package Managers, Workflow Automation, Coding Practices
2. BBDSI Warmup
Keywords: Stochastic vs Algorithmic Statistical Models, Leo Breiman
Associated Files: Breiman, 2001, “Statistical Modeling: The Two Cultures”; Olshen, 2001, “A Conversation with Leo Breiman”; Donoho, 2015, “50 Years of Data Science”
3. On Being a Scientist
Keywords: Treatment of Data, Plagiarism, Sharing of Results, Conflict of Interest, Researchers and Society
Associated Files: National Academy of Sciences, 2009, “On Being Scientist”; Baggerly and Coombes, 2009, “Deriving Chemosensitivity from Cell Lines: Forensic Bioinformatics and Reproducible Research in High-Throughput Biology”
4. Ethical Statistical Practice
Keywords: Study Design, Data Collection (Human Subject Protection), Data Management, Data Analysis
Associated Files: Committee on Professional Ethics, 1999, “Ethical Guidelines for Statistical Practice”; Al-Marzouki et al., 2005, “The effect of scientific misconduct on the results of clinical trials: A Delphi survey”
5. Statistical Significance
Readings: Benjamin et al., 2017, “Redefine statistical significance”; Wasserstein, Schirm, and Lazar, 2019, Moving to a World Beyond “p<0.05”
Note: These readings are not associated to any of the lectures in this section. We had them as a pre-read for BDSI.
1. Wu, Causal Inference in Medicine and Public Health
Keywords: Intervention, Rubin’s Causal Model, Stable Unit Treatment Value Assumption (SUTVA), Randomized Experiments and Observational Studies
1. Hartman, Probability
Keywords: Probability and Statistics, Set Theory, Counting, Conditioning and Independence, Bayes Theorem, Random Variables, Probability Functions, Cumulative Distribution Functions, Expectation, Poisson Distribution, Normal Distribution, Z-scores and P-values, R example
2. Hartman, Linear Algebra
Keywords: Scalars and Vectors, Matrix Algebra, Least Squares Regression, Linear (In)dependence and Rank, R example
3. Adar, (Social) Network Analysis
Keywords: Graphs in the real world, Describing Networks (Degree Distributions, Metrics, Models for Generating Random Graphs, Importance of Nodes, Subgraphs (Motifs + Motif Detection Software), Community Detection (Betweenness Clustering, Modularity, Walktrap), Fixing Networks (Sampling, Link Prediction in Relational Data), Processes on Networks, Planarity, Graph Visualizations, Software for Network Analysis, Applications (Classification, Text Summarization, Recommender Systems, Epidemiology)
4. Kang, Optimization
Keywords: Types of optimization, Single-dimensional Optimization (Bracketing, Golden Section Search, Parabola Method, Brent’s Method), Multi-dimensional (Nelder – Mead, Coordinate Descent, Gradient descent, Stochastic/Batch/Mini-batch Gradient Descent ), Newton’s Method, Quasi-Newton Method; Broygen – Fletcher -Goldfarb – Shanno (BFGS) Update, L-BFGS-B (BFGS with memory limit and box constraints), Specialized Methods, “Iteratively Reweighted Least Squares” (IRWS) for Logistic Regression, “Least-Angle Regression” (LARS) for LASSO, Expectation Maximization, Annealing (Markov Chain – Monte Carlo), Linear Programming, Quadratic Programming, Semidefinite Programming, Alternating Direction Method of Multipliers (ADMM), Reference to Other Methods
1. Beesley, Model Selection I and II
Keywords: Selecting features (P-value model selection, Backward Elimination, Forward Selection, Stepwise Regression), Quantile - Quantile plots, Residuals, adjusted R^2, Likelihood Ratio Testing, Akaike Information Criterion, Bayes Information Criteria, Prediction Measures (Prediction sum of squares, Mallow’s Cp, Mean Squared Prediction Error, Mean Square absolute error), Sensitivity, Specificity, ROC, Cross Validation, Tuning parameters ● Penalized linear Models (LASSO, Ridge, Elastic Net) ● Classification and Regression Trees (Growing Trees, Pruning Trees, Bagging and Boosting Forests, Out of bag prediction)
2. Boonstra, Model Assessment
Keywords: Training, Validating, Testing, Concordance Index, Deviance, Mean Square Error, Concordance, ROC, Specificity, Sensitivity, R example
1. Wiens, Machine Learning I
Keywords: Threshold Linear Mappings: Zero-One Loss, Hinge Loss, Gradient Descent, Stochastic Gradient Descent
2. Wiens, Machine Learning II
Keywords: Feed-Forward Neural Networks
3. Koutra, Unsupervised Learning I
Keywords: Class representatives, K-means Algorithm
4. Koutra, Unsupervised Learning II
Keywords: Hierarchical Clustering, Metrics for Hierarchical Clustering, Comparing Hierarchical Clustering and K-means, Spectral Clustering
5. Gryak, Data Mining I
Keywords: K-means ++, K-means ISODATA, Linear Manifold Clustering, Kohonen self-organizing maps, Classification, K Nearest Neighbors, Applications to Acute Respiratory Distress Syndrome
6. Gryak, Data Mining II
Keywords: Deep-learning (Activation Functions, Cybenko Theorem, Backpropagation), Support Vector Machines (Kernels, Privileged Information SVM+, Uncertainty of Labels LULUPI); Trees and Random Forests, Feature Extraction from Superpixels, Applications to Subdural Hematoma
1. Kay, Information Visualization I
Keywords: Grammar of Graphics (Data, Visual Channels, and Marks), Choices for Visualizing Information (Quantitative, Ordinal and Categorical), Color Choices, Labels and Figure Descriptions, Viewing Order, Small Multiples
2. Kay, Information Visualization II
Keywords: Communicating Uncertainty, Probabilistic Uncertainty vs Uncertainty in Study
3. Griffiths, Reading Like a [Scientific] Writer
Keywords: Structure of Scientific Texts (Intro, Methods, results, Analysis, Discussion), Genre, Genre Specific Details
Associated Files: Gopen and Swan, “The Science of writing”
4. Griffiths, Writing from Point A to Point D
Keywords: Simple Strategies for Conveying Complex Ideas, Sentence and Paragraph Organization
1. Little I, Sampling
Keywords: Hypothesis Testing, P-value, Confidence Intervals, Precision and Accuracy, Sampling, Treatment assignment
2. Little II, Treatment Control
Keywords: Confounding, Bias in observational data, Causal effect, Internal and External validity, Randomized assignments, Blinding and Masking, Clinical trials
Associated Files: Creagan et al.,1979, “Failure of high-dose vitamin C (ascorbic acid) therapy to benefit patients with advanced cancer. A controlled trial”; Cameron and Pauling,1976, “Supplemental ascorbate in the supportive treatment of cancer: Prolongation of survival times in terminal human cancer”; Ryden et al., 1980, “Prophylaxis of ventricular tachyarrhythmias with intravenous and oral tocainide in patients with and recovering from acute myocardial infarction”; Sprague, 1981, “Arthroscopic Debridement for Degenerative Knee Joint Disease”; Baumgaertner et al., 1990, “Arthroscopic Debridement of the Arthritic Knee”; Moseley et al., 2002, “A Controlled Trial of Orthroscopic Surgery for Osteoarthritis of the Knee”; Cast,1989, “Preliminary Report: Effect of Encainide and Flecainide on Mortality in a Randomized Trial of Arrhythmia Suppression After Myocardial Infarction”
3. Little III, Maximum Likelihood
Keywords: Parameter estimation, Likelihood Method (Maximum Likelihood Inference and Bayesian Inference), Statistical Models (Normal, Binomial, Generalized Linear), Maximum Likelihood Properties, Interval Estimation, Significance Tests
Associated Files: Fisher, 1922, “On the Mathematical Foundations of Theoretical Statistics”
4. Hector, Linear Regression
Keywords: Least Squares Regression, Analysis of Variance (ANOVA), Assumptions of Linear Regression
5. Hector, Logistic Regression
Keywords: Logit function, Likelihood Equations, Odds, Log Odds, Odds Ratio, R example
Associated Files: Hartman, Logistic Example CHD
6. Hartman, Generalized Linear Models
Keywords: Exponential Family Distributions, Linear Predictor, Link Function, R Example
7. Hartman, Correlated Data Models
Keywords: Clustered Data, Longitudinal Data, Mixed Effects Model, Random Effects, R example
8. Wen, Bayesian Statistics I
Keywords: Bayes theorem (prior, likelihood, posterior), Arguments for Bayesian Statistics, Exchangeability and De Finetti’s Theorem
9. Wen, Bayesian Statistics II
Keywords: Setting priors, Bayesian model for variable selection
10. Chen, Bayesian Data Analysis
Keywords: Bayes Theorem, Genetic Example, Markov Chain – Monte Carlo (MCMC), Approximate Bayesian Computation and Variational Inference
Associated Files: Chen, discrete MH
1. Flickinger, R dplyr
Associated Files: Flickinger, dplyr; Rstudio Dplyr Cheat Sheet; Flickinger, dlpyr nycflights; Flickinger, dplyr ocsls;
2. Flickinger, R ggplot2
Associated Files: Flickinger, ggplot2 mpg; Flickinger, ggplot2 nycflights, Rstudio Ggplot2 Cheat Sheet
3. Kamran, Python for Data Science
Keywords: Basics, NumPy, Pandas, Matplotlib
4. Barker, Cluster Computing
Keywords: Submitting Jobs, Basic Linux Commands
Boonstra, R markdown
1. Denton and Li, Prostate Cancer Surveillance Using Data-driven Markov Decision Processes
Keywords: Hidden Markov Models, The Baum-Welch Algorithm, Partially Observable Markov Decision Process (POMDP), Bayesian Networks, Multiple Types of Observation, Detecting Prostate Cancer
Associated Files: Barnett et al., 2017, “Two-Stage Biomarker Protocols for Improving the Precision of Early Detection of Prostate Cancer”; Barnet et al., 2017, “Optimizing Active Surveillance Strategies to Balance the Competing Goals of Early Detection of Grade Progression and Minimizing Harm From Biopsies”; Zhang et al., 2012, “Optimization of Prostate Biopsy Referral Decisions”
2. Provost, Human-Centered Computing
Keywords: Detecting Speech Rhythm, Detecting Emotion (Language, Anomaly Detection), Applications to Identifying Bipolar Disorder
3. Zelner, Spatial Epidemiology
Keywords: Neighbor Effects, Ecology, Tools in Spatial Epidemiology
Associated Files: Zelner, R Example
4. Lisabeth, Stroke Disparities Research
Keywords: Stroke Risk Factors, Challenges in Collecting Stroke Research Data, Stroke Disparities
5. Surakka, From Genomics to Prevention of Cardiovascular diseases
Keywords: Genetics (Inheritance and Mutations), Genome-Wide Association Analyses, Polygenic Risk Score, Coronary Heart Disease
6. Kheterpal, Precision Health and Big Data
Keywords: Machine Learning in Apps vs Machine Learning in Health, Misspellings in Medical Text, Visualizing Patient’s condition During Surgery, Anesthetics, Michigan Predictive Activity and Clinical Trajectories
7. Baladandayuthapani, Bayesian Data Integration and Precision Medicine
Keywords: Data Sources for Cancer Research, Software for Visualizing Cancer Data, Challenges in Cancer Research, Sparse Hierarchical Prior Distributions, Network-Based Approaches, Radiomics and Radiogenomics
8. Rao, Radiomics
Keywords: Linking Tumor-Derived Phenotype information with Genetics for Personalized Medicine, Phenotype-Guided Drug Discovery, Assessing Immune Status
9. Singh, Natural Language Processing
Keywords: Applications to Public Health, String Processing
1. Bhattacharyya and Rehnberg, Introduction
Keywords: BDSI Data Mining 2019 Overview, Genomics of Drug Sensitivity in Cancer (GDSC), Data Types
2. Rehnberg, Screening Cleaning
Keywords: Genomics of Drug Sensitivity in Cancer (GDSC), Cell Lines, Drug Screening, Half-Maximal Inhibitory Concentration (IC50), Downloading and Cleaning Screening Data
3. Rehnberg, Expression Cleaning
Keywords: Gene Expressions, Collecting Gene Expression Data, Microarrays, Downloading Expression Data, Preprocessing Screening and Expression Data
4. Rehnberg, Methylation Cleaning
Keywords: DNA Methylation, Methylation and Cancer, Collecting Methylation Data, Cleaning Screening, Expression, and Methylation Data
5. Rehnberg, Copynumber Cleaning
Keywords: Copy Number Variation, Copy Number Variation Data Collection, Cleaning Screening, Expression, Methylation, and Copy Number Variation Data
6. Bhattacharyya, Classification Methods
Keywords: Optimal Bayes Classifier, Parametric and Non-parametric classifiers, Naïve Bayes, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Logistic Regression, K- Nearest Neighbors, Random Forests and Decision Trees (CART), Visualizing Decision Boundaries
7. Bhattacharyya, Support Vector Machines
Keywords: Margins, Linear Support Vector Machines, Non-Linear Support Vector Machines
8. Bhattacharyya, Penalization and Feature Selection
Keywords: Overfitting, Regularized Logistic Regression, Regularized/Diagonal Linear Discriminant Analysis (DLDA), T-test Feature Selection
9. Bhattacharyya, PCA and PCR
Keywords: Principal Components, Principal Component Regression
10. Bhattacharyya, Cross Validation
Keywords: Leave-One-Out-Cross-Validation (LOOCV), K-fold Cross Validation, Parameter Tuning
11. Bhattacharyya, Research Guidelines
Keywords: High-Dimensional Data, Class Imbalance, Transforming Predictors, Missing Data
12. Rehnberg, Combining Datasets
1. Data Mining Group, Presentation
Keywords: Examining the Effects of Cancer Cell Line Genetics on Drug Sensitivity, Genomics of Drug Sensitivity in Cancer Data, Pre-processing and Missing Data, (Group 1) Chemocracy: A Voting-Based Method, (Group 2) Accounting for Adverse Effects of Drugs, (Group 3) Data Combination and Result Uncertainty, (Group 4) Integration of Tumor Site into Predictive Models
Note: There is a mistake on slide 30. The plot currently shows a perfect classifier performance for drug 1054.
2. Research Group, Abstract
3. Research Group, Poster
Note: There is a mistake in the plot or drug 1054. The plot currently shows a perfect classifier performance
1. Research Group, f1.score
Keywords: F1 score
2. Research Group, kappa.cohen
Keywords: Cohen’s Kappa Metric
3. Research Group, getp
Keywords: T-test P-value, Showing how well a feature distinguishes between effective and ineffective drug-cell line combination, Used for feature selection
4. Research Group, Manual Set-up
Keywords: Subset Data for a Specific Drug, Obtain P-values for Every Column
5. Research Group, Screening Bimodal Cleaning
Keywords: Bimodal Distributions of IC50, Redefining Drug Efficacy, Cleaning Screening Data
6. Research Group, Splitting Methylation
Keywords: Updating Row Names of Methylation Data, Splitting Methylation Data per Drug
7. Research Group, Implicit Imputation via PCA
Keywords: Imputing Copy Number Variation Data, Scaling Entries in Principal Component Analysis
8. Research Group, KNN
Keywords: Single Data Set, Works for all data sets, Parameter Tuning
9. Research Group, LDA
Keywords: Single Data Set, Works for all data sets, Parameter Tuning
10. Research Group, Logistic Regression LASSO
Keywords: Single Data Set, Works for all data sets, Parameter Tuning
11. Research Group, Naïve Bayes
Keywords: Single Data Set, Works for all data sets, Parameter Tuning
12. Research Group, PCR Linear
Keywords: Single Data Set, Works for all data sets, Parameter Tuning
13. Research Group, PCR Logistic
Keywords: Single Data Set, Works for all data sets, Parameter Tuning
14. Research Group, Random Forest
Keywords: Single Data Set, Works for all data sets, Parameter Tuning
15. Research Group, SVM with Linear Kernel
Keywords: Single Data Set, Works for all data sets, Parameter Tuning
16. Research Group, SVM with Polynomial Kernel
Keywords: Single Data Set, Works for all data sets, Parameter Tuning
17. Research Group, SVM with Radial Kernel
Keywords: Single Data Set, Works for all data sets, Parameter Tuning
18. Research Group, SVM with Sigmoid Kernel
Keywords: Single Data Set, Works for all data sets, Parameter Tuning
19. Research Group, Combined KNN voters
Keywords: Methylation and Expression Data Sets, Voting Model
20. Research Group, Combined PCR
Keywords: All Data Sets, Common Approximate Principal Components
21. Research Group, Plotting IC50 distributions
Keywords: IC50 distributions per drug (Raw, Fitting Bimodal Distributions, Parameters of Bimodal Distributions, Effective and Ineffective Distributions), Used for Presenting Research
22. Research Group, Plotting Research Results
Keywords: Comparing Performance with respect to Efficacy Definition, Best Classifiers, Best Datasets
1. Chen, Predicting Solar Storms
Keywords: Predicting Solar Storms with Machine Learning, Solar Flares, Classifying Strong Flares, Predict the Advent of a First Flare, Time-to-event Modeling for Flare Arrival Time, Predicting with Long Short-Term Memory (LSTM)
2. Davis-Kean, Testing Developmental Theories
Keywords: Predictor Variables for 10-and 16-year-olds’ total math scores from British and U.S., Confirming Empirical Science with Data, Pre-Registering Existing Data Analysis
3. Feinberg, Online Dating
Keywords: Data (Profile Data, Search Data, Browsing Data, Messaging Data), “Screening Potential Mates”, “Utility” Functions
4. McShane and Gal, Statistical Significance
Keywords: Statistical Significance and the Dichotomization of Evidence, P-value<0.05 Dichotomization, Common Errors Associated to P-value Use and Dichotomization
5. Mukherji, Music
Keywords: Learning structure vs Learning by Repetition, Similarities and differences between musical “languages”, Parametric Variation in Language, Formalization of Bhatkhande-ian Theory
6. Wu, Mobile Health
Keywords: Areas of Mobile Health, Delivering intervention for the right person at the right time, Identify Real-time, Objective Predictors of Depression under Stress, Intern Health Study-Data Collection, Randomization Scheme, Imputing Missing Data
1. Data Mining Group, Presentation
Keywords: Examining the Effects of Cancer Cell Line Genetics on Drug Sensitivity, Genomics of Drug Sensitivity in Cancer Data, Pre-processing and Missing Data, (Group 1) Chemocracy: A Voting-Based Method, (Group 2) Accounting for Adverse Effects of Drugs, (Group 3) Data Combination and Result Uncertainty, (Group 4) Integration of Tumor Site into Predictive Models
2. Genomics Group, Presentation
Keywords: Quantitative Analysis of Polygenic Risk Score Prediction in the Genes for Good Cohort, Polygenic Risk for Hypertension through the Lens of Population Genetics, Neuroticism and Alcohol Use: A Bidirectional Mendelian Randomization Study, Survival Analysis of Glioblastoma Patients Through Tumor Tissue Deconvolution
3. Machine Learning Group, Presentation
Keywords: Predicting Survival of ICU Patients, Classification of Irregularly Sampled
Clinical Time Series Data with Convolutional Neural Networks, Exploring the Robustness of Deep Learning Architectures, Exploring Feature Importance for Predictions of In Hospital Mortality
Additional Materials
Here are some additional materials that I found helpful during BDSI.
The two books were recommended by BDSI faculty. I used them to gain a deeper understanding of the mathematical details behind some of the topics covered during the lectures and research sessions. Particularly intriguing to me was the chapter about Dimensionality Reduction in “Hands-on Machine Learning with Scikit-Learn and Tensorflow”, including Locally Linear Embedding (LLE), Multidimensional Scaling (MDS), Isomap, and t-Distributed Stochastic Neighbor Embedding (t-SNE).
I found the paper myself as a follow-up to Chen, Bayesian Data Analysis.
Books
1. A. Géron, “ Hands-On Machine Learning with Scikit-Learn and TensorFlow; Concepts, Tools, and Techniques to Build Intelligent Systems”, O'Reilly Media, 2017.
2. G. James et al., “An Introduction to Statistical Learning: with Applications in R”, Springer, 2013
Papers
1. R. Neal, “MCMC using Hamiltonian dynamics”, 2012, arXiv: https://arxiv.org/pdf/1206.1901.pdf