Home > Learning > BI/DS Tutorials and Workshops > 2026 Summer Workshops
Home > Learning > BI/DS Tutorials and Workshops > 2026 Summer Workshops
Three-week course. Four hours/day. R-based Labs. Virtual or In-person (at USC Columbia).
June 1-5: Foundations of Data Science in R
June 8-12: Statistical Modeling
June 15-19: Bioinformatics & High-Dimensional Data
9 to 11 am: Lecture and discussion
1 to 3 pm: R Lab and application
Target: Assumes no prior R experience. All participants working with biological data.
Monday, June 1, Day 1
MORNING
Course Overview & Orientation to Data Science
Why data science in biology? Course logistics, expectations, reproducibility principles.
AFTERNOON
Getting Started with R and RStudio
Why data science in biology? Course logistics, expectations, reproducibility principles.
Tuesday, June 2, Day 2
MORNING
R Programming Fundamentals
Data types, vectors, matrices, lists, data frames; indexing; control flow (if/else, for loops); writing functions.
AFTERNOON
Working with Data in R
Importing CSV/Excel files; inspecting data (str, summary, head); basic data manipulation with base R.
Wednesday, June 3, Day 3
MORNING
Data Wrangling with the tidyverse
Tidy data principles; dplyr verbs (filter, select, mutate, group_by, summarize); pipes (%>%); tidyr (pivot_longer, pivot_wider).
AFTERNOON
tidyverse Lab
Hands-on cleaning and reshaping a messy biological dataset; joining tables; handling missing data.
Thursday, June 4, Day 4
MORNING
Data Visualization with ggplot2
Grammar of graphics; geom types (point, bar, box, line, histogram, density); faceting; themes; color palettes for biological data.
AFTERNOON
ggplot2 Lab
Recreating publication-quality figures from genomics/ecology datasets; customizing axes, legends, and themes.
Friday, June 5, Day 5
MORNING
Probability, Distributions, and Statistical Inference
Random variables; common distributions (Normal, Binomial, Poisson); Central Limit Theorem; p-values, confidence intervals, and their correct interpretation.
AFTERNOON
Simulation & Inference Lab
Simulating data in R; visualizing distributions; one- and two-sample t-tests; chi-square tests; Wilcoxon rank-sum test; interpreting output.
Target: Applied regression methods for quantitative, binary, time-to-event, and clustered outcomes.
Monday, June 8, Day 6
MORNING
Simple and Multiple Linear Regression
Model formulation; OLS estimation; interpretation of coefficients; assumptions; R² and model fit; introduction to confounding.
AFTERNOON
Linear Regression Lab
Fitting lm() models; diagnostic plots (residuals, Q-Q, leverage); testing assumptions; applying to a quantitative biological trait.
Tuesday, June 9, Day 7
MORNING
Model Selection and Variable Importance
Overfitting and bias-variance tradeoff; AIC/BIC; stepwise selection (and its limitations); introduction to cross-validation.
AFTERNOON
Model Selection Lab
Comparing nested models; using step() and AIC; k-fold cross-validation with caret or rsample; interpreting results critically.
Wednesday, June 10, Day 8
MORNING
Logistic Regression and Binary Outcomes
Generalized linear models; logit link; odds ratios and their interpretation; model diagnostics; introduction to classification metrics.
AFTERNOON
Logistic Regression Lab
Fitting glm() for case-control data; computing and plotting ROC curves (pROC); evaluating sensitivity/specificity; applying to a disease outcome dataset.
Thursday, June 11, Day 9
MORNING
Survival Analysis
Censoring and time-to-event data; Kaplan-Meier estimator; log-rank test; Cox proportional hazards model; checking the PH assumption.
AFTERNOON
Survival Analysis Lab
KM curves with survminer; log-rank tests; fitting coxph(); visualizing hazard ratios; applying to a clinical or ecological dataset.
Friday, June 12, Day 10
MORNING
Mixed Models and Clustered/Repeated Data
Why standard regression fails with clustered data; random intercepts and slopes; fixed vs. random effects; model interpretation; ICC.
AFTERNOON
Mixed Models Lab
Fitting lme4::lmer() and glmer(); random effect structures; model comparison; applying to longitudinal biological data.
Target: Genomics workflows, dimensionality reduction, penalized regression, and reproducible reporting.
Monday, June 15, Day 11
MORNING
Genomics Data: Structure, Formats, and Public Databases
FASTQ, BAM, VCF, count matrix formats; overview of NCBI/GEO/dbGaP; accessing public datasets; data provenance and metadata.
AFTERNOON
Accessing Public Genomics Data
Using GEOquery and Biobase to retrieve expression datasets; exploring metadata; quality assessment with basic EDA.
Tuesday, June 16, Day 12
MORNING
Differential Expression Analysis
RNA-seq workflow overview (alignment → counts → DE); negative binomial models; DESeq2/edgeR framework; normalization strategies.
AFTERNOON
DESeq2 Lab
Full DESeq2 workflow: importing count data, size factor normalization, dispersion estimation, Wald/LRT tests, results tables.
Wednesday, June 17, Day 13
MORNING
Multiple Testing, FDR, and Visualization of High-Dimensional Results
Family-wise error rate vs. FDR; Bonferroni, Benjamini-Hochberg; q-values; volcano plots; MA plots; heatmaps.
AFTERNOON
Multiple Testing & Visualization Lab
Applying p.adjust(); generating volcano and MA plots with ggplot2; hierarchical clustering and heatmaps with pheatmap/ComplexHeatmap.
Thursday, June 18, Day 14
MORNING
Dimensionality Reduction and Penalized Regression
PCA and its geometric interpretation; scree plots; biplots; introduction to LASSO/ridge/elastic net; coordinate descent; tuning λ.
AFTERNOON
PCA and glmnet Lab
prcomp() and factoextra for PCA; fitting penalized regression with glmnet; cross-validated λ selection; coefficient path plots; interpreting sparse solutions.
Friday, June 19, Day 15
MORNING
Reproducible Research and Course Capstone
R Markdown / Quarto for reproducible reporting; literate programming; project organization best practices; version control concepts.
AFTERNOON
Capstone Lab & Presentations
Students produce a short reproducible analysis report (R Markdown/Quarto) integrating skills from the course; brief group presentations and discussion.