Past projects

Spring 2020:

A Bayesian approach to super-resolution microscopy

Researcher: Shahzar Rizvi
Mentor: Bryan Liu
Faculty sponsor: Jon McAuliffe
Abstract: Recent advances in super-resolution microscopy techniques have enabled images to be taken with resolution higher than the diffraction limit. In brief, these techniques attach fluorophores to the sample of interest. The fluorophores are then stimulated to stochastically fluoresce. A key step in processing the resulting image is the estimation of the location, brightness, and number of fluorophores in a given frame.

I would like to propose an approach grounded in the framework of amortized variational Bayes. This approach (in development), has been applied to cataloging crowded starfields, where the problem is similar: instead of detecting fluorophores imaged by a microscope, we detected stars imaged by a telescope.

Our approach will be compared with existing approaches on both detection rate and computational speed.

Building models and case studies using NIMBLE [Slides]

Researcher: Tianchi Liu
Mentor: Sally Paganin
Faculty sponsor: Chris Paciorek
Abstract: NIMBLE is statistical software based on R, designed to build and share statistical models, enabling usage of different sampling algorithms (MCMC, Particle Filtering, etc) and their customization.

To improve software accessibility, this project focuses on expanding available learning resources, starting from examples coded in other similar software (BUGS/Jags/Stan). Depending on the student's interests, other activities may comprise developing case studies, implementation of sampling algorithms, or software coding.

Developing an R package for estimating causal effects in panel data

Researcher: Alan Liang
Mentor: Eli Ben-Michael
Faculty sponsor: Avi Feller

Fast computation of Bayesian multinomial logistic regression

Researcher: Kyle McEvoy
Mentor: Jared Fisher
Abstract: The goal of this project is to investigate the application of data augmentation methods to Bayesian multinomial logistic regression problems. When using a logistic regression model within a Bayesian framework, coefficients are often estimated using Markov Chain Monte Carlo (MCMC) methods, but this process can be slow and computationally taxing. Data augmentation methods have the potential to speed up this process by inducing independence in the coefficients of interest, opening up the possibility of parallelized sampling. In this project, we implemented our data augmentation approach in R with a Generalized Elliptical Slice Sampler, and we also attempted to implement our model in Stan using the default No U-Turns Sampler. The results of the R implementation are compared with Stan’s built in Bayesian multinomial logistic regression model on the Iris dataset.

Home court/field/ice, causality and time series

Researcher: Liam Shaw
Mentor: Jared Fisher

Library size distribution for grouped microbiome data

Researcher: Christina Jin
Mentor: Johnny Hong
Faculty sponsor: Will Fithian
Abstract: In microbiome data analysis, uneven library sizes among observations present great challenges for proper statistical inference. The library size of an observation is the total number of microbial species present in the observation. In practice, it is not uncommon for library sizes to vary across several orders of magnitude. On the other hand, observations in microbiome data are often grouped (for example, fecal samples from patients in different treatment groups). The analysis of microbial composition often requires permutation tests, which assumes exchangeability of observations, thus same library size distributions of different groups. The goal of the project is to investigate whether library sizes are typically dependent on the group membership in grouped microbiome data.

Look who's talking

Researchers: Shivin Devgon, Hubert Luo, Catherine Wang, Gracie Yao
Mentor: Amanda Glazer
Faculty sponsor: Philip Stark

Model selection for differential expression in RNA sequencing [Slides]

Researcher: Claire Man
Mentor: Hector Roux de Bezieux
Faculty sponsor: Sandrine Dudoit
Abstract: Slingshot is a newly introduced package used for single cell RNA sequencing. The package was designed to present a global trajectory of cell development and eventually represent this structure as a curve. The curve covers the information called “pseudotime,” which is the projection of each cell on how far it has moved from beginning (Street).

Another recently developed package, tradeSeq, tackles the issue of finding differentially expressed genes which relies on a negative binomial model that has been proven appropriate for this type of data.

We want to test how the Slingshot package, acting as a pre-processing method, would transform the data which impacts the process of modelling using tradeSeq afterwards. We achieve this by building the Principal Component Analysis (PCA) reconstruction on the curve obtained from Slingshot package and send it back to the same dimension as the original count matrix. Then, we compare the results between the new count matrix and the original count matrix from applying tradeSeq.

Normalization of DNA chromatin conformation (ATAC-seq) datasets [Poster]

Researcher: Ameek Bindra
Mentor: Koen Van den Berge
Faculty sponsor: Sandrine Dudoit
Abstract: In high-dimensional biological datasets, normalization is key. Without normalization, one may obtain spurious and irreproducible results, while a proper normalization often allows one to unravel the underlying biology in greater depth. Our group has been developing normalization methods for ATAC-seq data, which looks at the chromatin conformation (i.e., folding) structure of DNA in biological cells. In this project, we will evaluate these normalization methods and provide insights to the unique biological results they may provide.

Optimal robust adversarial reinforcement learning

Researcher: Ryan Roggenkemper
Mentor: Ivana Malenica
Faculty sponsor: Mark Van der Laan
Abstract: The aim of the project is to build robust, stable, and efficient policies that readily transfer across multiple environments and initialization states by making use of advances in adversarial reinforcement learning. In this problem setting, a primary agent (the protagonist) seeks to learn an optimal policy while a secondary agent (the antagonist) introduces state destabilizations; this results in the learned policy being robust to diverse states while avoiding issues of overfitting. Coupled with the formalism of adversarial examples that create natural observations learned policies can be robustified while achieving statistical efficiency (i.e., variance minimization) at each iteration. We introduce the use of a formal notion of efficiency from semiparametric theory (the efficient influence function; EIF) in order to develop a variant of adversarial reinforcement learning training that seeks to optimize a bias-variance trade-off by enforcing that an estimating equation based on the EIF be solved at each step. To achieve a targeted update of the policy for deep neural networks, we build upon the previously introduced notion of regularizing a training objective called targeted regularization. Our final contribution provides a statistical formalism that allows for inference for the final estimate under the learned optimal policy, obtained via Deep Q-learning.

Recovering gene expression datasets' true dimensionality: does initialization matter?

Researcher: Star Li and David Lyu
Mentor: Hector Roux de Bezieux and Koen Van den Berge
Faculty sponsor: Sandrine Dudoit
Abstract: Single Cell RNA Sequence (scRNAseq) data has been at the frontier of bioinformatics research. Since the observed gene expressions are abundant, dimensionality reduction techniques must be applied in the downstream analysis for visualization. Aside from the traditional principal component analysis (PCA), the most popular techniques in the field are the t-distributed Stochastic Neighborhood Embedding (t-SNE) and the Uniform Manifold Approximation and Projection (UMAP). Both of these methods are unsupervised and require initialization. Our research examines how the visualization quality would change with different sets of hyperparameters and different initialization. Our primary findings conclude that (1) In small sample using t-SNE, perplexity is the only hyperparameter that influences visualization quality and regardless of its value, global structure preservation is close to non-existent; (2) UMAP appears to preserve global structure and its embedding is more similar to that produced by PCA; (3) t-SNE and UMAP identify non-linear structure in the data, perhaps at the cost of accurately capturing global structure. In particular, we develop a shiny app allowing users to change the initialization setup and examine the visualization using human embryonic stem cells collected by La Manno et al. (2016)

Sequential ensemble learning of individualized treatment rules

Researcher: Yimeng Wang
Mentor: Aurelien Bibaut
Faculty sponsor: Mark Van der Laan
Abstract: We develop new contextual bandit algorithms for the task of sequentially learning individualized treatment rules.

Specifically, we develop variants or two existing contextual bandit algorithms, the Policy Elimination algorithm (Dudik et al. 2011), and the Epsilon-Greedy algorithm so as to adapt them to large nonparametric policy classes. We develop an ensemble learner which combines several Epsilon-Greedy algorithms, for which we provide regret guarantees.

Our methods belong to the family of oracle-based contextual bandits: they rely on classification and regression subroutines, which, for numerous policy classes, are implemented in existing supervised learning packages. We also show that they are computationally efficient.

Page updated

Google Sites

Report abuse