David Rossell's Web

Book and Software

Book on high-dimensional model selection (work in progress)

Most statistical methods I developed are freely available as R packages. Much of that work went into an R package called modelSelection (previously called mombf), whose focus is providing tools for high-dimensional inference for popular model classes like regression, generalized linear models (GLMs), generalized additive models (GAMs), graphical models, etc. My free online book gives a gentle introduction to model selection, focusing on Bayesian methods and L0 criteria, and offers a hands-on guide with examples.

Below are some other R packages that I developed over the years

casper

URL: casper

The package infers alternative splicing patterns from paired-end RNA-seq data. The boost in performance relative to other methods comes from two simple notions

- When you summarize your data, don't throw away information

- Induce shrinkage in your estimators

The package also addresses the issue of how to adequately design RNA-seq experiments: library preparation, sequencing depth and number of samples. Formally, the approach is based on Bayesian decision theory and Optimal sequential experiments theory. In non-technical jargon: one should (1) set up reasonable criteria and (2) look at all data available so far to guide future decisions (i.e. how to run/continue running the experiment)

Cartoon with RNA-seq reads. Some span >2 exons and are highly informative about splicing. casper considers this information, but most methods discard it

chroGPS

URL: chroGPS

The package provides a nice way to visualise large amounts of genetic and epigenetic data. The idea is that it creates intuitive maps that help navigate the epigenome, i.e. a GPS system. The perks of the methodology are that it's easy to use & interpret, we carefully considered how to create good maps at a computationally manageable time, and we investigated how to adjust for biases when integrating data from multiple sources.

gaga

URL: gaga

The software performs differential expression analysis by classifying genes into expression patterns, which we believe are more intuitively appealing (and more in line with what the scientist really wants) than the usual pairwise comparisons. Here's an example where one wishes to compare expression across 3 groups, which gives rise to 5 possible patterns.

Pattern 0: A = B = C

Pattern 1: A = B ≠ C

Pattern 2: A = C ≠ B

Pattern 3: A ≠ B = C

Pattern 4: A ≠ B ≠ C

In contrast, a typical analysis (e.g. F-test) might tell us "A is not different from B, B is not different from C, but A is different from C". Would this solution really make any sense to anybody?

Another interesting (to me) implemented method is how to design optimal sequential high-throughput experiments. We all know that reliable conclusions require a reliable experimental design and that sequential clinical trials are more ethical, economical and generally adequate than fixed-sample counterparts. However, when it comes to high-throughput experiments we forget all these nice principles and hope that our designs will magically be efficient and lead to reliable conclusions. gaga implements a decision-theoretic framework which boils down to assessing the advantages of collecting vs. not collecting more data.

Sequential design for a high-throughput study to find differentially expressed genes. Each time we observe new samples (x-axis), we assess the expected increase in True Positive findings (y-axis). While this quantity is above the solid line we continue experimentation, else we stop

htSeqTools

URL: htSeqTools

The package provides numerous data processing and quality control tools for sequencing data. For instance, it provides PCA analogues based on Multi-dimensional Scaling, finds and compares genomic regions accumulating large numbers of reads, and removes PCR over-amplification artefacts. A note regarding the latter, most PCR artefacts removing methods consider that all repeated sequences are PCR artefacts. That can be seriously wrong!

Some sequences repeat naturally, especially when targeting covering a narrow genomic area with many reads (e.g. ChIP-seq experiments). This number of natural repeats changes from experiment to experiment depending on coverage, targeted genomic regions, experimental protocol... We attempt to quantify just how many natural repeats are expected in each case based on the observed data.

Number of read repeats in Human and S. Cerevisiae chip-seq data. For these human data a read may be repeated up to roughly 10 times. This is not due to PCR over-amplification but to the study targeting a small part of the genome. For the S. Cerevisiae data a read may repeat naturally up to 100 times, consistently with the genome being shorter

Conditional on the hypothesis that theta is not 0, non-local prior densities formalize the idea that 0 is not a possible value for theta. They contrast with priors which set the mode at 0 (even though 0 is excluded under the assumed hypothesis)

Two-piece distributions

Javier Rubio developed package twopiece (http://twopiece.r-forge.r-project.org)implementing basic functionality for two-piece distributions, including evaluation of the density function and random number generation. The package also contains functions that accompany our Chapter 10 in the Handbook of mixture analysis, implementing multivariate mixtures for continuous data that allow for heavy tails and asymmetry.

Package modelSelection also contains functionality on two-piece distributions, mainly variable selection under residuals that can allow for asymmetric & thicker-than-normal tails (e.g. as described in Rossell & Rubio, JASA 2017). To our surprise allowing even for such simple deviations from normality can result in significant improvements, particularly in terms of sensitivity to detect truly active variables.

Google Sites

Report abuse