Most statistical methods I developed are freely available as R packages. Below is a list and shameless advertising of some of their features. Just for fun below are some download statistics.
The package infers alternative splicing patterns from paired-end RNA-seq data. The boost in performance relative to other methods comes from two simple notions
- When you summarize your data, don't throw away information
- Induce shrinkage in your estimators
The package also addresses the issue of how to adequately design RNA-seq experiments: library preparation, sequencing depth and number of samples. Formally, the approach is based on Bayesian decision theory and Optimal sequential experiments theory. In non-technical jargon: one should (1) set up reasonable criteria and (2) look at all data available so far to guide future decisions (i.e. how to run/continue running the experiment)
Cartoon with RNA-seq reads. Some span >2 exons and are highly informative about splicing. casper considers this information, but most methods discard it
The package provides a nice way to visualise large amounts of genetic and epigenetic data. The idea is that it creates intuitive maps that help navigate the epigenome, i.e. a GPS system. The perks of the methodology are that it's easy to use & interpret, we carefully considered how to create good maps at a computationally manageable time, and we investigated how to adjust for biases when integrating data from multiple sources.
The software performs differential expression analysis by classifying genes into expression patterns, which we believe are more intuitively appealing (and more in line with what the scientist really wants) than the usual pairwise comparisons. Here's an example where one wishes to compare expression across 3 groups, which gives rise to 5 possible patterns.
Pattern 0: A = B = C
Pattern 1: A = B ≠ C
Pattern 2: A = C ≠ B
Pattern 3: A ≠ B = C
Pattern 4: A ≠ B ≠ C
In contrast, a typical analysis (e.g. F-test) might tell us "A is not different from B, B is not different from C, but A is different from C". Would this solution really make any sense to anybody?
Another interesting (to me) implemented method is how to design optimal sequential high-throughput experiments. We all know that reliable conclusions require a reliable experimental design and that sequential clinical trials are more ethical, economical and generally adequate than fixed-sample counterparts. However, when it comes to high-throughput experiments we forget all these nice principles and hope that our designs will magically be efficient and lead to reliable conclusions. gaga implements a decision-theoretic framework which boils down to assessing the advantages of collecting vs. not collecting more data.
Sequential design for a high-throughput study to find differentially expressed genes. Each time we observe new samples (x-axis), we assess the expected increase in True Positive findings (y-axis). While this quantity is above the solid line we continue experimentation, else we stop
The package provides numerous data processing and quality control tools for sequencing data. For instance, it provides PCA analogues based on Multi-dimensional Scaling, finds and compares genomic regions accumulating large numbers of reads, and removes PCR over-amplification artefacts. A note regarding the latter, most PCR artefacts removing methods consider that all repeated sequences are PCR artefacts. That can be seriously wrong!
Some sequences repeat naturally, especially when targeting covering a narrow genomic area with many reads (e.g. ChIP-seq experiments). This number of natural repeats changes from experiment to experiment depending on coverage, targeted genomic regions, experimental protocol... We attempt to quantify just how many natural repeats are expected in each case based on the observed data.
Number of read repeats in Human and S. Cerevisiae chip-seq data. For these human data a read may be repeated up to roughly 10 times. This is not due to PCR over-amplification but to the study targeting a small part of the genome. For the S. Cerevisiae data a read may repeat naturally up to 100 times, consistently with the genome being shorter
The package provides numerous methodology related to Non-Local Priors, including Bayes factors, high-dimensional model selection, posterior inference and density evaluations. Currently most methods target linear or binary regression models. The manual can be accessed by typing vignette("mombf") after loading the mombf library.
Conditional on the hypothesis that theta is not 0, non-local prior densities formalize the idea that 0 is not a possible value for theta. They contrast with priors which set the mode at 0 (even though 0 is excluded under the assumed hypothesis)
Javier Rubio developed package twopiece (http://twopiece.r-forge.r-project.org)implementing basic functionality for two-piece distributions, including evaluation of the density function and random number generation. The package also contains functions that accompany our Chapter 10 in the Handbook of mixture analysis, implementing multivariate mixtures for continuous data that allow for heavy tails and asymmetry.
Package mombf also contains functionality on two-piece distributions, mainly variable selection under residuals that can allow for asymmetric & thicker-than-normal tails (e.g. as described in Rossell & Rubio, JASA 2017). To our surprise allowing even for such simple deviations from normality can result in significant improvements, particularly in terms of sensitivity to detect truly active variables.
The package computes Bayes factors using test statistics in ANOVA setups with order restrictions. Different order restrictions can be imposed and tested.
- On Unix/Mac OS. Download the tar.gz, open the terminal and type tar xvzf isoregbf_0.0.2.tar.gz to unzip and then type R CMD INSTALL isoregbf. On Windows. Note: under Windows follow the same process, but you need to install Rtools and MinGW in order to install R packages from source
- Source files: isoregbf_0.0.3.tar.gz
- Manual: load the package and type vignette("isoregbfmanual")
Semi-parametric Differential Expression Analysis via Partial Mixture Estimation
The file ebayes.l2e.r below implements the methods described in our paper. Simply download and source the file from R.
Peter Mueller and I developed the R library seqdesphII implementing the methods presented in our paper “Screening designs for drug development”.
This library is not yet available through the usual CRAN mirrors, but it’s fully functional and can be easily downloaded and installed.
- INSTALLATION INSTRUCTIONS
- Windows pre-compiled library
- Source files