Software bits

Main page              Publications             Software bits             Debug C in Emacs

Most statistical methods I worked on over the years are freely available in R or Bioconductor packages (jointly coded with numerous collaborators). In fact, some packages group several sets of related methods developed in separate papers. Below you can find a list, where I shamelessly advertise some of their features.


URL: casper

The package infers alternative splicing patterns from paired-end RNA-seq data. The boost in performance relative to other methods comes from two simple notions

- When you summarize your data, don't throw away information

- Induce shrinkage in your estimators

The package also addresses the issue of how to adequately design RNA-seq experiments: library preparation, sequencing depth and number of samples. Formally, the approach is based on Bayesian decision theory and Optimal sequential experiments theory. In non-technical jargon: one should (1) set up reasonable criteria and (2) look at all data available so far to guide future decisions (i.e. how to run/continue running the experiment)

Cartoon with RNA-seq reads. Some span >2 exons and are highly informative about splicing. casper considers this information, but most methods discard it


URL: chroGPS

The package provides a nice way to visualise large amounts of genetic and epigenetic data. The idea is that it creates intuitive maps that help navigate the epigenome, i.e. a GPS system. The perks of the methodology are that it's easy to use & interpret, we carefully considered how to create good maps at a computationally manageable time, and we investigated how to adjust for biases when integrating data from multiple sources.

Visualize epigenetic factors

Visualize genes


URL: gaga

The software performs differential expression analysis by classifying genes into expression patterns, which we believe are more intuitively appealing (and more in line with what the scientist really wants) than the usual pairwise comparisons.  Here's an example where one wishes to compare expression across 3 groups, which gives rise to 5 possible patterns.

Pattern 0: A = B = C

Pattern 1: A = B ≠ C

Pattern 2: A = C ≠ B

Pattern 3: A ≠ B = C

Pattern 4: A ≠ B ≠ C

In contrast, a typical analysis (e.g. F-test) might tell us "A is not different from B, B is not different from C, but A is different from C". Would this solution really make any sense to anybody? 

Another interesting (to me) implemented method is how to design optimal sequential high-throughput experiments. We all know that reliable conclusions require a reliable experimental design and that sequential clinical trials are more ethical, economical and generally adequate than fixed-sample counterparts. However, when it comes to high-throughput experiments we forget all these nice principles and hope that our designs will magically be efficient and lead to reliable conclusions. gaga implements a decision-theoretic framework which boils down to assessing the advantages of collecting vs. not collecting more data.

Sequential design for a high-throughput study to find differentially expressed genes. Each time we observe new samples (x-axis), we assess the expected increase in True Positive findings (y-axis). While this quantity is above the solid line we continue experimentation, else we stop


URL: htSeqTools

The package provides numerous data processing and quality control tools for sequencing data. For instance, it provides PCA analogues based on Multi-dimensional Scaling, finds and compares genomic regions accumulating large numbers of reads, and removes PCR over-amplification artefacts. A note regarding the latter, most PCR artefacts removing methods consider that all repeated sequences are PCR artefacts. That can be seriously wrong! 

Some sequences repeat naturally, especially when targeting covering a narrow genomic area with many reads (e.g. ChIP-seq experiments). This number of natural repeats changes from experiment to experiment depending on coverage, targeted genomic regions, experimental protocol... We attempt to quantify just how many natural repeats are expected in each case based on the observed data.

Number of read repeats in Human and S. Cerevisiae chip-seq data. For these human data a read may be repeated up to roughly 10 times. This is not due to PCR over-amplification but to the study targeting a small part of the genome. For the S. Cerevisiae data a read may repeat naturally up to 100 times, consistently with the genome being shorter 


The package provides numerous methodology related to Non-Local Priors, including Bayes factors, high-dimensional model selection, posterior inference and density evaluations. Currently most methods target linear or binary regression models. The manual can be accessed by typing vignette("mombf") after loading the mombf library.

Conditional on the hypothesis that theta is not 0, non-local prior densities formalize the idea that 0 is not a possible value for theta. They contrast with priors which set the mode at 0 (even though 0 is excluded under the assumed hypothesis)


The package computes Bayes factors using test statistics in ANOVA setups with order restrictions. Different order restrictions can be imposed and tested. 

  • On Unix/Mac OS. Download the tar.gz, open the terminal and type tar xvzf isoregbf_0.0.2.tar.gz to unzip and then type R CMD INSTALL isoregbf. On Windows. Note: under Windows follow the same process, but you need to install Rtools and MinGW in order to install R packages from source    
  • Source files: isoregbf_0.0.3.tar.gz 
  • Manual: load the package and type vignette("isoregbfmanual") 

Semi-parametric Differential Expression Analysis via Partial Mixture Estimation 

The file ebayes.l2e.r below implements the methods described in our paper. Simply download and source the file from R.


Peter Mueller and I developed the R library seqdesphII implementing the methods presented in our paper “Screening designs for drug development”.
This library is not yet available through the usual CRAN mirrors, but it’s fully functional and can be easily downloaded and installed.