Software bits Main page Publications Software bits Debug C in Emacs Most statistical methods I worked on over the years are freely available in R or Bioconductor packages (jointly coded with numerous collaborators). In fact, some packages group several sets of related methods developed in separate papers. Below you can find a list, where I shamelessly advertise some of their features. casper URL: casper The package infers alternative splicing patterns from pairedend RNAseq data. The boost in performance relative to other methods comes from two simple notions  When you summarize your data, don't throw away information  Induce shrinkage in your estimators The package also addresses the issue of how to adequately design RNAseq experiments: library preparation, sequencing depth and number of samples. Formally, the approach is based on Bayesian decision theory and Optimal sequential experiments theory. In nontechnical jargon: one should (1) set up reasonable criteria and (2) look at all data available so far to guide future decisions (i.e. how to run/continue running the experiment) Cartoon with RNAseq reads.
Some span >2 exons and are highly informative about splicing. casper considers this information, but most methods discard it chroGPS URL: chroGPS The package provides a nice way to visualise large amounts of genetic and epigenetic data. The idea is that it creates intuitive maps that help navigate the epigenome, i.e. a GPS system. The perks of the methodology are that it's easy to use & interpret, we carefully considered how to create good maps at a computationally manageable time, and we investigated how to adjust for biases when integrating data from multiple sources. gaga URL: gaga The software performs differential expression analysis by classifying genes into expression patterns, which we believe are more intuitively appealing (and more in line with what the scientist really wants) than the usual pairwise comparisons. Here's an example where one wishes to compare expression across 3 groups, which gives rise to 5 possible patterns. Pattern 0: A = B = C Pattern 1: A = B ≠ C Pattern 2: A = C ≠ B Pattern 3: A ≠ B = C Pattern 4: A ≠ B ≠ C In contrast, a typical analysis (e.g. Ftest) might tell us "A is not different from B, B is not different from C, but A is different from C". Would this solution really make any sense to anybody? Another interesting (to me) implemented method is how to design optimal sequential highthroughput experiments. We all know that reliable conclusions require a reliable experimental design and that sequential clinical trials are more ethical, economical and generally adequate than fixedsample counterparts. However, when it comes to highthroughput experiments we forget all these nice principles and hope that our designs will magically be efficient and lead to reliable conclusions. gaga implements a decisiontheoretic framework which boils down to assessing the advantages of collecting vs. not collecting more data. Sequential design for a highthroughput study to find differentially expressed genes. Each time we observe new samples (xaxis), we assess the expected increase in True Positive findings (yaxis). While this quantity is above the solid line we continue experimentation, else we stop htSeqTools URL: htSeqTools The package provides numerous data processing and quality control tools for sequencing data. For instance, it provides PCA analogues based on Multidimensional Scaling, finds and compares genomic regions accumulating large numbers of reads, and removes PCR overamplification artefacts. A note regarding the latter, most PCR artefacts removing methods consider that all repeated sequences are PCR artefacts. That can be seriously wrong! Some sequences repeat naturally, especially when targeting covering a narrow genomic area with many reads (e.g. ChIPseq experiments). This number of natural repeats changes from experiment to experiment depending on coverage, targeted genomic regions, experimental protocol... We attempt to quantify just how many natural repeats are expected in each case based on the observed data. Number of read repeats in Human and S. Cerevisiae chipseq data. For these human data a read may be repeated up to roughly 10 times. This is not due to PCR overamplification but to the study targeting a small part of the genome. For the S. Cerevisiae data a read may repeat naturally up to 100 times, consistently with the genome being shorter mombfThe package provides numerous methodology related to NonLocal Priors, including Bayes factors, highdimensional model selection, posterior inference and density evaluations. Currently most methods target linear or binary regression models. The manual can be accessed by typing vignette("mombf") after loading the mombf library. Conditional on the hypothesis that theta is not 0, nonlocal prior densities formalize the idea that 0 is not a possible value for theta. They contrast with priors which set the mode at 0 (even though 0 is excluded under the assumed hypothesis) isoregbfThe package computes Bayes factors using test statistics in ANOVA setups with order restrictions. Different order restrictions can be imposed and tested.
Semiparametric Differential Expression Analysis via Partial Mixture EstimationThe file ebayes.l2e.r below implements the methods described in our paper. Simply download and source the file from R. seqdesphIIPeter Mueller and I developed the R library seqdesphII implementing the methods presented in our paper “Screening designs for drug development”.
