HMM Book

Hidden Markov Models with Applications in Computational Biology

Standard first-order Hidden Markov Models (HMMs) are frequently used methods for the analysis of sequential data in a broad range of scientific domains including applications in speech recognition or computational biology. HMMs are versatile and structurally simple models enabling probabilistic modeling based on a sound mathematical and algorithmic grounding. However, still most of the developed HMM-based approaches are only applying concepts of standard first-order HMMs. This is sufficient enough to reach good results in many applications, but most results can also be substantially improved by utilizing higher-order HMMs as demonstrated in the domains of speech or handwriting recognition, robotics or the analysis of protein and DNA sequences.

My main intention for writing this book based on my initial PhD thesis was to create an easily accessible and comprehensive extension of the mathematical and algorithmic basics of standard first-order HMMs to higher-order HMMs coupled with some selected practical applications in modern computational biology. Additionally, one of my other goals was to improve and ease the creation of biologically meaningful models by the integration of biological prior knowledge into the training of HMMs. This has indeed proved to be of utmost importance for my selected case studies. Moreover, I have also developed two novel more specialized model extensions (i) parsimonious higher-order HMMs and (ii) HMMs with scaled transition matrices to improve the modeling of sequential data. The parsimonious higher-order HMM enables a data-dependent interpolation between a mixture model that ignores spatial dependencies between adjacent sequence positions and a higher-order HMM that exhaustively models spatial dependencies in a sequence of data. This allows to use improved modeling characteristics of higher-order HMMs in cases of limited availability of sequential data and contributes to avoid overfitting of models. The HMM with scaled transition matrices enables the integration of additional prior knowledge into the state-transition process of the model for realizing an improved modeling of measurements in chromosomal contexts such as the differentiation between gene-pair orientations or the modeling of the distance between adjacent genes on a chromosome. I have applied all these different models in specific contexts covering the identification of differentially expressed genes in breast tumors, the identification of transcription factor target genes in yeast and plant data and for comparative genomics of two accessions of the model plant Arabidopsis thaliana. Additionally, I have made use of independent data for model evaluations in each case study and I have compared all extensions of HMMs to standard first-order HMMs and other typically used methods.

With this book, I try to address readers that already have some basic knowledge about standard first-order HMMs. What the reader can expect is to gain more insights into higher-order HMMs followed by details how one can derive specific model extensions. This should also allow the reader to develop own approaches based on higher-order HMMs. Since my research interests are in computational biology, some basic knowledge in biology is helpful, but definitely not mandatory for getting into the algorithmic extensions.

I have structured this book into two main parts. After a general introduction, the book starts with a theoretical part (chapters 2 -- 4) where I develop the algorithmic basics of higher-order HMMs and where I describe the extensions to parsimonious higher-order HMMs and HMMs with scaled transition matrices. In the second part (chapters 5 -- 8), I consider applications where I apply these models to the analysis of different DNA microarray data sets. Finally, I close the book with a comprehensive discussion of the achieved results. The first part of the book can be read independently of the second part and can be considered as a repository for the algorithmic basics of higher-order HMMs that has not been outlined in such a great detail in the existing literature.