1. Introduction

What is the Genomic HyperBrowser?



There is a short answer to that question: The Genomic HyperBrowser is a statistical analysis system for genomic data. But what does this mean?

Let's look at the term: Genomic Data. The first point to notice is that genomic data is more than just gene lists. As an example, we will look at an output from the UCSC Genome Browser. Here we have different datasets defined over the human genome. We have repeating elements, genes, CpG islands, association studies. We have SNPs, conserved elements. We have output from the ENCODE project, and so on. Adding to this, next-generation sequencing techniques, such as ChIP-seq, are rapidly generating enormous amounts of genomic information. These kinds of datasets are usually called annotation tracks, or simply tracks.

All available genomic datasets can be envisioned as a huge matrix. On one axis, we have the actual basepairs, the nucleic acids related to the datasets. There can be billions of base pairs for each genome. On the other axis, we have the different tracks, datasets from different experimental techniques or related to different features of the genome. Adding to this, we have the third axis: All available genomes. A lot of different reference genomes are being sequenced at a daily basis. Also, the personal genome is rapidly being generated, sequencing the genome of every individual. In addition to this, we could add other axes like cell types and disease states, and so on. So the question logically arises: Given this enormous amount of data, how do you compare them?

Let's go down to earth with an example: Here, we have a part of the human chromosome arm 3p with different datasets plotted along the genome coordinates. The top curve is the DNA melting temperature, as predicted by a computer algorithm. We can see that there is a peak here, and looking at the other datasets, we notice that there is a peak in the number of genes in the same area. So it looks like there is a connection between these two data sets, that the melting temperature is higher in areas with many genes. But is this really so? And how do we do a statistical analysis of this relation? Adding even more complexity to this question, we notice that the percentage of GC in the genome is also higher in the same area. So what exactly is the relation here? As GC content is directly used in the computer algorithm, could it be that the peak in melting temperature is only caused by the amount of GC, which again is known to be higher inside genes? Or is there a connection between melting temperatures and genes not explained by the GC content. These kinds of questions can be answered by the Genomic HyperBrowser.

So let's look at the HyperBrowser approach for answering these kinds of questions. The Genomic HyperBrowser is not only a tool, but also a methodology. We have created an abstract methodology where we have identified five different types of genomic datasets. We have points, which are features that can be located at specific base pairs. We have segments, which are features that span an area of the genome. And we have functions, which are datasets that assigns a value to each base pair. Adding to this, we have valued versions of points and segments, so that i.e. a gene, which is a segment, can have an attached mark, i.e. an expression value. Having selected two tracks, each of a certain genomic type, the system presents a set of predefined questions that can be asked for this combination of datasets. In this example we ask the question: Are the points of the first dataset located inside the segments of the other dataset. In addition to this, we have to select a null model. The null model is a difficult task, but is essential for statistic validity. The null model should represent the set of random events that characterize the datasets. Having defined a null model and a question, the system selects an appropriate statistical test, which can be either exact of based on Monte Carlo. The test is then carried out and the results are calculated. We have global results available for the whole genome. We can also ask the question for a set of bins, giving p-values or effect size values locally along the genome.