Descriptive statistics

The gadget spec URL could not be found

Tutorial 1: Descriptive statistics (Gene coverage)

In this tutorial, you will carry out two different analyses on genes.

Part 1

In this part, we are interested in answering the question: How much of the genome is covered by genes?

Start by creating a new user. This is needed if you want to be able to store the results for later access.
  • Click "Register" in the "User" menu in the top right of the screen
  • Type your e-mail and a password, and click "Create"
If you already have a user, you can instead log in with the user by clicking "Login" in the "User" menu.

The interface of the Genomic HyperBrowser is divided into four panels. One top panel and three panels dividing the area below vertically. The leftmost one, the tool panel, allows you to select tools and analyzes. The top section of the tool panel contains tools created for the Genomic HyperBrowser, and the bottom section contains Galaxy tools. The middle panel is used for specifying input parameters and also for viewing data and analysis results. Lastly, there is the the rightmost panel, the history panel, which can be thought of a laboratory log book. Here, we store input data, intermediate data and the results from different analyses.

After the first login, a new history will be created, by default named "Unnamed history". You usually create a new history for each analysis, or connected set of analyses. You should rename the history, in order to easily locate the history if you want to go back to it at a later point:
  • Click the pencil button next to the history name
  • Type "Gene coverage" and type the return key
All output from tools and analyzes will be stored in this history, until a new one is created.

Now, lets start the analysis:
  • Select "The Genomic HyperBrowser" and "Perform analysis" in the tool panel
This is the main interface of the Genomic Hyperbrowser.
  • Select "Human Mar. 2006 (hg18/NCBI36)" as the genome build
Next, you need to select one or two tracks, or datasets, that you want to analyze. Clicking the track selection boxes brings up a list of categories of the tracks that are in the system.
  • As the first track, select "Genes and gene subsets" as category, and then "Genes" in the next level of categories. Now all gene definitions that are in the system are listed. Select the "CCDS" track
  • Click the information button marked with "i"
By pressing it you will get information about the selected track. Among other things, the box contains a description of the track, article references, the genomic type of the track (or track format= and the number of elements. In this case, the genomic type of the track is "unmarked segments", meaning that each genomic element has a position and a length, and the track contains almost 20 000 genes.
  • Close the information box
  • In this analysis we only want to include one track, so for Second Track, select "No track (single track analysis)"
The system now presents all applicable analysis options, based dynamically on the current choice of tracks.
  • Select the category "Descriptive statistics" and the analysis "Proportional coverage"
The selected statistical question is now presented as a fully formulated English sentence.

Some statistical tests may need a simplified track format as input. You will have the option to specify how it should be simplified. In this analysis we want to leave the track type as it is, as the default "Original format ('Unmarked segments')".

The last thing that needs to be specified is where, and at what scale, we want to do the analysis. In this case you want to do the analysis over all cytobands, with both a global result, in addition to a local result for each cytoband.
  • Select "Cytobands", and leave the default "*" in the textbox. Here you could have narrowed the scale down to individually specified cytobands
Things are now in place to run the analysis.
  • Click the information button at the bottom of the screen (the run description)
A summary of your selected options appear. Make a habit of checking this output to make sure that everything is set the way you intended.
  • Click "Start analysis" to run the analysis
A "Perform analysis" element turns up in the history panel. It is initially gray while in queue, then, as the analysis starts, it turns yellow and, finally, it turns green as the analysis is finished.
  • Click the eye symbol of the history element
The global result is displayed in the middle panel. According to CCDS, 25.97% of the human genome is covered by genes. The "assembly gap coverage" score is the proportion of base pairs that has not been sequenced in the genome assembly used. Most assembly gaps are centromeres or other heterochromatic regions. Here, you can see that a little mor than 7% of the genome has not been sequenced.
  • Click the link "html" in the "Table: values per bin" column
The table presenting local results is shown, with the base pair coverage of genes for each cytoband. The table can be sorted by clicking the header of the column to sort on. Also notice that in some cytobands, the "assembly gap coverage" is 1, or 100%. These are typically centromeric regions.

Part 2

In this part of the tutorial, we are interested in answering the question: How do two different gene definition compare in terms of genome coverage?
  • Select "The Genomic HyperBrowser" and "Perform analysis" again
  • Make sure "Human Mar. 2006 (hg18/NCBI36)" is selected as the genome build
  • As the first track, select "Genes and gene subsets", "Genes" and "CCDS"
  • As the second track, select the same subdirectory as above, but now choose the "Refseq" gene definition
  • Select "Descriptive statistics" and "Bp coverage"
  • Leave the formats of the tracks as the default "Original format ('Unmarked segments')"
  • In the Region and scale box, select "Cytobands" and leave the default "*" in the textbox
  • Click "Start analysis"
  • Click the eye symbol of the new history element
In the global results, you can find the number of base pairs covered by only one of the gene definitions, by both definitions, and by no genes of any definition. You can also find the corresponding proportions of the genome. Notice that almost 26% percent of the genome is covered by both definition and that the Refseq definition by it own adds another 9%. The proportion of the genome covered by only CCDS is very small. It thus looks like the CCDS genes is something of a subset of the Refseq genes.
  • Click "html" under "Table: values per bin", for the local results
A table of the same statistics appear with the values calculated for each cytoband.