Bioinformatics analysis of genomic annotations mostly consist on the manipulation of plain text files in which data are encapsulated into columns. The interplay between the rows and the columns of one or more flat files allows the bioinformatician an easy and efficient generation of new knowledge through the comparisons and the calculation of certain statistics. We will use the UCSC Table browser to explore how these files are, focusing on the gene annotations by RefSeq. Galaxy is an excellent web-based genome analysis tool designed for experimental people with no computational background. Basically, it is possible to perform with this web server most tasks that are usually automatized through the use of Linux environments in Bioinformatics labs. Our main purpose here is to show the potential of Galaxy for the management of basic genome information (e.g. lists of genes), while in posterior sessions we will focus on Galaxy services for massive sequencing information.
TABLE OF CONTENTS:
[ETC: 25 mins]
Open the UCSC Table browser
[http://genome.ucsc.edu/cgi-bin/hgTables]
Explore the Table browser interface
We will work with the mouse genome (mm10 assembly)
Select the NCBI RefSeq track and the UCSC RefSeq (refGene) table
Finally, let us work with the genome (region)
Press the summary/statistics button to see the totals
Go back to the Table browser main site
Press the get output button (output format: all fields from selected table)
Go back to the Table browser main site
Define output format to selected fields from primary and related tables
Press the get output button
Choose name, chrom, strand, txStart, txEnd, name2
Press the get output button
Go back to the Table browser main site
Keep the selected fields from primary and related tables option
Press the get output button
Switch the refSeqStatus table on
Press the allow selection button
Click over the option status
Press the get output button
Explore the current status of the list of transcripts
Go back to the Table browser main site
Set the output format to sequence (genomic)
Choose CDS and extract the sequences. Check the Start/Stop codons
Repeat with the promoter sequence (1 Kb) option
Play with the options to distinguish each element in a separate sequence
Reset the output format to selected fields from primary and related tables
Press the (filter) create button
Define the search on genes in chr1, strand + and >= 10 exons
Press the get output button
Add the exonCount to the fields to display
Press the get output button
Next, add to the filter those genes that are Validated and Reviewed
Press the get output button
Go back to the Table browser main site
Clear the filter
Press the paste list button
Type these gene names: Nanog, Sox2, Pou5f1, Myc
Press the summary/statistics button to see the totals
Go back to the Table browser main site
Get their name, chrom, strand, txStart, txEnd, name2 and refSeqStatus
How many transcripts per gene do you see in the list?
Go back to the Table browser main site
Change the output format (custom track)
Visualize these results (genes) in the graphical UCSC genome browser
Explore the LiftOver tool interface
[ETC: 10 mins]
Open Galaxy main server or reach other servers here:
Register into Galaxy to work with your own account (confirmation e-mail)
Open the Get Data menu on the Tools
Click on the UCSC Main table browser
Configure UCSC table browser for:
- D. melanogaster (BDGPR5/dm3)
- RefSeq genes, region: genome
- Output format: all fields from the table
Press the get output button
Next, press Send query to Galaxy (2-5 mins)
Focus on the History (on the right panel)
Click over the name of this job "1 UCSC Main on D. melanogaster..."
Press the Information icon to see the file size of this output
Press the EYE icon to visualize this dataset
Scroll along the content: each line corresponds to a gene transcript
Study the name of the attributes (columns)
Press the PENCIL to change the Name of this dataset to "Fly genes (dm3)"
[ETC: 15 mins]
Let us go to work on our "Fly genes (dm3)" dataset
First, we will select only certain attributes of the transcript:
Open the Text Manipulation menu
Click on the Cut columns from a table
Use the description on the right to select these columns (in this order):
chrom, txStart, txEnd, strand, exonCount, name2, name
Execute the query
Open the results with the EYE icon and check the content
Edit the Name of this new dataset to "Fly genes (dm3, compact)"
Second, we will narrow the search down to one chromosome
Open the Filter and Sort menu
Click on the Filter data on any column
Use this function to get the list of genes in the chromosome 'chr2L' (positive strand)
Execute and check the correctness of the output
Refine this list by removing out all genes with less than 10 exons
Click on the Sort data
Sort our current dataset by gene name
Edit the Name of this new dataset to "Fly genes (dm3, processed)"
Open the Join, Subtract and Group menu
Click on the Group data
Find out how to count the number of genes in our processed dataset
Edit the Name of this new dataset to "Fly genes (dm3, summary)"
[ETC: 10 mins]
Let us go to work on our Fly genes (dm3, summary) data set to annotate their GO function:
Open BIOMART in a new tab of your browser
Choose the database ENSEMBL GENES
Choose the dataset Drosophila melanogaster
Click on Filters (left menu)
Select chr2L in the REGION box
Press the Count button on top to see how many genes are included
Click on Attributes (left menu)
Open the GENE section and switch on the Gene Name field (off all the other boxes)
Open the EXTERNAL section to select GO term accession and the GO Term Name
Click on Results (top menu) to visualize a preview
Click over the Unique results only box
Choose TSV (Tab Separated Value) format instead of HTML (web)
Select compressed file (.gz) to save time and space
Press the Go button and wait
Save the file mart_export.txt.gz in your Desktop
Go back to our Galaxy session
Press the Download/Upload icon on top (left, Tools)
Choose local files button
Press Start, close the window once the uploading is finished
A new job has been incorporated into the history
Press the EYE icon to visualize this new dataset
Edit the Name of this new dataset "Fly genes (GO)"
Open the Join, Subtract and Group menu
Click on the Join two Datasets
Combine our "Fly genes (dm3, summary)" with the new "Fly genes (GO)"
Keep the header lines before executing the job
Explore the GO functions that annotate genes in chr2L using this function
Edit the Name of this new dataset to "Fly genes (dm3, summary, GO)"
Open the Filter and Sort menu
Click on the Select lines
Use this service to count how many genes are related to 'dorsal closure'
Find how many genes are involved in 'transcription factor activity'
Repeat the same for other biological functions (e.g. 'proliferation', 'apoptosis', etc.)
Choose a name to your History and open you register of Saved histories
E. Blanco. Fundamentos de Informatica en Entornos Bioinformaticos (spanish, 242 pages). Editorial UOC. ISBN: 978-84-9029-998-2.
E. Blanco. Genomica Computacional (spanish, 248 pages). Editorial UOC. ISBN: 978-84-9029-910-4.
Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.Genome Biol. 2010 Aug 25;11(8):R86.
Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. Galaxy: a web-based genome analysis tool for experimentalists. Current Protocols in Molecular Biology. 2010 Jan; Chapter 19:Unit 19.10.1-21.
Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. Galaxy: a platform for interactive large-scale genome analysis. Genome Research. 2005 Oct; 15(10):1451-5.
For further references, see this Galaxy web site