With the advent of RNA-seq technologies it is feasible to monitorize and quantify the expression of the full transcriptome of genomes. Such a huge volume of information is priceless for identifying unknown novel transcribed regions or unveiling new alternatively spliced forms of known genes. Here we will learn how to access and process RNA-seq information, accessing NCBI GEO and SRA to explore the raw data. Next, we will use Galaxy to explore the content of a sample of RNAseq raw data (pair-end, strand-specific) that has been previously aligned to the mouse genome. We aim to examine the content of this file, compute basic statistics about the quality of the mapping and the resulting reads and build a custom track in BED format to graphically represent the mapped reads along the mouse genome. Finally, we will open the UCSC Browser to visualize a fragment of a real RNAseq. It is important to take into account that in a real case, the mapping of an RNAseq sample usually takes more than one day even when running in powerful workstations.
TABLE OF CONTENTS:
[ETC: 15 mins]
Open the following publication at PUBMED:
Read the Summary of the article
Use the link Related information (menu on the right) to access the GEO dataset GSE57982
Explore the list of RNAseq samples associated to this project
Open the Primary T-ALL sample_1 entry (GSM1399182)
Read carefully the information about the experimental and computational details
Click on the SRA link SRX553457 to access the raw data information at the SRA
Click on one run to obtain more information about this RNAseq
If you click the Download->FASTA/FASTQ menu, you could obtain the raw data in SRA format (caution: fastq files are long archives that can consume your hard drive)
Alternatively, search the same RNAseq experiment in the ENA [LINK]
[ETC: 30 mins]
Open Galaxy:
[https://usegalaxy.org/] ----- Mirrors (more web servers here)
Enter into your Galaxy own account using your username+password
Create a New History (+)
RNAseq mapping is a time-consuming activity, so we will work with a set of RNAseq aligned reads
Download this BAM file (aligned reads) into your computer to skip the mapping step
Use the Upload icon on top to upload this file in Galaxy (format: auto-detect, genome: mm9)
Prepare the BAM file for visualization by running Samtools sort
Use the display at UCSC main link to see the reads mapped over mm9
Show only the RefSeq genes and the bam reads. Explore the chr1
Check how the aligned reads fit into exonic-intronic structures (e.g. Xkr4 gene)
Click over one read to explore its characteristics
Rename this file into Galaxy as RNAseq mapped reads (BAM, pair-end, strand-specific)
Run the SAMtools flagstat service on the BAM file to see the final statistics of the mapping
Analyze the number of reads of each class
Rename the result as Mapping statistics
Go back to the BAM file
Analyze carefully the output file (picture below)
Check for three reads with UCSC BLAT that the mapping over mm9 is correct
Open the following web resource to understand the SAM flag of each mapped read
Please, remind that we are working with pair-end, strand-specific RNASeq
Double-check the analysis of the SAM FLAG of each read with the Mapping statistics
Open the BED block of functions
Run the bedtools Genome Coverage service to create a BedGraph profile
Unset the option: Report regions with zero coverage
Use the display at UCSC main link to see the profile in mm9
You will notice that the custom track is formatted by Galaxy as BED (not BedGraph)
Use the Table browser to get a copy of the records
Add the new custom track and paste the bedgraph information:
track type=bedGraph name=test
Focus on the Sox17 locus to see the BAM reads and the BED/BEDGRAPH tracks
[ETC: 10 mins]
Open the UCSC Genome browser (mouse: mm9)
Open the configuration of the antisense track and switch the Negate values box on
Analyze the antisense expression at the TSS of both genes
Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.Genome Biol. 2010 Aug 25;11(8):R86.
Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. Galaxy: a web-based genome analysis tool for experimentalists. Current Protocols in Molecular Biology. 2010 Jan; Chapter 19:Unit 19.10.1-21.
Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. Galaxy: a platform for interactive large-scale genome analysis. Genome Research. 2005 Oct; 15(10):1451-5.