ChIPseq analysis are high-throughput methods to report the binding regions of a biological feature (protein, histone modification) along the genome in a single experiment. We introduce here the first stage of the procedure to computationally analyze these experiments: the mapping of the reads on a reference genome (mouse). First, with a small piece of an original ChIPseq raw data file, we will use Galaxy to map the reads and build a custom track in Bed and BedGraph format to graphically represent the mapped reads along the mouse genome. It is important to take into account that in a real case, the mapping of a ChIPseq sample usually takes several hours in powerful workstations. Next, on a real scenario, we will access into an article in which the results of multiple ChIPseq experiments are published, accessing NCBI GEO and SRA to explore the raw and the processed data.
TABLE OF CONTENTS:
Accessing ChIPseq data in GEO (I): general information and raw data
Accessing ChIPseq data in GEO (II): genome-wide profiles and peaks
[ETC: 25 mins]
Open Galaxy and login with your username:
Create a New History (+)
Download this FASTQ file into your computer [LINK]
Use the Upload icon in Galaxy to add this file into our history
(Type: fastq)
Rename this file into Galaxy as ChIPseq Raw data (FASTQ)
With the Eye icon, examine the content of this FASTQ raw data file
Identify the sequence of the reads and the total number of reads
Find the following read (use the search option of your browser):
@SRR391032.60 GAII02:5:1:17:1151
Copy the sequence of the this ChIPseq read in the clipboard
Open a new tab for the UCSC Genome browser
Open the BLAT tool and paste the sequence of the read
Run BLAT on mouse (mm10) by pressing Submit
Check that the sequence in the genome viewer coincides with our read
Open the Genomic File Manipulation functions
Find the block of FASTQ Quality Control functions
Run the FastQC tool to obtain a report on the quality of these reads
Examine carefully the final FastQC report: parameters and results
Find the Genomic File Manipulation -> FASTA/FASTQ: FASTQ Groomer tool
Run on our raw data file (job1) to refine our raw data to be used in Galaxy
Rename this resulting file as ChIPseq Raw data (FASTQ) processed
Open the Genomics Analysis -> Mapping block of functions in Galaxy
Open the Bowtie2 mapping service (reference genome: mm10)
Run the mapping on the ChIPseq Raw data (FASTQ) processed file:
Rename the output to "BAM reads"
Analyze carefully the output and the Bowtie mapping statistics
Confirm with UCSC BLAT that the mapping on mm10 is correct for three reads
Use the display at UCSC main link to see the reads mapped in mm10 (e.g. view the reads along a whole chromosome)
Let us count with Galaxy how many reads were mapped in Forward or Reverse
Open the Genomic File Manipulation -> SAM/BAM
Run the SAMtools view function to convert our BAM mapped reads to SAM (output format)
Rename the output of the job to "SAM reads"
To count the number of reads of each class open Join, Subtract and Group: Group
(column 2, operation: count on column 2)
Run the SAMtools flagstat service on the BAM file to see the BAM statistics of the mapping
Rename the result as Mapping statistics
Run the Generate pileup from BAM dataset tool to count the number of reads per position
Examine the content and the help to understand what a pileup representation is
Open the BED menu in the Genomic File Manipulation block
Open the bedtools Genome Coverage service to create a BedGraph profile
(input type: BAM, BAM reads)
Unset the option: Report regions with zero coverage
Use the display at UCSC main link to see the profile in mm10
[ETC: 20 mins]
Open the following publication at PUBMED:
Read the Summary of the article
Use the GEO DataSets link in the Related information panel to find this ChIPseq data in the NCBI GEO web
Examine this information: first, focus on the number of samples and the type of sequencing experiments
Next, click on the ChIP_H3K4me3_WT link (GSM2645495) to open this experiment
Read carefully the information about the experimental details
Study the bioinformatics protocol (genome assembly, mapping) in the Data processing section
Click on the SRA link SRX2875249 to access the raw data information at the SRA
Click on the run to obtain more information about the sample
Find the Download button on the top menu to download the raw data file
(caution: do not execute the downloading, raw data files occupy several gigabytes)
Alternatively, search the same experiment in the European Nucleotide Archive (ENA) to download the raw data
[ETC: 20 mins]
Open the NCBI GEO homepage
[http://www.ncbi.nlm.nih.gov/geo/]
We will resume our work on the GSM2645495 entry (ChIP_H3K4me3_WT)
Study the bioinformatics protocol (profiles and peak calling) in the Data processing section
Explore this entry to find the processed data in the Supplementary files section
Open the UCSC genome browser
Open the genome browser for Mus musculus (mm9)
Press the add custom tracks button
Upload the gzipped BG and the BED archives for H3K4me3 in chr19
Go to chromosome 19 to visualize the data (profile and peaks)
Change the color of these tracks and activate the auto-scale mod
Study the occupancy profile of H3K4me3 in respect to genes
Use the UCSC table browser to see which genes are marked by H3K4me3
G. Mas, E. Blanco, C. Ballare, M. Sanso, Y. Spill, D. Hu, Y. Aoi, F. Le Dily, A. Shilatifard, M. A. Marti-Renom and L. Di Croce. Promoter bivalency favors an open chromatin architecture in embryonic stem cells. Nature Genetics 50: 1452–1462.
Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.Genome Biol. 2010 Aug 25;11(8):R86.
Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. Galaxy: a web-based genome analysis tool for experimentalists. Current Protocols in Molecular Biology. 2010 Jan; Chapter 19:Unit 19.10.1-21.
Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. Galaxy: a platform for interactive large-scale genome analysis. Genome Research. 2005 Oct; 15(10):1451-5.