ChIP-Seq

Analysis Step By Step using Galaxy

1 - Get your data into Galaxy

Get Data -> Princeton HTSEQ or Upload File

2 - If necessary, split datasets(s) using barcode splitter

NGS: QC and manipulation -> FASTX-Toolkit for FASTQ data -> Barcode Splitter

3 - Run FastQC on fastq file(s) for each sample/replicate. Look for low quality, bias in sequence, etc.

NGS: QC and manipulation -> Fastqc: Fastq QC

4 - Map reads to reference using Bowtie. Do this for both ChIP and Control/Input sample.

NGS: Mapping -> Map with Bowtie for Illumina

Include only uniquely mapping reads by setting:

Suppress all alignments for a read if more than n reportable alignments exist (-m): 1

5 - Convert to BAM file (and add read group if not already done when mapping).

NGS: Picard -> Add or Replace Groups

6 - Combine technical replicates (optional)

NGS: Sam Tools -> Merge BAM Files

7 - Call Peaks

NGS: Peak Calling -> MACS14

- Select control and chip sam/bam files

- Effective genome size (size of sequencable and mappable genome, minus repeats, etc.)

- Tag size: length of reads

- Band Width: used for model building, set to expected insert size

- p-value cutoff

- MFOLD - Will need to try various values, above 10 is recommended.

- Save wiggle creates file for visualization (may have issues in Galaxy browser at the moment)

8 - Intersect Biological Replicates (optional, see discussion below)

Combine replicates into one file for each sample

Operate on Genomic Intervals -> Concatenate

Combine peaks for each sample

Operate on Genomic Intervals -> Cluster

max distance between intervals: 1

min number of intervals per cluster: 2

9 - Get unique or common peaks in two samples

Shared Sites: Use Concatenate and Cluster operations as for biological replicates above.

Unique Site: Operate on Genomic Intervals -> Subtract

Return "Intervals with no overlap"

where minimal overlap is: 1

10 - Annotate Peaks

Get annotations: Get Data -> UCSC Main

Get promoter regions: Operate on Genomic Intervals -> Get Flanks

Find peaks that overlap with annotations: Operate on Genomic Intervals -> Intersect

Other useful tools available in BedTools section

Other tutorials and resources online

Things to Consider

Library Prep / Sequencing Issues

  • Paired end or single end - Single end is often sufficient, but paired end allows precise determination of fragment size, thus potentially providing better resolution of peaks.
  • Sequencing depth - To determine if depth is sufficient empirically, subsample your Fastq files (e.g. 20%, 40%, 60%, etc.) and run the analysis. Plot number of peaks for successively larger subsamples of the data. If you see a plateau, you have sufficient depth of coverage.

Replicates

  • Technical replicates should be run through QC separately. If they all look OK, then you can concatenate the data (preferably combine BAM files with read groups) unless you interested in measuring technical variation.
  • Biological replicates should probably be processed individually (as long as you have deep enough coverage, see above). You can then use simple intersection of the peaks to increase your confidence in the results. You could certainly get more sophisticated (estimating variance, etc.) but this will take some additional work (look into RNA-seq methods?). Beware that the "weakest" replicate could dominate the analysis if you require a peak to be found in all replicates.

Analysis Issues

  • Duplicate reads - Many people discard duplicate reads (reads mapping to the same coordinate on the same strand), assuming they are PCR duplicates and would contribute to false positives. This is the default behavior for MACS. However, it is possible, esp. if you have deep coverage to have "real" duplicates and discarding them could remove useful data. It controls the MACS behavior towards duplicate tags at the exact same location -- the same coordination and the same strand.
    • The --keep-dup option in MACS (not currently available via Galaxy) has a few options. The 'auto' option makes MACS calculate the maximum tags at the exact same location based on binomal distribution using 1e-5 as pvalue cutoff; and the 'all' option keeps every tags. If an integer is given, at most this number of tags will be kept at the same location. Default: 1.
  • Low quality mapping - Reads with low mapping qualities may be discarded prior to peak detection. Many aligners (e.g. BWA) will assign a mapping quality of zero (0) to reads that map to multiple places, so removing low quality mappings will remove these. Controlling the parameters of the alignment software will also have an effect on what reads are aligned and how many mismatches and gaps are allowed.
  • Reads Mapping to Multiple Locations - Reads mapping to multiple locations are handled in various ways be alignment software and care must be taken to interpret such mappings appropriately. Default settings for BWA and Bowtie will assign a read that maps to multiple locations to one location randomly and assign a mapping quality of zero. There are options, however, to not map these at all or map them to all (or up to some maximum number) of positions.
  • Parameterization - If you use MACS, make sure you run it multiple times using different min_fold parameters (and sometimes you must tune max_fold as well). Some people also use multiple tools and combine the results (typically intersection of peaks).

Software Options

  • MACS - Commonly used, available in Galaxy. May be best suited for transcription factor binding sites (narrow peaks).
  • CCAT - Also available in Galaxy, sometimes recommended for histone marker CHiP-Seq (wide peaks).
  • SICER - Clustering approach targeted at histone data (wide peaks). Not currently available in Galaxy, but could be added relatively easily if requested.
  • PeakSeq - Generally well reviewed peak detection software.
  • ChIPseqR - Software package for R, designed to detect nucleosome positions or histone modifications which typically have larger binding domains than transcription factors.
    • Humburg, P., Helliwell, C. A., Bulger, D. & Stone, G. ChIPseqR: analysis of ChIP-seq experiments. BMC Bioinformatics 12, 39 (2011).
  • CisGenome - Windows software, requires mapped files (in BED format), handles replicates.
  • HTSeq - Python package and scripts to work with high throughput sequencing data including fastq, gff/gtf, and vcf formatted files

Comparisons of various software approaches

Validation

  • ChIP-qPCR should be used to validate results.