Projects‎ > ‎

(2012) ENCODE: TF ChIP-seq peak calling using the Irreproducibility Discovery Rate (IDR) framework


Last Updated: Jul 7, 2013

Mailing List

Please join the IDR mailing list https://groups.google.com/group/idr-discuss for FAQs, discussions and updates on software.

Summary

Reproducibility is essential to reliable scientific discovery in high-throughput experiments. The IDR (Irreproducible Discovery Rate) framework is a unified approach to measure the reproducibility of findings identified from replicate experiments and provide highly stable thresholds based on reproducibility. Unlike the usual scalar measures of reproducibility, the IDR approach creates a curve, which quantitatively assesses when the findings are no longer consistent across replicates. In layman's terms, the IDR method compares a pair of ranked lists of identifications (such as ChIP-seq peaks). These ranked lists should not be pre-thresholded i.e. they should provide identifications across the entire spectrum of high confidence/enrichment (signal) and low confidence/enrichment (noise). The IDR method then fits the bivariate rank distributions over the replicates in order to separate signal from noise based on a defined confidence of rank consistency and reproducibility of identifications i.e the IDR threshold.

The method was developed by Qunhua Li and Peter Bickel's group and is extensively used by the ENCODE and modENCODE projects and is part of their ChIP-seq guidelines and standards. A publication describing the statistical details of the IDR framework is http://www.stat.washington.edu/qli/IDR-AOAS.pdf or http://projecteuclid.org/euclid.aoas/1318514284

Below we provide an automated pipeline around the IDR framework with various tweaks to systematically analyze ChIP-seq data. One important addition is the use of the IDR framework for assessing the self-consistency of a single dataset and for "rescuing" datasets (pooled-consistency) when replicates are not very comparable in terms of data quality (i.e. getting the maximum juice out of your datasets). This happens more often that you might imagine. It is not trivial to obtain "truly equivalent and comparable replicates". In the event, that one replicate is significantly better than the other, any evaluation or thresholding based on reproducibility will be lower bounded by the worst replicate. Ofcourse, the ideal scenario is to repeat your experiments until you get high quality comparable replicates. However, due to limitations of resources (financial or experimental material) this is often not accomplished. Hence from a practical stand point, the rescue strategy allows one to be less conservative and "smooth out" artificial differences between replicates.

A publication on this pipeline is in the works ...

NOTE: The current pipeline listed below is heavily tested and verified for human transcription factor ChIP-seq data and punctate chromatin marks (e.g. H3k9ac, H3k27ac, H3k4me3, H3k4me1 etc.) using the SPP peak caller.

We do NOT recommend using it as is for broad chromatin marks ChIP-seq. We are still developing a robust pipeline for broad chromatin marks (e.g. H3k9me3, H3k27me3, H3k36me3 etc.). The main issue is that most peak callers do not work well with broad marks and so we are switching to a generic genome-wide segmentation or windowing approach for broad marks. We are also testing a few callers such as RSEG and MACS2 which claim to be decent for broad marks.


Peak callers tested with IDR

  • SPP - Works out of the box
  • MACS1.4 - DO NOT use with IDR for reasons mentioned below
  • MACS2 - Works well with IDR with occasional problems of too many ties in ranks for low quality ChIP-seq data.
  • HOMER - I have not personally tested HOMER with IDR. But the HOMER developers have a detailed pipeline and code (in beta) for IDR analysis with HOMER at https://github.com/karmel/homer-idr 
  • PeakSeq - Run with modified PeakSeq parameters to obtain large number of peaks to input into IDR
  • HotSpot - Works well with IDR
  • MOSAiCS - Works well with IDR
  • GPS/GEM - Works well with IDR (need to use relaxed thresholds to call peaks)
  • Scripture - DO NOT use with IDR. Rank distributions are very strange and non-canonical
  • BELT - Works reasonably well with IDR. Occasionally has too many ties in the ranking measure.
See this presentation for a quick tour of comparative analysis of these peak callers. NOTE: The analyses in this presentation are NOT supposed to be treated as published results AND should not be used as a formal citation as an argument for one peak caller or the other. The analysis was performed by me (Anshul Kundaje) alongside the ENCODE Binding working group and was used as an empirical evaluation to test the behavior of different peak callers specifically wrt compatibility with IDR and the behavior of peak rank distributions. The analysis was vetted by all the groups involved in the analysis but should not be considered as an official verdict from ENCODE regarding efficacy of these peak callers.

Intuitive Explanation of IDR and IDR plots

For an intuitive description of the IDR method and a description of the various plots generated by the IDR package read this IDR-101 document compiled by Qunhua Li. A detailed explanation of the model and the underlying statistics is in the AoAS paper 


Code for IDR Analysis

IDR CODE: Download this Code archive
NOTE: This is different from the CRAN version of the IDR package. The CRAN package implements a slightly different underlying algorithm and is more general purpose for pairs of ranked lists. It has not yet been fully tested in the ChIP-seq setting and does not have the necessary wrappers to work directly with peak files. So I recommend using the above linked code even though it is a bit clunky. We will be releasing a unified version soon.

SPP PEAK CALLER CODE: http://code.google.com/p/phantompeakqualtools/ . Modified from the official version to add some additional functionality and an easy to use unified script. Please use the version of SPP specifically included in the phantompeakqualtools package.

MACSv2 PEAK CALLER CODE: The official version.

IDR CODE README

===========================
Qunhua Li and Anshul Kundaje (Oct,2010)
===========================
This set of programs are used for consistency analysis on peak calling results on multiple replicates of a dataset
================
DEPENDENCIES
================
unix, R version 2.9 or higher
================
FILES:
================
  • batch-consistency-analysis.r : for pairwise IDR analysis of replicates
  • batch-consistency-plot.r: for creating diagnostic and IDR plots
  • functions-all-clayton-12-13.r: helper function
  • genome_table.txt: This tab-delimited file MUST contain the size of each chromosome of the genome of the organism that the peak files are referring to
================
INPUT FILE FORMATS
================
(1) genome_table.txt
It contains two space delimited fields
Col1: chromosome name (These MUST match the chromosome names in the peak files)
Col2: chromosome size (in bp)

(1) Peak Files
Peak files MUST be in narrowPeak format (and unzipped ... the code currently doesnt handle gzipped peak files directly)

NarrowPeak files are in BED6+4 format. It consists of 10 tab-delimited columns

1.chrom	 string	 Name of the chromosome
2.chromStart	 int	 The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
3.chromEnd	 int	 The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the   feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
4.name	 string	 Name given to a region (preferably unique). Use '.' if no name is assigned
5.score	 int	 Indicates how dark the peak will be displayed in the browser (1-1000). If '0', the DCC will assign this based on signal value.         Ideally average signalValue per base spread between 100-1000.
6.strand	 char	 +/- to denote strand or orientation (whenever applicable). Use '.' if no orientation is assigned.
7.signalValue	 float	 Measurement of overall (usually, average) enrichment for the region.
8.pValue	 float	 Measurement of statistical signficance (-log10). Use -1 if no pValue is assigned.
9.qValue	 float	 Measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned.
10.peak	 int	 Point-source called for this peak; 0-based offset from chromStart. Use -1 if no point-source called.

*NOTE*: the p-value and q-value columns MUST be in -log10() scale

The narrowPeak format has 3 columns that can be used to rank peaks 
(signal.value, p.value (-log_10) and q.value (-log_10)).
The peak summit column must have values relative to the start coordinate of the peaks.
You can use any of these columns but make sure that whichever measure you are using rank peaks is relatively continuous without too many ties.
e.g. For the SPP peak caller it is recommended to use signal.value column
e.g. For MACS2 it is recommended to use the p.value column
e.g. PeakRanger/PeakSeq peak caller has relatively continuous q.values without too many ties. So for PeakSeq it is better to use q.value
================
RUNNING INSTRUCTIONS
================

(1) batch-consistency-analysis.r

This is used to run pairwise consistency analysis on a pair of replicate peak files

----------------
GENERAL USAGE:
----------------
Rscript batch-consistency-analysis.r [peakfile1] [peakfile2] [peak.half.width] [outfile.prefix] [min.overlap.ratio] [is.broadpeak] [ranking.measure]

Typical usage for SPP peak caller peaks
Rscript batch-consistency-analysis.r [peakfile1] [peakfile2] -1 [outfile.prefix] 0 F signal.value

Typical usage for MACSv2 peak caller peaks
Rscript batch-consistency-analysis.r [peakfile1] [peakfile2] -1 [outfile.prefix] 0 F p.value

[peakfile1] and [peakfile2] are the peak calls for the pair of replicates in narrowPeak format. They must be uncompressed files.
e.g. /peaks/reps/chipSampleRep1_VS_controlSampleRep0.narrowPeak AND /peaks/reps/chipSampleRep2_VS_controlSampleRep0.narrowPeak

[peak.half.width]: Set this to -1 if you want to use the reported peak width in the peak files.
IMPORTANT: Currrently this parameter does not work properly so please pre-truncate your peaks if desired before feeding to IDR. Always set this parameter to -1.
[outfile.prefix] is a prefix that will be used to name the output data for this pair of replicates.
The prefix must also include the PATH to the directory where you want to store the output data.
e.g. /consistency/reps/chipSampleRep1_VS_chipSampleRep2

[min.overlap.ratio]: fractional bp overlap (ranges from 0 to 1) between peaks in replicates to be considered as overlapping peaks. 
Set to 0 if you want to allow overlap to be defined as >= 1 bp overlap.
If set to say 0.5 this would mean that atleast 50% of the peak in one replicate should be covered by a peak in the other replicate to count as an overlap.
IMPORTANT: This parameter has not been tested fully. It is recommended to set this to 0.
[is.broadpeak]: Is the peak file format narrowPeak or broadPeak. Set to F if it is narrowPeak/regionPeak or T if it is broadPeak. BroadPeak files do not contain Column 10.

[ranking.measure] is the ranking measure to use. It can take only one of the following values
signal.value (Col7 in narrowPeak file), p.value (Col8 in narrowPeak file) or q.value (Col9 in narrowPeak file)

OUTPUT:
The results will be written to the directory contained in [outfile.prefix]
a. The output from EM fitting: suffixed by -em.sav
b. The output for plotting empirical curves: suffixed by -uri.sav
	Note: 1 and 2 are objects that can be loaded back to R for plotting or other purposes (e.g. retrieve data)
c. The parameters estimated from EM and the log of consistency analysis, suffixed by -Rout.txt
d. The number of peaks that pass specific IDR thresholds for the pairwise analysis: suffixed by npeaks-aboveIDR.txt
e. The full set of peaks that overlap between the replicates with local and global IDR: suffixed by overlapped-peaks.txt


(2) batch-consistency-plot.r

This is used to plot the IDR plots and diagnostic plots for a single or multiple pairs of replicates.

----------------
GENERAL USAGE:
----------------
Rscript batch-consistency-plot.r [npairs] [output.prefix] [input.file.prefix1] [input.file.prefix2] [input.file.prefix3] ....

[n.pairs] is the number of pairs of replicates that you want to plot on the same plot
e.g. 1 or 3 or ...

[output.prefix] is a prefix that will be used to name output data from this analysis. 
NOT TO BE CONFUSED with [outfile.prefix] in batch-consistency-analysis.r
The prefix must also include the PATH to the directory where you want to store the output data.
e.g. /consistency/plots/chipSampleAllReps

[input.file.prefix 1, 2, 3 ...] are the [outfile.prefix] values used to name the output from pairwise analysis on all replicates
e.g. /consistency/reps/chipSampleRep1_VS_chipSampleRep2
     /consistency/reps/chipSampleRep1_VS_chipSampleRep3
     /consistency/reps/chipSampleRep2_VS_chipSampleRep3

OUTPUT:
1. summary consistency plots in .ps format: suffixed by -plot.ps
These plots are very informative about the quality and similarity of the replicates.

===================================================
GETTING NUMBER OF PEAKS THAT PASS AN IDR THRESHOLD
===================================================
For each pairwise analysis, we have a *overlapped-peaks.txt file

The last column (Column 11) of the overlapped-peaks.txt file has the global IDR score for each pair of overlapping peaks
To get the number of peaks that pass an IDR threshold of T (e.g. 0.01) you simply find the number of lines that have a global IDR score <= T

awk '$11 <= 0.01 {print $0}' [overlappedPeaksFileName] | wc -l

IDR PIPELINE

Most of these instructions are for use with the SPP peak caller. I have created a slightly modified version of the official SPP package (with some additional functionality and a pretty easy to use automated script for peak calling) that you can download here.

I have added code snippets for the MACSv2 peak caller as well where necessary in the README below. I highly recommend NOT using MACS1.4 with IDR due to 3 fundamental problems with MACS1.4.
  • If you relax MACS1.4 peak calling thresholds (e.g. set p-value threshold to 1e-3 instead of 1e-5), MACS will start calling very broad peaks. This is because the peak width detection is unfortunately tied to peak detection itself. Ideally, a peak caller should provide a setting where one can relax a threshold and simply obtain more peaks with lower enrichment without affecting the scores or peak-widths of higher enrichment peaks. Due to these artificially broad peaks in MACS, the definition of peak overlap becomes very fuzzy and it is difficult to reliably decide whether 2 regions that have some base pair overlap in two different datasets are truly representing the same "peak" or whether the overlap is simply an artifact of the broad peak widths. One way to get around this would have been to artificially truncate the peak widths to some fixed smaller widths (.e.g. +/- 500 bp around peak summits for punctate TFs). However, when one truncates peak widths it effectively changes their peak scores and so we cannot directly use the peak scores that are output by MACS.
  • The MACS1.4 p-value computation module does not use log-space computation. Hence, more than often for datasets with high signal-to-noise, the MACS p-values will saturate i.e. the top X peaks will all have the same exact p-value i.e. -log10(pvalue) of ~320 or so. This 'X' can often be quite large > 3K-5K peaks. These peaks all essentially have 'tied' scores and the IDR has no way of knowing how to rank these peaks. IDR deals with ties by randomly breaking the ties. So what happens, is the strongest peaks (due to the ties) appear to have lower rank correlation that the slightly weaker peaks (that do not have tied scores). This completely destroys the assumptions of the IDR model (which are infact appropriate i.e. stronger peaks should show higher consistency). The results from the IDR model are thus often completely opposite to what they should be.
  • A less severe problem is that MACS1.4 more than often does not accurately estimate the fragment length for shifting tags. There is a work-around for this that I mention below.
We recommend using the SPP peak caller with IDR since it has been thoroughly tested. If you do plan to use MACS, use MACSv2 which has fixed each of these 3 problems listed above. We would highly recommend NOT using MACS1.4 with IDR.

Ok so now lets start with the analysis pipeline:

CALL PEAKS ON INDIVIDUAL REPLICATES

- Lets say we have a particular TF ChIP-seq dataset. I will refer to it as chipSample from here on.

- Lets say we have 3 replicates (read alignment files) for chipSample named
	- chipSampleRep1.tagAlign.gz
	- chipSampleRep2.tagAlign.gz
	- chipSampleRep3.tagAlign.gz
NOTE: The tagAlign format is a simple BED format for 
If your alignment files are in BAM format, convert them to tagAlign using SAMtools and the bamToBed function from bedTools
samtools view -b -F 1548 -q 30 chipSampleRep1.bam | bamToBed -i stdin | awk 'BEGIN{FS="\t";OFS="\t"}{$4="N"; print $0}' | gzip -c > chipSampleRep1.tagAlign.gz
- Each of the ChIP replicates might be paired with its own input/control tagAlign file or a single tagAlign file.
	- controlSampleRep1.tagAlign.gz
	- controlSampleRep2.tagAlign.gz
	- controlSampleRep3.tagAlign.gz
	
- The first thing we need to do is pool all control datasets that correspond to all replicates of chipSample i.e. simply concatenate the control tagAlign files.
zcat controlSampleRep1.tagAlign.gz controlSampleRep2.tagAlign.gz controlSampleRep3.tagAlign.gz | gzip -c > controlSampleRep0.tagAlign.gz

- Now we are ready to call peaks on each of the original replicates (REMEMBER TO USE RELAXED THRESHOLDS AND TRY TO CALL 150k to 300k peaks even if most of them are noise)
Rscript run_spp.R -c=chipSampleRep1.tagAlign.gz -i=controlSampleRep0.tagAlign.gz -npeak=300000 -odir=/peaks/reps -savr -savp -rf -out=/stats/phantomPeakStatsReps.tab
Rscript run_spp.R -c=chipSampleRep2.tagAlign.gz -i=controlSampleRep0.tagAlign.gz -npeak=300000 -odir=/peaks/reps -savr -savp -rf -out=/stats/phantomPeakStatsReps.tab
Rscript run_spp.R -c=chipSampleRep3.tagAlign.gz -i=controlSampleRep0.tagAlign.gz -npeak=300000 -odir=/peaks/reps -savr -savp -rf -out=/stats/phantomPeakStatsReps.tab
NOTE: If you have removed duplicates from your sample use run_spp_nodups.R instead of run_spp.R otherwise you will get errors
- This will generate 3 peak files in directory /peaks/reps/
	chipSampleRep1_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep2_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep3_VS_controlSampleRep0.regionPeak.gz
These file are in the standard UCSC/ENCODE narrowPeak format which is an extended BED format.
- Strand cross-correlation based data quality measures for the 3 runs will be appended to file /stats/phantomPeakStats.tab
(See README.txt in the SPP peak calling package for details about the strand-cross-correlation based data quality measures files)
IMPORTANT: It is important to check the cross-correlation plot that is produced by SPP to make sure the estimated fragment length is appropriate. Often for weaker datasets or datasets with few binding sites, the estimated fragment length may be incorrectly estimated as read-length or a value close to read-length (due to a phantom peak in the cross-correlation profile). The code automatically tries to account for this. However, it can make mistakes. For most ChIP-seq datasets, it is good to use the exclusion parameter (-x) with run_spp.R which will help the code skip any cross-correlation peak in the exclusion zone (values that are impossible as fragment lengths). I recommend that you use -x=-500:85 it is relatively unlikely that your fragment lengths will be in that range.
#MACS2: If you are using MACS2 then use the following commands
macs2 callpeak -t chipSampleRep1.tagAlign.gz -c controlSampleRep0.tagAlign.gz -f BED -n chipSampleRep1_VS_controlSampleRep0 -g hs -p 1e-3 --to-large
macs2 callpeak -t chipSampleRep2.tagAlign.gz -c controlSampleRep0.tagAlign.gz -f BED -n chipSampleRep2_VS_controlSampleRep0 -g hs -p 1e-3 --to-large
macs2 callpeak -t chipSampleRep3.tagAlign.gz -c controlSampleRep0.tagAlign.gz -f BED -n chipSampleRep3_VS_controlSampleRep0 -g hs -p 1e-3 --to-large
NOTE: You can skip --to-large if your ChIP and control samples (per replicate) have substantial number of reads (e.g. for human TF ChIP-seq atleast 15 M reads)
One robust alternative to estimate the --shiftsize parameter is using strand cross-correlation analysis. You can use a subroutine from the SPP peak caller package to estimate the predominant fragment length. Divide the estimated fragment length by half to get the shift-size parameter for MACS2. Then use the following MACS2 commands
macs2 callpeak -t chipSampleRep1.tagAlign.gz -c controlSampleRep0.tagAlign.gz -f BED -n chipSampleRep1_VS_controlSampleRep0 -g hs -p 1e-3 --to-large --nomodel --shiftsize <fraglen/2>
macs2 callpeak -t chipSampleRep2.tagAlign.gz -c controlSampleRep0.tagAlign.gz -f BED -n chipSampleRep2_VS_controlSampleRep0 -g hs -p 1e-3 --to-large --nomodel --shiftsize <fraglen/2>
macs2 callpeak -t chipSampleRep3.tagAlign.gz -c controlSampleRep0.tagAlign.gz -f BED -n chipSampleRep3_VS_controlSampleRep0 -g hs -p 1e-3 --to-large --shiftize <fraglen/2>
IMPORTANT: It is important to check the cross-correlation plot that is produced by SPP to make sure the estimated fragment length is appropriate. Often for weaker datasets or datasets with few binding sites, the estimated fragment length may be incorrectly estimated as read-length or a value close to read-length (due to a phantom peak in the cross-correlation profile). The code automatically tries to account for this. However, it can make mistakes. For most ChIP-seq datasets, it is good to use the exclusion parameter (-x) with run_spp.R which will help the code skip any cross-correlation peak in the exclusion zone (values that are impossible as fragment lengths). I recommend that you use -x=-500:85 it is relatively unlikely that your fragment lengths will be in that range.
#MACS2: MACS2 will output peaks in various formats (bed, xls and encodePeak). You want to use the *encodePeak files. These are in the standard UCSC/ENCODE narrowPeak format. Sort the *encodPeak files from best to worst using the -log10(pvalue) column i.e. column 8. Then truncate the number of peaks to the top 100k-125k. Using more than this simply increases the running time of the IDR analysis with no advantage. Infact using more peaks with MACS2 can cause problems with the IDR model because MACS2 seems to produce strange highly correlated peak scores for very weak and noisy detections. This can confuse the IDR model.
sort -k 8nr,8nr chipSampleRep1_VS_controlSampleRep0_peaks.encodePeak | head -n 100000 | gzip -c > chipSampleRep1_VS_controlSampleRep0.regionPeak.gz
sort -k 8nr,8nr chipSampleRep2_VS_controlSampleRep0_peaks.encodePeak | head -n 100000 | gzip -c > chipSampleRep1_VS_controlSampleRep0.regionPeak.gz
sort -k 8nr,8nr chipSampleRep2_VS_controlSampleRep0_peaks.encodePeak | head -n 100000 | gzip -c > chipSampleRep1_VS_controlSampleRep0.regionPeak.gz

CALL PEAKS ON POOLED REPLICATES

- Now pool all ChIP replicates
zcat chipSampleRep1.tagAlign.gz chipSampleRep2.tagAlign.gz chipSampleRep3.tagAlign.gz | gzip -c > chipSampleRep0.tagAlign.gz
	
- Call peaks on the pooled data (REMEMBER TO USE RELAXED THRESHOLDS AND TRY TO CALL 150k to 300k peaks even if most of them are noise)
Rscript run_spp.R -c=chipSampleRep0.tagAlign.gz -i=controlSampleRep0.tagAlign.gz -npeak=300000 -odir=/peaks/pooled/ -savr -savp -rf

IMPORTANT: It is important to check the cross-correlation plot that is produced by SPP to make sure the estimated fragment length is appropriate. Often for weaker datasets or datasets with few binding sites, the estimated fragment length may be incorrectly estimated as read-length or a value close to read-length (due to a phantom peak in the cross-correlation profile). The code automatically tries to account for this. However, it can make mistakes. For most ChIP-seq datasets, it is good to use the exclusion parameter (-x) with run_spp.R which will help the code skip any cross-correlation peak in the exclusion zone (values that are impossible as fragment lengths). I recommend that you use -x=-500:85 it is relatively unlikely that your fragment lengths will be in that range.
- This will generate a peak file in directory /peaks/pooled
	chipSampleRep0_VS_controlSampleRep0.regionPeak.gz
#MACS2: Use analogous commands are for the individual original replicates (above).

FOR SELF-CONSISTENCY ANALYSIS CALL PEAKS ON PSEUDOREPLICATES OF INDIVIDUAL REPLICATES

For each replicate e.g. chipSampleRep1.tagAlign.gz
- Randomly split the mapped reads in the tagAlign file into 2 parts (pseudoReplicates) with approximately equal number of reads
	You can use the following code snippet to do this
	*NOTE*: It uses the unix 'shuf' command that is standard with most installations. Incase you dont have it, I have included a 64-bit binary in the SPP package.
	
fileName='chipSampleRep1.tagAlign.gz' # input tagAlign file name
outputDir='/mapped/selfPseudoReps' # output directory for pseudoreplicate files
outputStub='chipSampleRep1' # prefix name for pseudoReplicate files

nlines=$( zcat ${fileName} | wc -l ) # Number of reads in the tagAlign file
nlines=$(( (nlines + 1) / 2 )) # half that number
zcat "${fileName}" | shuf | split -d -l ${nlines} - "${outputDir}/${outputStub}" # This will shuffle the lines in the file and split it into two parts
gzip "${outputDir}/${outputStub}00"
gzip "${outputDir}/${outputStub}01"
mv "${outputDir}/${outputStub}00.gz" "${outputDir}/${outputStub}.pr1.tagAlign.gz"
mv "${outputDir}/${outputStub}01.gz" "${outputDir}/${outputStub}.pr2.tagAlign.gz"

- You will now have 2 self-pseudoReplicates in the directory named /mapped/selfPseudoReps/
	chipSampleRep1.pr1.tagAlign.gz
	chipSampleRep1.pr2.tagAlign.gz

- Now call peaks on each of these using the pooled control dataset as the control (REMEMBER TO USE RELAXED THRESHOLDS AND TRY TO CALL 150k to 300k peaks even if most of them are noise)
Rscript run_spp.R -c=/mapped/selfPseudoReps/chipSampleRep1.pr1.tagAlign.gz -i=controlSampleRep0.tagAlign.gz -npeak=300000 -odir=/peaks/selfPseudoReps -savr -savp -rf
Rscript run_spp.R -c=/mapped/selfPseudoReps/chipSampleRep1.pr2.tagAlign.gz -i=controlSampleRep0.tagAlign.gz -npeak=300000 -odir=/peaks/selfPseudoReps -savr -savp -rf

IMPORTANT: It is important to check the cross-correlation plot that is produced by SPP to make sure the estimated fragment length is appropriate. Often for weaker datasets or datasets with few binding sites, the estimated fragment length may be incorrectly estimated as read-length or a value close to read-length (due to a phantom peak in the cross-correlation profile). The code automatically tries to account for this. However, it can make mistakes. For most ChIP-seq datasets, it is good to use the exclusion parameter (-x) with run_spp.R which will help the code skip any cross-correlation peak in the exclusion zone (values that are impossible as fragment lengths). I recommend that you use -x=-500:85 it is relatively unlikely that your fragment lengths will be in that range.
- This will create two peak files in the directory /peaks/selfPseudoReps chipSampleRep1.pr1_VS_controlSampleRep0.regionPeak.gz chipSampleRep1.pr2_VS_controlSampleRep0.regionPeak.gz
#MACS2: Use analogous commands are for the individual original replicates (above).

CREATE PSEUDOREPLICATES OF POOLED DATA AND CALL PEAKS

- Now, we create pseudoReplicates on the pooled data (same code as above)

fileName='chipSampleRep0.tagAlign.gz' # input tagAlign file name
outputDir='/mapped/pooledPseudoReps' # output directory for pseudoreplicate files
outputStub='chipSampleRep0' # prefix name for pseudoReplicate files

nlines=$( zcat ${fileName} | wc -l ) # Number of reads in the tagAlign file
nlines=$(( (nlines + 1) / 2 )) # half that number
zcat "${fileName}" | shuf | split -d -l ${nlines} - "${outputDir}/${outputStub}" # This will shuffle the lines in the file and split it into two parts
gzip "${outputDir}/${outputStub}00"
gzip "${outputDir}/${outputStub}01"
mv "${outputDir}/${outputStub}00.gz" "${outputDir}/${outputStub}.pr1.tagAlign.gz"
mv "${outputDir}/${outputStub}01.gz" "${outputDir}/${outputStub}.pr2.tagAlign.gz"

- You will now have 2 pooled-pseudoReplicates in the directory named /mapped/pooledPseudoReps/
	chipSampleRep0.pr1.tagAlign.gz
	chipSampleRep0.pr2.tagAlign.gz

- Now call peaks on each of these using the pooled control dataset as the control (REMEMBER TO USE RELAXED THRESHOLDS AND TRY TO CALL 150k to 300k peaks even if most of them are noise)
Rscript run_spp.R -c=/mapped/selfPseudoReps/chipSampleRep0.pr1.tagAlign.gz -i=controlSampleRep0.tagAlign.gz -npeak=300000 -odir=/peaks/pooledPseudoReps -savr -savp -rf
Rscript run_spp.R -c=/mapped/selfPseudoReps/chipSampleRep0.pr2.tagAlign.gz -i=controlSampleRep0.tagAlign.gz -npeak=300000 -odir=/peaks/pooledPseudoReps -savr -savp -rf
IMPORTANT: It is important to check the cross-correlation plot that is produced by SPP to make sure the estimated fragment length is appropriate. Often for weaker datasets or datasets with few binding sites, the estimated fragment length may be incorrectly estimated as read-length or a value close to read-length (due to a phantom peak in the cross-correlation profile). The code automatically tries to account for this. However, it can make mistakes. For most ChIP-seq datasets, it is good to use the exclusion parameter (-x) with run_spp.R which will help the code skip any cross-correlation peak in the exclusion zone (values that are impossible as fragment lengths). I recommend that you use -x=-500:85 it is relatively unlikely that your fragment lengths will be in that range.
- This will create two peak files in the directory /peaks/pooledPseudoReps chipSampleRep0.pr1_VS_controlSampleRep0.regionPeak.gz chipSampleRep0.pr2_VS_controlSampleRep0.regionPeak.gz
#MACS2: Use analogous commands are for the individual original replicates (above).

INPUT TO IDR ANALYSIS

For IDR analysis, we always compare a pair of peak files
For dataset chipSample, we will have ended up with the following files

PEAKS ON ORIGINAL REPLICATES
	chipSampleRep1_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep2_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep3_VS_controlSampleRep0.regionPeak.gz

PEAKS ON SELF-PSEUDOREPLICATES
	chipSampleRep1.pr1_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep1.pr2_VS_controlSampleRep0.regionPeak.gz

	chipSampleRep2.pr1_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep2.pr2_VS_controlSampleRep0.regionPeak.gz

	chipSampleRep3.pr1_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep3.pr2_VS_controlSampleRep0.regionPeak.gz
	
PEAKS ON POOLED PSEUDOREPLICATES
	chipSampleRep0.pr1_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep0.pr2_VS_controlSampleRep0.regionPeak.gz

IDR ANALYSIS ON ORIGINAL REPLICATES

We will perform the following pairwise analyses 
/peaks/reps/chipSampleRep1_VS_controlSampleRep0.regionPeak.gz AND /peaks/reps/chipSampleRep2_VS_controlSampleRep0.regionPeak.gz
/peaks/reps/chipSampleRep1_VS_controlSampleRep0.regionPeak.gz AND /peaks/reps/chipSampleRep3_VS_controlSampleRep0.regionPeak.gz
/peaks/reps/chipSampleRep2_VS_controlSampleRep0.regionPeak.gz AND /peaks/reps/chipSampleRep3_VS_controlSampleRep0.regionPeak.gz

First gunzip all the peak files

Then run the IDR analysis on all pairs. For SPP we use signal.value as the ranking measure.

Rscript batch-consistency-analysis.r /peaks/reps/chipSampleRep1_VS_controlSampleRep0.regionPeak /peaks/reps/chipSampleRep2_VS_controlSampleRep0.regionPeak -1 /consistency/reps/chipSampleRep1_VS_chipSampleRep2 0 F signal.value
Rscript batch-consistency-analysis.r /peaks/reps/chipSampleRep1_VS_controlSampleRep0.regionPeak /peaks/reps/chipSampleRep3_VS_controlSampleRep0.regionPeak -1 /consistency/reps/chipSampleRep1_VS_chipSampleRep3 0 F signal.value
Rscript batch-consistency-analysis.r /peaks/reps/chipSampleRep2_VS_controlSampleRep0.regionPeak /peaks/reps/chipSampleRep3_VS_controlSampleRep0.regionPeak -1 /consistency/reps/chipSampleRep2_VS_chipSampleRep3 0 F signal.value
#MACS2: If you are using peaks based on the MACS2 peak caller, then use p.value as the ranking measure.
Rscript batch-consistency-analysis.r /peaks/reps/chipSampleRep1_VS_controlSampleRep0.regionPeak /peaks/reps/chipSampleRep2_VS_controlSampleRep0.regionPeak -1 /consistency/reps/chipSampleRep1_VS_chipSampleRep2 0 F p.value
Rscript batch-consistency-analysis.r /peaks/reps/chipSampleRep1_VS_controlSampleRep0.regionPeak /peaks/reps/chipSampleRep3_VS_controlSampleRep0.regionPeak -1 /consistency/reps/chipSampleRep1_VS_chipSampleRep3 0 F p.value
Rscript batch-consistency-analysis.r /peaks/reps/chipSampleRep2_VS_controlSampleRep0.regionPeak /peaks/reps/chipSampleRep3_VS_controlSampleRep0.regionPeak -1 /consistency/reps/chipSampleRep2_VS_chipSampleRep3 0 F p.value
You can plot the IDR plots using
Rscript batch-consistency-plot.r 3 /consistency/reps/chipSampleAllReps /consistency/reps/chipSampleRep1_VS_chipSampleRep2 /consistency/reps/chipSampleRep1_VS_chipSampleRep3 /consistency/reps/chipSampleRep2_VS_chipSampleRep3

IDR ANALYSIS ON SELF-PSEUDOREPLICATES

We will perform the following pairwise analyses 
/peaks/selfPseudoReps/chipSampleRep1.pr1_VS_controlSampleRep0.regionPeak.gz AND /peaks/selfPseudoReps/chipSampleRep1.pr2_VS_controlSampleRep0.regionPeak.gz
/peaks/selfPseudoReps/chipSampleRep2.pr1_VS_controlSampleRep0.regionPeak.gz AND /peaks/selfPseudoReps/chipSampleRep2.pr2_VS_controlSampleRep0.regionPeak.gz
/peaks/selfPseudoReps/chipSampleRep3.pr1_VS_controlSampleRep0.regionPeak.gz AND /peaks/selfPseudoReps/chipSampleRep3.pr2_VS_controlSampleRep0.regionPeak.gz

First gunzip all the peak files

Then run the IDR analysis on all pairs

Rscript batch-consistency-analysis.r /peaks/selfPseudoReps/chipSampleRep1.pr1_VS_controlSampleRep0.regionPeak /peaks/selfPseudoReps/chipSampleRep1.pr2_VS_controlSampleRep0.regionPeak -1 /consistency/selfPseudoReps/chipSampleRep1.pr1_VS_chipSampleRep1.pr2 0 F signal.value
Rscript batch-consistency-analysis.r /peaks/selfPseudoReps/chipSampleRep2.pr1_VS_controlSampleRep0.regionPeak /peaks/selfPseudoReps/chipSampleRep2.pr2_VS_controlSampleRep0.regionPeak -1 /consistency/selfPseudoReps/chipSampleRep2.pr1_VS_chipSampleRep2.pr2 0 F signal.value
Rscript batch-consistency-analysis.r /peaks/selfPseudoReps/chipSampleRep3.pr1_VS_controlSampleRep0.regionPeak /peaks/selfPseudoReps/chipSampleRep3.pr2_VS_controlSampleRep0.regionPeak -1 /consistency/selfPseudoReps/chipSampleRep3.pr1_VS_chipSampleRep3.pr2 0 F signal.value

#MACS2: If you are using peaks based on the MACS2 peak caller, then use p.value as the ranking measure.
You can plot the IDR plots using Rscript batch-consistency-plot.r 3 /consistency/selfPseudoReps/chipSampleAllSelfReps /consistency/selfPseudoReps/chipSampleRep1.pr1_VS_chipSampleRep1.pr2 /consistency/selfPseudoReps/chipSampleRep2.pr1_VS_chipSampleRep2.pr2 /consistency/selfPseudoReps/chipSampleRep3.pr1_VS_chipSampleRep3.pr2

IDR ANALYSIS ON POOLED-PSEUDOREPLICATES

We will perform the following pairwise analyses 
/peaks/pooledPseudoReps/chipSampleRep0.pr1_VS_controlSampleRep0.regionPeak.gz AND /peaks/pooledPseudoReps/chipSampleRep0.pr2_VS_controlSampleRep0.regionPeak.gz

First gunzip all the peak files

Then run the IDR analysis on all pairs

Rscript batch-consistency-analysis.r /peaks/pooledPseudoReps/chipSampleRep0.pr1_VS_controlSampleRep0.regionPeak /peaks/pooledPseudoReps/chipSampleRep0.pr2_VS_controlSampleRep0.regionPeak -1 /consistency/pooledPseudoReps/chipSampleRep0.pr1_VS_chipSampleRep0.pr2 0 F signal.value
#MACS2: If you are using peaks based on the MACS2 peak caller, then use p.value as the ranking measure.
You can plot the IDR plots using Rscript batch-consistency-plot.r 1 /consistency/pooledPseudoReps/chipSamplePooledReps /consistency/pooledPseudoReps/chipSampleRep0.pr1_VS_chipSampleRep0.pr2

GETTING THRESHOLDS TO TRUNCATE PEAK LISTS

The following IDR thresholds are recommended for the different types of IDR analyses.
For self-consistency and comparison of true replicates
  • If you started with ~150 to 300K relaxed pre-IDR peaks for large genomes (human/mouse), then threshold of 0.01 or 0.02 generally works well. 
  • If you started with < 100K pre-IDR peaks for large genomes (human/mouse), then threshold of 0.05 is more appropriate. This is because the IDR sees a smaller noise component and the IDR scores get weaker. This is typically for use with peak callers that are unable to be adjusted to call large number of peaks (eg. PeakSeq or QuEST)
  • For smaller genomes such as worm, if you start with ~15K to 40K peaks then once again IDR thresholds of 0.01 or 0.02 work well.
  • For self-consistency analysis of datasets with shallow sequencing depths, you can use an IDR threshold as relaxed as 0.1 if you start with < 100K pre-IDR peaks.
For pooled-consistency analysis
  • If you started with ~150 to 300K relaxed pre-IDR peaks for large genomes (human/mouse), then threshold of 0.0025 or 0.005 generally works well. We use a tighter threshold for pooled-consistency since pooling and subsampling equalizes the pseudo-replicates in terms of data quality. So we err on the side of caution and use more stringent thresholds. The equivalence between a pooled-consistency threshold of 0.0025 and original replicate consistency threshold of 0.01 was calibrated based on a gold-standard pair of high quality replicate datasets for the CTCF transcription factor in human.
  • You can also use 0.01 if you don't want to be very stringent.
  • If you started with < 100K pre-IDR peaks for large genomes (human/mouse) , then a threshold of 0.01 is more appropriate. This is because the IDR sees a smaller noise component and the IDR scores get weaker so we have to relax the thresholds. This is typically for use with peak callers that are unable to be adjusted to call large number of peaks (eg. PeakSeq or QuEST)
  • For smaller genomes such as worm, if you start with ~15K to 40K peaks then once again IDR thresholds of 0.01 work well.
Each of these comparisons will give us a certain number of peaks that pass their respective IDR thresholds. We will refer to them as follows
ORIGINAL REPLICATE THRESHOLD numPeaks_Rep1_Rep2=$( awk '$11 <= 0.01 {print $0}' /consistency/reps/chipSampleRep1_VS_chipSampleRep2-overlapped-peaks.txt | wc -l ) numPeaks_Rep1_Rep3=$( awk '$11 <= 0.01 {print $0}' /consistency/reps/chipSampleRep1_VS_chipSampleRep3-overlapped-peaks.txt | wc -l ) numPeaks_Rep2_Rep3=$( awk '$11 <= 0.01 {print $0}' /consistency/reps/chipSampleRep2_VS_chipSampleRep3-overlapped-peaks.txt | wc -l ) We pick the maximum value out of these 3. Lets call it 'max_numPeaks_Rep' SELF-CONSISTENCY THRESHOLDS numPeaks_Rep1_pr=$( awk '$11 <= 0.02 {print $0}' /consistency/selfPseudoReps/chipSampleRep1.pr1_VS_chipSampleRep1.pr2-overlapped-peaks.txt | wc -l ) numPeaks_Rep2_pr=$( awk '$11 <= 0.02 {print $0}' /consistency/selfPseudoReps/chipSampleRep2.pr1_VS_chipSampleRep2.pr2-overlapped-peaks.txt | wc -l ) numPeaks_Rep3_pr=$( awk '$11 <= 0.02 {print $0}' /consistency/selfPseudoReps/chipSampleRep3.pr1_VS_chipSampleRep3.pr2-overlapped-peaks.txt | wc -l ) These 3 numbers should be in a similar range. If any of these is substantially different from the others, it points that replicate being different in some way. POOLED-PSEUDOREPLICATE THRESHOLD numPeaks_Rep0=$( awk '$11 <= 0.0025 {print $0}' /consistency/pooledPseudoReps/chipSampleRep0.pr1_VS_chipSampleRep0.pr2-overlapped-peaks.txt | wc -l )

FLAGGING REPLICATES FOR LOW CONSISTENCY

Typically the self-consistency thresholds for all the original replicates should be reasonably similar (within a factor of 2).
Also, max_numPeaks_Rep (comparing original replicates) and numPeaks_Rep0 (comparing pooled-pseudoreplicates) should be similar (within a factor of 2).
We typically flag datasets that fail both the above mentioned criteria. We also flag datasets that fail either one of these criteria by a significant margin (i.e the above ratios are significantly >> 2).
You should refer back to the strand cross-correlation based quality metrics, sequencing depth differences and library complexity metrics for the replicates as this can further support data quality problems.

FINAL SET OF PEAK CALLS

You can now generate a conservative and an optimal final set of peak calls for the dataset chipSample For the conservative set - Sort the peaks in chipSampleRep0_VS_controlSampleRep0.regionPeak.gz by the signal.value (or p.value or q.value depending on what you used to rank the peaks in the IDR analysis) column (column 7 is signal.value) in descending order - Pick the top max_numPeaks_Rep peaks
zcat chipSampleRep0_VS_controlSampleRep0.regionPeak.gz | sort -k7nr,7nr | head -n [max_numPeaks_Rep] | gzip -c > spp.conservative.chipSampleRep0_VS_controlSampleRep0.regionPeak.gz

#MACS2: Sort peaks by p.value before selecting the top 'N' peaks
zcat chipSampleRep0_VS_controlSampleRep0.regionPeak.gz | sort -k8nr,8nr | head -n [max_numPeaks_Rep] | gzip -c > macs2.conservative.chipSampleRep0_VS_controlSampleRep0.regionPeak.gz

For the optimal set
- Sort the peaks in chipSampleRep0_VS_controlSampleRep0.regionPeak.gz by the signal.value (or p.value or q.value depending on what you used to rank the peaks in the IDR analysis) column (column 7 is signal.value) in descending order
- Select the maximum of max_numPeaks_Rep and numPeaks_Rep0 i.e. optThresh = max(max_numPeaks_Rep,numPeaks_Rep0)
- Pick the top 'optThresh' peaks
zcat chipSampleRep0_VS_controlSampleRep0.regionPeak.gz | sort -k7nr,7nr | head -n [optThresh] | gzip -c > spp.optimal.chipSampleRep0_VS_controlSampleRep0.regionPeak.gz
#MACS2: Sort peaks by p.value before selecting the top 'N' peaks
zcat chipSampleRep0_VS_controlSampleRep0.regionPeak.gz | sort -k8nr,8nr | head -n [optThresh] | gzip -c > macs2.optimal.chipSampleRep0_VS_controlSampleRep0.regionPeak.gz

UNIFORMLY PROCESSED ENCODE PEAK CALLS

The default peak calls at the UCSC ENCODE DCC are NOT uniformly processed. They are simply peak calls submitted by the individual ENCODE production groups that are generated using a variety of different peak callers and thresholding methods.

The ENCODE Analysis Working Group (AWG) regularly re-processes all the data uniformly using a high quality peak caller with IDR.

It is highly recommended that you use the uniformly processed datasets linked below. These are official ENCODE products. I just list them here for your convenience.

June 2012 Freeze TFBS peak calls (SPP+IDR)

690 datasets of transcription factor ChIP-seq peaks based on data from all five ENCODE TFBS ChIP-seq production groups from the project inception in 2007 through the ENCODE March 2012 data freeze. The track covers 161 unique regulatory factors (generic and sequence-specific factors), spanning 91 human cell types, some under various treatment conditions.


This track represents peak calls (regions of enrichment) generated by the ENCODE Analysis Working Group (AWG) using the uniform processing pipeline developed for the ENCODE Integrative Analysis effort and published in a set of coordinated papers in September 2012. Peak calls from that effort (based on datasets from the January 2011 ENCODE data freeze) are available at the ENCODE Analysis Data Hub. The new Uniform TFBS track at UCSC includes newer data, slightly modified processing methods, and improved metadata. Quality metrics are included in metadata, with detailed metrics in a quality spreadsheet linked to the track description. Browser users will see the uniform peaks first when using track search for TFBS, and this track is now the default track shown when the ENCODE TF Binding menu item is selected in the browser.