Personal

Projects‎ > ‎

(2012) ENCODE: Wiggler - Normalized genome-wide signal coverage tracks of ENCODE data

Summary

As part of the ENCODE consortium Jan 2011 data release, we generated a comprehensive set of genome-wide normalized signal coverage tracks for various ENCODE data types. These processed datasets were used for all analyses presented in the Sept 2012 ENCODE papers.




Data Files (Jan 2011 Release)

BAM files:

The raw alignment files in BAM format (as submitted by the individual ENCODE production labs) can be downloaded from http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/
These raw files contain reads mapped using different alignment parameters (some files have multi-mapping reads while others do not)

We uniformly processed all the alignment files to remove multi-mapping reads using custom uniqueness maps of the hg19 genome. These filtered alignment files in tagAlign format (extended BED format) are available here

Signal Track files:

NOTE: The URLs listed below can take a few seconds (upto half a minute to load) since they list a large number of bedgraph/bigwig files

INDIVIDUAL DATASETS

Replicates datasets were combined.
All signal tracks in bigWig format can be downloaded or streamed to the UCSC browser from: http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/supplementary/integration_data_jan2011/byDataType/signal/jan2011/bigwig/

LABS-COMBINED DATASETS

For select set of data, datasets from multiple labs were combined. The list of datasets that were combined are listed in the google spreadsheet table below (See sheet labeled LabCombinedDatasets).

Methods

The methods used to produce these tracks are explained in the ENCODE companion paper (code GRCP011)

Hoffman M, Ernst J, Wilder S, Kundaje A, Harris R, Libbrecht M, Giardine B, Bilmes J, Birney E, Hardison R et al. Integrative annotation of chromatin elements from ENCODE data (In Review)

Code

All signal tracks were generated using the wiggler software http://code.google.com/p/align2rawsignal/

Parameters

- Read-shift parameters and smoothing bandwidths for all individual datasets (and replicates) are listed in the table shown below (See sheet labeled FragmentLengths). Replicates have the same BAM filename prefix except for the "Rep[0-9]+" term.
- For a select set of data, we had datasets (and replicates) from multiple labs. These we all combined into a single track. The list of datasets that were combined are listed in the table shown below (See sheet labeled LabCombinedDatasets).

signaltracks.estimated.fragmentLengths



BRIEF DESCRIPTION OF METHOD

We used a uniform signal-processing pipeline to generate genome-wide normalized signal coverage tracks for the different types of ENCODE datasets. Different subsets of these tracks were used as input to the segmentation algorithms. We combined signal from multiple replicates of each experiment. Also, for a select subset of experiments that had equivalent datasets from multiple labs, signal was combined across all datasets. Read alignment files that were downloaded from the ENCODE portal (http://genome.ucsc.edu/ENCODE/downloads.html) were filtered to remove multi-mapping reads. For each of the replicates (datasets) that were combined to produce a unified signal track, we estimated a characteristic read-shift parameter from the data and reads were shifted by this estimated value in the 5' to 3' direction. The effective fragment coverage was computed for each position in the genome by first computing the read-count coverage followed by a smooth aggregation (using kernel smoothing) in a pre-specified window around each position. The fragment coverage was then normalized for the total number of mapped reads over all replicates, effective read-extension and smoothing lengths, local mappability around each location and the overall mappable size of the genome. The final signal values at each position are expressed in terms of a fold-change over the expected signal from an equivalent uniform distribution of reads over all mappable locations in the genome. Signal values at unreliable locations consisting of genomic positions surrounded by a large number of unmappable locations and those in assembly gaps were explicitly represented as missing values. A complete description of this procedure is in Supplementary Methods of [1]


DETAILED DESCRIPTION OF METHOD

The ENCODE datasets span five main data types across a variety of cell-types and treatment conditions, as summarized in Supplementary Table 2. The complete list of datasets is available at http://genome.ucsc.edu/ENCODE/downloads.html. Each dataset is the result of at least two biological replicates. For a select subset of targets, multiple laboratories generated their own versions of the datasets. Laboratories generated datasets using a variety of protocols and with a diversity of sequencing parameters (such as library size, number of mapped reads, and read lengths). Hence, we developed a uniform processing pipeline to generate genome-wide signal coverage tracks by pooling data from multiple replicate experiments and combining appropriate datasets across labs when available (https://sites.google.com/site/anshulkundaje/projects/wiggler). We implemented this pipeline using the Wiggler package (http://code.google.com/p/align2rawsignal). It accounts for the individual characteristics of and differences between specific datasets and data types.

First, we downloaded from the ENCODE Data Portal (http://genome.ucsc.edu/ENCODE/downloads.html) all Binary Alignment/Map (BAM) files containing reads mapped to the GRCh37 human genome assembly submitted by the ENCODE production groups. To eliminate artificial differences between datasets due to different mapping stringency parameters we created custom unique-mappability tracks for the GRCh37 male and female genomes (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/referenceSequences) for read lengths ranging from 20 to 54. For a particular read length k, we labeled each coordinate on the ‘+’ strand of the genome as unique if the k-mer starting at that position and continuing 3' mapped uniquely to only that position with no mismatches. If a position x on the ‘+’ strand is labeled unique, it implies that the k-mer starting at position x + k − 1 on the ‘−’ strand is also unique. Hence, one can infer mappability values for any position on the ‘−’ strand directly from the ‘+’ strand mappability track. The mappability tracks are available for downloaded at http://code.google.com/p/align2rawsignal. We used the mappability tracks to filter BAM files by discarding all reads that mapped to non-unique locations in the genome.

The sequenced reads in each dataset represent ends of target DNA fragments isolated in various ways depending on the data type. The experimental protocols result in a variety of characteristic strand-specific distribution of sequenced reads around target sites. For example, ChIP-seq datasets typically show mirror peaks of mapped reads on the ‘+’ and ‘−’ strand around binding sites of the target protein separated by a characteristic distance equal to the average lengths of the immunoprecipitated DNA fragments . Therefore, in order to faithfully represent the target of interest and generate signal tracks that show peak signal at the target binding site rather than in flanking regions, it is important to account for this strand-specific read-shift. We used strand cross-correlation analysis to infer the predominant read-shifts for each replicate in each dataset. Kundaje et al. discuss read distribution characteristics of the different data types. The complete set of dataset-specific read-shift parameters are provided at https://sites.google.com/site/anshulkundaje/projects/wiggler. The Duke and University of Washington (UW) production groups used very different protocols for performing DNase-seq, and the read-shift characteristics of their datasets diverged greatly, so we did not pool datasets across the two labs, unlike for ChIP-seq.

Multiple filtered BAM datasets, Bi, corresponding to replicates (or similar experiments from multiple labs), along with their respective estimated fragment lengths (2*read-shift), Li, were provided as input to the signal processing engine. For each dataset Bi, we performed the following procedure:

(1) We shifted read starts in the 3' direction by Li / 2.

(2) We computed shifted read-start coverage at each position on both DNA strands.

(3) For each genomic-location x and strand s, we then computed a smooth weighted sum of read counts Fis using a kernel of width w centered at x. We used different values of w for different data types (https://sites.google.com/site/anshulkundaje/projects/wiggler). For each data type, we selected w based on the maximum estimated fragment length for any dataset of that type (for any dataset i of data type j, we expect that wj ≥ Lij), and the general characteristics of the data type. For example, we set w = 300 for histone mark datasets because we expected that the sonication and size-selection protocols for the corresponding experiments would generate DNA fragments of size ≤ 300 bp. For all data types except nucleosome data, we used the Tukey window kernel, which has a central window of length cw (cww) with weights equal to 1. The weight then tapers to 0 on either end following a cosine curve. We typically set cw = Li. This procedure of shifting reads and then aggregating over a window is equivalent to a smooth read extension of length w. Hence, this aggregate signal value at each position represents the approximate number of sequenced fragments that overlap that position. The use of a common overall smoothing length for all datasets of a particular data type while allowing for dataset-specific (and replicate-specific) read-shifts provides equivalent and comparable resolution across all datasets of a data type while accounting for the dataset-specific fragment length distributions. For nucleosome data, which requires finer smoothing to distinguish individual nucleosomes, we used a sharper tri-weight kernel with a window size of 60 bp around each position after shifting reads by the dataset-specific read-shift (typically 148 bp / 2 = 74 bp) .

(4) We added together the unnormalized fragment counts from both strands at each position x (Fi(x) = Fi+(x) + Fi(x)).

(5) Since we used only unique mapping reads from each dataset, we needed to distinguish between mappable positions with zero signal for a particular dataset from those positions that were simply not uniquely mappable (missing data). We also needed to normalize Fi(x) for the total number of mappable locations that contribute signal to any position x. Different datasets also have different library sizes. In order to compute signal on a common scale across all data sets and data types, we needed to normalize for differences in number of replicates (or pooled datasets) and the total number of mapped reads. We accomplished these normalizations thus:

(a) Using the binary unique-mappability track corresponding to the read length for Bi, we computed the local cumulative mappability Mi(x), for each position x by using the same read-shift and aggregation smoothing kernel window parameters. Mi(x) represents the effective number of positions in the local kernel window around x that can potentially contribute read counts based on mappability.

(b) For each Bi, we then computed the expected fragment counts at each position x if all the mapped reads were uniformly distributed over all uniquely mappable locations on both strands in the genome Ei(x) = (Mi(x) Ri) / Gi where Ri is the total number of mapped reads in Bi and Gi is the total mappable genome size over both strands

(c) We then add fragment counts from all Bi at each position x: F(x) = Σi Fi(x).

(d) We also add expected fragment counts from all replicates/datasets at each position: E(x) = Σi Ei(x).

(e) We then compute the normalized signal S(x) at each position x as the fold-change of the observed fragment count over the expected fragment count. S(x= F(x) / E(x). This signal scale is similar to the reads per kilobasepair per million mapped reads (RPKM) measure often used to represent normalized RNA-seq read counts.

(6) Finally, we discard signal at positions that lie in regions of low dispersed mappability as these locations typically show mapping artifacts and also because the signal normalization procedure can over-correct for the small number of locations within the smoothing window that contribute signal. Specifically, we discard positions x that have E(x) ≤ 0.25 maxx E(x), including positions lying in completely unmappable windows. We represent the signal at such locations as missing data, but assign a non-negative signal value to all other positions. Therefore a position with signal value 0 is a reliable mappable location, but no reads map to it in a particular dataset.


REFERENCES

Comments