Research‎ > ‎

(2014) Roadmap Epigenomics Project: Chromatin States across 100s of human cell/tissue types

This webpage is now deprecated and replaced by the official supplementary website for the Roadmap paper that has all the uniformly processed data and analysis products. Supplementary Website: http://compbio.mit.edu/roadmap.The archived version is provided below for posterity reasons only. The download URLs provided here are not guaranteed to be functional or upto date.



The NIH Roadmap Epigenomics Project has been generating high quality comprehensive epigenomic maps of several key histone modifications (ChIP-seq) , chromatin accessibility (DNase-seq) DNA methylation (SBS, RRBS, mCRF) and mRNA expression (RNA-seq, exon arrays) across 100s of human cell and tissue types (primary cells and tissues, adult and fetal tissues, embryonic stem cells and derived cells). The ENCODE project has also been generating similar maps largely in cell-lines (several of which are cancer cell lines). As part of the Analysis Working Group of the Roadmap Project, we have uniformly processed all datasets from the compendium, systematically annotated datasets with relevant metadata and QC information and generated integrative chromatin state maps jointly across Roadmap and ENCODE samples.

IMPORTANT: When downloading the data please be gentle on our servers! Do not start 1000s of parallel download threads!


Metadata + QC spreadsheet for 127 Standardized unique epigenomes

The primary data compendium at http://www.genboree.org/EdaccData/Release-9/ contains all the primary alignment datasets mapped uniformly to the hg19 version of the human genome. However, the primary data often have substantial differences in read lengths, sequencing depth which do not allow easy integration without reprocessing. Also, there often exist multiple datasets for each data type per cell/tissue type (replicates as well as datasets from multiple centers and in some cases multiple individuals).

For the integrative analysis paper, we have carefully and systematically merged, reprocessed and subsampled datasets to ultimately obtain 127 standardized/equalized epigenomes with specific epigenome IDs (EIDs e.g. E001). We have also provided useful metadata about age, sex, anatomy etc. for each of the samples as well as extensive quality control statistics to allow for more informed analyses. These are provided below. The epigenomes include some of the ENCODE cell-lines that have the minimal required set of chromatin marks (E114-E127 are ENCODE samples). These were processed similar to the Roadmap datasets to allow for easy integration.

The "Core histone marks" include H3K4me3, H3K4me1, H3K36me3, H3K27me3, H3K9me3 (and Input-DNA control) which are present for all 127 epigenomes.

A substantial subset but not all of these 127 epigenomes also have H3K27ac, H3K9ac and DNase-seq. 

A small set of epigenomes have several other histone mark datasets (we refer to these as "Additional marks"). RNA-seq expression data is available for a subset of epigenomes as is DNA methylation data (using various platforms). These are all listed in the metadata table below.

The spreadsheet below contains the mappings from the primary data files at http://www.genboree.org/EdaccData/Release-9/ to the standardized epigenome IDs and associated QC and metadata information.

jul2013.roadmapData.qc



Uniformly processed ChIP-seq and DNase-seq data

Uniformly processed Alignment files

Since there was significant variation in the read length across different datasets from different centers (36 bp, 50 bp, 76 bp and 100 bp), reads were filtered using a 36 bp mappability track to only retain reads that map to locations/strands at which the corresponding 36-mers are unique in the genome (no mismatches). Further, multiple datasets (replicates or datasets from multiple centers) corresponding to the same data type and cell types were further pooled and subsampled where necessary to equalize signal strength across all epigenomes.

The complete set of uniformly processed alignment files (36 bp mappability filtered, pooling appropriate replicates/samples, subsampled to a max of 30M reads for chromatin marks and 50M reads for DNase and Class1 additional histone Marks) are available here. We recommend using this version of the alignment files for all standard analyses.

The uniform alignment files (36 bp mappability filtered, pooling appropriate files, NO subsampling i.e. original seq. depth) are available here

The uniform alignment files for individual replicates/samples (36 bp mappability filtered, NO pooling, NO subsampling) are available here

Uniformly processed and normalized genome-wide signal/coverage tracks

We used the signal processing engine of the MACSv2 peak caller to generate input/control normalized genome-wide signal coverage tracks using the stdnames30M alignment files. We generated 2 types of tracks that use different statistics to represent signal.

-log10(P-value) of enrichment

The p-value is the Poisson p-value relative to background (For ChIP-seq the background corresponds to input-DNA control. For DNase its a uniform background)

http://www.broadinstitute.org/~anshul/projects/roadmap/signal/stdnames30M/macs2signal/pval/

Fold enrichment

For ChIP-seq the background corresponds to input-DNA control. For DNase its a uniform background

NOTE: The p-value tracks have better signal to noise contrast. It is recommended to use these for analyses and visualization. A threshold of 2 can be used to filter noise in both the the fold-change and -log10(pvalue) tracks

Code: The script for generating the signal tracks is at http://www.broadinstitute.org/~anshul/softwareRepo/macs2.signal.lsf.submitscript.sh . (Please note that is the submit script specifically for our compute cluster. You will need to modify it to run it on your machine. However, it has all the core commands that you need to reproduce the results. We will release more polished code in the near future through a github repository. The fragment length (shift-size*2) parameter for MACS2 is estimated for each dataset using strand cross-correlation analysis using this package. The fragment length estimates for all datasets are in Column I in this spreadsheet)

Uniformly processed peak calls

Histone ChIP-seq data

MACS2 was used to call peaks on all the standardized histone ChIP-seq data. Two types of peak calls are provided
(1) Narrow regions of contiguous enrichment
MACS2 was used in the default narrow peak mode with p-value threshold of 0.01. This is not an optimized threshold for peak calls. Think of these are relatively relaxed lists of peaks. We are developing IDR based methods that are applicable to histone marks which will be used in the future to optimize the thresholds. These are narrow contiguous regions of enrichment. The files are in the standard narrowPeak format . Use the p-value column to rank peaks. 
Narrow Peak calls for all standardized epigenomes can be downloaded here: http://www.broadinstitute.org/~anshul/projects/roadmap/peaks/stdnames30M/combrep/
Code: The script for generating the narrow peaks is here http://www.broadinstitute.org/~anshul/softwareRepo/macs2.peaks.lsf.submitscript.sh . (Please note that is the submit script specifically for our compute cluster. You will need to modify it to run it on your machine. However, it has all the core commands that you need to reproduce the results. We will release more polished code in the near future through a github repository. The fragment length (shift-size*2) parameter for MACS2 is estimated for each dataset using strand cross-correlation analysis using this package. The fragment length estimates for all datasets are in Column I in this spreadsheet)
(2) Broad domains of enrichment
MACS2 was used in broadpeak mode with a broadpeak p-value threshold of 0.1 and a narrowpeak threshold of 0.01. This is not an optimized threshold for peak calls. Think of these are relatively relaxed lists of peaks. We are developing IDR based methods that are applicable to histone marks which will be used in the future to optimize the thresholds. These are broad domains enrichment. There are two types of files (1) broadPeak.gz and (2) gappedPeak.gz. The broadPeak.gz files are simply domains passing the p-value 0.1 threshold. The gapped peaks are broad domains (passing p-value 0.1) that contain atleast one narrow peak passing a pvalue of 0.01.
Broad domain calls for all standardized epigenomes can be downloaded here: http://www.broadinstitute.org/~anshul/projects/roadmap/peaks/stdnames30M/broadPeak/
Code: The script for generating the broad peaks is at http://www.broadinstitute.org/~anshul/softwareRepo/macs2.broadpeaks.lsf.submitscript.sh . (Please note that is the submit script specifically for our compute cluster. You will need to modify it to run it on your machine. However, it has all the core commands that you need to reproduce the results. We will release more polished code in the near future through a github repository. The fragment length (shift-size*2) parameter for MACS2 is estimated for each dataset using strand cross-correlation analysis using this package. The fragment length estimates for all datasets are in Column I in this spreadsheet)

DNase-seq

(1) Narrow regions of contiguous chromatin accessibility
For DNase data, we called narrow peaks peaks using two different peak callers on all datasets, MACS2 and HOTSPOT.
MACS2 was used with a p-value threshold of 0.01. 
The files are in the standard narrowPeak format
HOTSPOT (by Bob Thurman) was also used to call peaks at FDR of 1% and relaxed peak sets. These are in BED format.
- *DNase*hotspot.all.fdr0.01.pks.bed.gz: narrow Peaks in FDR 1% hotspots (i.e., FDR 1% peaks). 5th column score is peak tag density, 6th column score is z-score.
- *DNase*hotspot.all.pks.bed.gz: Genome-wide tag density peak calls. These are not restricted to hotspots, thresholded or unthresholded. Score column is peak tag density
Narrow Peak calls for all standardized epigenomes can be downloaded here: http://www.broadinstitute.org/~anshul/projects/roadmap/peaks/stdnames30M/combrep/
Code: The script for generating the MACS2 narrow peaks is here http://www.broadinstitute.org/~anshul/softwareRepo/macs2.peaks.lsf.submitscript.sh . (Please note that is the submit script specifically for our compute cluster. You will need to modify it to run it on your machine. However, it has all the core commands that you need to reproduce the results. We will release more polished code in the near future through a github repository. The fragment length (shift-size*2) parameter for MACS2 is estimated for each dataset using strand cross-correlation analysis using this package. The fragment length estimates for all datasets are in Column I in this spreadsheet)
(2) Broad domains of chromatin accessibility
The HOTSPOT peak caller was used to call broad domains (hotspots) of chromatin accessibility at FDR of 1% and relaxed thresholds.
- *DNase*hotspot.fdr0.01.broad.bed.gz FDR 1% hotspots. Score column is z-score
- *DNase*hotspot.broad.bed.gz Unthresholded (i.e., no FDR thresholding) hotspots. Score column is z-score
Broad hotspots for all standardized epigenomes can be downloaded here: http://www.broadinstitute.org/~anshul/projects/roadmap/peaks/stdnames30M/broadPeak/


Uniformly processed RNA-seq data

To be added

Uniformly processed DNA methylation data

To be added

Core Integrative chromatin state maps (127 Epigenomes)

NOTE: This is the primary segmentation we recommend to be used for most analyses.

The final segmentation results for all 127 epigenomes (111 Roadmap + 16 ENCODE) using the 5 core marks is now available here 

http://www.broadinstitute.org/~anshul/projects/roadmap/segmentations/models/coreMarks/parallel/set2/final/


Models ranging from 10 to 25 states were trained on data from 60 high quality epigenomes spanning as many diverse tissue types as possible. These are listed in the metadata spreadsheet above. A 15 state model was used as the optimal trade off between model complexity and interpretability. 

Annotation of the states with mnemonics and various overlap and neighborhood enrichments are shown in this figure (computed using auxiliary data from the H1 cell-line)


The states are as follows

STATE NO.MNEMONICDESCRIPTIONCOLOR NAMECOLOR CODE
1TssAActive TSSRed255,0,0
2TssAFlnkFlanking Active TSSOrange Red255,69,0
3TxFlnkTranscr. at gene 5' and 3'LimeGreen50,205,50
4TxStrong transcriptionGreen0,128,0
5TxWkWeak transcriptionDarkGreen0,100,0
6EnhGGenic enhancersGreenYellow194,225,5
7EnhEnhancersYellow255,255,0
8ZNF/RptsZNF genes & repeatsMedium Aquamarine102,205,170
9HetHeterochromatinPaleTurquoise138,145,208
10TssBivBivalent/Poised TSSIndianRed205,92,92
11BivFlnkFlanking Bivalent TSS/EnhDarkSalmon233,150,122
12EnhBivBivalent EnhancerDarkKhaki189,183,107
13ReprPCRepressed PolyCombSilver128,128,128
14ReprPCWkWeak Repressed PolyCombGainsboro192,192,192
15QuiesQuiescent/LowWhite255,255,255

The following files would be useful to most

(1) MNEMONICS BED FILES ( [Epigenome_id]_15_coreMarks_mnemonics.bed.gz files )
- Tab delimited 4 columns
- chromosome, start (0-based), stop (1-based), state_label_mnemonic for that region
You can download an archive containing all the mnemonics.bed files from 

(2) BROWSER FRIENDLY FILES
[Epigenome_id]_15_coreMarks_dense.bb
The dense BIGBED files will allow you to view each epigenome as a single track with regions labeled with state mnemonics and representative colors. You can stream these to UCSC Genome Browser or IGV
You can download an archive containing all the dense BIGBED files from 

[Epigenome_id]_15_coreMarks_dense.bed.gz (Same as above except in text format
You can download an archive containing all the dense BED files from 

[Epigenome_id]_15_coreMarks_expanded.bed.gz files
The expanded files will allow you to view each epigenome with each state as a separate track labeled with state mnemonics and representative colors
You can download an archive containing all the expanded files from

(3) STATES FOR EACH 200bp BIN
Max. posterior state label for each 200 bp bin in each chromosome for all epigenomes. The difference from the Mnemonic BED files is that in the Mnemonic files contiguous bins with the same state label are merged and a label is assigned to the entire merged regions whereas these files are at a fixed 200 bp resolution.

(4) POSTERIOR PROBABILITY FOR EACH 200bp BIN
Posterior probabilities of each state in each 200 bp bin for all chromosomes in all epigenomes

(5) PDF figures of chromatin state maps for the whole genome split into 1MB chunks

Auxiliary Integrative chromatin state maps (98 Epigenomes)

The auxiliary segmentation results for 98 of the 127 epigenomes (82 Roadmap + 16 ENCODE) using the 5 core marks+H3K27ac is now available here


Models ranging from 10 to 25 states were trained on high quality data from 39 epigenomes that had all 6 marks. These are listed in the metadata spreadsheet above. A 18 state model was used as the optimal trade off between model complexity and interpretability. 

Annotation of the states with mnemonics and various overlap and neighborhood enrichments are shown in this figure (computed using auxiliary data from the H1 cell-line)

The states are as follows

STATE NO.MNEMONICDESCRIPTIONCOLOR NAMECOLOR CODE
1TssAActive TSSRed255,0,0
2TssFlnkFlanking TSSOrange Red255,69,0
3TssFlnkUFlanking TSS UpstreamOrange Red255,69,0
4TssFlnkDFlanking TSS DownstreamOrange Red255,69,0
5TxStrong transcriptionGreen0,128,0
6TxWkWeak transcriptionDarkGreen0,100,0
7EnhG1Genic enhancer1GreenYellow194,225,5
8EnhG2Genic enhancer2GreenYellow194,225,5
9EnhA1Active Enhancer 1Orange255,195,77
10EnhA2Active Enhancer 2Orange255,195,77
11EnhWkWeak EnhancerYellow255,255,0
12ZNF/RptsZNF genes & repeatsMedium Aquamarine102,205,170
13HetHeterochromatinPaleTurquoise138,145,208
14TssBivBivalent/Poised TSSIndianRed205,92,92
15EnhBivBivalent EnhancerDarkKhaki189,183,107
16ReprPCRepressed PolyCombSilver128,128,128
17ReprPCWkWeak Repressed PolyCombGainsboro192,192,192
18QuiesQuiescent/LowWhite255,255,255

The following files would be useful to most

(1) MNEMONICS BED FILES ( [Epigenome_id]_18_core_K27ac_mnemonics.bed.gz files )
- Tab delimited 4 columns
- chromosome, start (0-based), stop (1-based), state_label_mnemonic for that region
You can download an archive containing all the mnemonics.bed files from 

(2) BROWSER FRIENDLY FILES
[Epigenome_id]_18_core_K27ac_dense.bb
The dense BIGBED files will allow you to view each epigenome as a single track with regions labeled with state mnemonics and representative colors. You can stream these to UCSC Genome Browser or IGV
You can download an archive containing all the dense BIGBED files from 

[Epigenome_id]_18_core_K27ac_dense.bed.gz (Same as above except in text format
You can download an archive containing all the dense BED files from 

[Epigenome_id]_18_core_K27ac_expanded.bed.gz files
The expanded files will allow you to view each epigenome with each state as a separate track labeled with state mnemonics and representative colors
You can download an archive containing all the expanded files from

(3) STATES FOR EACH 200bp BIN
Max. posterior state label for each 200 bp bin in each chromosome for all epigenomes. The difference from the Mnemonic BED files is that in the Mnemonic files contiguous bins with the same state label are merged and a label is assigned to the entire merged regions whereas these files are at a fixed 200 bp resolution.

(4) POSTERIOR PROBABILITY FOR EACH 200bp BIN
Posterior probabilities of each state in each 200 bp bin for all chromosomes in all epigenomes