[Note: This webpage is currently under construction and some links are not yet functional. A beta version of tcga_util is currently available. The download website is specified below.]
We believe the TCGA data set is useful enough and sufficiently complex to warrant its own tool for downloading, querying, pre-processing and managing it. tcga_util provides the following basic functionality:
basic summary reports of available TCGA data
download of TCGA clinical and experimental data into a local data store organized by cancer and data type
sample query tools to easily find samples of interest based on clinical and mutational criteria which enables the creation of filtered data matrices that consists of data for samples of interest that may be easier to work with in downstream analysis tools such as R
ability to update the local data store with the latest TCGA data releases
version-tracking of downloaded TCGA data for analysis reproducibility
Some existing options exist for browsing, downloading and exploring TCGA data that may better fulfill for your needs. Direct TCGA data download through URLs and web forms is available through the NCI's TCGA Data Portal. Or a more user-friendly web-based alternative for downloading and exploring TCGA data (in addition to other cancer data sets) is the cBioPortal from Memorial Sloan Kettering which includes visual tools for browsing and analyzing TCGA data. Â
Another option for bulk TCGA data download is through firehose_get which facilitates retrieval of open-access TCGA data that has been processed through the Broad GDAC Firehose. Firehose is a large-scale data analysis pipeline that automatically performs standard pre-processing of TCGA data, making the data more amenable to downstream analyses. Â
The main goal of tcga_util is to help users query, download and filter through analysis-ready TCGA data for use in downstream analyses, so tcga_util leverages firehose_get for the majority of its TCGA data retrieval. tcga_util most benefits users who wish to manage and manipulate matrices of TCGA data locally and is designed more specifically for use at the command-line, which more easily allows bioinformaticists to integrate TCGA data into their own repeatable analyses or custom applications and pipelines.
Please adhere to the TCGA publication guidelines when using TCGA data in your publications.
Prerequisites: install the following and make these accessible through your $PATH
Python 2.7 or greater
wget
Download and install tcga_util
Please register at http://cidd.houstonbioinformatics.org and download and unzip tcga_util_{version}.tgz. Â
Install tcga_util:
tar -xvzf tcga_util_{version}.tgz
cd tcga_util_{version}
sudo python setup.py install
tcga_util installs TCGA data locally into a data store. This data store is created for you in your local directory when you download TCGA data.  Alternatively, you can set an environment variable called $DATA_STORE to the full path of a data_store directory. This will tell tcga_util where to find your default data store. As another alternative, you can specify the location of your data store as a parameter to the tcga_util commands. This alternative approach works best if you want to manage multiple data stores.
tcga_util desc -h
usage: tcga_util desc [-h] {cohorts,run_dates,data,analyses} ...
This describes data available from the TCGA data portal as well as the Broad
Institute's Firehose pipeline.
optional arguments:
  -h, --help      show this help message and exit
desc subcommands:
  {cohorts,run_dates,data,analyses}
    cohorts       list available TCGA disease cohorts that have data in
                        the Broad GDAC open-access repositories
    run_dates      list available TCGA firehose run dates from the Broad
                        GDAC open-access repositories
    data        list available data types for a TCGA cohort
    analyses      list Firehose analyses available for a TCGA cohort
Install clinical, gene expression and mutation date for the TCGA colorectal adenocarcinoma project.
tcga_util setup -c COAD
Explicitly install the mutation data for coadread (a combined colorectal and renal TCGA project).
tcga_util setup -a mutation_assessor -c coadread
tcga_util mutations -g BRAF -aac V600E -hd -c coadread
# Hugo_Symbol ChromChange AAChange Variant_Classification Variant_Type Tumor_Sample_Barcode
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AA-3947-01A-01W-0995-10
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AA-3966-01A-01W-1073-09
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AA-3877-01A-01W-0995-10
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-A6-2672-01A-01W-0833-10
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AA-A022-01A-21W-A096-10
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AA-3543-01A-01W-0833-10
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AG-3578-01A-01W-0831-10
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AA-A01D-01A-01W-A00E-09
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AA-A01P-01A-21W-A096-10
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AA-3833-01A-01W-0900-09
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-A6-2676-01A-01W-0833-10
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AA-3516-01A-02W-0833-10
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AA-3949-01A-01W-0995-10
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AA-3684-01A-02W-0900-09
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AA-3821-01A-01W-0995-10
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AA-A00D-01A-01W-A005-10
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AA-3525-01A-02W-0833-10
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AA-A00J-01A-02W-A005-10
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AA-3715-01A-01W-0900-09
BRAF c.T1799A p.V600E Missense_Mutation SNP TCGA-AA-3664-01A-01W-0900-09
Check for what data is available in the TCGA.
The following example shows how to list the available data from the Broad Institute Firehose pipeline.
tcga_util desc data -c coadread
running: firehose_get -tasks stddata latest coadread
Clinical_Pick_Tier1
Merge_Clinical
Merge_cna__illuminahiseq_dnaseqc__hms_harvard_edu__Level_3__segmentation__seg
Merge_methylation__humanmethylation27__jhu_usc_edu__Level_3__within_bioassay_data_set_function__data
Merge_methylation__humanmethylation450__jhu_usc_edu__Level_3__within_bioassay_data_set_function__data
Merge_mirnaseq__illuminaga_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data
Merge_mirnaseq__illuminaga_mirnaseq__bcgsc_ca__Level_3__miR_isoform_expression__data
Merge_mirnaseq__illuminahiseq_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data
Merge_mirnaseq__illuminahiseq_mirnaseq__bcgsc_ca__Level_3__miR_isoform_expression__data
Merge_protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data
Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__exon_quantification__data
Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__junction_quantification__data
Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data
Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__RSEM_genes__data
Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__RSEM_isoforms_normalized__data
Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__exon_quantification__data
Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__junction_quantification__data
Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data
Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data
Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_isoforms_normalized__data
Merge_rnaseq__illuminaga_rnaseq__unc_edu__Level_3__exon_expression__data
Merge_rnaseq__illuminaga_rnaseq__unc_edu__Level_3__gene_expression__data
Merge_rnaseq__illuminaga_rnaseq__unc_edu__Level_3__splice_junction_expression__data
Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg18__seg
Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg
Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg18__seg
Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg
Merge_transcriptome__agilentg4502a_07_3__unc_edu__Level_3__unc_lowess_normalization_gene_level__data
Methylation_Preprocess
miRseq_Mature_Preprocess
miRseq_Preprocess
mRNAseq_Preprocess
mRNA_Preprocess_Median
Mutation_Packager_Calls
Mutation_Packager_Coverage
RPPA_AnnotateWithGene
Check the available analysis data.
tcga_util desc analyses -c coadread
running: firehose_get -tasks analyses latest coadread
Aggregate_Molecular_Subtype_Clusters
CopyNumberLowPass_Gistic2
CopyNumber_Clustering_CNMF
CopyNumber_Clustering_CNMF_thresholded
CopyNumber_Gistic2
Correlate_Clinical_vs_CopyNumber_Arm
Correlate_Clinical_vs_CopyNumber_Focal
Correlate_Clinical_vs_Methylation
Correlate_Clinical_vs_miRseq
Correlate_Clinical_vs_Molecular_Subtypes
Correlate_Clinical_vs_mRNA
Correlate_Clinical_vs_mRNAseq
Correlate_Clinical_vs_Mutation
Correlate_Clinical_vs_RPPA
Correlate_CopyNumber_vs_mRNA
Correlate_CopyNumber_vs_mRNAseq
Correlate_Methylation_vs_mRNA
Correlate_molecularSubtype_vs_CopyNumber_Arm
Correlate_molecularSubtype_vs_CopyNumber_Focal
Correlate_molecularSubtype_vs_Mutation
Methylation_Clustering_CNMF
miRseq_Clustering_CNMF
miRseq_Clustering_Consensus
miRseq_Mature_Clustering_CNMF
miRseq_Mature_Clustering_Consensus
mRNAseq_Clustering_CNMF
mRNAseq_Clustering_Consensus
mRNA_Clustering_CNMF
mRNA_Clustering_Consensus
Mutation_Assessor
Mutation_CHASM
MutSigNozzleReport1
MutSigNozzleReport2
MutSigNozzleReportCV
MutSigNozzleReportMerged
Pathway_FindEnrichedGenes
Pathway_Hotnet
Pathway_Paradigm_mRNA
Pathway_Paradigm_mRNA_And_Copy_Number
Pathway_Paradigm_RNASeq
Pathway_Paradigm_RNASeq_And_Copy_Number
RPPA_Clustering_CNMF
RPPA_Clustering_Consensus