Software‎ > ‎

cidd

Introduction

Cancer In Silico Drug Discovery (cidd) is a command-line based tool for analyzing TCGA data and other cancer data sets for tumor molecular profiling and candidate drug discovery.  

When identifying candidate drugs using cidd, cancer data is being analyzed that has been produced and shared by other cancer research groups.  As a result, please adhere to the TCGA publication guidelines in addition to citing work from the CMap, CCLE and drug databases (for annotations used) when using cidd results in your publications.

Installation

tcga_util and cidd run on Mac OSX and Linux operating systems.  Installation involves 3 major steps: pre-requisite software installation, data resource installation and R library installation as described below.

Prerequisite software installation

  • Python 2.7 or greater
  • Python libraries: numpy, scipy, lxml
  • R 3.0 or greater
  • firehose_get

Download and install cidd

Please register at http://cidd.houstonbioinformatics.org and download and unzip the most recent versions of tcga_util_{version}.tgz and cidd_{version}.tgz.  In the base directories of these extractions run the following command to install the software packages:

For tcga_util (you may need to add sudo to the beginning of the python setup.py install command):
tar -xvzf tcga_util_{version}.tgz
cd tcga_util_{version}
python setup.py install

For cidd (you may need to add sudo to the beginning of the python setup.py install command):
tar -xvzf cidd_{version}.tgz
cd cidd_{version}
python setup.py install

Data resource installation

The below command will check to make sure that the pre-requisite data are available.  Required data include data from the CMap, CCLE and MSigDB.  Other data sources that can be directly downloaded from the web, and those that have been customized for cidd use, will be installed by cidd check if this data has not been downloaded previously.  See below for these data details.  If cidd check is successful, you will get a "resources verified" message.  Otherwise, a message letting you know what is missing will be displayed.

cidd check

This command will set up the directory structure for a data store and download several data resources automatically from the web.  Some data sets require registration at websites for download and these are specified here.  Note that $DATA_STORE refers to the location of your data_store directory created by cidd check.  Once you have manually installed the data resources, you can run cidd check again to see if everything is set up for use by cidd.  Once you set up this data store, you can reuse it for additional projects and analyses.

Connectivity Map (http://www.broadinstitute.org/cmap)
requires registration: yes
install location: $DATA_STORE/cmap
data files:
  • instance inventory: cmap_instances_02.xls (1.6 MB) 
  • data matrix: rankMatrix.txt.zip (309 MB) 

MSigDB (http://www.broadinstitute.org/gsea/msigdb/collections.jsp)
requires registration: yes
install directory: $DATA_STORE/msigdb
data files: 
  • C2 curated gene sets by gene symbols: c2.cp.kegg.v4.0.symbols.gmt (87.6 KB) 
  • C2 curated gene sets by entrez ids: c2.cp.kegg.v4.0.entrez.gmt (87.7 KB) 

Cancer Cell Line Encyclopedia (http://www.broadinstitute.org/ccle/data/browseData)
requires registration: yes
install directory: $DATA_STORE/ccle
data files:
  • mRNA expression: CCLE_Expression_Entrez_2012-09-29.gct (167.2 MB) 
  • Cell Line Annotations: CCLE_sample_info_file_2012-10-18.txt (196 KB) 
  • Oncomap mutations: CCLE_Oncomap3_2012-04-09.maf (318 KB) 
  • Hybrid capture sequencing mutations: CCLE_hybrid_capture1650_hg19_NoCommonSNPs_NoNeutralVariants_CDS_2012.05.07.maf (56.5 MB)

Below are drug annotation resources that are automatically downloaded by cidd.  If you use the drug annotations, please cite the following resources.  These websites make these data sources freely downloadable without the need for user registration.  See the specific websites for inquiries regarding non-academic use.

DrugBank (http://www.drugbank.ca): free for non-commercial uses; please visit their website for commercial license inquiries.

Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, Maciejewski A, Arndt D, Wilson M, Neveu V, Tang A, Gabriel G, Ly C, Adamjee S, Dame ZT, Han B, Zhou Y, Wishart DS. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 2014 Jan 1;42(1):D1091-7.

MATADOR (http://matador.embl.de): free for non-commercial uses; please visit their website for commercial license inquiries.

Günther S, Kuhn M, Dunkel M, Campillos M, Senger C, Petsalaki E, Ahmed J, Urdiales EG, Gewiess A, Jensen LJ, Schneider R, Skoblo R, Russell RB, Bourne PE, Bork P, Preissner R. SuperTarget and Matador: resources for exploring drug-target relationships. Nucleic Acids Res. 2008 Jan;36(Database issue):D919-22.

KEGG Medicus (ftp://ftp.genome.jp/pub/kegg/medicus/): free for academic users at the GenomeNet FTP site; please visit http://www.kegg.jp/kegg/download for non-academic users.  KEGG Medicus is a subset of KEGG.  Any other KEGG data (besides KEGG Medicus) requires a data subscription.  See http://www.kegg.jp/kegg/download (KEGG FTP Academic Subscription) if interested in these data sets.

Kanehisa, M., Goto, S., Sato, Y., Kawashima, M., Furumichi, M., and Tanabe, M.; Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. 42, D199–D205 (2014).

Kanehisa, M. and Goto, S.; KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27-30 (2000).

Install R package dependencies

Start an R console and install the following packages:

mirror = "http://cran.us.r-project.org"

install.packages("graphics", repos=mirror)
install.packages("amap", repos=mirror)
install.packages("gplots", repos=mirror)
install.packages("data.table", repos=mirror)
install.packages("snowfall", repos=mirror)

source("http://bioconductor.org/biocLite.R")
biocLite("edgeR", dependencies=TRUE)
biocLite("piano", dependencies=TRUE)
biocLite("ktspair", dependencies=TRUE)

Quickstart example

In this example, we will setup a cidd data store, create a gene expression signature and identify candidate drugs for BRAF V600E colorectal cancer.  These commands should all be run in the same directory.  By default, a data store will be created in this directory, and by default, cidd commands will look for a local data_store directory in the same location where you run cidd commands.  Output for the below commands are placed in a project directory with the project name specified in the cidd setup command (i.e., crc_brafv600e).  A log file for the below commands will be created crc_brafv600e/crc_brafv600e.log (the -v2 verbosity parameter displays highly verbose log messages - you can remove this parameter if you want to minimize the console output) and output reports of the below commands can be found at crc_brafv600e/reports.  The prefix for output reports use the -n (or analysis name) parameter in the below commands (these just happen to be the same value as the project name in this simple example).

# [1] check if data dependencies have been installed in expected locations;
# manually install any required dependencies that might be missing;
# these should have already been downloaded in the installation step above
cidd check                                                 

# [2] download and install colorectal cancer data
cidd setup -c coadread \
  --data_run_date 2014_07_15 \
  --analysis_run_date 2014_07_15 \
  crc_brafv600e -v2   

# [3] generate a gene expression signature
cidd mutation_signature \
 
--data_run_date 2014_07_15 \
  --analysis_run_date 2014_07_15 \

  -c coadread \
  -g BRAF \
  -aac V600E \
  --gsa_num_perm 100 \
  --gsa_num_cpus 20 \
  -lfc 2 -n crc_brafv600e -v2
            

# [4] identify candidate drugs                        
cidd drugs \
  -np 100 \
  -nt 20 \
  -n crc_brafv600e -v2  

# [5] generate a gene expression classifier
cidd classifier generate \
  -n crc_brafv600e -v2  

# [6] identify candidate cell lines
cidd cell_lines \
  -g BRAF \
  -aac V600E \
  -t LARGE_INTESTINE \
  -n crc_brafv600e -v2
 

cidd commands

check: Verify that required data resources are installed and install resources that can be automatically downloaded

cidd check -h

usage: cidd check [-h] [-d DATA_STORE] [-v {0,1,2}]
This will check for a data store in the specified location. If one doesn't
exist, it will create an empty one for manual population. If resources are
missing, please install them manually and then run check again until it
succeeds before proceeding with running cidd analyses.

optional arguments:
  -h, --help            show this help message and exit
  -d DATA_STORE, --data_store DATA_STORE
                        name of directory where data resources are, or will
                        be, stored. Defaults to environment variable
                        $DATA_STORE.
  -v {0,1,2}, --verbosity {0,1,2}
                        output error and warning (0), info (1) and debug (2)
                        information to standard output (default to 1)

setup: Setup a new project and/or download and install TCGA data

If a cidd project (a project_name.cidd file and a project_name folder) does not exist in the current directory, this command will create one.  The command also can be used to download necessary TCGA data for a given TCGA project.  For example, specifying coadread for the --cohort parameter will result in cidd downloading clinical, gene expression microarray, rna-seq and mutation data for the TCGA colorectal cancer project into a TCGA directory in your local data store.

cidd setup -h

usage: cidd setup [-h] -c COHORT [-ar ANALYSIS_RUN_DATE] [-dr DATA_RUN_DATE]
                  [-f] [-d DATA_STORE] [-v {0,1,2}]
                  project

This will create a project directory for storing project artifacts like
expression signatures, classifiers, reports, etc. It will also create a
data_store directory if one doesn't exist already with an appropriate data
structure for cidd projects.  If a project directory already exists, the
command simply adds data to the existing directory.

positional arguments:
  project               name of a new project

optional arguments:
  -h, --help            show this help message and exit
  -c COHORT, --cohort COHORT
                        disease cohort to setup data for (for a list of
                        possible disease cohorts run "tcga_util desc cohorts".
  -ar ANALYSIS_RUN_DATE, --analysis_run_date ANALYSIS_RUN_DATE
                        run date for analyses to describe (defaults to
                        "latest")
  -dr DATA_RUN_DATE, --data_run_date DATA_RUN_DATE
                        run date for data to describe (defaults to "latest")
  -f, --force           force replace a project if it already exists
  -d DATA_STORE, --data_store DATA_STORE
                        name of directory where data resources are, or will
                        be, stored. Defaults to environment variable
                        $DATA_STORE.
  -v {0,1,2}, --verbosity {0,1,2}
                        output error and warning (0), info (1) and debug (2)
                        information to standard output (default to 1)

signature: Generate a gene expression signature

cidd signature -h


usage: cidd signature [-h] [-ar ANALYSIS_RUN_DATE] [-dr DATA_RUN_DATE] -c
                      COHORT
                      [-et {rnaseq,rnaseq_illuminaga,rnaseq_illuminahiseq,agilent}]
                      [--cases CASES] [--controls CONTROLS]
                      [-cg CANDIDATE_GENES] -n NAME
                      [-lcm {euclidean,maximum,manhattan,canberra,binary,pearson,abspearson,correlation,abscorrelation,spearman,kendall}]
                      [-lam {none,BH,BY,holm}] [-lp LIMMA_ADJ_PVAL_THRESH]
                      [-lfc LIMMA_FC_THRESH] [-lperm LIMMA_PERMUTATIONS]
                      [-gsa {fisher,stouffer,reporter,tailStrength,wilcoxon,mean,median,sum,maxmean,gsea,page}]
                      [-gsm {geneSampling,samplePermutation}]
                      [-gam {holm,hochberg,hommel,bonferroni,BH,BY,fdr,none}]
                      [-ggp GSA_GSEA_PARAM] [-gperm GSA_NUM_PERM]
                      [-gnc GSA_NUM_CPUS] [-gs GENE_SETS] [-d DATA_STORE]
                      [-v {0,1,2}]
This command generates an expression signature that represents a class of
samples. In addition, a classifier will be generated to be used in subsequent
class prediction analyses. A heatmap illustrating clustering of samples using
the signature can also be generated.
optional arguments:
  -h, --help            show this help message and exit
  -ar ANALYSIS_RUN_DATE, --analysis_run_date ANALYSIS_RUN_DATE
                        run date for analyses to describe (defaults to
                        "latest")
  -dr DATA_RUN_DATE, --data_run_date DATA_RUN_DATE
                        run date for data to describe (defaults to "latest")
  -c COHORT, --cohort COHORT
                        disease cohort to setup data for (for a list of
                        possible disease cohorts run "tcga_util desc cohorts"
  -et {rnaseq,rnaseq_illuminaga,rnaseq_illuminahiseq,agilent}, --expression_type {rnaseq,rnaseq_illuminaga,rnaseq_illuminahiseq,agilent}
                        the TCGA data type to be analyzed. By default,
                        "rnaseq" is selected and the platform (IlluminaGA or
                        IlluminaHiSeq) that provides the most case samples is
                        selected for analysis.
  --cases CASES         name of collection or file with case patient or sample
                        IDs
  --controls CONTROLS   name of collection or file with control patient or
                        sample IDs
  -cg CANDIDATE_GENES, --candidate_genes CANDIDATE_GENES
                        filename containing a list of genes to limit the
                        signature to (e.g., a set of pathway genes or a set of
                        genes with some prior evidence suggesting that they
                        are related to the phenotype of interest, etc). By
                        default, all genes are considered for inclusion in the
                        signature.
  -n NAME, --name NAME  name of signature - used to prefix output filenames
  -lcm {euclidean,maximum,manhattan,canberra,binary,pearson,abspearson,correlation,abscorrelation,spearman,kendall}, --limma_clust_method {euclidean,maximum,manhattan,canberra,binary,pearson,abspearson,correlation,abscorrelation,spearman,kendall}
                        hierarchical clustering distance method to be used
                        with the R function hcluster {amap}.
  -lam {none,BH,BY,holm}, --limma_adjust_method {none,BH,BY,holm}
                        method used to adjust the differential expression
                        p-values for multiple testing using the R function
                        toptable {limma}. Options, in increasing conservatism,
                        include "none", "BH", "BY" and "holm"
  -lp LIMMA_ADJ_PVAL_THRESH, --limma_adj_pval_thresh LIMMA_ADJ_PVAL_THRESH
                        adjusted p-value threshold at which to define
                        differentially expressed genes for inclusion in the
                        gene signature
  -lfc LIMMA_FC_THRESH, --limma_fc_thresh LIMMA_FC_THRESH
                        fold change threshold at which to define
                        differentially expressed genes for inclusion in the
                        gene signature
  -lperm LIMMA_PERMUTATIONS, --limma_permutations LIMMA_PERMUTATIONS
                        number of results to generate with permuted sample
                        labels (these results can be used downstream for
                        assessing gene set analysis significance)
  -gsa {fisher,stouffer,reporter,tailStrength,wilcoxon,mean,median,sum,maxmean,gsea,page}, --gsa_stat {fisher,stouffer,reporter,tailStrength,wilcoxon,mean,median,sum,maxmean,gsea,page}
                        statistical gene set method to use to identify gene
                        sets associated with cases using the R function runGSA
                        {piano}.
  -gsm {geneSampling,samplePermutation}, --gsa_sig_method {geneSampling,samplePermutation}
                        the method for significance assessment of gene sets as
                        defined by the R function runGSA {piano}. geneSampling
                        permutes gene labels and samplePermutation permutes
                        sample status labels
  -gam {holm,hochberg,hommel,bonferroni,BH,BY,fdr,none}, --gsa_adj_method {holm,hochberg,hommel,bonferroni,BH,BY,fdr,none}
                        the method for adjusting for multiple testing. Can be
                        any of the methods supported by p.adjust, i.e. "holm",
                        "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr"
                        or "none". The exception is for --gsa_stat=gsea, where
                        only the options "fdr" and "none" can be used.
  -ggp GSA_GSEA_PARAM, --gsa_gsea_param GSA_GSEA_PARAM
                        parameter as defined by gsea - recommended to be 1 by
                        http://www.broadinstitute.org/gsea/index.jsp. This
                        parameter is only used if the "gsa_sig_method" is
                        "gsea"
  -gperm GSA_NUM_PERM, --gsa_num_perm GSA_NUM_PERM
                        number permutations for assessing significance of gene
                        set associations with the case status
  -gnc GSA_NUM_CPUS, --gsa_num_cpus GSA_NUM_CPUS
                        number of cpus available for gene set analyses
  -gs GENE_SETS, --gene_sets GENE_SETS
                        gene sets to use for gene set analyses
  -d DATA_STORE, --data_store DATA_STORE
                        name of directory where data resources are, or will
                        be, stored. Defaults to environment variable
                        $DATA_STORE.
  -v {0,1,2}, --verbosity {0,1,2}
                        output error and warning (0), info (1) and debug (2)
                        information to standard output (default to 1)

drugs: Identify and annotate candidate drugs

cidd drugs -h

usage: cidd drugs [-h] [--up UP] [--down DOWN] [--rank_matrix RANK_MATRIX]
                  [--instances INSTANCES]
                  [--background_enrichment_scores BACKGROUND_ENRICHMENT_SCORES]
                  [-np NUM_PERMUTATIONS] [-nt NUM_THREADS]
                  [-cg CANDIDATE_GENES] -n NAME [-d DATA_STORE] [-v {0,1,2}]

This command identifies candidate drugs that, when compared to the provided
gene expression signature, induces a complementary gene expression signature
on cell lines.

optional arguments:
  -h, --help            show this help message and exit
  --up UP               a list of up regulated genes (Entrez IDs or gene symbols)
  --down DOWN           a list of down regulated genes (Entrez IDs or gene symbols)
  --rank_matrix RANK_MATRIX
                        matrix for perturbagen instance effects on cell line
                        gene expression ranks (defaults to an Entrez gene
                        version of the CMAP rank matrix)
  --instances INSTANCES
                        details for the perturbagen instances represented in
                        the rank matrix (defaults to instance details provided
                        by the CMAP)
  --background_enrichment_scores BACKGROUND_ENRICHMENT_SCORES
                        list of enrichment scores obtained by applying random
                        gene signatures to the rank matrix - used to calculate
                        an empirical p-value for the enrichment scores of user
                        signatures (defaults to scores obtained by applying
                        MSigDB gene signatures to the CMAP rank matrix)
  -np NUM_PERMUTATIONS, --num_permutations NUM_PERMUTATIONS
                        number permutations for assessing significance of gene
                        set associations with the case status
  -nt NUM_THREADS, --num_threads NUM_THREADS
                        number of threads for parallel processing
  -cg CANDIDATE_GENES, --candidate_genes CANDIDATE_GENES
                        filename containing a list of genes to limit the
                        signature to (e.g., a set of genes with some prior
                        evidence suggesting that they are differentially
                        expressed between the classes of interest). By
                        default, all signature genes are used in the drug
                        search.
  -n NAME, --name NAME  name of analysis - used to prefix output files
  -d DATA_STORE, --data_store DATA_STORE
                        name of directory where data resources are, or will
                        be, stored. Defaults to environment variable
                        $DATA_STORE.
  -v {0,1,2}, --verbosity {0,1,2}
                        output error and warning (0), info (1) and debug (2)
                        information to standard output (default to 1)

By default, the cidd drugs command will use the output of the cidd signature command and screen that signature against drug-induced signatures provided by the connectivity map.  If you have your own list of up- and down-regulated genes, you can explicitly specify them (without running a cidd signature command) in the cidd drugs command with the --up and --down parameters.  If you did not download any TCGA data and have not run a cidd setup command, this command will fail.  A cidd project should be established first.
 
cidd setup project_name
 
Then an empty cidd project called project_name will be created in the local directory and cidd drugs reports will be generated in the file structure of this project.

ccle: Identify candidate cell lines to test drugs on

cidd cell_lines -h


usage: cidd cell_lines [-h] [-t TISSUE] [-r RUN_DATE] [-g GENES]
                       [-aac AMINO_ACID_CHANGES] [-cod CODONS]
                       [-vc VARIANT_CLASSIFICATIONS] [-cg CANDIDATE_GENES]
                       [-c CLASSIFIER] -n NAME [-d DATA_STORE] [-v {0,1,2}]
This command queries ccle data to try to identify cell lines that are most
similar to the samples used to generate a signature.
optional arguments:
  -h, --help            show this help message and exit
  -t TISSUE, --tissue TISSUE
                        cell line tissue of interest
  -r RUN_DATE, --run_date RUN_DATE
                        run date for data to describe (defaults to "latest")
  -g GENES, --genes GENES
                        A quoted list of genes that should contain the
                        mutations to retrieve (e.g., "BRCA2|BRAF|KRAS"). If a
                        sample has mutations in any of these, genes, they will
                        be reported.
  -aac AMINO_ACID_CHANGES, --amino_acid_changes AMINO_ACID_CHANGES
                        A quoted list of amino acid substitutions to search
                        for (e.g., "V600E|G12D"). Each substitution will be
                        searched for within all genes specified through
                        --genes. If a sample has one of these amino acid
                        changes in any of the genes (in --genes), that sample
                        will be reported.
  -cod CODONS, --codons CODONS
                        A quoted list of codon numbers of mutations to search
                        for (e.g., "12|13|146"). Each substitution will be
                        searched for within all genes specified through
                        --genes.
  -vc VARIANT_CLASSIFICATIONS, --variant_classifications VARIANT_CLASSIFICATIONS
                        A quoted list of classifications to search for (e.g.,
                        "Missense_Mutation|Nonstop_Mutation"). Each will be
                        searched for within all genes specified through
                        --genes. Possible values for -vc include: 3'UTR,
                        5'Flank, 5'UTR, De_novo_Start_InFrame,
                        De_novo_Start_OutOfFrame, Frame_Shift_Del,
                        Frame_Shift_Ins, In_Frame_Del,In_Frame_Ins, Intron,
                        Missense_Mutation, Nonsense_Mutation,
                        Nonstop_Mutation, RNA, Silent, Splice_Site and
                        Translation_Start_Site
  -cg CANDIDATE_GENES, --candidate_genes CANDIDATE_GENES
                        filename containing a list of genes to limit the
                        signature to (e.g., a set of pathway genes or a set of
                        genes with some prior evidence suggesting that they
                        are related to the phenotype of interest, etc). By
                        default, all genes are considered for inclusion in the
                        signature.
  -c CLASSIFIER, --classifier CLASSIFIER
                        a signature to identify candidate for
  -n NAME, --name NAME  name of analysis - used to prefix output files
  -d DATA_STORE, --data_store DATA_STORE
                        name of directory where data resources are, or will
                        be, stored. Defaults to environment variable
                        $DATA_STORE.
  -v {0,1,2}, --verbosity {0,1,2}
                        output error and warning (0), info (1) and debug (2)
                        information to standard output (default to 1)


Tutorials

What TCGA data is available for download?

Check the available data for colorectal cancer (e.g., the coadread project):

tcga_util desc data -c coadread

running: firehose_get -tasks stddata latest coadread
Clinical_Pick_Tier1
Merge_Clinical
Merge_cna__illuminahiseq_dnaseqc__hms_harvard_edu__Level_3__segmentation__seg
Merge_methylation__humanmethylation27__jhu_usc_edu__Level_3__within_bioassay_data_set_function__data
Merge_methylation__humanmethylation450__jhu_usc_edu__Level_3__within_bioassay_data_set_function__data
Merge_mirnaseq__illuminaga_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data
Merge_mirnaseq__illuminaga_mirnaseq__bcgsc_ca__Level_3__miR_isoform_expression__data
Merge_mirnaseq__illuminahiseq_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data
Merge_mirnaseq__illuminahiseq_mirnaseq__bcgsc_ca__Level_3__miR_isoform_expression__data
Merge_protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data
Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__exon_quantification__data
Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__junction_quantification__data
Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data
Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__RSEM_genes__data
Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__RSEM_isoforms_normalized__data
Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__exon_quantification__data
Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__junction_quantification__data
Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data
Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data
Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_isoforms_normalized__data
Merge_rnaseq__illuminaga_rnaseq__unc_edu__Level_3__exon_expression__data
Merge_rnaseq__illuminaga_rnaseq__unc_edu__Level_3__gene_expression__data
Merge_rnaseq__illuminaga_rnaseq__unc_edu__Level_3__splice_junction_expression__data
Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg18__seg
Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg
Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg18__seg
Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg
Merge_transcriptome__agilentg4502a_07_3__unc_edu__Level_3__unc_lowess_normalization_gene_level__data
Methylation_Preprocess
miRseq_Mature_Preprocess
miRseq_Preprocess
mRNAseq_Preprocess
mRNA_Preprocess_Median
Mutation_Packager_Calls
Mutation_Packager_Coverage
RPPA_AnnotateWithGene

Check the available analysis data:

tcga_util desc analyses -c coadread

running: firehose_get -tasks analyses latest coadread
Aggregate_Molecular_Subtype_Clusters
CopyNumberLowPass_Gistic2
CopyNumber_Clustering_CNMF
CopyNumber_Clustering_CNMF_thresholded
CopyNumber_Gistic2
Correlate_Clinical_vs_CopyNumber_Arm
Correlate_Clinical_vs_CopyNumber_Focal
Correlate_Clinical_vs_Methylation
Correlate_Clinical_vs_miRseq
Correlate_Clinical_vs_Molecular_Subtypes
Correlate_Clinical_vs_mRNA
Correlate_Clinical_vs_mRNAseq
Correlate_Clinical_vs_Mutation
Correlate_Clinical_vs_RPPA
Correlate_CopyNumber_vs_mRNA
Correlate_CopyNumber_vs_mRNAseq
Correlate_Methylation_vs_mRNA
Correlate_molecularSubtype_vs_CopyNumber_Arm
Correlate_molecularSubtype_vs_CopyNumber_Focal
Correlate_molecularSubtype_vs_Mutation
Methylation_Clustering_CNMF
miRseq_Clustering_CNMF
miRseq_Clustering_Consensus
miRseq_Mature_Clustering_CNMF
miRseq_Mature_Clustering_Consensus
mRNAseq_Clustering_CNMF
mRNAseq_Clustering_Consensus
mRNA_Clustering_CNMF
mRNA_Clustering_Consensus
Mutation_Assessor
Mutation_CHASM
MutSigNozzleReport1
MutSigNozzleReport2
MutSigNozzleReportCV
MutSigNozzleReportMerged
Pathway_FindEnrichedGenes
Pathway_Hotnet
Pathway_Paradigm_mRNA
Pathway_Paradigm_mRNA_And_Copy_Number
Pathway_Paradigm_RNASeq
Pathway_Paradigm_RNASeq_And_Copy_Number
RPPA_Clustering_CNMF
RPPA_Clustering_Consensus

Downloading and installing TCGA data for colorectal cancer

cidd check
cidd setup -c coadread crc_brafv600e -v2

Generate a gene expression signature based on a mutation

You can specify mutations explicitly like in the below example:

cidd mutation_signature -c coadread -g BRAF -aac V600E --gsa_num_perm 100 --gsa_num_cpus 20 -lfc 2 -n crc_brafv600e -v2

Generate a gene expression signature by explicitly specifying sample id's for a list of case and a list of control samples

This method is useful for identifying signatures that might be associated with clinical data or non-mutation molecular data.  In such cases, you can identify your own lists of case and control samples (from your own analyses) and run the cidd signature command.  In this example, we repeat the BRAF V600E signature except we specify this list of case and control ids explicitly.  In this example, these lists were generated in the cidd mutation_signature described previously.

cidd signature -c coadread --cases crc_brafv600e/reports/crc_brafv600e_cases.samples --controls crc_brafv600e/reports/crc_brafv600e_controls.samples --gsa_num_perm 100 --gsa_num_cpus 20 -lfc 2 -n crc_brafv600e -v2

Identifying candidate drugs for tumors by explicitly specifying a tumor gene expression signature

In this command, we've identified our own gene expression signature external of cidd.  A list of up-regulated entrez gene IDs and a list of down-regulated entrez gene IDs are input to cidd.

cidd drugs --up entrez_ids.up --down entrez_ids.down -n test_analysis -v2

Identifying candidate drugs for tumors using a signature generated by cidd

This can be done by simply specifying the analysis name used when generating the signature.  cidd will automatically retrieve the gene expression signature. 

cidd drugs -np 100 -n crc_brafv600e -v2

Generate a gene expression classifier

After generating a gene expression signature in cidd, you can generate a kTSP (k-Top Scoring Pairs) classifier using these signature genes.  This classifier is needed if you want to identify candidate cell lines that resemble your tumor of interest.

cidd classifier generate -n crc_brafv600e -v2

Identifying cell lines to represent tumors with a given mutation

In this command, cell lines will be retrieved to represent tumors characterized by cidd.  This command tells cidd to use the gene expression classifier generated by cidd in the analysis named crc_brafv600e.  cidd will use the classifier generated by cidd.  Additionally, large intestine cell lines are filtered for and the candidate cell lines are required to have a BRAF V600E mutation.

cidd cell_lines -g BRAF -aac V600E -t LARGE_INTESTINE -n crc_brafv600e