cidd

Introduction

Cancer In Silico Drug Discovery (cidd) is a command-line based tool for analyzing TCGA data and other cancer data sets for tumor molecular profiling and candidate drug discovery.

When identifying candidate drugs using cidd, cancer data is being analyzed that has been produced and shared by other cancer research groups. As a result, please adhere to the TCGA publication guidelines in addition to citing work from the CMap, CCLE and drug databases (for annotations used) when using cidd results in your publications.

Installation

tcga_util and cidd run on Mac OSX and Linux operating systems. Installation involves 3 major steps: pre-requisite software installation, data resource installation and R library installation as described below.

Prerequisite software installation

- Python 2.7 or greater
- Python libraries: numpy, scipy, lxml
- R 3.0 or greater

firehose_get

Download and install cidd

Please register at http://cidd.houstonbioinformatics.org and download and unzip the most recent versions of tcga_util_{version}.tgz and cidd_{version}.tgz. In the base directories of these extractions run the following command to install the software packages:

For tcga_util (you may need to add sudo to the beginning of the python setup.py install command):

tar -xvzf tcga_util_{version}.tgz

cd tcga_util_{version}

python setup.py install

For cidd (you may need to add sudo to the beginning of the python setup.py install command):

tar -xvzf cidd_{version}.tgz

cd cidd_{version}

python setup.py install

Data resource installation

The below command will check to make sure that the pre-requisite data are available. Required data include data from the CMap, CCLE and MSigDB. Other data sources that can be directly downloaded from the web, and those that have been customized for cidd use, will be installed by cidd check if this data has not been downloaded previously. See below for these data details. If cidd check is successful, you will get a "resources verified" message. Otherwise, a message letting you know what is missing will be displayed.

cidd check

This command will set up the directory structure for a data store and download several data resources automatically from the web. Some data sets require registration at websites for download and these are specified here. Note that $DATA_STORE refers to the location of your data_store directory created by cidd check. Once you have manually installed the data resources, you can run cidd check again to see if everything is set up for use by cidd. Once you set up this data store, you can reuse it for additional projects and analyses.

Connectivity Map (http://www.broadinstitute.org/cmap)

requires registration: yes

install location: $DATA_STORE/cmap

data files:

- instance inventory: cmap_instances_02.xls (1.6 MB)
- data matrix: rankMatrix.txt.zip (309 MB)

MSigDB (http://www.broadinstitute.org/gsea/msigdb/collections.jsp)

requires registration: yes

install directory: $DATA_STORE/msigdb

data files:

- C2 curated gene sets by gene symbols: c2.cp.kegg.v4.0.symbols.gmt (87.6 KB)
- C2 curated gene sets by entrez ids: c2.cp.kegg.v4.0.entrez.gmt (87.7 KB)

Cancer Cell Line Encyclopedia (http://www.broadinstitute.org/ccle/data/browseData)

requires registration: yes

install directory: $DATA_STORE/ccle

data files:

- mRNA expression: CCLE_Expression_Entrez_2012-09-29.gct (167.2 MB)
- Cell Line Annotations: CCLE_sample_info_file_2012-10-18.txt (196 KB)
- Oncomap mutations: CCLE_Oncomap3_2012-04-09.maf (318 KB)
- Hybrid capture sequencing mutations: CCLE_hybrid_capture1650_hg19_NoCommonSNPs_NoNeutralVariants_CDS_2012.05.07.maf (56.5 MB)

Below are drug annotation resources that are automatically downloaded by cidd. If you use the drug annotations, please cite the following resources. These websites make these data sources freely downloadable without the need for user registration. See the specific websites for inquiries regarding non-academic use.

DrugBank (http://www.drugbank.ca): free for non-commercial uses; please visit their website for commercial license inquiries.

Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, Maciejewski A, Arndt D, Wilson M, Neveu V, Tang A, Gabriel G, Ly C, Adamjee S, Dame ZT, Han B, Zhou Y, Wishart DS. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 2014 Jan 1;42(1):D1091-7.

MATADOR (http://matador.embl.de): free for non-commercial uses; please visit their website for commercial license inquiries.

Günther S, Kuhn M, Dunkel M, Campillos M, Senger C, Petsalaki E, Ahmed J, Urdiales EG, Gewiess A, Jensen LJ, Schneider R, Skoblo R, Russell RB, Bourne PE, Bork P, Preissner R. SuperTarget and Matador: resources for exploring drug-target relationships. Nucleic Acids Res. 2008 Jan;36(Database issue):D919-22.

KEGG Medicus (ftp://ftp.genome.jp/pub/kegg/medicus/): free for academic users at the GenomeNet FTP site; please visit http://www.kegg.jp/kegg/download for non-academic users. KEGG Medicus is a subset of KEGG. Any other KEGG data (besides KEGG Medicus) requires a data subscription. See http://www.kegg.jp/kegg/download (KEGG FTP Academic Subscription) if interested in these data sets.

Kanehisa, M., Goto, S., Sato, Y., Kawashima, M., Furumichi, M., and Tanabe, M.; Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. 42, D199–D205 (2014).

Kanehisa, M. and Goto, S.; KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27-30 (2000).

Install R package dependencies

Start an R console and install the following packages:

mirror = "http://cran.us.r-project.org"

install.packages("graphics", repos=mirror)

install.packages("amap", repos=mirror)

install.packages("gplots", repos=mirror)

install.packages("data.table", repos=mirror)

install.packages("snowfall", repos=mirror)

source("http://bioconductor.org/biocLite.R")

biocLite("edgeR", dependencies=TRUE)

biocLite("piano", dependencies=TRUE)

biocLite("ktspair", dependencies=TRUE)

Quickstart example

In this example, we will setup a cidd data store, create a gene expression signature and identify candidate drugs for BRAF V600E colorectal cancer. These commands should all be run in the same directory. By default, a data store will be created in this directory, and by default, cidd commands will look for a local data_store directory in the same location where you run cidd commands. Output for the below commands are placed in a project directory with the project name specified in the cidd setup command (i.e., crc_brafv600e). A log file for the below commands will be created crc_brafv600e/crc_brafv600e.log (the -v2 verbosity parameter displays highly verbose log messages - you can remove this parameter if you want to minimize the console output) and output reports of the below commands can be found at crc_brafv600e/reports. The prefix for output reports use the -n (or analysis name) parameter in the below commands (these just happen to be the same value as the project name in this simple example).

# [1] check if data dependencies have been installed in expected locations;

# manually install any required dependencies that might be missing;

# these should have already been downloaded in the installation step above

cidd check

# [2] download and install colorectal cancer data

cidd setup -c coadread \

--data_run_date 2014_07_15 \

--analysis_run_date 2014_07_15 \

crc_brafv600e -v2

# [3] generate a gene expression signature

cidd mutation_signature \

--data_run_date 2014_07_15 \

--analysis_run_date 2014_07_15 \

-c coadread \

-g BRAF \

-aac V600E \

--gsa_num_perm 100 \

--gsa_num_cpus 20 \

-lfc 2 -n crc_brafv600e -v2

# [4] identify candidate drugs

cidd drugs \

-np 100 \

-nt 20 \

-n crc_brafv600e -v2

# [5] generate a gene expression classifier

cidd classifier generate \

-n crc_brafv600e -v2

# [6] identify candidate cell lines

cidd cell_lines \

-g BRAF \

-aac V600E \

-t LARGE_INTESTINE \

-n crc_brafv600e -v2

cidd commands

check: Verify that required data resources are installed and install resources that can be automatically downloaded

cidd check -h

usage: cidd check [-h] [-d DATA_STORE] [-v {0,1,2}]

This will check for a data store in the specified location. If one doesn't

exist, it will create an empty one for manual population. If resources are

missing, please install them manually and then run check again until it

succeeds before proceeding with running cidd analyses.

optional arguments:

-h, --help show this help message and exit

-d DATA_STORE, --data_store DATA_STORE

name of directory where data resources are, or will

be, stored. Defaults to environment variable

$DATA_STORE.

-v {0,1,2}, --verbosity {0,1,2}

output error and warning (0), info (1) and debug (2)

information to standard output (default to 1)

setup: Setup a new project and/or download and install TCGA data

If a cidd project (a project_name.cidd file and a project_name folder) does not exist in the current directory, this command will create one. The command also can be used to download necessary TCGA data for a given TCGA project. For example, specifying coadread for the --cohort parameter will result in cidd downloading clinical, gene expression microarray, rna-seq and mutation data for the TCGA colorectal cancer project into a TCGA directory in your local data store.

cidd setup -h

usage: cidd setup [-h] -c COHORT [-ar ANALYSIS_RUN_DATE] [-dr DATA_RUN_DATE]

[-f] [-d DATA_STORE] [-v {0,1,2}]

project

This will create a project directory for storing project artifacts like

expression signatures, classifiers, reports, etc. It will also create a

data_store directory if one doesn't exist already with an appropriate data

structure for cidd projects. If a project directory already exists, the

command simply adds data to the existing directory.

positional arguments:

project name of a new project

optional arguments:

-h, --help show this help message and exit

-c COHORT, --cohort COHORT

disease cohort to setup data for (for a list of

possible disease cohorts run "tcga_util desc cohorts".

-ar ANALYSIS_RUN_DATE, --analysis_run_date ANALYSIS_RUN_DATE

run date for analyses to describe (defaults to

"latest")

-dr DATA_RUN_DATE, --data_run_date DATA_RUN_DATE

run date for data to describe (defaults to "latest")

-f, --force force replace a project if it already exists

-d DATA_STORE, --data_store DATA_STORE

name of directory where data resources are, or will

be, stored. Defaults to environment variable

$DATA_STORE.

-v {0,1,2}, --verbosity {0,1,2}

output error and warning (0), info (1) and debug (2)

information to standard output (default to 1)

signature: Generate a gene expression signature

cidd signature -h

usage: cidd signature [-h] [-ar ANALYSIS_RUN_DATE] [-dr DATA_RUN_DATE] -c

COHORT

[-et {rnaseq,rnaseq_illuminaga,rnaseq_illuminahiseq,agilent}]

[--cases CASES] [--controls CONTROLS]

[-cg CANDIDATE_GENES] -n NAME

[-lcm {euclidean,maximum,manhattan,canberra,binary,pearson,abspearson,correlation,abscorrelation,spearman,kendall}]

[-lam {none,BH,BY,holm}] [-lp LIMMA_ADJ_PVAL_THRESH]

[-lfc LIMMA_FC_THRESH] [-lperm LIMMA_PERMUTATIONS]

[-gsa {fisher,stouffer,reporter,tailStrength,wilcoxon,mean,median,sum,maxmean,gsea,page}]

[-gsm {geneSampling,samplePermutation}]

[-gam {holm,hochberg,hommel,bonferroni,BH,BY,fdr,none}]

[-ggp GSA_GSEA_PARAM] [-gperm GSA_NUM_PERM]

[-gnc GSA_NUM_CPUS] [-gs GENE_SETS] [-d DATA_STORE]

[-v {0,1,2}]

This command generates an expression signature that represents a class of

samples. In addition, a classifier will be generated to be used in subsequent

class prediction analyses. A heatmap illustrating clustering of samples using

the signature can also be generated.

optional arguments:

-h, --help show this help message and exit

-ar ANALYSIS_RUN_DATE, --analysis_run_date ANALYSIS_RUN_DATE

run date for analyses to describe (defaults to

"latest")

-dr DATA_RUN_DATE, --data_run_date DATA_RUN_DATE

run date for data to describe (defaults to "latest")

-c COHORT, --cohort COHORT

disease cohort to setup data for (for a list of

possible disease cohorts run "tcga_util desc cohorts"

-et {rnaseq,rnaseq_illuminaga,rnaseq_illuminahiseq,agilent}, --expression_type {rnaseq,rnaseq_illuminaga,rnaseq_illuminahiseq,agilent}

the TCGA data type to be analyzed. By default,

"rnaseq" is selected and the platform (IlluminaGA or

IlluminaHiSeq) that provides the most case samples is

selected for analysis.

--cases CASES name of collection or file with case patient or sample

IDs

--controls CONTROLS name of collection or file with control patient or

sample IDs

-cg CANDIDATE_GENES, --candidate_genes CANDIDATE_GENES

filename containing a list of genes to limit the

signature to (e.g., a set of pathway genes or a set of

genes with some prior evidence suggesting that they

are related to the phenotype of interest, etc). By

default, all genes are considered for inclusion in the

signature.

-n NAME, --name NAME name of signature - used to prefix output filenames

-lcm {euclidean,maximum,manhattan,canberra,binary,pearson,abspearson,correlation,abscorrelation,spearman,kendall}, --limma_clust_method {euclidean,maximum,manhattan,canberra,binary,pearson,abspearson,correlation,abscorrelation,spearman,kendall}

hierarchical clustering distance method to be used

with the R function hcluster {amap}.

-lam {none,BH,BY,holm}, --limma_adjust_method {none,BH,BY,holm}

method used to adjust the differential expression

p-values for multiple testing using the R function

toptable {limma}. Options, in increasing conservatism,

include "none", "BH", "BY" and "holm"

-lp LIMMA_ADJ_PVAL_THRESH, --limma_adj_pval_thresh LIMMA_ADJ_PVAL_THRESH

adjusted p-value threshold at which to define

differentially expressed genes for inclusion in the

gene signature

-lfc LIMMA_FC_THRESH, --limma_fc_thresh LIMMA_FC_THRESH

fold change threshold at which to define

differentially expressed genes for inclusion in the

gene signature

-lperm LIMMA_PERMUTATIONS, --limma_permutations LIMMA_PERMUTATIONS

number of results to generate with permuted sample

labels (these results can be used downstream for

assessing gene set analysis significance)

-gsa {fisher,stouffer,reporter,tailStrength,wilcoxon,mean,median,sum,maxmean,gsea,page}, --gsa_stat {fisher,stouffer,reporter,tailStrength,wilcoxon,mean,median,sum,maxmean,gsea,page}

statistical gene set method to use to identify gene

sets associated with cases using the R function runGSA

{piano}.

-gsm {geneSampling,samplePermutation}, --gsa_sig_method {geneSampling,samplePermutation}

the method for significance assessment of gene sets as

defined by the R function runGSA {piano}. geneSampling

permutes gene labels and samplePermutation permutes

sample status labels

-gam {holm,hochberg,hommel,bonferroni,BH,BY,fdr,none}, --gsa_adj_method {holm,hochberg,hommel,bonferroni,BH,BY,fdr,none}

the method for adjusting for multiple testing. Can be

any of the methods supported by p.adjust, i.e. "holm",

"hochberg", "hommel", "bonferroni", "BH", "BY", "fdr"

or "none". The exception is for --gsa_stat=gsea, where

only the options "fdr" and "none" can be used.

-ggp GSA_GSEA_PARAM, --gsa_gsea_param GSA_GSEA_PARAM

parameter as defined by gsea - recommended to be 1 by

http://www.broadinstitute.org/gsea/index.jsp. This

parameter is only used if the "gsa_sig_method" is

"gsea"

-gperm GSA_NUM_PERM, --gsa_num_perm GSA_NUM_PERM

number permutations for assessing significance of gene

set associations with the case status

-gnc GSA_NUM_CPUS, --gsa_num_cpus GSA_NUM_CPUS

number of cpus available for gene set analyses

-gs GENE_SETS, --gene_sets GENE_SETS

gene sets to use for gene set analyses

-d DATA_STORE, --data_store DATA_STORE

name of directory where data resources are, or will

be, stored. Defaults to environment variable

$DATA_STORE.

-v {0,1,2}, --verbosity {0,1,2}

output error and warning (0), info (1) and debug (2)

information to standard output (default to 1)

drugs: Identify and annotate candidate drugs

cidd drugs -h

usage: cidd drugs [-h] [--up UP] [--down DOWN] [--rank_matrix RANK_MATRIX]

[--instances INSTANCES]

[--background_enrichment_scores BACKGROUND_ENRICHMENT_SCORES]

[-np NUM_PERMUTATIONS] [-nt NUM_THREADS]

[-cg CANDIDATE_GENES] -n NAME [-d DATA_STORE] [-v {0,1,2}]

This command identifies candidate drugs that, when compared to the provided

gene expression signature, induces a complementary gene expression signature

on cell lines.

optional arguments:

-h, --help show this help message and exit

--up UP a list of up regulated genes (Entrez IDs or gene symbols)

--down DOWN a list of down regulated genes (Entrez IDs or gene symbols)

--rank_matrix RANK_MATRIX

matrix for perturbagen instance effects on cell line

gene expression ranks (defaults to an Entrez gene

version of the CMAP rank matrix)

--instances INSTANCES

details for the perturbagen instances represented in

the rank matrix (defaults to instance details provided

by the CMAP)

--background_enrichment_scores BACKGROUND_ENRICHMENT_SCORES

list of enrichment scores obtained by applying random

gene signatures to the rank matrix - used to calculate

an empirical p-value for the enrichment scores of user

signatures (defaults to scores obtained by applying

MSigDB gene signatures to the CMAP rank matrix)

-np NUM_PERMUTATIONS, --num_permutations NUM_PERMUTATIONS

number permutations for assessing significance of gene

set associations with the case status

-nt NUM_THREADS, --num_threads NUM_THREADS

number of threads for parallel processing

-cg CANDIDATE_GENES, --candidate_genes CANDIDATE_GENES

filename containing a list of genes to limit the

signature to (e.g., a set of genes with some prior

evidence suggesting that they are differentially

expressed between the classes of interest). By

default, all signature genes are used in the drug

search.

-n NAME, --name NAME name of analysis - used to prefix output files

-d DATA_STORE, --data_store DATA_STORE

name of directory where data resources are, or will

be, stored. Defaults to environment variable

$DATA_STORE.

-v {0,1,2}, --verbosity {0,1,2}

output error and warning (0), info (1) and debug (2)

information to standard output (default to 1)

By default, the cidd drugs command will use the output of the cidd signature command and screen that signature against drug-induced signatures provided by the connectivity map. If you have your own list of up- and down-regulated genes, you can explicitly specify them (without running a cidd signature command) in the cidd drugs command with the --up and --down parameters. If you did not download any TCGA data and have not run a cidd setup command, this command will fail. A cidd project should be established first.

cidd setup project_name

Then an empty cidd project called project_name will be created in the local directory and cidd drugs reports will be generated in the file structure of this project.

ccle: Identify candidate cell lines to test drugs on

cidd cell_lines -h

usage: cidd cell_lines [-h] [-t TISSUE] [-r RUN_DATE] [-g GENES]

[-aac AMINO_ACID_CHANGES] [-cod CODONS]

[-vc VARIANT_CLASSIFICATIONS] [-cg CANDIDATE_GENES]

[-c CLASSIFIER] -n NAME [-d DATA_STORE] [-v {0,1,2}]

This command queries ccle data to try to identify cell lines that are most

similar to the samples used to generate a signature.

optional arguments:

-h, --help show this help message and exit

-t TISSUE, --tissue TISSUE

cell line tissue of interest

-r RUN_DATE, --run_date RUN_DATE

run date for data to describe (defaults to "latest")

-g GENES, --genes GENES

A quoted list of genes that should contain the

mutations to retrieve (e.g., "BRCA2|BRAF|KRAS"). If a

sample has mutations in any of these, genes, they will

be reported.

-aac AMINO_ACID_CHANGES, --amino_acid_changes AMINO_ACID_CHANGES

A quoted list of amino acid substitutions to search

for (e.g., "V600E|G12D"). Each substitution will be

searched for within all genes specified through

--genes. If a sample has one of these amino acid

changes in any of the genes (in --genes), that sample

will be reported.

-cod CODONS, --codons CODONS

A quoted list of codon numbers of mutations to search

for (e.g., "12|13|146"). Each substitution will be

searched for within all genes specified through

--genes.

-vc VARIANT_CLASSIFICATIONS, --variant_classifications VARIANT_CLASSIFICATIONS

A quoted list of classifications to search for (e.g.,

"Missense_Mutation|Nonstop_Mutation"). Each will be

searched for within all genes specified through

--genes. Possible values for -vc include: 3'UTR,

5'Flank, 5'UTR, De_novo_Start_InFrame,

De_novo_Start_OutOfFrame, Frame_Shift_Del,

Frame_Shift_Ins, In_Frame_Del,In_Frame_Ins, Intron,

Missense_Mutation, Nonsense_Mutation,

Nonstop_Mutation, RNA, Silent, Splice_Site and

Translation_Start_Site

-cg CANDIDATE_GENES, --candidate_genes CANDIDATE_GENES

filename containing a list of genes to limit the

signature to (e.g., a set of pathway genes or a set of

genes with some prior evidence suggesting that they

are related to the phenotype of interest, etc). By

default, all genes are considered for inclusion in the

signature.

-c CLASSIFIER, --classifier CLASSIFIER

a signature to identify candidate for

-n NAME, --name NAME name of analysis - used to prefix output files

-d DATA_STORE, --data_store DATA_STORE

name of directory where data resources are, or will

be, stored. Defaults to environment variable

$DATA_STORE.

-v {0,1,2}, --verbosity {0,1,2}

output error and warning (0), info (1) and debug (2)

information to standard output (default to 1)

Tutorials

What TCGA data is available for download?

Check the available data for colorectal cancer (e.g., the coadread project):

tcga_util desc data -c coadread

running: firehose_get -tasks stddata latest coadread

Clinical_Pick_Tier1

Merge_Clinical

Merge_cna__illuminahiseq_dnaseqc__hms_harvard_edu__Level_3__segmentation__seg

Merge_methylation__humanmethylation27__jhu_usc_edu__Level_3__within_bioassay_data_set_function__data

Merge_methylation__humanmethylation450__jhu_usc_edu__Level_3__within_bioassay_data_set_function__data

Merge_mirnaseq__illuminaga_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data

Merge_mirnaseq__illuminaga_mirnaseq__bcgsc_ca__Level_3__miR_isoform_expression__data

Merge_mirnaseq__illuminahiseq_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data

Merge_mirnaseq__illuminahiseq_mirnaseq__bcgsc_ca__Level_3__miR_isoform_expression__data

Merge_protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data

Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__exon_quantification__data

Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__junction_quantification__data

Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data

Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__RSEM_genes__data

Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__RSEM_isoforms_normalized__data

Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__exon_quantification__data

Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__junction_quantification__data

Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data

Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data

Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_isoforms_normalized__data

Merge_rnaseq__illuminaga_rnaseq__unc_edu__Level_3__exon_expression__data

Merge_rnaseq__illuminaga_rnaseq__unc_edu__Level_3__gene_expression__data

Merge_rnaseq__illuminaga_rnaseq__unc_edu__Level_3__splice_junction_expression__data

Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg18__seg

Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg

Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg18__seg

Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg

Merge_transcriptome__agilentg4502a_07_3__unc_edu__Level_3__unc_lowess_normalization_gene_level__data

Methylation_Preprocess

miRseq_Mature_Preprocess

miRseq_Preprocess

mRNAseq_Preprocess

mRNA_Preprocess_Median

Mutation_Packager_Calls

Mutation_Packager_Coverage

RPPA_AnnotateWithGene

Check the available analysis data:

tcga_util desc analyses -c coadread

running: firehose_get -tasks analyses latest coadread

Aggregate_Molecular_Subtype_Clusters

CopyNumberLowPass_Gistic2

CopyNumber_Clustering_CNMF

CopyNumber_Clustering_CNMF_thresholded

CopyNumber_Gistic2

Correlate_Clinical_vs_CopyNumber_Arm

Correlate_Clinical_vs_CopyNumber_Focal

Correlate_Clinical_vs_Methylation

Correlate_Clinical_vs_miRseq

Correlate_Clinical_vs_Molecular_Subtypes

Correlate_Clinical_vs_mRNA

Correlate_Clinical_vs_mRNAseq

Correlate_Clinical_vs_Mutation

Correlate_Clinical_vs_RPPA

Correlate_CopyNumber_vs_mRNA

Correlate_CopyNumber_vs_mRNAseq

Correlate_Methylation_vs_mRNA

Correlate_molecularSubtype_vs_CopyNumber_Arm

Correlate_molecularSubtype_vs_CopyNumber_Focal

Correlate_molecularSubtype_vs_Mutation

Methylation_Clustering_CNMF

miRseq_Clustering_CNMF

miRseq_Clustering_Consensus

miRseq_Mature_Clustering_CNMF

miRseq_Mature_Clustering_Consensus

mRNAseq_Clustering_CNMF

mRNAseq_Clustering_Consensus

mRNA_Clustering_CNMF

mRNA_Clustering_Consensus

Mutation_Assessor

Mutation_CHASM

MutSigNozzleReport1

MutSigNozzleReport2

MutSigNozzleReportCV

MutSigNozzleReportMerged

Pathway_FindEnrichedGenes

Pathway_Hotnet

Pathway_Paradigm_mRNA

Pathway_Paradigm_mRNA_And_Copy_Number

Pathway_Paradigm_RNASeq

Pathway_Paradigm_RNASeq_And_Copy_Number

RPPA_Clustering_CNMF

RPPA_Clustering_Consensus

Downloading and installing TCGA data for colorectal cancer

cidd check

cidd setup -c coadread crc_brafv600e -v2

Generate a gene expression signature based on a mutation

You can specify mutations explicitly like in the below example:

cidd mutation_signature -c coadread -g BRAF -aac V600E --gsa_num_perm 100 --gsa_num_cpus 20 -lfc 2 -n crc_brafv600e -v2

Generate a gene expression signature by explicitly specifying sample id's for a list of case and a list of control samples

This method is useful for identifying signatures that might be associated with clinical data or non-mutation molecular data. In such cases, you can identify your own lists of case and control samples (from your own analyses) and run the cidd signature command. In this example, we repeat the BRAF V600E signature except we specify this list of case and control ids explicitly. In this example, these lists were generated in the cidd mutation_signature described previously.

cidd signature -c coadread --cases crc_brafv600e/reports/crc_brafv600e_cases.samples --controls crc_brafv600e/reports/crc_brafv600e_controls.samples --gsa_num_perm 100 --gsa_num_cpus 20 -lfc 2 -n crc_brafv600e -v2

Identifying candidate drugs for tumors by explicitly specifying a tumor gene expression signature

In this command, we've identified our own gene expression signature external of cidd. A list of up-regulated entrez gene IDs and a list of down-regulated entrez gene IDs are input to cidd.

cidd drugs --up entrez_ids.up --down entrez_ids.down -n test_analysis -v2

Identifying candidate drugs for tumors using a signature generated by cidd

This can be done by simply specifying the analysis name used when generating the signature. cidd will automatically retrieve the gene expression signature.

cidd drugs -np 100 -n crc_brafv600e -v2

Generate a gene expression classifier

After generating a gene expression signature in cidd, you can generate a kTSP (k-Top Scoring Pairs) classifier using these signature genes. This classifier is needed if you want to identify candidate cell lines that resemble your tumor of interest.

cidd classifier generate -n crc_brafv600e -v2

Identifying cell lines to represent tumors with a given mutation

In this command, cell lines will be retrieved to represent tumors characterized by cidd. This command tells cidd to use the gene expression classifier generated by cidd in the analysis named crc_brafv600e. cidd will use the classifier generated by cidd. Additionally, large intestine cell lines are filtered for and the candidate cell lines are required to have a BRAF V600E mutation.

cidd cell_lines -g BRAF -aac V600E -t LARGE_INTESTINE -n crc_brafv600e

Page updated

Google Sites

Report abuse