MutScape: an analytical toolkit for probing the mutational landscape in cancer genomics
Cancer genomics has been evolving rapidly, fueled by the emergence of numerous studies and public databases through next-generation sequencing technologies. However, the downstream programs used to preprocess and analyze data on somatic mutations are scattered in different tools, most of which require specific input formats. Here, we developed a user-friendly Python toolkit, MutScape, which provides a comprehensive pipeline of filtering, combination, transformation, analysis and visualization for researchers, to easily explore the cohort-based mutational characterization for studying cancer genomics when obtaining somatic mutation data. MutScape not only can preprocess millions of mutation records in a few minutes, but also offers various analyses simultaneously, including driver gene detection, mutational signature, large-scale alteration identifica- tion and actionable biomarker annotation. Furthermore, MutScape supports somatic variant data in both variant call format and mutation annotation format, and leverages caller combination strategies to quickly eliminate false positives. With only two simple commands, robust results and publication-quality images are generated automatically. Herein, we demonstrate the ability of MutScape to correctly reproduce known results using breast cancer samples from The Cancer Genome Atlas. More significantly, discovery of novel results in cancer genomic studies is enabled through the advanced features in MutScape. MutScape is freely available on GitHub, at https://github.com/anitalu724/MutScape.
CNVIntegrate: the first multi-ethnic database for identifying copy number variations associated with cancer
(impact factor: 4.462, journal ranking: 19%)
Human copy number variations (CNVs) and copy number alterations (CNAs) are DNA segments (>1000 base pairs) of duplications or deletions with respect to the reference genome, potentially causing genomic imbalance leading to diseases such as cancer. CNVs further cause genetic diversity in healthy populations and are predominant drivers of gene/genome evolution. Initiatives have been taken by the research community to establish large-scale databases to comprehensively characterize CNVs in humans. Exome Aggregation Consortium (ExAC) is one such endeavor that catalogs CNVs, of nearly 60 000 healthy individuals across five demographic clusters. Furthermore, large projects such as the Catalogue of Somatic Mutations in Cancer (COSMIC) and the Cancer Cell Line Encyclopedia (CCLE) combine CNA data from cancer-affected individuals and large panels of human cancer cell lines, respectively. However, we lack a structured and comprehensive CNV/CNA resource including both healthy individuals and cancer patients across large populations. CNVIntegrate is the first web-based system that hosts CNV and CNA data from both healthy populations and cancer patients, respectively, and concomitantly provides statistical comparisons between copy number frequencies of multiple ethnic populations. It further includes, for the first time, well-cataloged CNV and CNA data from Taiwanese healthy individuals and Taiwan Breast Cancer data, respectively, along with imported resources from ExAC, COSMIC and CCLE. CNVIntegrate offers a CNV/CNA-data hub for structured information retrieval for clinicians and scientists towards important drug discoveries and precision treatments.
ATTRACTIVE – An Auto-Updating Database for Experimental Protocols in Regenerative Medicine
(impact factor: 3.476, journal ranking: 38%)
Many research articles are published on regenerative medicine every year. However, only a small proportion of these articles provide experimental methods on organ/tissue differentiation. Therefore, we developed a database – ATTRACTIVE (An auTo-updating daTabase foR experimentAl protoCols in regeneraTIVe mEdicine) – that collects journal articles with differentiation methods in regenerative medicine and updates itself automatically on a regular basis. Since the number of articles in regenerative medicine was insufficient and unbalanced, which limited the performance of the supervised learning algorithms, we proposed an algorithm that combines cosine similarity and linear discriminant functions to classify articles based on their titles and abstracts more efficiently. The results show that our proposed methods out-performed other machine learning algorithms such as k-nearest neighbors, support vector machine, and long short-term memory methods. The classification accuracy reached 94.62%, even with a small and unbalanced dataset. Lastly, we incorporated our classifier into the database for automatic updates. The database is available at http://attractive.cgm.ntu.edu.tw/.
VariED: the first integrated database of gene annotation and expression profiles for variants related to human diseases
(impact factor: 4.462, journal ranking: 19%)
Integrated analysis of DNA variants and gene expression profiles may facilitate precise identification of gene regulatory networks involved in disease mechanisms. Despite the widespread availability of public resources, we lack databases that are capable of simultaneously providing gene expression profiles, variant annotations, functional prediction scores and pathogenic analyses. VariED is the first web-based querying system that integrates an annotation database and expression profiles for genetic variants. The database offers a user-friendly platform and locates gene/variant names in the literature by connecting to established online querying tools, biological annotation tools and records from free-text literature. VariED acts as a central hub for organized genome information consisting of gene annotation, variant allele frequency, functional prediction, clinical interpretation and gene expression profiles in three species: human, mouse and zebrafish. VariED also provides a novel scoring scheme to predict the functional impact of a DNA variant. With one single entry, all results regarding queried DNA variants can be downloaded. VariED can potentially serve as an efficient way to obtain comprehensive variant knowledge for clinicians and scientists around the world working on important drug discoveries and precision treatments.
anamiR: integrated analysis of MicroRNA and gene expression profiling
(impact factor: 3.328, journal ranking: 35%)
Background
With advancements in high-throughput technologies, the cost of obtaining expression profiles of both mRNA and microRNA in the same individual has substantially decreased. Integrated analysis of these profiles can help to elucidate the functional effects of RNA expression in complex diseases, such as cancer. However, fundamental discrepancies are observed in the results from microRNA-mRNA target gene prediction algorithms, and few packages can be used to analyze microRNA and mRNA expression levels simultaneously.
Results
To address these issues, an R package, anamiR, was developed. A total of 10 experimental/prediction databases were integrated. Two analytical functions are provided in anamiR, including the single marker test and functional gene set enrichment analysis, and several parameters can be changed by users. Here we demonstrate the potential application of the anamiR package to 2 publicly available microarray datasets.
Conclusion
The anamiR package is effective for an integrated analysis of both RNA and microRNA profiles. By characterizing biological functions and signaling pathways, this package helps identify dysregulated genes/miRNAs from biological and medical experiments. The source code and manual of the anamiR package are freely available at https://bioconductor.org/packages/release/bioc/html/anamiR.html.
CellExpress: a comprehensive microarray-based cancer cell line and clinical sample gene expression analysis online system
(impact factor: 4.462, journal ranking: 19%)
With the advancement of high-throughput technologies, gene expression profiles in cell lines and clinical samples are widely available in the public domain for research. However, a challenge arises when trying to perform a systematic and comprehensive analysis across independent datasets. To address this issue, we developed a web-based system, CellExpress, for analyzing the gene expression levels in more than 4000 cancer cell lines and clinical samples obtained from public datasets and user-submitted data. First, a normalization algorithm can be utilized to reduce the systematic biases across independent datasets. Next, a similarity assessment of gene expression profiles can be achieved through a dynamic dot plot, along with a distance matrix obtained from principal component analysis. Subsequently, differentially expressed genes can be visualized using hierarchical clustering. Several statistical tests and analytical algorithms are implemented in the system for dissecting gene expression changes based on the groupings defined by users. Lastly, users are able to upload their own microarray and/or next-generation sequencing data to perform a comparison of their gene expression patterns, which can help classify user data, such as stem cells, into different tissue types. In conclusion, CellExpress is a user-friendly tool that provides a comprehensive analysis of gene expression levels in both cell lines and clinical samples. The website is freely available at http://cellexpress.cgm.ntu.edu.tw/. Source code is available at https://github.com/LeeYiFang/Carkinos under the MIT License.
iGC—an integrated analysis package of gene expression and copy number alteration
(impact factor: 3.328, journal ranking: 35%)
Background
With the advancement in high-throughput technologies, researchers can simultaneously investigate gene expression and copy number alteration (CNA) data from individual patients at a lower cost. Traditional analysis methods analyze each type of data individually and integrate their results using Venn diagrams. Challenges arise, however, when the results are irreproducible and inconsistent across multiple platforms. To address these issues, one possible approach is to concurrently analyze both gene expression profiling and CNAs in the same individual.
Results
We have developed an open-source R/Bioconductor package (iGC). Multiple input formats are supported and users can define their own criteria for identifying differentially expressed genes driven by CNAs. The analysis of two real microarray datasets demonstrated that the CNA-driven genes identified by the iGC package showed significantly higher Pearson correlation coefficients with their gene expression levels and copy numbers than those genes located in a genomic region with CNA. Compared with the Venn diagram approach, the iGC package showed better performance.
Conclusion
The iGC package is effective and useful for identifying CNA-driven genes. By simultaneously considering both comparative genomic and transcriptomic data, it can provide better understanding of biological and medical questions. The iGC package’s source code and manual are freely available at https://www.bioconductor.org/packages/release/bioc/html/iGC.html.