Genomics

123Genomics - a Genomics, Proteomics and Bioinformatics Knowledge Base

4DXpress - In the major animal model species like mouse, fish or fly, detailed spatial information on gene expression over time can be acquired through whole mount in situ hybridization experiments. In these species, expression patterns of many genes have been studied and data has been integrated into dedicated model organism databases like ZFIN for zebrafish, MEPD for medaka, BDGP for Drosophila or GXD for mouse. However, a central repository that allows users to query and compare gene expression patterns across different species has not yet been established. Therefore, we have integrated expression patterns for zebrafish, Drosophila, medaka and mouse into a central public repository called 4DXpress (expression database in four dimensions). Users can query anatomy ontology-based expression annotations across species and quickly jump from one gene to the orthologues in other species. Genes are linked to public microarray data in ArrayExpress. We have mapped developmental stages between the species to be able to compare developmental time phases. We store the largest collection of gene expression patterns available to date in an individual resource, reflecting 16 505 annotated genes. 4DXpress will be an invaluable tool for developmental as well as for computational biologists interested in gene regulation and evolution.

AnnoJ browser - DNA cytosine methylation is a central epigenetic modification that has essential roles in cellular processes including genome regulation, development and disease. Here we present the first genome-wide, single-base-resolution maps of methylated cytosines in a mammalian genome, from both human embryonic stem cells and fetal fibroblasts, along with comparative analysis of messenger RNA and small RNA components of the transcriptome, several histone modifications, and sites of DNA-protein interaction for several key regulatory factors. Widespread differences were identified in the composition and patterning of cytosine methylation between the two genomes. Nearly one-quarter of all methylation identified in embryonic stem cells was in a non-CG context, suggesting that embryonic stem cells may use different methylation mechanisms to affect gene regulation. Methylation in non-CG contexts showed enrichment in gene bodies and depletion in protein binding sites and enhancers. Non-CG methylation disappeared upon induced differentiation of the embryonic stem cells, and was restored in induced pluripotent stem cells. We identified hundreds of differentially methylated regions proximal to genes involved in pluripotency and differentiation, and widespread reduced methylation levels in fibroblasts associated with lower transcriptional activity. These reference epigenomes provide a foundation for future studies exploring this key epigenetic modification in human disease and development. AnnoJ is a Web 2.0 application designed for visualizing deep sequencing data and other genome annotation data. It is intended to run in modern W3C compliant browsers*, and allows flexible configuration of plugins and data streams from providers located anywhere on the internet.

ArrayTrack - A robust bioinformatics capability is widely acknowledged as central to realizing the promises of toxicogenomics. Successful application of toxicogenomic approaches, such as DNA microarrays, inextricably relies on appropriate data management, the ability to extract knowledge from massive amounts of data, and the availability of functional information for data interpretation. At the FDA's National Center for Toxicological Research (NCTR), we are developing a public microarray data management and analysis software, called ArrayTrack, that is also used in the routine review of genomic data submitted to the FDA. ArrayTrack stores a full range of information related to DNA microarrays and clinical and non-clinical studies as well as the digested data derived from proteomics and metabonomics experiments. In addition, ArrayTrack provides a rich collection of functional information about genes, proteins, and pathways drawn from various public biological databases for facilitating data interpretation. Many data analysis and visualization tools are available with ArrayTrack for individual platform data analysis, multiple omics data integration, and integrated analysis of omics data with study data. Importantly, gene expression data, functional information, and analysis methods are fully integrated so that the data analysis and interpretation process is simplified and enhanced. Using ArrayTrack, users can select an analysis method from the ArrayTrack tool box, apply the method to selected microarray data, and the analysis of results can be directly linked to individual gene, pathway, and Gene Ontology analysis. ArrayTrack is publicly available online ( http://www.fda.gov/nctr/science/centers/toxicoinformatics/ArrayTrack/index.htm ) and the prospective user can also request a local installation version by contacting the authors.

BIOBASE - TRANSFAC and TRANSPATH

CEBS - CEBS (Chemical Effects in Biological Systems) is an integrated public repository for toxicogenomics data, including the study design and timeline, clinical chemistry and histopathology findings and microarray and proteomics data. CEBS contains data derived from studies of chemicals and of genetic alterations, and is compatible with clinical and environmental studies. CEBS is designed to permit the user to query the data using the study conditions, the subject responses and then, having identified an appropriate set of subjects, to move to the microarray module of CEBS to carry out gene signature and pathway analysis. Scope of CEBS: CEBS currently holds 22 studies of rats, four studies of mice and one study of Caenorhabditis elegans. CEBS can also accommodate data from studies of human subjects. Toxicogenomics studies currently in CEBS comprise over 4000 microarray hybridizations, and 75 2D gel images annotated with protein identification performed by MALDI and MS/MS. CEBS contains raw microarray data collected in accordance with MIAME guidelines and provides tools for data selection, pre-processing and analysis resulting in annotated lists of genes of interest. Additionally, clinical chemistry and histopathology findings from over 1500 animals are included in CEBS. CEBS/BID: The BID (Biomedical Investigation Database) is another component of the CEBS system. BID is a relational database used to load and curate study data prior to export to CEBS, in addition to capturing and displaying novel data types such as PCR data, or additional fields of interest, including those defined by the HESI Toxicogenomics Committee (in preparation). BID has been shared with Health Canada and the US Environmental Protection Agency. CEBS is available at http://cebs.niehs.nih.gov. BID can be accessed via the user interface from https://dir-apps.niehs.nih.gov/arc/.

CellMiner - BACKGROUND: Advances in the high-throughput omic technologies have made it possible to profile cells in a large number of ways at the DNA, RNA, protein, chromosomal, functional, and pharmacological levels. A persistent problem is that some classes of molecular data are labeled with gene identifiers, others with transcript or protein identifiers, and still others with chromosomal locations. What has lagged behind is the ability to integrate the resulting data to uncover complex relationships and patterns. Those issues are reflected in full form by molecular profile data on the panel of 60 diverse human cancer cell lines (the NCI-60) used since 1990 by the U.S. National Cancer Institute to screen compounds for anticancer activity. To our knowledge, CellMiner is the first online database resource for integration of the diverse molecular types of NCI-60 and related meta data. DESCRIPTION: CellMiner enables scientists to perform advanced querying of molecular information on NCI-60 (and additional types) through a single web interface. CellMiner is a freely available tool that organizes and stores raw and normalized data that represent multiple types of molecular characterizations at the DNA, RNA, protein, and pharmacological levels. Annotations for each project, along with associated metadata on the samples and datasets, are stored in a MySQL database and linked to the molecular profile data. Data can be queried and downloaded along with comprehensive information on experimental and analytic methods for each data set. A Data Intersection tool allows selection of a list of genes (proteins) in common between two or more data sets and outputs the data for those genes (proteins) in the respective sets. In addition to its role as an integrative resource for the NCI-60, the CellMiner package also serves as a shell for incorporation of molecular profile data on other cell or tissue sample types. CONCLUSION: CellMiner is a relational database tool for storing, querying, integrating, and downloading molecular profile data on the NCI-60 and other cancer cell types. More broadly, it provides a template to use in providing such functionality for other molecular profile data generated by academic institutions, public projects, or the private sector.

Cell Montage - The establishment and rapid expansion of microarray databases has created a need for new search tools. Here we present CellMontage, the first server for expression profile similarity search over a large database-69 000 microarray experiments derived from NCBI's; GEO site. CellMontage provides a novel, content-based search engine for accessing gene expression data. Microarray experiments with similar overall expression to a user-provided expression profile (e.g. microarray experiment) are computed and displayed-usually within 20 s. The core search engine software is downloadable from the site.

ConsensusPathDB -ConsensusPathDB is a database system for the integration of human functional interactions. Current knowledge of these interactions is dispersed in more than 200 databases, each having a specific focus and data format. ConsensusPathDB currently integrates the content of 12 different interaction databases with heterogeneous foci comprising a total of 26,133 distinct physical entities and 74,289 distinct functional interactions (protein-protein interactions, biochemical reactions, gene regulatory interactions), and covering 1738 pathways. We describe the database schema and the methods used for data integration. Furthermore, we describe the functionality of the ConsensusPathDB web interface, where users can search and visualize interaction networks, upload, modify and expand networks in BioPAX, SBML or PSI-MI format, or carry out over-representation analysis with uploaded identifier lists with respect to substructures derived from the integrated interaction network.

CoPub -Medline is a rich information source, from which links between genes and keywords describing biological processes, pathways, drugs, pathologies and diseases can be extracted. We developed a publicly available tool called CoPub that uses the information in the Medline database for the biological interpretation of microarray data. CoPub allows batch input of multiple human, mouse or rat genes and produces lists of keywords from several biomedical thesauri that are significantly correlated with the set of input genes. These lists link to Medline abstracts in which the co-occurring input genes and correlated keywords are highlighted. Furthermore, CoPub can graphically visualize differentially expressed genes and over-represented keywords in a network, providing detailed insight in the relationships between genes and keywords, and revealing the most influential genes as highly connected hubs.

COXPRESdb - A database of coexpressed gene sets can provide valuable information for a wide variety of experimental designs, such as targeting of genes for functional identification, gene regulation and/or protein-protein interactions. Coexpressed gene databases derived from publicly available GeneChip data are widely used in Arabidopsis research, but platforms that examine coexpression for higher mammals are rather limited. Therefore, we have constructed a new database, COXPRESdb (coexpressed gene database) (http://coxpresdb.hgc.jp), for coexpressed gene lists and networks in human and mouse. Coexpression data could be calculated for 19 777 and 21 036 genes in human and mouse, respectively, by using the GeneChip data in NCBI GEO. COXPRESdb enables analysis of the four types of coexpression networks: (i) highly coexpressed genes for every gene, (ii) genes with the same GO annotation, (iii) genes expressed in the same tissue and (iv) user-defined gene sets. When the networks became too big for the static picture on the web in GO networks or in tissue networks, we used Google Maps API to visualize them interactively. COXPRESdb also provides a view to compare the human and mouse coexpression patterns to estimate the conservation between the two species.

COSMIC - The catalogue of Somatic Mutations in Cancer (COSMIC) (http://www.sanger.ac.uk/cosmic/) is the largest public resource for information on somatically acquired mutations in human cancer and is available freely without restrictions. Currently (v43, August 2009), COSMIC contains details of 1.5-million experiments performed through 13 423 genes in almost 370 000 tumours, describing over 90 000 individual mutations. Data are gathered from two sources, publications in the scientific literature, (v43 contains 7797 curated articles) and the full output of the genome-wide screens from the Cancer Genome Project (CGP) at the Sanger Institute, UK. Most of the world's literature on point mutations in human cancer has now been curated into COSMIC and while this is continually updated, a greater emphasis on curating fusion gene mutations is driving the expansion of this information; over 2700 fusion gene mutations are now described. Whole-genome sequencing screens are now identifying large numbers of genomic rearrangements in cancer and COSMIC is now displaying details of these analyses also. Examination of COSMIC's data is primarily web-driven, focused on providing mutation range and frequency statistics based upon a choice of gene and/or cancer phenotype. Graphical views provide easily interpretable summaries of large quantities of data, and export functions can provide precise details of user-selected data.

Database of Genomics Variants- The discovery of an abundance of copy number variants (CNVs; gains and losses of DNA sequences >1 kb) and other structural variants in the human genome is influencing the way research and diagnostic analyses are being designed and interpreted. As such, comprehensive databases with the most relevant information will be critical to fully understand the results and have impact in a diverse range of disciplines ranging from molecular biology to clinical genetics. Here, we describe the development of bioinformatics resources to facilitate these studies. The Database of Genomic Variants (http://projects.tcag.ca/variation/) is a comprehensive catalogue of structural variation in the human genome. The database currently contains 1,267 regions reported to contain copy number variation or inversions in apparently healthy human cases. We describe the current contents of the database and how it can serve as a resource for interpretation of array comparative genomic hybridization (array CGH) and other DNA copy imbalance data. We also present the structure of the database, which was built using a new data modeling methodology termed Cross-Referenced Tables (XRT). This is a generic and easy-to-use platform, which is strong in handling textual data and complex relationships. Web-based presentation tools have been built allowing publication of XRT data to the web immediately along with rapid sharing of files with other databases and genome browsers. We also describe a novel tool named eFISH (electronic fluorescence in situ hybridization) (http://projects.tcag.ca/efish/), a BLAST-based program that was developed to facilitate the choice of appropriate clones for FISH and CGH experiments, as well as interpretation of results in which genomic DNA probes are used in hybridization-based experiments.

dcode.org - Comparative genomics provides the means to demarcate functional regions in anonymous DNA sequences. The successful application of this method to identifying novel genes is currently shifting to deciphering the non-coding encryption of gene regulation across genomes. To facilitate the practical application of comparative sequence analysis to genetics and genomics, we have developed several analytical and visualization tools for the analysis of arbitrary sequences and whole genomes. These tools include two alignment tools, zPicture and Mulan; a phylogenetic shadowing tool, eShadow for identifying lineage- and species-specific functional elements; two evolutionary conserved transcription factor analysis tools, rVista and multiTF; a tool for extracting cis-regulatory modules governing the expression of co-regulated genes, Creme 2.0; and a dynamic portal to multiple vertebrate and invertebrate genome alignments, the ECR Browser. Here, we briefly describe each one of these tools and provide specific examples on their practical applications.

dbDEPC - Cancer-related investigations have long been in the limelight of biomedical research. Years of effort from scientists and doctors worldwide have generated large amounts of data at the genome, transcriptome, proteome and even metabolome level, and DNA and RNA cancer signature databases have been established. Here we present a database of differentially expressed proteins in human cancers (dbDEPC), with the goal of collecting curated cancer proteomics data, providing a resource for information on protein-level expression changes, and exploring protein profile differences among different cancers. dbDEPC currently contains 1803 proteins differentially expressed in 15 cancers, curated from 65 mass spectrometry (MS) experiments in peer-reviewed publications. In addition to MS experiments, low-throughput experiment data from the same literatures and cancer-associated genes from external databases were also integrated to provide some validation information. Furthermore, dbDEPC associates differential proteins with important structural variations in the human genome, such as copy number variations or single nucleotide polymorphisms, which might be helpful for explaining changes in protein expression at the DNA level. Data in dbDEPC can be queried by protein identifier, description or sequence; the retrieved protein entry provides the differential expression pattern seen in cancers, along with detailed annotations. dbDEPC is expected to be a reference database for cancer signatures at the protein level.

dbZach - Quantitative risk assessment and the elucidation of mechanisms of toxicity requires computational infrastructure and innovative analysis approaches that systematically consider available data at all levels of biological organization. dbZach (http://dbzach.fst.msu.edu) is a modular relational database with associated data insertion, retrieval, and mining tools that manages traditional toxicology and complementary toxicogenomic data to facilitate comprehensive data integration, analysis, and sharing. It consists of four Core Subsystems (i.e., Clones, Genes, Sample Annotation, and Protocols), four Experimental Subsystems (i.e., Microarray, Affymetrix, Real-Time PCR, and Toxicology), and three Computational Subsystems (i.e., Gene Regulation, Pathways, Orthology) that comply with the Minimum Information About a Microarray Experiment (MIAME) standard. It is capable of including emerging technologies and other model systems, including ecologically relevant species. dbZach represents an enterprise toxicogenomic data management system which facilitates data integration and analysis, and reduces uncertainties in the continuum from initial exposure to toxicity while facilitating more comprehensive elucidations of mechanisms of toxicity and supporting mechanistically-based quantitative risk assessment.

ECR Browser - With an increasing number of vertebrate genomes being sequenced in draft or finished form, unique opportunities for decoding the language of DNA sequence through comparative genome alignments have arisen. However, novel tools and strategies are required to accommodate this large volume of genomic information and to facilitate the transfer of predictions generated by comparative sequence alignment to researchers focused on experimental annotation of genome function. Here, we present the ECR Browser, a tool that provides easy and dynamic access to whole genome alignments of human, mouse, rat and fish sequences. This web-based tool (http://ecrbrowser.dcode.org) provides the starting point for discovery of novel genes, identification of distant gene regulatory elements and prediction of transcription factor binding sites. The genome alignment portal of the ECR Browser also permits fast and automated alignments of any user-submitted sequence to the genome of choice. The interconnection of the ECR Browser with other DNA sequence analysis tools creates a unique portal for studying and exploring vertebrate genomes.

Eichler Lab Large Scale Structural Variation Databases - Database of large scale structural variation across multiple species

Enviro-Health Links -Toxicogenomics - National Library of Medicine Toxicogenomics links

GATHER - MOTIVATION: Understanding the full meaning of the biology captured in molecular profiles, within the context of the entire biological system, cannot be achieved with a simple examination of the individual genes in the signature. To facilitate such an understanding, we have developed GATHER, a tool that integrates various forms of available data to elucidate biological context within molecular signatures produced from high-throughput post-genomic assays. RESULTS: Analyzing the Rb/E2F tumor suppressor pathway, we show that GATHER identifies critical features of the pathway. We further show that GATHER identifies common biology in a series of otherwise unrelated gene expression signatures that each predict breast cancer outcome. We quantify the performance of GATHER and find that it successfully predicts 90% of the functions over a broad range of gene groups. We believe that GATHER provides an essential tool for extracting the full value from molecular signatures generated from genome-scale analyses.

GeneCruiser - SUMMARY: GeneCruiser is a web service allowing users to annotate their genomic data by mapping microarray feature identifiers to gene identifiers from databases, such as UniGene, while providing links to web resources, such as the UCSC Genome Browser. It relies on a regularly updated database that retrieves and indexes the mappings between microarray probes and genomic databases. Genes are identified using the Life Sciences Identifier standard.

GeneSigDB - The primary objective of most gene expression studies is the identification of one or more gene signatures; lists of genes whose transcriptional levels are uniquely associated with a specific biological phenotype. Whilst thousands of experimentally derived gene signatures are published, their potential value to the community is limited by their computational inaccessibility. Gene signatures are embedded in published article figures, tables or in supplementary materials, and are frequently presented using non-standard gene or probeset nomenclature. We present GeneSigDB (http://compbio.dfci.harvard.edu/genesigdb) a manually curated database of gene expression signatures. GeneSigDB release 1.0 focuses on cancer and stem cells gene signatures and was constructed from more than 850 publications from which we manually transcribed 575 gene signatures. Most gene signatures (n = 560) were successfully mapped to the genome to extract standardized lists of EnsEMBL gene identifiers. GeneSigDB provides the original gene signature, the standardized gene list and a fully traceable gene mapping history for each gene from the original transcribed data table through to the standardized list of genes. The GeneSigDB web portal is easy to search, allows users to compare their own gene list to those in the database, and download gene signatures in most common gene identifier formats.

Gene Aging Nexus -The recent development of microarray technology provided unprecedented opportunities to understand the genetic basis of aging. So far, many microarray studies have addressed aging-related expression patterns in multiple organisms and under different conditions. The number of relevant studies continues to increase rapidly. However, efficient exploitation of these vast data is frustrated by the lack of an integrated data mining platform or other unifying bioinformatic resource to enable convenient cross-laboratory searches of array signals. To facilitate the integrative analysis of microarray data on aging, we developed a web database and analysis platform 'Gene Aging Nexus' (GAN) that is freely accessible to the research community to query/analyze/visualize cross-platform and cross-species microarray data on aging. By providing the possibility of integrative microarray analysis, GAN should be useful in building the systems-biology understanding of aging.

Gene Expression Atlas - The Gene Expression Atlas (http://www.ebi.ac.uk/gxa) is an added-value database providing information about gene expression in different cell types, organism parts, developmental stages, disease states, sample treatments and other biological/experimental conditions. The content of this database derives from curation, re-annotation and statistical analysis of selected data from the ArrayExpress Archive of Functional Genomics Data. A simple interface allows the user to query for differential gene expression either (i) by gene names or attributes such as Gene Ontology terms, or (ii) by biological conditions, e.g. diseases, organism parts or cell types. The gene queries return the conditions where expression has been reported, while condition queries return which genes are reported to be expressed in these conditions. A combination of both query types is possible. The query results are ranked using various statistical measures and by how many independent studies in the database show the particular gene-condition association. Currently, the database contains information about more than 200 000 genes from nine species and almost 4500 biological conditions studied in over 30 000 assays from over 1000 independent studies.

Gene Expression Omnibus - a gene expression/molecular abundance repository supporting MIAME compliant data submissions, and a curated, online resource for gene expression data browsing, query and retrieval

GenomeRNAi - The GenomeRNAi database (http://www.genomernai.org/) contains phenotypes from published cell-based RNA interference (RNAi) screens in Drosophila and Homo sapiens. The database connects observed phenotypes with annotations of targeted genes and information about the RNAi reagent used for the perturbation experiment. The availability of phenotypes from Drosophila and human screens also allows for phenotype searches across species. Besides reporting quantitative data from genome-scale screens, the new release of GenomeRNAi also enables reporting of data from microscopy experiments and curated phenotypes from published screens. In addition, the database provides an updated resource of RNAi reagents and their predicted quality that are available for the Drosophila and the human genome. The new version also facilitates the integration with other genomic data sets and contains expression profiling (RNA-Seq) data for several cell lines commonly used in RNAi experiments.

Genomics Portals - A large amount of experimental data generated by modern high-throughput technologies is available through various public repositories. Our knowledge about molecular interaction networks, functional biological pathways and transcriptional regulatory modules is rapidly expanding, and is being organized in lists of functionally related genes. Jointly, these two sources of information hold a tremendous potential for gaining new insights into functioning of living systems. RESULTS: Genomics Portals platform integrates access to an extensive knowledge base and a large database of human, mouse, and rat genomics data with basic analytical visualization tools. It provides the context for analyzing and interpreting new experimental data and the tool for effective mining of a large number of publicly available genomics datasets stored in the back-end databases. The uniqueness of this platform lies in the diversity of genomics data that can be accessed and analyzed (gene expression, ChIP-chip, ChIP-seq, epigenomics, computationally predicted binding sites, etc), and the integration with an extensive knowledge base that can be used in such analysis. CONCLUSION: The integrated access to primary genomics data, functional knowledge and analytical tools makes Genomics Portals platform a unique tool for interpreting results of new genomics experiments and for mining the vast amount of data stored in the Genomics Portals backend databases.

GNCPro - The use of computational applications in biological research is significantly lagging behind other scientific research areas such as physics, mathematics, and geology; more in silico tools are needed. The increasing complexity of biological data makes it more and more difficult for scientists to verify their hypotheses and results against existing discoveries. GNCPro is a free data integration and visualization tool for gaining comprehensive overviews of such complicated biological knowledge. In particular, GNCPro warehouses and encodes biological information as binary relationships. When represented graphically, these binary relationships take on the form of edges that connect the genes and proteins, which are represented by nodes. By using distinguishing features such as colors, shape, and opacity, GNCPro provides a stimulating visual experience in which the user can quickly identify groups of genes by annotations and the types of relationships involved. GNCPro integrates human gene expressions, regulations, gene product modifications, and interactions into one platform while delivering a simple and powerful user interface for systems biology study.

GNF Expression Atlas - Global gene expression data resource for both mouse and human - multiple tissues and cell lines

GNF SymAtlas - Gene expression atlas containing data from numerous array platforms

GRAIL - Translating a set of disease regions into insight about pathogenic mechanisms requires not only the ability to identify the key disease genes within them, but also the biological relationships among those key genes. Here we describe a statistical method, Gene Relationships Among Implicated Loci (GRAIL), that takes a list of disease regions and automatically assesses the degree of relatedness of implicated genes using 250,000 PubMed abstracts. We first evaluated GRAIL by assessing its ability to identify subsets of highly related genes in common pathways from validated lipid and height SNP associations from recent genome-wide studies. We then tested GRAIL, by assessing its ability to separate true disease regions from many false positive disease regions in two separate practical applications in human genetics. First, we took 74 nominally associated Crohn's disease SNPs and applied GRAIL to identify a subset of 13 SNPs with highly related genes. Of these, ten convincingly validated in follow-up genotyping; genotyping results for the remaining three were inconclusive. Next, we applied GRAIL to 165 rare deletion events seen in schizophrenia cases (less than one-third of which are contributing to disease risk). We demonstrate that GRAIL is able to identify a subset of 16 deletions containing highly related genes; many of these genes are expressed in the central nervous system and play a role in neuronal synapses. GRAIL offers a statistically robust approach to identifying functionally related genes from across multiple disease regions--that likely represent key disease pathways.

HOMGL

HPtaa - Potential Targets for Cancer Diagnosis and Immunotherapy - biomarkers database

Ingenuity Pathway Analysis - a software application that enables biologists and bioinformaticians to identify the biological mechanisms, pathways and functions most relevant to their experimental datasets or genes of interest

L2L Microarray Analysis Tool - A simple tool for discovering the hidden biological significance

in microarray expression data

MAVEN - SUMMARY: We describe the features and implementation of a web application tool named MAVEN - for Management, Analysis, Visualization and rEsults shariNg of genome-wide association (GWA) data using cutting edge technologies. Main capabilities include user data uploading and management, queries using a variety of criteria, visualization of results, interactive selections, and seamless integration of users' data with databases at the National Center for Bio-technology Information for functional annotations of SNPs and genes.

Microarray Expression Data Analysis References - review of statistical/bioinformatic analysis of microarray expression data

MIDAW (MIcroarray Data Analysis Web tool) - a web tool for the analysis of microarray data

MouseIndelDB - MouseIndelDB is an integrated database resource containing thousands of previously unreported mouse genomic indel (insertion and deletion) polymorphisms ranging from approximately 100 nt to 10 Kb in size. The database currently includes polymorphisms identified from our alignment of 26 million whole-genome shotgun sequence traces from four laboratory mouse strains mapped against the reference C57BL/6J genome using GMAP. They can be queried on a local level by chromosomal coordinates, nearby gene names or other genomic feature identifiers, or in bulk format using categories including mouse strain(s), class of polymorphism(s) and chromosome number. The results of such queries are presented either as a custom track on the UCSC mouse genome browser or in tabular format. We anticipate that the MouseIndelDB database will be widely useful for research in mammalian genetics, genomics, and evolutionary biology.

MPSS - We have used massively parallel signature sequencing (MPSS) to sample the transcriptomes of 32 normal human tissues to an unprecedented depth, thus documenting the patterns of expression of almost 20,000 genes with high sensitivity and specificity. The data confirm the widely held belief that differences in gene expression between cell and tissue types are largely determined by transcripts derived from a limited number of tissue-specific genes, rather than by combinations of more promiscuously expressed genes. Expression of a little more than half of all known human genes seems to account for both the common requirements and the specific functions of the tissues sampled. A classification of tissues based on patterns of gene expression largely reproduces classifications based on anatomical and biochemical properties. The unbiased sampling of the human transcriptome achieved by MPSS supports the idea that most human genes have been mapped, if not functionally characterized. This data set should prove useful for the identification of tissue-specific genes, for the study of global changes induced by pathological conditions, and for the definition of a minimal set of genes necessary for basic cell maintenance.

M.P. RNA-seq Database - The functional complexity of the human transcriptome is not yet fully elucidated. We report a high-throughput sequence of the human transcriptome from a human embryonic kidney and a B cell line. We used shotgun sequencing of transcripts to generate randomly distributed reads. Of these, 50% mapped to unique genomic locations, of which 80% corresponded to known exons. We found that 66% of the polyadenylated transcriptome mapped to known genes and 34% to nonannotated genomic regions. On the basis of known transcripts, RNA-Seq can detect 25% more genes than can microarrays. A global survey of messenger RNA splicing events identified 94,241 splice junctions (4096 of which were previously unidentified) and showed that exon skipping is the most prevalent form of alternative splicing.

NIA Array Analysis - False discovery rate (FDR), ANOVA with error variance correction, 3D PCA/SVD-biplot, PCA import for experiment comparison, Pattern matching, Optional permutation test, Server-based software

OmicBrowse - OmicBrowse is a genome browser designed as a scalable system for maintaining numerous genome annotation datasets. It is an open source tool capable of regulating multiple user data access to each dataset to allow multiple users to have their own integrative view of both their unpublished and published datasets, so that the maintenance costs related to supplying each collaborator exclusively with their own private data are significantly reduced. OmicBrowse supports DAS1 imports and exports of annotations to Internet site servers worldwide. We also provide a data-download named OmicDownload server that interactively selects datasets and filters the data on the selected datasets.

Oncomine - DNA microarrays have been widely applied to cancer transcriptome analysis; however, the majority of such data are not easily accessible or comparable. Furthermore, several important analytic approaches have been applied to microarray analysis; however, their application is often limited. To overcome these limitations, we have developed Oncomine, a bioinformatics initiative aimed at collecting, standardizing, analyzing, and delivering cancer transcriptome data to the biomedical research community. Our analysis has identified the genes, pathways, and networks deregulated across 18,000 cancer gene expression microarrays, spanning the majority of cancer types and subtypes. Here, we provide an update on the initiative, describe the database and analysis modules, and highlight several notable observations.

Pathway Studio - pathway analysis software helps you to interpret your experimental results in the context of pathways, gene regulation networks and protein interaction maps.

PRIDE - The Proteomics Identifications database (PRIDE, http://www.ebi.ac.uk/pride) at the European Bioinformatics Institute has become one of the main repositories of mass spectrometry-derived proteomics data. For the last 2 years, PRIDE data holdings have grown substantially, comprising 60 different species, more than 2.5 million protein identifications, 11.5 million peptides and over 50 million spectra by September 2009. We here describe several new and improved features in PRIDE, including the revised submission process, which now includes direct submission of fragment ion annotations. Correspondingly, it is now possible to visualize spectrum fragmentation annotations on tandem mass spectra, a key feature for compliance with journal data submission requirements. We also describe recent developments in the PRIDE BioMart interface, which now allows integrative queries that can join PRIDE data to a growing number of biological resources such as Reactome, Ensembl, InterPro and UniProt. This ability to perform extremely powerful across-domain queries will certainly be a cornerstone of future bioinformatics analyses. Finally, we highlight the importance of data sharing in the proteomics field, and the corresponding integration of PRIDE with other databases in the ProteomExchange consortium.

PubGene- We have carried out automated extraction of explicit and implicit biomedical knowledge from publicly available gene and text databases to create a gene-to-gene co-citation network for 13,712 named human genes by automated analysis of titles and abstracts in over 10 million MEDLINE records. The associations between genes have been annotated by linking genes to terms from the medical subject heading (MeSH) index and terms from the gene ontology (GO) database. The extracted database and accompanying web tools for gene-expression analysis have collectively been named 'PubGene'. We validated the extracted networks by three large-scale experiments showing that co-occurrence reflects biologically meaningful relationships, thus providing an approach to extract and structure known biology. We validated the applicability of the tools by analyzing two publicly available microarray data sets.

Prophect - a web-based application that uses gene expression data to build prediction rules and subsequent sample classification

Reactome - Reactome (http://www.reactome.org) is an expert-authored, peer-reviewed knowledgebase of human reactions and pathways that functions as a data mining resource and electronic textbook. Its current release includes 2975 human proteins, 2907 reactions and 4455 literature citations. A new entity-level pathway viewer and improved search and data mining tools facilitate searching and visualizing pathway data and the analysis of user-supplied high-throughput data sets. Reactome has increased its utility to the model organism communities with improved orthology prediction methods allowing pathway inference for 22 species and through collaborations to create manually curated Reactome pathway datasets for species including Arabidopsis, Oryza sativa (rice), Drosophila and Gallus gallus (chicken). Reactome's data content and software can all be freely used and redistributed under open source terms.

SAGE genie - Serial Analysis of Gene Expression Database

SCAN - MOTIVATION: Genome-wide association studies (GWAS) generate relationships between hundreds of thousands of single nucleotide polymorphisms (SNPs) and complex phenotypes. The contribution of the traditionally overlooked copy number variations (CNVs) to complex traits is also being actively studied. To facilitate the interpretation of the data and the designing of follow-up experimental validations, we have developed a database that enables the sensible prioritization of these variants by combining several approaches, involving not only publicly available physical and functional annotations, but also multilocus linkage disequilibrium (LD) annotations as well as annotations of expression quantitative trait loci (eQTLs). RESULTS: For each SNP, the SCAN database provides: (1) Summary information from eQTL mapping of HapMap SNPs to gene expression (evaluated by the Affymetrix exon array) in the full set of HapMap CEU (Caucasians from Utah, USA) and YRI (Yoruba people from Ibadan, Nigeria) samples; (2) LD information, in the case of a HapMap SNP, including what genes have variation in strong LD (pairwise or multilocus LD) with the variant and how well the SNP is covered by different high-throughput platforms; (3) Summary information available from public databases (e.g., physical and functional annotations); and (4) Summary information from other GWAS. For each gene, SCAN provides annotations on: (1) eQTLs for the gene (both local and distant SNPs); and (2) The coverage of all variants in the HapMap at that gene on each high-throughput platform. For each genomic region, SCAN provides annotations on: (1) Physical and functional annotations of all SNPs, genes, and known CNVs within the region; and (2) All genes regulated by the eQTLs within the region.

SemiBiosphere - A Semantic Web Approach to Recommending Microarray Clustering Services

Software for Genomic Data Analysis - List of genomic data analysis software

SNPster -

Stanford Microarray Database - a resource for the entire biological research community that provides unrestricted access to microarray data published by SMD users

Starnet - BACKGROUND: Although expression microarrays have become a standard tool used by biologists, analysis of data produced by microarray experiments may still present challenges. Comparison of data from different platforms, organisms, and labs may involve complicated data processing, and inferring relationships between genes remains difficult. RESULTS: STARNET 2 is a new web-based tool that allows post hoc visual analysis of correlations that are derived from expression microarray data. STARNET 2 facilitates user discovery of putative gene regulatory networks in a variety of species (human, rat, mouse, chicken, zebrafish, Drosophila, C. elegans, S. cerevisiae, Arabidopsis and rice) by graphing networks of genes that are closely co-expressed across a large heterogeneous set of preselected microarray experiments. For each of the represented organisms, raw microarray data were retrieved from NCBI's Gene Expression Omnibus for a selected Affymetrix platform. All pairwise Pearson correlation coefficients were computed for expression profiles measured on each platform, respectively. These precompiled results were stored in a MySQL database, and supplemented by additional data retrieved from NCBI. A web-based tool allows user-specified queries of the database, centered at a gene of interest. The result of a query includes graphs of correlation networks, graphs of known interactions involving genes and gene products that are present in the correlation networks, and initial statistical analyses. Two analyses may be performed in parallel to compare networks, which is facilitated by the new HEATSEEKER module. CONCLUSION: STARNET 2 is a useful tool for developing new hypotheses about regulatory relationships between genes and gene products, and has coverage for 10 species. Interpretation of the correlation networks is supported with a database of previously documented interactions, a test for enrichment of Gene Ontology terms, and heat maps of correlation distances that may be used to compare two networks. The list of genes in a STARNET network may be useful in developing a list of candidate genes to use for the inference of causal networks. The tool is freely available at http://vanburenlab.medicine.tamhsc.edu/starnet2.html, and does not require user registration.

String -Functional partnerships between proteins are at the core of complex cellular phenotypes, and the networks formed by interacting proteins provide researchers with crucial scaffolds for modeling, data reduction and annotation. STRING is a database and web resource dedicated to protein-protein interactions, including both physical and functional interactions. It weights and integrates information from numerous sources, including experimental repositories, computational prediction methods and public text collections, thus acting as a meta-database that maps all interaction evidence onto a common set of genomes and proteins. The most important new developments in STRING 8 over previous releases include a URL-based programming interface, which can be used to query STRING from other resources, improved interaction prediction via genomic neighborhood in prokaryotes, and the inclusion of protein structures. Version 8.0 of STRING covers about 2.5 million proteins from 630 organisms, providing the most comprehensive view on protein-protein interactions currently available

TAPPA - Extracting biological insight from microarray data is important but challenging. Here we describe TAPPA, a java-based tool, for identification of phenotype-associated genetic pathways utilizing the pathway topological measures. This is achieved by first calculating a Pathway Connectivity Index (PCI) for each pathway, followed by evaluating its correlation to the phenotypic variation. Our PCI definition not only efficiently captures the contributions from genes that show subtle but consistent changes in expression, but also naturally overweighs the hub genes that interact with a large number of other genes in the pathway. TAPPA also allows evaluation of sub-modules within a pathway and their association to phenotypes.

tm4 -Powerful specialized software is essential for managing, quantifying, and ultimately deriving scientific insight from results of a microarray experiment. We have developed a suite of software applications, known as TM4, to support such gene expression studies. The suite consists of open-source tools for data management and reporting, image analysis, normalization and pipeline control, and data mining and visualization. An integrated MIAME-compliant MySQL database is included. This chapter describes each component of the suite and includes a sample analysis walk-through.

TiSGeD - The tissue-specific genes are a group of genes whose function and expression are preferred in one or several tissues/cell types. Identification of these genes helps better understanding of tissue-gene relationship, etiology and discovery of novel tissue-specific drug targets. In this study, a statistical method is introduced to detect tissue specific genes from more than 123125 gene expression profiles over 107 human tissues, 67 mouse tissues and 30 rat tissues. As a result, a novel subject-specialized repository, namely the Tissue-Specific Genes Database (TiSGeD), is developed to represent the analyzed results. Auxiliary information of tissue-specific genes were also collected from biomedical literatures.

ToxExpress (subscription) - a flexible, enabling program that brings the power of toxicity-based gene expression to lead optimization and drug safety studies from GeneLogic

Y.F. Leung's Functional Genomics - tons of links to functional genomics sites

Bioinformatics.net: Human Genetics

CASCAD - Rat strain polymorphism database

CDC’s National Office of Public Health Genomics (NOPHG)

Database of Genomics Variants - A curated catalogue of structural variation in the human genome

dcode.org - Comparative Genomics Center is a publicly available resourse for regulatory genome data mining. It provides tools for evolutionary comparisons, sequence alignments, and detection of functional sequence patterns

ECR Browser - a dynamic whole-genome navigation tool for visualizing and studying evolutionary relationships between vertebrate and non-vertebrate genomes

Eichler Lab Large Scale Structural Variation Databases

FastSNP - a web server that allows users to efficiently identify and prioritize high-risk SNPs according to their phenotypic risks and putative functional effects

GenePaint - a digital atlas of gene expression patterns in the mouse

Genetic Analysis Software - Software for genetic linkage analysis for human pedigree data, QTL analysis for animal/plant breeding data, genetic marker ordering, genetic association analysis, haplotype construction, pedigree drawing, and population genetics

GeneSeeker - human genetics association candidate gene identifier

GeneSNPS - This Environmental Genome Project web resource integrates gene, sequence and polymorphism data into individually annotated gene models.

GeneTests - a publicly funded medical genetics information resource

Genetic Association Database - an archive of human genetic association studies of complex diseases and disorders

Genomics at FDA - FDA launch point for the agency's view of genomics in drug development

HAPLOT - A simple program for graphical presentation of haplotype block structures, tagSNP selection and SNP variation

Human Gene Mutation Database - an attempt to collate known (published) gene lesions responsible for human inherited disease

iHAP - Integrated haplotype analysis pipeline for characterizing the haplotype structure of genes

Inherited Arrhythmias Database - database of genetic variants that lead to cardiac arrhythmia

LS-SNP - an annotated database of SNPs. Currently only coding non-synonomous SNPs found in human genes are included

MutDB - annotate human variation data with protein structural information and other functionally relevant information, if available

NHLBI PGA - NHLBI Programs for Genomic Applications Integrated Resource Portal

PAAR - Pharmacogenetics of Anticancer Agents Research Group

PantherDB - Score proteins against the PANTHER HMM library, use PANTHER to do gene expression analyses, and download PANTHER tools and data

PGbase - A unique drug-centered database specializing in pharmacogenomics and personalized medicine

PharmGKB - curates information that establishes knowledge about the relationships among drugs, diseases and genes, including their variations and gene products

PicSNP - a catalog of non-synonymous SNP (Single Nucleotide Polymorphism) in the human genome

PolyDoms - a whole genome database for the identification of non-synonymous coding SNPs with the potential to impact disease

PolyPhen - a tool which predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations

pLINKS - a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner

PupasView - a web tool for finding SNPs with putative effect at transcriptional level

SIFT - predicts whether an amino acid substitution affects protein function based on sequence homology and the physical properties of amino acids

SNP@Domain - A web resource of Single Nucleotide Polymorphisms (SNPs) within protein

domain structures and sequences

SNPs3D - a website which assigns molecular functional effects of non-synonymous SNPs based on structure and sequence analysis

SNPeffect - analyses the effect of coding, non-synonymous SNPs on 3 categories of functional and physico-chemical properties of the affected proteins

SNPHunter

SNPselector - Linkage disequilibrium and functional SNP selection program for designing human genetic association studies

SNPstats - a simple, ready-to-use software which has been designed to analyze genetic-epidemiology studies of association using SNPs

SNP Function Portal - analyze potential SNP function and LD with other SNPs

SNP-VISTA - graphical interface and use of visual representations, which support interactive exploration and hence better understanding of large-scale SNP data by the user

SUSPECTS - a simple and effective way to identify genes involved in Mendelian and oligogenic disorders

TAMAL - to help the user select single nucleotide polymorphisms (SNPs) in a specified set of candidate genes for genotyping

topoSNP - This site allows for the visualization of disease and non-disease associated non-synonymous single nucleotide polymorphisms (nsSNPs) and displays geometric and relative entropy calculations

Variome - an openfree web resource of world wide SNP (Single Nucleotide Polymorphism) researchers

VISTA - a comprehensive suite of programs and databases for comparative analysis of genomic sequences

Ensembl - a joint project between EMBL - EBI and the Sanger Institute to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes. This site provides free access to all the data and software from the Ensembl project

GeneCards - an integrated database of human genes that includes automatically-mined genomic, proteomic and transcriptomic information, as well as orthologies, disease relationships, SNPs, gene expression, gene function, and service links for ordering assays and antibodies

Genomes Online Database - a World Wide Web resource for comprehensive access to information regarding complete and ongoing genome projects around the world

GenomeWeb - Lists of genome sites

PENDANT - exhaustive annotation of 468 genomes by a broad set of bioinformatics algorithms

UCSC Genome Browser - site contains the reference sequence and working draft assemblies for a large collection of genomes

WikiGene - a scientific project that follows a community-based approach to collect data about genes and gene regulatory events

Page updated

Google Sites

Report abuse