Genomic Data Extraction and Visualization

The study of genomics is entirely digitized today. Throughout the past century, improving processing power has kept up with the exponentially growing quantity of data. But now with the consistent improvement in processing power slowly coming to a standstill, the importance of better optimizations and use of the latest technology is becoming evident. Latest technologies employed in other areas of computer science must be adopted in the field of bioinformatics as well.

These new technologies such as machine learning and deep learning techniques have opened various new avenues in analysis in fields like bioinformatics. Considering the rapid rate of generation of data, it is a timely requirement that bio-scientists require intuitive and rapid analysis tools which help them to visualize and go through the data that is being produced. In today’s context even though bio-scientists can use many automated processes to analyze the data, some of them require human intervention and judgement to carry on the work. In some cases, although the processes are automated, the scientist should have to look at those to validate the results and present those to others. Visualizing genomic data is such an instance where the human intervention takes place.

We intend to investigate the ability and the potential of using deep learning-based non-linear dimensionality reduction techniques to get better clustering for interactive genomic data visualization applications.

Metagenomics is the study of the genomic content of the microbial organisms which are extracted from a sample in their natural habitats. These unknown collections of genomic data

are analyzed without any prior lab-based cultivation. One of the vital aspects of metagenomics analysis is the visualization of the information that is derived from the genomic sequences of a microbiome sample. In a successful visualization, the congruent reads of the sequences should appear in clusters depending on the diversity and taxonomy of the microorganisms in the sequenced sample. In converting higher dimensional sequence data into lower dimensional data for visualization purposes, preserving the genomic characteristics is given the highest priority. In this process, the demand for precise and efficient methods of dimensionality reduction is crucial. Currently, PCA and t-SNE are used for dimensionality reduction purposes in metagenomics, which are linear and non-linear techniques respectively.

Although the above-mentioned techniques are widely used, there are shortcomings in accuracy and efficiency in terms of visualizations. In this project, we explore the possibility of using autoencoders, a deep learning technique, to get a rich dimensionality reduction, overcoming the prevailing impediments of PCA and t-SNE and outperforming them to achieve better metagenomic visualizations. Furthermore, we present MetaG, a tool incorporating all the dimensionality reduction techniques with the novel technique of autoencoders. This tool also harnesses the taxonomic information from the samples to give users some unique insight into the samples.

Research Home Publications

Page updated

Report abuse