Project 1:
Metagenomics is the study of genetic material recovered directly from environmental samples. The broad field may also be referred to as environmental genomics, eco-genomics or community genomics. Binning plays an important role in metagenomics studies to characterize the difference microbes present in environmental samples. Binning techniques represent a "best-effort" to identify reads or contigs with certain groups of organisms designated as operational taxonomic units (OTUs).
A popular pipeline followed in metagenomics analysis is to first assemble reads into contigs and then bin these contigs to identify the different taxonomic groups. Several metagenomics contig-binning tools have been developed that make use of different features such as nucleotide composition, sequencing coverage, paired-end reads and assembly graphs, along with different computational techniques such as Expectation Maximisation algorithms, normalised cuts, clustering and probabilistic approaches.
Recent work has shown how to use convex hulls in clustering high-dimensional data. Since we can obtain a large number of features for metagenomic data, The key objective of the project is investigating how convex hull-based clustering techniques can be applied to improve metagenomics binning.
Project 2:
Meta-genomics is the study of genetic substances extracted from the environment which mainly includes microorganisms. It is important for biologists to bin the gene sequences that are found in the environment into different bins to expand the understanding about the microorganisms. The project is intended to develop a taxonomic-independent binning tool to classify metagenomic sequences of previously identified and unknown microorganisms with high accuracy. Although tools aimed for that purpose exist already, they do not have adequate precision needed. The two phases of the project are, developing a classifier to predict whether each read is a chromosome or a plasmid and developing a clustering tool to categorize sequences into the microbial bins. Factors like the kmer count of gene sequences and also other biological features that can be extracted out of the microbial genome are used as features for the model. The results would be evaluated against the standard microbial databases. The output would be a tool that can be widely used for metagenomic binning.
Project 3: Binning Metagenomic Sequences Using Affine or Convex Hulls
Metagenomics has enabled culture-independent analysis of micro-organisms present in environmental samples. Metagenomics binning, which involves the grouping of contigs into bins that represent different taxonomic groups, is an important step of a typical metagenomic workflow followed after assembly. The majority of the metagenomic binning tools represent the composition and coverage information of contigs as feature vectors consisting of a large number of dimensions. However, these tools use traditional Euclidean distance or Manhattan distance metrics which become unreliable in the high-dimensional space. In this report, we propose CH-Bin, a binning approach that leverages the benefits of using affine/convex hull distance for binning contigs represented by high dimensional feature vectors. We demonstrate using experimental evidence on simulated and real datasets that the use of high dimensional feature vectors to represent contigs can preserve additional information, and result in improved binning results. We further demonstrate that the affine/convex hull distance based binning approach can be effectively utilized in binning such high-dimensional data. To the best of our knowledge, this is the first time that composition information from oligonucleotides of multiple sizes has been used in representing the composition information of contigs and a affine/convex-hull distance based binning algorithm has been used to bin metagenomic contigs. The source code of CH-Bin is available at
https://github.com/kdsuneraavinash/CH-Bin
CH-Bin: A Convex Hull Based Approach for Binning Metagenomic Contigs
Project 4: DNA classification usign DL (Metagenomic binning)
Chromosomes and plasmids are the major carriers of genetic material in microorganisms such as bacteria. Separating chromosomal and plasmid Deoxyribonucleic acid (DNA) from large datasets is important as plasmids and chromosomes affect functions and other environmental adaptations. Bioinformatics methodologies have been developed for plasmid classification with the advancements in sequencing technologies. The usage of normalized short k-mer counts and the usage of bio-barkers from DNA sequences as features with machine learning models have been very popular. However, both approaches have their strengths and weaknesses. MetaPCbin is a plasmid detection tool that combines computational and genetic approaches into a hybrid method of plasmid prediction. MetaPCbin uses an artificial neural network that uses k-mer counts as features and a random forest model that uses biomarkers. The evaluations of MetaPCbin with real-world DNA sequences and simulated sequences show that it is capable of performing plasmid classification with greater accuracy compared to the state of the art. We introduce our second tool, MetaGraph, a Graph Neural Network (GNN) based tool for plasmid/chromosome classification enhancement. It uses the high confidence predictions of existing plasmid/chromosome prediction tools and improves the prediction accuracy of low confidence predictions using plasmid probabilities as features for the GNN. We evaluated MetaGraph for a set of real and simulated DNA sequences. The results were a significant improvement over the state-of-the-art tools which were used for the initial predictions. The source code for MetaPCbin and Meta-Graph is freely available at: https://github.com/MetaGSC/MetaPCbin and https://github.com/MetaGSC/MetaGraph