Codes

Research Topics:

Benchmarking protein language models for protein crystallization

The standalone source code and models are available here. Our key contributions are:

Benchmarking different ESM2, Ankh, ProtT5, XtrimoPGLM, SaProt models for the task of protein crystallization prediction using raw protein sequences on external balanced, SwissProt and TrEMBL test sets;
Benchmark per-residue feature representation of three top-performing PLMs as input to CNN and LSTM models for the task of protein crystallization prediction on external balanced, SwissProt and TrEMBL test sets;
Fine-tune a protein generator namely ProtGPT2 to generate de novo protein sequences from the crystallizable class;
Evaluate, screen and validate the generated proteins to identify a unique set of stable and well-folded proteins.

Network-based Identification of Key Master Regulators associated with an Immune-Silent Cancer Phenotype

The standalone source code and models are available here. Our key contributions are:

Identified master regulators (MRs) of hot and cold tumors in 12 different cancer types using multiple master regulator analysis pipelines.
Determine MRs specific to hot tumors across the 12 cancers of interest (vice-versa for cold tumors).
Pan-cancer validation of these MRs in 20 cancers in TCGA + PRECOG datasets.
Master Regulators such as L3MBTL1, SALL2, BTRC, PRKCZ, KAT2A and SMARCC2 are positively active for the immune-silent cancer phenotype in pan-cancer settings.
Downstream pathway analysis leads to detection of NOTCH1, TGF-β, Interleukin-1 and TNF-α signaling pathways that were coherently associated with absence of a protective immune response, potentially representing a target for cancer immunologic conversion.

DeepRepurpose: A Modelling Framework for Embedding-based Predictions for Compound-Viral Protein Activity

The standalone source code and models are available here. Our key contributions are:

Collection of compound-viral protein activity from resources such as PubChem and ChEMBL leading to >60k interactions between >50k compounds and ≈ 100 viral organisms.
Propose 4 different end-to-end deep learning techniques to predict compound-viral protein activity based on SMILES strings of compounds and primary structure of viral proteins.
Showcase the effectiveness of the consensus framework as it outperforms individual modeling techniques on the test set.
Identified ranked list of 47 compounds and validated them using molecular docking simulations.

DeepSol: a deep learning framework for sequence-based protein solubility prediction

The code related to this research is available here. Our key contributions are:

A deep learning model that can directly be applied on protein sequences without extensive feature engineering.
Beats state-of-the-art methods like PaRSnIP, PROSO II, Solpro easily.
Allows incorporation of additional bio-physical and bio-chemical features to improve the model performance.

RGBM: regularized gradient boosting machines for identification of the transcriptional regulators of discrete gliomas

https://sites.google.com/site/raghvendramallmlresearcher/codes/GliomaMRs_v2.png

The code related to this research is available here. Our key contributions are:

Sparsifying the GRN inferred from tree-based ML techniques (GBM/RF) using a Tikonov regularization optimal L-curve criterion on the edge-weight distribution from the Relative Variable Importance (RVI) scores of a target gene to determine the optimal set of TFs associated with it.
Propose a simple heuristic based on maximum relative variable importance score for all genes to detect nodes with 0 indegree or upstream regulators.
Incorporation of prior knowledge in the form of mechanistic active binding network (cis regulatory motifs).
Show RGBM beats state-of-the-art methods like ARACNE, GENIE, ENNET w.r.t. area under precision-recall curve and area under receiver operating curve by 10-15% on various DREAM Challenge datasets.
Show through synthetic RNA-Seq experiments that random-forest based methods are inferior to gradient boosting machines for inferring GRNs where very few TFs (hubs) are regulating a majority of the genes.
Identification of main regulators of different molecular subtypes of brain tumor i.e. master regulators driving transformation from the G-CIMP-high into the G-CIMP-low and PA-like into LGm6-GBM subtype of glioma.
Identification and valdiation of the main regulators of the mechanism of action of FGFR3-TACC3 fusion in glioblastomas.

The supplementary material related to this research is available here.

Differential Community Detection in Paired Biological Networks

https://sites.google.com/site/raghvendramallmlresearcher/codes/TCGA_Full_Subgraph_Image.jpg?attredirects=0

The code related to this research is available here.

The supplementary material related to this research is available here.

Differential Sub-network Analysis of paired biological networks:

https://sites.google.com/site/raghvendramallmlresearcher/codes/combined_real.jpg?attredirects=0

The code related to this research is provided here.

Our contribution includes:

Propose an improvement over the state-of-the-art based on Generalized Hamming Distance to identify statistically significant differences (sub-networks) between two labeled topological graphs.
Proposed Closed-Form procedure exploits an effective model selection criterion in combination with asymptotic solutions to p-values for statistical significance.
Figure is the result of our proposed Closed-Form approach on regulatory networks of IDH-Mutant versus IDH-wildtype in case of glioma cancer.

This work has been developed by Raghvendra Mall under the guidance of Prof. Michele Ceccarelli and the source code is available here.

Kernel Methods for Sparse Classification:

Very Sparse Least Squares Support Vector Machines (LSSVM):

This work is attached to this research.

Our contribution includes:

Very sparse version of primal and dual LSSVMs meant for classification and regression problems.
Applicable to large scale datasets in the primal using sparse version of fixed-size LSSVM.
Uses re-weighted L1-norm penalty (convex relaxation of L0-norm).

This work has been developed by Raghvendra Mall under the guidance of Prof. Johan Suykens

and the source code is available here.

Kernel Methods for Community Detection:

Multilevel Hierarchical Kernel Spectral Clustering for Large Scale Networks

https://sites.google.com/site/raghvendramallmlresearcher/codes/10.1371-journal.pone.0099966.g009.png?attredirects=0

This work is attached to this research.

This tool can run on a network with upto 10^6-10^7 nodes and 10^8-10^9 edges on a standard machine with 8-16 Gb Ram using Matlab 2011 or above in under 10 minutes. The main options are available with this tool:

Possibility to extract hierarchical community structure from a large scale network.
Produces good quality clusters at finer as well as coarser levels of granularity and overcomes the resolution limit problem.

Source code is available here.

Kernel Spectral Clustering for Big Data Networks:

https://sites.google.com/site/raghvendramallmlresearcher/codes/block_diagonal.jpg?attredirects=0

This work is attached to this research.

This tool can run on a network with upto 10^6-10^7 nodes and 10^8-10^9 edges on a standard machine with 8-16 Gb Ram using Matlab 2011 or above in under 4 minutes. The main options are available with this tool:

Possibility to extract flat communities in a given large scale network.
Possibility to use either a "Self" tuned approach or a "Balance Angular Fitting" approach for model selection (estimating the number of communities "k")
Possibility to run on large scale sparse complex networks on both Linux and Windows.

Source code is available here (Linux , Windows).

Source code for FURS sampling technique is available here.

Data Visualization:

Netgram: Visualizing Evolution of Communities in Dynamic Networks

https://sites.google.com/site/raghvendramallmlresearcher/codes/Mergesplit_LineTrack_Louvain_04.jpg

Netgram is a software which allows visualization of evolution of communities/clusters in time-evolving data.

Release and developer version can be found here. Some of the salient feature of the software include:

Independent of the evolutionary clustering algorithm used.
Ability to track evolution of clusters and highlight events like birth, death, merge, split, continuation, growth and shrinkage of communities.
Tries to optimally satisfy certain aesthetic qualities like minimization of cross-talk between communities over multiple time-stamps.

Google Sites

Report abuse