ARGContextProfiler: extracting and scoring the genomic contexts of antibiotic resistance genes using assembly graphs. (Frontiers in Microbiology, 2025)
Nazifa Ahmed Moumi, Shafayat Ahmed, Connor Brown, Amy Pruden, Liqing Zhang
Overview: Antibiotic resistance (AR) presents a global health challenge, necessitating an improved understanding of the ecology, evolution, and dissemination of antibiotic resistance genes (ARGs). Several tools, databases, and algorithms are now available to facilitate the identification of ARGs in metagenomic sequencing data; however, direct annotation of short-read data provides limited contextual information. Knowledge of whether an ARG is carried in the chromosome or on a specific mobile genetic element (MGE) is critical to understanding mobility, persistence, and potential for co-selection. Here, we developed ARGContextProfiler, a pipeline designed to extract and visualize ARG genomic contexts. By leveraging the assembly graph for genomic neighborhood extraction and validating contexts through read mapping, ARGContextProfiler minimizes chimeric errors that are a common artifact of assembly outputs. Testing on real, synthetic, and semi-synthetic data, including long-read sequencing data from environmental samples, demonstrated that ARGContextProfiler offers superior accuracy, precision, and sensitivity compared to conventional assembly-based methods. ARGContextProfiler thus provides a powerful tool for uncovering the genomic context of ARGs in metagenomic sequencing data, which can be of value to both fundamental and applied research aimed at understanding and stemming the spread of AR. The source code of ARGContextProfiler is publicly available at GitHub.
Resistance genes are distinct in protein-protein interaction networks according to drug class and gene mobility. (International Conference on Computational Advances in Bio and Medical Sciences, 2025)
Nazifa Ahmed Moumi, Connor L Brown, Shafayat Ahmed, Peter J Vikesland, Amy Pruden, Liqing Zhang
Overview: With growing calls for increased surveillance of antibiotic resistance as an escalating global health threat, improved bioinformatic tools are needed to track antibiotic resistance genes (ARGs) across One Health domains. Most studies to date profile ARGs using sequence homology, but such approaches provide limited information about the broader context or function of the ARG in bacterial genomes. Here, we introduce a new pipeline, PPI-ARG-finder, for identifying ARGs in genomic data that employs machine learning analysis of Protein-Protein Interaction Networks (PPINs) as a means to improve predictions of ARGs while also providing vital information about the genetic context, such as gene mobility. A random forest model was trained to effectively differentiate between ARGs and non-ARGs and was validated using the PPINs of ESKAPE pathogens (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, and Enterobacter cloacae), which represent urgent threats to human health because they tend to be multi-antibiotic resistant. The pipeline exhibited robustness in discriminating ARGs from non-ARGs, achieving an average area under the precision-recall curve of 88%. We further identified that the neighbors of ARGs, i.e., genes connected to ARGs by only one edge, were disproportionately associated with mobile genetic elements, which is consistent with the understanding that ARGs tend to be more mobile compared to randomly sampled genes in the PPINs. PPI-ARG-finder showcases the utility of PPINs in discerning distinctive characteristics of ARGs within a broader genomic context and in differentiating ARGs from non-ARGs through network-based attributes and interaction patterns.
CIWARS: a web server for antibiotic resistance surveillance using longitudinal metagenomic data. (Journal of Molecular Biology, 2025)
Muhit Islam Emon, Yat Fei Cheung, James Stoll, Monjura Afrin Rumi, Connor Brown, Joung Min Choi, Nazifa Ahmed Moumi, Shafayat Ahmed, Haoqiu Song, Justin Sein, Shunyu Yao, Ahmad Khan, Suraj Gupta, Rutwik Kulkarni, Ali Butt, Peter Vikesland, Amy Pruden, Liqing Zhang
Overview: The rise of antibiotic resistance (AR) poses a substantial threat to human and animal health, food security, and economic stability. Wastewater-based surveillance (WBS) has emerged as a powerful strategy for population-level AR monitoring, providing valuable data to guide public health and policy decisions. Metagenomic sequencing is especially promising, as it can yield comprehensive profiles of antibiotic resistance genes (ARGs) and other genes relevant to AR in a single run. However, online analytical platforms to facilitate analysis of longitudinal metagenomic data are lacking. To address this, we introduce CyberInfrastructure for Waterborne Antibiotic Resistance Surveillance (CIWARS), a web server configured for characterizing key AR trends from longitudinal metagenomic WBS data. CIWARS offers comprehensive profiling of ARGs and taxonomic profiling of pathogen-associated bacterial taxonomic groups, identifies potential associations of ARGs with mobile genetic elements (MGEs) and pathogen-containing taxa, and assesses resistome risk based on the co-occurrence of ARGs, MGEs, and pathogen-like sequences. Additionally, it detects anomalous AR indicators over time, aiding in identifying potential events of concern, such as the emergence of resistant strains or outbreaks. Through interactive temporal data visualization, CIWARS enables AR monitoring and can serve as a tool to inform effective and timely interventions to mitigate the spread and transmission of AR. Here, CIWARS is demonstrated using longitudinal metagenomic data from a wastewater treatment plant (WWTP) influent and effluent, but it can be extended to any environment. CIWARS provides a valuable tool to support global efforts to combat the evolution and spread of AR, while also guiding agricultural and public health efforts aimed at optimizing antibiotic use. The web server is freely available at https://ciwars.cs.vt.edu/.
LLMAgent4Bio: LLM Agents for Biological Intelligence Across Genomics, Proteomics, Spatial Biology, and Biomedicine. (Briefings in Bioinformatics, 2026)
Sajib Acharjee Dip, Dipanwita Mallick, Uddip Acharjee Shuvo, Shovito Barua Soummo, Fazle Rafsani, Bikash Kumar Paul, Nazifa Ahmed Moumi, Shafayat Ahmed, Liqing Zhang
Overview: Large language models are evolving from passive predictors into agentic systems capable of planning, tool use, and multimodal reasoning. This shift is especially consequential for biology, where complex, noisy, and multi-scale data require adaptive and integrative computational strategies. In this review, we provide the first systematic synthesis of LLM-based agents across genomics, molecular biology, imaging, biomedical analysis, and automated bioinformatics workflows. We analyze more than fifty emerging systems and organize them within a unifying framework that characterizes agentic traits such as autonomous decision-making, external tool invocation, memory, and self-correction. Across domains, agentic LLMs show early promise in enabling multi-step analysis, linking heterogeneous evidence, and supporting exploratory scientific tasks. At the same time, our comparative assessment highlights consistent challenges, including unstable reasoning, limited biological grounding, retrieval misalignment, and barriers to reproducibility and biosafety. We conclude by outlining opportunities for trustworthy and collaborative biological agents, including multimodal integration, closed-loop experimental design, and robust evaluation practices. This survey aims to clarify the emerging landscape and chart a path toward reliable agentic systems for biological discovery.
HALO: Hybrid Attention Model for Subcellular Localization. (Pacific Symposium Biocomputing, 2026)
Shafayat Ahmed, Nazifa Ahmed Moumi, Liqing Zhang
Overview: Subcellular localization prediction is critical for understanding protein functions and interactions, providing insights into cellular mechanisms and potential therapeutic targets. We propose HALO (Hybrid Attention model for subcellular LOcalization), a framework that integrates semantic embeddings from fine-tuned protein language models (e.g., ESM) with structural information derived from AlphaFold. HALO uses a graph attention network (GAT) to incorporate biochemical, structural, and sequence-derived features into a unified representation, while dynamically balancing their contributions. Crucially, the design allows HALO to operate in two modes: (i) a sequence-only mode, where predictions are made from the fine-tuned protein language model (PLM) when structural data are unavailable, and (ii) a hybrid mode, where structural adjacency and biochemical features complement PLM predictions, especially in low-confidence regions. We evaluate HALO on multiple datasets with minimal homology between training and test sets, where it achieves competitive performance across key metrics. By flexibly combining sequence-based and structure-informed predictions, HALO addresses the limitations of relying on a single modality and offers an adaptable framework for accurate and generalizable subcellular localization.
ProtAlign-ARG: antibiotic resistance gene characterization integrating protein language models and alignment-based scoring. (Scientific Reports, 2025)
Shafayat Ahmed, Muhit Islam Emon, Nazifa Ahmed Moumi, Lifu Huang, Dawei Zhou, Peter Vikesland, Amy Pruden, Liqing Zhang
Overview: The evolution and spread of antibiotic resistance pose a global health challenge. Whole genome and metagenomic sequencing offer a promising approach to monitoring the spread, but typical alignment-based approaches for antibiotic resistance gene (ARG) detection are inherently limited in the ability to detect new variants. Large protein language models could present a powerful alternative but are limited by databases available for training. Here we introduce ProtAlign-ARG, a novel hybrid model combining a pre-trained protein language model and an alignment scoring-based model to expand the capacity for ARG detection from DNA sequencing data. ProtAlign-ARG learns from vast unannotated protein sequences, utilizing raw protein language model embeddings to improve the accuracy of ARG classification. In instances where the model lacks confidence, ProtAlign-ARG employs an alignment-based scoring method, incorporating bit scores and e-values to classify ARGs according to their corresponding classes of antibiotics. ProtAlign-ARG demonstrated remarkable accuracy in identifying and classifying ARGs, particularly excelling in recall compared to existing ARG identification and classification tools. We also extended ProtAlign-ARG to predict the functionality and mobility of ARGs, highlighting the model’s robustness in various predictive tasks. A comprehensive comparison of ProtAlign-ARG with both the alignment-based scoring model and the pre-trained protein language model demonstrated the superior performance of ProtAlign-ARG.
LM-ARG: classification of antibiotic resistance genes using protein language model. (BIBM,2022)
Shafayat Ahmed, Muhit Islam Emon, Nazifa Ahmed Moumi, Liqing Zhang
Overview: Antibiotic resistance genes (ARG) are genes that enable bacteria carrying them to survive in the presence of antibiotics and their increasing prevalence in the environment poses a great threat to human health. Therefore, detecting and classifying ARGs is of utmost priority. Classification of ARGs means determining the antibiotic class the genes confer resistance to. Language models (LM) have been used extensively in natural language processing for various prediction tasks. Recently protein language models have been trained using a large number of protein sequences. These protein language models should capture the distant features of protein sequences. These features can be used for the identification and classification of ARGs.
SHONGLAP: A Large Bengali Open-Domain Dialogue Corpus (LREC, 2022)
Syed Mostofa Monsur, Sakib Chowdhury, Md Shahrar Fatemi, Shafayat Ahmed
Overview: We introduce SHONGLAP, a large annotated open-domain dialogue corpus in the Bengali language. Due to the unavailability of high-quality dialogue datasets for low-resource languages like Bengali, existing neural open-domain dialogue systems suffer from data scarcity. We propose a framework to prepare large-scale open-domain dialogue datasets from publicly available multi-party discussion podcasts, talk-shows and label them based on weak-supervision techniques which are particularly suitable for low-resource settings. Using this framework, we prepared our corpus, the first reported Bengali open-domain dialogue corpus (7.7k+ fully annotated dialogues in total) which can serve as a strong baseline for future works. Experimental results show that our corpus improves the performance of large language models (BanglaBERT) in the case of downstream classification tasks during fine-tuning.
Improving End-to-End Bangla Speech Recognition with Semi-supervised Training (Conference: Findings of ACL: EMNLP 2020)
Nafis Sadeq, Nafis Tahmid Chowdhury, Farhan Tanvir Utshaw, Shafayat Ahmed, Muhammad Abdullah Adnan
Overview: Automatic speech recognition systems usually require large annotated speech corpus for training. The manual annotation of a large corpus is very difficult. It can be very helpful to use unsupervised and semi-supervised learning methods in addition to supervised learning. In this work, we focus on using a semi-supervised training approach for Bangla Speech Recognition that can exploit large unpaired audio and text data. We encode speech and text data in an intermediate domain and propose a novel loss function based on the global encoding distance between encoded data to guide the semi-supervised training. Our proposed method reduces the Word Error Rate (WER) of the system from 37% to 31.9%.
Preparation of Bangla Speech Corpus from Publicly Available Audio (LREC,2020)
Shafayat Ahmed,Nafis Sadeq, Sudipta Saha Shubha, Md. Nahidul Islam, Muhammad Abdullah Adnan
Overview: Automatic speech recognition systems require a large annotated speech corpus. Manual annotation of a large corpus is very difficult. In this paper, we focus on the automatic preparation of a speech corpus (512 hours) for Bangla - a low-resource language. We have used publicly available Bangla audiobooks and TV news recordings as audio sources. We designed and implemented an iterative algorithm that takes as input a speech corpus and a huge amount of raw audio (without transcription) and outputs a much larger speech corpus with reasonable confidence. Also, we have leveraged speaker diarization, gender detection, etc. to prepare the annotated corpus. Our corpus is suitable for training with Kaldi. Experimental results show that the system trained on our corpus outperforms the system trained on the publicly available Google speech dataset (229 hours).
BANGLA VOICE COMMAND RECOGNITION IN END-TO-END SYSTEM USING TOPIC MODELING BASED CONTEXTUAL RESCORING (ICASSP, 2020)
Nafis Sadeq, Shafayat Ahmed, Sudipta Saha Shubha, Md. Nahidul Islam & Muhammad Abdullah Adnan
Overview: In this work, we perform contextual rescoring using multi-label topic modeling to improve the performance of an End-to-End Bangla voice command recognition system. We use a hybrid of Connectionist Temporal Classification (CTC) and Attention mechanism in our End-to-End architecture. We use Recurrent Neural Network (RNN) as a language model and LabeledLDA(Latent Dirichlet Allocation) for contextual rescoring. Our experiments show that our rescoring method reduces Word Error Rate (WER) from16.7% to12.8% in the Bangla voice command recognition task when the relevant context is provided. The system does not lose any performance when the irrelevant context is provided.
Customizing Grapheme-to-Phoneme System for Non-Trivial Transcription Problems in Bangla Language ( NAACL, 2019)
Conference: North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019) At: Minneapolis, USA
Sudipta Saha Shubha , Nafis Sadeq, Shafayat Ahmed, Md. Nahidul Islam , Muhammad Abdullah Adnan , Md. Yasin Ali Khan , and Mohammad Zuberul Islam
NORTH: a highly accurate and scalable Naive Bayes based ORTHologous gene clustering algorithm
Nabil Ibtehaz, Shafayat Ahmed, Bishwajit Saha, M. Sohel Rahman, and Md. Shamsuzzoha Bayzid
Overview : We present NORTH, a novel, automated, highly accurate, and scalable machine learning-based orhtologous gene clustering method. We have utilized the biological basis of orthologous genes and made an effort to incorporate appropriate ideas from machine learning (ML) and natural language processing (NLP). We studied 1,255,877 genes in the largest 250 orthologous clusters from the KEGG database, across 3,880 organisms comprising the six major groups of life. NORTH is able to cluster them with 98.48% Precision, 98.43% Recall, and 98.44% F1 score, showing that automatic orthologous gene clustering can be both highly accurate and scalable.
This is the first study that maps the orthology identification to the text classification problem and achieves remarkable accuracy and scalability. We believe NORTH will be considered as a potential alternative to the existing phylogenetic tree and BLAST-based methods.
Server Mapping for Protecting Connectivity of VIP Clients in Network
[Undergraduate Thesis]
Supervisor: Dr. Md. Saidur Rahman, Professor, Department of Computer Science
and Engineering, Bangladesh University of Engineering and Technology (BUET).
Predicting Human infecting Diseases from viral sequences
Overview: There are a huge amount of viral sequences available in this world. But among them very little are able to cross the human-animal barrier and cause human diseases. They are called zoonotic diseases. In this work, I am trying to focus on this problem and predict these zoonotic deseases just from their viral sequences.
Predicting Protein Functions Using Protein Sequences by Topic Modelling
Shafayat Ahmed, Nafis Sadeq, and Md. Shamsuzzoha Bayzid
Overview: Our focus of this research is to develop a model that can efficiently and accurately predict protein function from protein sequence using machine-learning algorithms. Currently we are using topic modelling algorithm- Labelled LDA, performing our experiments, and evaluating our results.