Research & Publications

Published Paper

LM-ARG: classification of antibiotic resistance genes using protein language model. (BIBM,2022)

Shafayat Ahmed, Muhit Islam Emon, Nazifa Ahmed Moumi, Liqing Zhang

Overview: Antibiotic resistance genes (ARG) are genes that enable bacteria carrying them to survive in the presence of antibiotics and their increasing prevalence in the environment poses a great threat to human health. Therefore, detecting and classifying ARGs is of utmost priority. Classification of ARGs means determining the antibiotic class the genes confer resistance to. Language models (LM) have been used extensively in natural language processing for various prediction tasks. Recently protein language models have been trained using a large number of protein sequences. These protein language models should capture the distant features of protein sequences. These features can be used for the identification and classification of ARGs.

SHONGLAP: A Large Bengali Open-Domain Dialogue Corpus (LREC, 2022)

Syed Mostofa Monsur, Sakib Chowdhury, Md Shahrar Fatemi, Shafayat Ahmed

Overview: We introduce SHONGLAP, a large annotated open-domain dialogue corpus in the Bengali language. Due to the unavailability of high-quality dialogue datasets for low-resource languages like Bengali, existing neural open-domain dialogue systems suffer from data scarcity. We propose a framework to prepare large-scale open-domain dialogue datasets from publicly available multi-party discussion podcasts, talk-shows and label them based on weak-supervision techniques which are particularly suitable for low-resource settings. Using this framework, we prepared our corpus, the first reported Bengali open-domain dialogue corpus (7.7k+ fully annotated dialogues in total) which can serve as a strong baseline for future works. Experimental results show that our corpus improves the performance of large language models (BanglaBERT) in the case of downstream classification tasks during fine-tuning.

Improving End-to-End Bangla Speech Recognition with Semi-supervised Training (Conference: Findings of ACL: EMNLP 2020)

Nafis Sadeq, Nafis Tahmid Chowdhury, Farhan Tanvir Utshaw, Shafayat Ahmed, Muhammad Abdullah Adnan

Overview: Automatic speech recognition systems usually require large annotated speech corpus for training. The manual annotation of a large corpus is very difficult. It can be very helpful to use unsupervised and semi-supervised learning methods in addition to supervised learning. In this work, we focus on using a semi-supervised training approach for Bangla Speech Recognition that can exploit large unpaired audio and text data. We encode speech and text data in an intermediate domain and propose a novel loss function based on the global encoding distance between encoded data to guide the semi-supervised training. Our proposed method reduces the Word Error Rate (WER) of the system from 37% to 31.9%.

Preparation of Bangla Speech Corpus from Publicly Available Audio (LREC,2020)

Shafayat Ahmed,Nafis Sadeq, Sudipta Saha Shubha, Md. Nahidul Islam, Muhammad Abdullah Adnan

Overview: Automatic speech recognition systems require a large annotated speech corpus. Manual annotation of a large corpus is very difficult. In this paper, we focus on the automatic preparation of a speech corpus (512 hours) for Bangla - a low-resource language. We have used publicly available Bangla audiobooks and TV news recordings as audio sources. We designed and implemented an iterative algorithm that takes as input a speech corpus and a huge amount of raw audio (without transcription) and outputs a much larger speech corpus with reasonable confidence. Also, we have leveraged speaker diarization, gender detection, etc. to prepare the annotated corpus. Our corpus is suitable for training with Kaldi. Experimental results show that the system trained on our corpus outperforms the system trained on the publicly available Google speech dataset (229 hours).

BANGLA VOICE COMMAND RECOGNITION IN END-TO-END SYSTEM USING TOPIC MODELING BASED CONTEXTUAL RESCORING (ICASSP, 2020)

Nafis Sadeq, Shafayat Ahmed, Sudipta Saha Shubha, Md. Nahidul Islam & Muhammad Abdullah Adnan

Overview: In this work, we perform contextual rescoring using multi-label topic modeling to improve the performance of an End-to-End Bangla voice command recognition system. We use a hybrid of Connectionist Temporal Classification (CTC) and Attention mechanism in our End-to-End architecture. We use Recurrent Neural Network (RNN) as a language model and LabeledLDA(Latent Dirichlet Allocation) for contextual rescoring. Our experiments show that our rescoring method reduces Word Error Rate (WER) from16.7% to12.8% in the Bangla voice command recognition task when the relevant context is provided. The system does not lose any performance when the irrelevant context is provided.

Customizing Grapheme-to-Phoneme System for Non-Trivial Transcription Problems in Bangla Language ( NAACL, 2019)

Conference: North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019) At: Minneapolis, USA

Sudipta Saha Shubha , Nafis Sadeq, Shafayat Ahmed, Md. Nahidul Islam , Muhammad Abdullah Adnan , Md. Yasin Ali Khan , and Mohammad Zuberul Islam

NORTH: a highly accurate and scalable Naive Bayes based ORTHologous gene clustering algorithm

Nabil Ibtehaz, Shafayat Ahmed, Bishwajit Saha, M. Sohel Rahman, and Md. Shamsuzzoha Bayzid

Overview : We present NORTH, a novel, automated, highly accurate, and scalable machine learning-based orhtologous gene clustering method. We have utilized the biological basis of orthologous genes and made an effort to incorporate appropriate ideas from machine learning (ML) and natural language processing (NLP). We studied 1,255,877 genes in the largest 250 orthologous clusters from the KEGG database, across 3,880 organisms comprising the six major groups of life. NORTH is able to cluster them with 98.48% Precision, 98.43% Recall, and 98.44% F1 score, showing that automatic orthologous gene clustering can be both highly accurate and scalable.

This is the first study that maps the orthology identification to the text classification problem and achieves remarkable accuracy and scalability. We believe NORTH will be considered as a potential alternative to the existing phylogenetic tree and BLAST-based methods.

Server Mapping for Protecting Connectivity of VIP Clients in Network

[Undergraduate Thesis]

Supervisor: Dr. Md. Saidur Rahman, Professor, Department of Computer Science

and Engineering, Bangladesh University of Engineering and Technology (BUET).

Current Ongoing Research Projects

Predicting Human infecting Diseases from viral sequences

Overview: There are a huge amount of viral sequences available in this world. But among them very little are able to cross the human-animal barrier and cause human diseases. They are called zoonotic diseases. In this work, I am trying to focus on this problem and predict these zoonotic deseases just from their viral sequences.

Predicting Protein Functions Using Protein Sequences by Topic Modelling

Shafayat Ahmed, Nafis Sadeq, and Md. Shamsuzzoha Bayzid

Overview: Our focus of this research is to develop a model that can efficiently and accurately predict protein function from protein sequence using machine-learning algorithms. Currently we are using topic modelling algorithm- Labelled LDA, performing our experiments, and evaluating our results.