Research & Projects

Research

Ongoing Research

A Lightweight Transformer-Based Encoder–Decoder Model for Video Summarization

Waka Waka Biggest flashmob in the Netherlands.mp4

Random Video from YouTube

model_summary_of_Waka Waka Biggest flashmob in the Netherlands.mov

Summary of the Video

FLASHMOB!!

I am currently working on a project that uses advanced AI techniques to automatically create short, meaningful summaries from longer videos. The goal is to make it faster and easier to find important moments without watching the entire footage. This approach is designed to work efficiently, even on devices with limited computing power, and can be applied in areas like security monitoring, sports highlights, and event coverage. This work has been submitted to the International Conference on Data Science, Artificial Intelligence, and Applications (ICDSAIA) 2025.

Research 1

A hybrid deep learning framework for daily living human activity recognition with cluster-based video summarization

This project focused on developing a robust framework for human activity recognition (HAR) in daily living scenarios, particularly relevant for environments such as assisted living facilities and nursing homes. The motivation stemmed from the need to accurately monitor residents’ movements and actions, which is crucial for providing timely care and updating emergency response plans. The project was undertaken by a collaborative team: Shihab Hossain, Kaushik Deb, Saadman Sakib, and Iqbal H. Sarker, and was completed at the Department of Computer Science & Engineering, Chittagong University of Engineering and Technology (CUET), Bangladesh.

Objectives

The primary objectives were:

To develop an effective method for classifying daily living activities using deep neural networks.
To overcome the limitations of random or sequential frame sampling in videos, which can cause significant temporal information loss during HAR.
To propose and evaluate a cluster-based video summarization approach for keyframe extraction, ensuring both the retention of essential video content and computational efficiency.
To compare two main deep learning strategies for HAR: (a) pose-based activity recognition and (b) a single hybrid pre-trained CNN-LSTM model.
To validate the approach on challenging datasets, including MSRDailyActivity3D, PRECIS HAR, and UCF11, thereby addressing both indoor and outdoor activity recognition challenges.

Essential considerations included ensuring that the summarization process preserved the temporal context necessary for accurate recognition, and that the proposed framework generalized well across varied datasets and activity types.

Results

The project successfully developed and validated a hybrid CNN-LSTM framework that achieved a mean accuracy of 95.56% for RGB video data on the MSRDailyActivity3D dataset, surpassing many recent multimodal approaches. The cluster-based video summarization method efficiently extracted keyframes, which reduced computational overhead while preserving semantic content. Comparative analysis revealed that the hybrid CNN-LSTM model outperformed the pose-based model and several state-of-the-art methods in terms of precision, recall, and F1 score (with F1 up to 95.03%). The model was further tested on PRECIS HAR and UCF11 datasets, confirming its robustness across different scenarios. Notably, the cluster-based keyframe extraction not only improved recognition accuracy but also highlighted the importance of smart frame selection in video-based HAR. The experience underscored the value of integrating video summarization with deep learning to address practical limitations in real-world surveillance and healthcare monitoring, and set the stage for future research in lightweight, real-time HAR systems.

Research 2

Improved YOLOv5-Based Real-Time Road Pavement Damage Detection in Road Infrastructure Management

This project focused on advancing the efficiency and accuracy of real-time road pavement damage detection, an essential task for modern infrastructure management. The initiative was motivated by the high costs, dangers, and delays inherent in manual inspection processes, and the growing need for automated, scalable, and reliable road maintenance systems. The work was a collaboration among researchers at the Department of Computer Science & Engineering, Chittagong University of Engineering & Technology (CUET), Bangladesh, and Edith Cowan University, Australia. The project team included Abdullah As Sami, Saadman Sakib, Kaushik Deb, and Iqbal H. Sarker.

Objectives

The main objectives were:

To design and implement an improved, lightweight YOLOv5-based deep learning model for real-time detection and classification of various types of road pavement damage.
To address the limitations of existing YOLO-based and traditional models, particularly regarding detection accuracy, class imbalance, and model complexity.
To incorporate advanced techniques—such as Efficient Channel Attention (ECA-Net), label smoothing, K-means++ anchor box generation, Focal Loss, and an extra prediction layer—to enhance performance.
To validate the model using the comprehensive, multinational RDD 2022 dataset, which includes diverse damage types (vertical cracks, horizontal cracks, alligator cracks, potholes) and images from multiple countries.

Essential considerations throughout the project included optimizing the balance between model size and detection speed, achieving robust generalization across diverse road conditions and countries, and ensuring the solution could be deployed in real-world, resource-constrained settings.

Results

The improved YOLOv5 model achieved a significant boost in both accuracy and efficiency. It attained a mean average precision (mAP) of 67.81% and an F1 score of 66.51% on the RDD 2022 dataset, outperforming the baseline YOLOv5s by 1.9% mAP and 1.29% F1 score with only a modest increase in parameters. Compared to YOLOv8s, it achieved similar or better accuracy with fewer parameters and lower computational cost. The results demonstrated the effectiveness of the enhancements, particularly ECA-Net for better feature selection, Focal Loss for class imbalance, and K-means++ for anchor optimization. The ablation studies validated the individual contributions of each technique.

The work highlighted the importance of lightweight, high-performance models for road maintenance and management, especially in countries with large-scale, diverse infrastructure. The project established a practical framework that can impact future intelligent transportation systems and inspire further research in instance segmentation and specialized datasets for even more precise damage detection.

Research 3

A Framework for Industrial Inspection System using Deep Learning

This project aimed to advance automated industrial inspection systems by leveraging deep learning, aligning with the goals of Industry 4.0. The motivation stemmed from the need to improve product quality, reduce human labor, and enhance efficiency in manufacturing environments. To this end, the team (Monowar Wadud Hridoy, Mohammad Mizanur Rahman, and Saadman Sakib) developed a novel framework for camera-based inspection and introduced a new hex-nut product dataset, with the research completed at Chittagong University of Engineering & Technology (CUET), Bangladesh.

Objectives

The main objectives were:

To design a deep learning-based automated inspection system that can accurately classify defective and non-defective products in industrial settings.
To address the lack of high-quality datasets for industrial defect detection by developing a new hex-nut dataset with 4,000 labeled images (2,000 defective, 2,000 non-defective).
To evaluate various CNN architectures (Custom CNN, Inception ResNet v2, Xception, ResNet 101 v2, ResNet 152 v2) using transfer learning and fine-tuning, identifying the optimal solution for defect classification.
Essential considerations included optimizing both CPU and GPU inference times, minimizing computational cost, and ensuring generalizability to other industrial inspection tasks.

Results

The framework achieved 100% accuracy on the newly developed hex-nut dataset and 99.72% accuracy on a public casting material dataset using the fine-tuned Xception model (last 14 layers trainable, excluding the fully connected layer). These results outperformed several state-of-the-art methods and demonstrated robust, transferable defect detection across different product types. The introduction of a high-quality dataset further supports future research in industrial inspection. Key learnings included the effectiveness of transfer learning and careful layer fine-tuning, as well as the significant impact of high-quality, well-labeled data on deep learning performance in industrial contexts. The project’s results are expected to foster broader adoption of deep learning-based inspection systems in smart manufacturing environments.

Research 4

A Framework for Pedestrian Attribute Recognition Using Deep Learning

This project tackled the challenge of recognizing pedestrian attributes in surveillance scenarios, motivated by the growing demand for automated security, soft biometrics, and person retrieval systems. The research was conducted by Saadman Sakib, Kaushik Deb, Pranab Kumar Dhar (CUET, Bangladesh), and Oh-Jin Kwon (Sejong University, Korea).

Objectives

The main objectives were:

To develop a framework for recognizing pedestrian attributes (e.g., gender, age, clothing) in images containing multiple pedestrians, particularly in real-world surveillance footage.

To leverage Mask R-CNN for pedestrian extraction and apply transfer learning with multiple CNN architectures (Inception ResNet v2, Xception, ResNet 101 v2, ResNet 152 v2).

To fine-tune the ResNet 152 v2 model by experimenting with freezing different layers, optimizing performance for attribute recognition.

To address class imbalance in datasets (RAP v2 and PARSE100K) using oversampling and a weighted binary cross-entropy loss function.

Results

The proposed framework achieved state-of-the-art results, with 93.41% mean accuracy (mA) on the RAP v2 dataset and 89.24% mA on the PARSE100K dataset, outperforming previous methods. The fine-tuned ResNet 152 v2 architecture, enhanced with oversampling and weighted loss, delivered significant improvements, particularly in handling class imbalance and diverse attribute sets. The experiments confirmed the effectiveness of Mask R-CNN for pedestrian localization and demonstrated that careful model tuning and data balancing are crucial for robust performance. The findings underscore the value of transfer learning and tailored CNN architectures in real-world pedestrian attribute recognition and provide a foundation for future advancements in surveillance analytics and smart city applications.

Projects

Project 1

Bone Abnormality Classification

Worked on the MURA (musculoskeletal radiographs) dataset of bone X-rays, which is a large-scale dataset with seven bone categories
Some basic data analysis was performed, particularly on the class label distribution
Experimented with Xception CNN architecture by tuning the hyperparameters and Transfer Learning by freezing layers
Plotted the ROC-AUC curve to select the best model and best operating point.

Project 2

Epicardium & Endocardium Layer Location of Heart with CNN

Preprocessed the MRI images and labels with interpolation techniques
Experimented with Inception V3 CNN architecture by tuning the hyperparameters
Predicted the points to localize the Epicardium and Endocardium layers

Project 3

COVID-19 Dashboard

Displayed the number of infections by Division, District, and City in Bangladesh
Tableau is used to visualize the data collected from government websites

Project 4

Custom Airplane Detector with CNN

Preprocessed the images and labels for bounding box prediction
Applied ResNET 152 v2 model for training with the help of Keras Framework
Applied OpenCV tool for displaying the predicted bounding box

Page updated

Google Sites

Report abuse