We introduce the first state space model (FSSM) with frequency selective spectral operators, parameterizing a family of stable, causal, band-selective kernels whose spectral weights are conditioned on the end task. This yields a representation that adapts its characteristics per task domain while retaining linear-time inference and memory. The key novelty is the trainable spectral front-end through which the model can adapt frequency weighting and inter-bin window size. We show the effectiveness of our learned spectral representations on two independent domains: radar object detection and speech keyword recognition, outperforming state of the art frequency based methods in both domains while maintaining competitive throughput and computational overhead. We further show the robustness of our approach under input perturbations, demonstrating the value of stabilized sequential operators in spectral representation learning.
We introduce SSMRadNet, the first multi-scale State Space Model (SSM) based detector for Frequency Modulated Continuous Wave (FMCW) radar that sequentially processes raw ADC samples through two SSMs. One SSM learns a chirp-wise feature by sequentially processing samples from all receiver channels within one chirp, and a second SSM learns a representation of a frame by sequentially processing chirp-wise features. The latent representations of a radar frame are decoded to perform segmentation and detection tasks. Comprehensive evaluations on the RADIal dataset show SSMRadNet has 10-33× fewer parameters and 60-88× less computation (GFLOPs) while being 3.7× faster than state-of-the-art transformer and convolution-based radar detectors at competitive performance for segmentation tasks.
In this work, we introduce an adaptive hierarchical framework for efficient 3D object detection from point cloud data, designed to dynamically balance computational efficiency and detection performance. Our approach employs a shared feature extractor and multiple detector backbones of varying widths, enabling selective activation of models based on the complexity of the input scene. A novel feature gating mechanism dynamically determines the most relevant features for reduced-width backbones, while a surrogate loss prediction module ranks models in real-time, ensuring optimal backbone selection with minimal overhead. This adaptive strategy reduces compute costs by 41.4% while maintaining a negligible 2.44% reduction in detection accuracy across a range of real-world driving scenes (urban, highway, residential, campus, person) from the KITTI dataset. By addressing runtime adaptability—a critical gap in existing 3D detection frameworks—our method provides a significant algorithmic improvement for high-performance detection models in resource-constrained environments.
Find our work on IEEE Robotics and Automation Letters
Dermoscopic images ideally depict pigmentation attributes on the skin surface which is highly regarded in the medical community for detection of skin abnormality, disease or even cancer. The identification of such abnormality, however, requires trained eyes and accurate detection necessitates the process being time-intensive. As such, computerized detection schemes have become quite an essential, especially schemes which adopt deep learning tactics. In this paper, a convolutional deep neural network, S2C-DeLeNet, is proposed, which (i) Performs segmentation procedure of lesion based regions with respect to the unaffected skin tissue from dermoscopic images using a segmentation sub-network, (ii) Classifies each image based on its medical condition type utilizing transferred parameters from the inherent segmentation sub-network
Find the paper in Journal: Computers in Biology and Medicine, Elsevier
Reduced blood flow due to the blocking of blood vessels is a significant pathological feature of the brain affected by Alzheimer’s disease. These blocks are identified from the Two Photon Excitation Microscopy (TPEF) of the brain which contains spatial and depth-time variable image samples of the vessel structures. In this study, we propose preprocessing techniques on such data to help identify stalled or non-stalled brain capillaries. In our work, we proposed to use processed image data and point cloud data on two separate streams for different state-of-the-art video classification networks. Since the two streams have different modalities and contain complementary information, an early fusion of the two streams allows the combined model to achieve better performance in stalled and non-stalled vessels classification. Our experimental results on The Clog Loss dataset show that our proposed technique consistently improves the performance of all the baseline methods.
Published in ICASSP Workshops 2023.
The dataset provided for our task included traffic images at intersections taken using fisheye cameras. The first task is to detect the vehicles present in the frame. The second task is to track the detected vehicles and obtain their trajectory. There are a few challenges associated with the dataset - fisheye camera causes spherical distortion, overhead camera positioning causes various orientations. The dataset also contained samples for different times of the day - namely day and night. In our detection framework, we first identified the frames as day or night using the SqueezeNet model and then model weights for the appropriate light conditions are selected. We have used two state of the art models named UniverseNet and YOLOV5 in parallel for detecting the vehicles. After the concatenation of detections by the two models, we applied non-max suppression (NMS) to generate final detections. For tracking the detected vehicles, we used the SORT algorithm.
Anomaly detection from drone flight data is a challenging task. In this work we have proposed an ensemble of classical and neural network based approaches to automatically localize and quantify anomalies from drone flight data in both time and sensor domain. Our experiments showed that using the IMU sensor values is the most efficient way for this task.
GitHub link: https://github.com/ClockWorkKid/SPCUP2020-BUET-Synapticans
Automatic speech recognition (ASR) converts the human voice into readily understandable and categorized text or words. Although Bengali is one of the most widely spoken languages in the world, there have been very few studies on Bengali ASR, particularly on Bangladeshi-accented Bengali. In this study, audio recordings of spoken digits (0-9) from university students were used to create a Bengali speech digits dataset that may be employed to train artificial neural networks for voice-based digital input systems. This paper also compares the Bengali digit recognition accuracy of several Convolutional Neural Networks (CNNs) using spectrograms and shows that a test accuracy of 98.23% is achievable using parameter-efficient models such as SqueezeNet on our dataset.
Submitted in ICECE 2022. The dataset is available at Kaggle: https://www.kaggle.com/datasets/mirsayeed/banglanum-bengali-number-recognition-from-voice