1. Medical image segmentation using CT or MRI
Task: lung nodule (cancer) segmentation
Data:
LIDC-LDRI(Lung Image Database Consortium and Image Database Resource Initiative, a reference database of lung nodules on CT scans)
BraTS(Brain Tumor Segmentation)
Links:
Background & Significance
Medical image segmentation is the process of partitioning images (from modalities such as CT, MRI, PET, and ultrasound) into meaningful regions or structures.
It plays a crucial role in diagnostics, treatment planning, and quantitative analysis by isolating organs, lesions, or other regions of interest for further study.
Key Challenges
Variability in Modalities and Appearance: Different imaging techniques produce diverse contrast levels, noise characteristics, and artifacts.
Ambiguous Boundaries: Lesions and organs often present with fuzzy or overlapping boundaries due to partial volume effects.
Heterogeneity: Intra- and inter-patient variability demand robust algorithms that generalize well over a wide range of cases.
Deep Learning Approaches
Convolutional neural networks (CNNs) like U-Net and Fully Convolutional Networks (FCNs) perform end-to-end segmentation by learning hierarchical representations directly from the data.
Recent advances incorporate attention mechanisms and transformer-based modules to improve the detection of fine details and contextual information.
Evaluation Metrics and Future Directions
Segmentation accuracy is typically measured using the Dice Similarity Coefficient (DSC), Intersection over Union (IoU), sensitivity, specificity, and boundary-based metrics such as the Hausdorff Distance.
Future research is focused on enhancing model robustness across heterogeneous datasets and integrating multimodal imaging data to support personalized medicine.
2. Language-based audio retrieval
Task: to retrieve audio files based on a given text query
Data: Clotho dataset
Challenge: Detection and Classification of Acoustic Scenes and Events (DCASE)
Links:
Background & Significance
Language-based audio retrieval refers to the process of using natural language queries to search and retrieve relevant audio content (e.g., speeches, podcasts, music) from large multimedia databases.
This cross-modal retrieval task is vital for enhancing multimedia search engines, enabling users to access specific audio segments based on descriptive language inputs.
Key Challenges
Heterogeneity of Data: Audio recordings vary in quality, language, accent, background noise, and recording conditions, making it challenging to extract consistent semantic cues.
Semantic Gap: Bridging the gap between low-level audio features (like spectrograms or MFCCs) and high-level textual semantics remains a fundamental challenge.
Temporal Dynamics: Audio is inherently time-dependent, so aligning dynamic audio streams with static or temporally varying textual queries is complex.
Deep Learning Techniques
Modern methods embed both audio and text data into a shared latent space using end-to-end deep learning models, enabling more effective cross-modal retrieval.
Architectures often incorporate convolutional or recurrent networks—and increasingly transformer models—to capture both the spectral and temporal features of audio along with textual semantics.
Evaluation Metrics & Future Directions
Retrieval performance is commonly assessed using metrics such as mean average precision (mAP), recall, and ranking scores.
Future research aims to improve multimodal representation learning, better handle noisy or low-resource audio data, and develop zero-shot retrieval capabilities that can generalize to unseen queries or languages.
3. Visual object tracking
Background & Significance
Visual object tracking involves continuously localizing a target object across video frames, based on its appearance and motion.
It is a key technology in surveillance, autonomous driving, robotics, and human–computer interaction, enabling systems to monitor objects in dynamic environments.
Key Challenges
Occlusion & Deformation: Objects may be partially or fully occluded or change shape due to articulation or deformation.
Appearance Changes: Variations in lighting, scale, rotation, and background clutter can significantly alter the target’s visual appearance over time.
Real-Time Processing: Achieving high accuracy while maintaining real-time speed is crucial for many practical applications.
Deep Learning Techniques
Recent advancements employ convolutional neural networks (CNNs) to learn robust feature representations and capture object appearance variations.
Siamese network architectures compare target and candidate regions to determine similarity, while recurrent models help capture temporal dynamics.
Evaluation Metrics & Future Directions
Common metrics include precision plots, success rate (overlap ratio), and frame-per-second (FPS) to gauge both accuracy and real-time capability.
Future research aims to improve robustness against occlusions and rapid appearance changes, enhance scale and rotation estimation, and optimize models for real-time performance with minimal computational resources.
4. Automated audio captioning
Background & Significance
Automated audio captioning involves generating natural language descriptions of audio clips (e.g., sounds, events, ambiance) without human intervention.
This technology enhances multimedia search, supports accessibility (e.g., for visually impaired users), and aids in content analysis in smart environments and digital media archives.
Key Challenges
Semantic Gap: Bridging low-level acoustic features (e.g., spectrograms, MFCCs) and high-level semantic information described in language remains complex.
Variability and Ambiguity: Diverse audio types, background noise, and overlapping sound events make it difficult to extract consistent descriptive features.
Temporal Dynamics: Audio is inherently time-dependent; aligning dynamic audio events with coherent, temporally ordered captions is challenging.
Deep Learning Techniques
Modern approaches employ encoder-decoder architectures (using CNNs or RNNs) to directly map audio representations (often derived from spectrograms) to textual descriptions.
Attention mechanisms and transformer-based models are increasingly incorporated to better align audio events with the generated words and capture long-range dependencies.
Evaluation Metrics & Future Directions
Retrieval and captioning quality is measured using metrics from machine translation and image captioning (e.g., BLEU, METEOR, CIDEr, ROUGE).
Future research aims to improve multimodal fusion, enhance language coherence and context awareness, and develop models that generalize well across diverse audio domains.
5. Language-queried audio source separation
Background & Significance
This task involves isolating a specific audio source from a mixture based on a natural language query (e.g., “extract the guitar,” “separate the human speech”).
It bridges audio signal processing with language understanding, enabling user-driven content manipulation and improved accessibility in multimedia applications.
Key Challenges
Cross-Modal Gap: The task must align low-level acoustic signals with high-level textual semantics, which involves learning compatible representations between audio and language modalities.
Ambiguous Queries: Natural language inputs can be vague or diverse in expression, posing difficulties in accurately understanding user intent.
Overlapping Sources & Noise: Audio mixtures often contain interfering sources or background noise, making it challenging to isolate the target signal precisely.
Deep Learning Techniques
Modern approaches leverage end-to-end deep learning architectures that jointly embed audio (e.g., via convolutional or recurrent neural networks processing spectrograms) and language (via transformer or RNN-based encoders) into a shared latent space.
The learned representations are fused—often through attention mechanisms or conditional layers—to generate time–frequency masks that selectively extract the queried source.
Recent models aim to refine alignment between the query and audio features, improving both the quality of separation (using metrics like SI-SDR) and interoperability.
Evaluation Metrics & Future Directions
Metrics: Performance is typically assessed using audio separation quality measures such as Signal-to-Distortion Ratio (SDR), Scale-Invariant SDR (SI-SDR), and perceptual metrics (e.g., PESQ), alongside evaluation of query relevance.
Future Work: Research is directed toward more robust cross-modal representations, handling diverse and ambiguous queries, real-time processing, and integrating additional modalities (e.g., video) to further refine source separation under complex real-world conditions.
6. Unsupervised anomalous sound detection for machine condition monitoring
Background & Significance
Unsupervised anomalous sound detection (ASD) for machine condition monitoring aims to identify abnormal acoustic patterns that may indicate mechanical failures, without requiring labeled anomaly data.
This approach is crucial in industrial environments where collecting diverse and annotated fault data is costly or infeasible.
ASD enables predictive maintenance, improves operational safety, and reduces downtime by detecting faults early based on deviations from normal operating sound.
Key Challenges
Lack of Anomaly Labels: Since anomalies are rare and unpredictable, labeled datasets for training are often unavailable, requiring models to learn normal patterns exclusively.
Environmental Noise: Background noise and varying acoustic conditions can obscure subtle anomalies, especially in real-world factory or outdoor settings.
Domain Generalization: Variability across machines, types, and operational contexts requires models that generalize well to unseen scenarios.
Temporal Variability: Machine sounds evolve over time due to wear or load changes, making static modeling insufficient.
Deep Learning Techniques
Recent methods leverage autoencoders, variational autoencoders (VAE), and normalizing flows to model the distribution of normal sounds.
Anomalies are detected based on reconstruction errors, likelihood estimation, or embedding distance in learned latent spaces.
Spectrogram-based CNNs are widely used to extract robust features from time–frequency representations.
Self-supervised contrastive learning helps in learning generalizable representations without labels.
Transformer-based temporal models are emerging to capture long-term dependencies in sound sequences.
Evaluation Metrics & Future Directions
Metrics: Area Under the ROC Curve (AUC), Precision-Recall curves, Equal Error Rate (EER), and inference latency are commonly used to assess detection performance and real-time feasibility.
Future Directions:
Improved domain adaptation to transfer knowledge across different machines and environments.
Integration of multimodal data (e.g., vibration, video) for more robust detection.
Lightweight models for edge deployment in industrial IoT settings.
Generative approaches to simulate rare anomalies for better evaluation and training.
7. Speech signal-based depression diagnosis using deep learning
8. Toxicity prediction for chemicals based on molecular structure
9. Time series forecasting