Researcher: Shiv Jatinkumar Malvi
The project focuses on developing a deep learning model to accurately localize and separate sound sources in complex auditory environments. Using both spatial and temporal cues, the model aims to distinguish multiple sound sources by integrating audio and visual information. Advanced neural network architectures, including skip connections, are employed to ensure precise localization and separation. This system has potential applications in areas such as surveillance, multimedia, and robotics, where distinguishing individual sound sources is critical for performance enhancement.
Researcher: Kazi Ruslan Rahman
The project delves into Visual to Audio (V2A) conversion process, leveraging a pipeline to effectively translate visual cues into high-quality, context-aware audio. By using the SVA V2A approach, this research aims to create a system that interprets visual data and transforms it into audio signals that preserve the nuances of the original content.
The project incorporates classifier models to organize and categorize video content before audio generation. The classifiers enhance accuracy by discerning different contexts and content types within the visual data, enabling the audio output to be tailored.
This work could have broad applications, such as assisting visually impaired users in understanding video content, advancing real-time content narration, or improving multimedia accessibility tools.
Taken from PySlowFast Video Classifier Demo
Researcher: Panya Sukphranee
The project centers on developing a diffusion-based deep learning model for generating audio from visual data. By learning from temporally and semantically aligned features of video frames with the corresponding Mel-Spec data, we generate a Mel-Spec through diffusion . The model captures audio characteristics that are related semantically and temporally. This approach has promising applications in fields like immersive media, virtual reality, and content creation, where audio generation for action can be automated.
Researcher: Aditya Anil
This project aims to generate captions for videos using a CNN-LSTM model. Recent advances in using long short-term memory (LSTM) networks for image captioning have inspired exploration into their applications for video captioning. By processing video frames as a sequence of features, an LSTM model is trained on video-sentence pairs, learning to associate each video with an appropriate caption. Unlike most existing methods that operate on pre-rendered videos, this approach captures real-time frames from a live camera feed to generate captions dynamically.