Creating an agentic AI to extract audio dialogues from movies and store them in a structured datasetÂ
Objective Definition
Extract individual dialogue clips from movie files and store them along with their corresponding text (transcription) in a structured dataset.
Key Requirements
Audio Processing: Split the movie into dialogue-based segments.
Speech-to-Text Conversion: Transcribe each dialogue to text.
Data Structuring: Store extracted audio and text in a table format like the one shown.
Automation with an Agentic AI: Build an AI system that uses large language models (LLMs) for automation.
FFmpeg: For splitting and extracting audio from movies.
Speech Recognition Models:
OpenAI Whisper (Highly accurate for multilingual transcription)
Google Speech-to-Text or Hugging Face ASR models
Python Libraries:
ffmpeg-python (for audio splitting)
openai (for Whisper API)
pandas (for structured data storage)
numpy (for processing)
torch (if using Whisper locally)
Agentic AI for Automation
To make the process more autonomous:
Monitor Folder: Automatically detect new movie files.
Process in Parallel: Split audio and transcribe simultaneously.
LLM Integration: Use an LLM to verify transcription quality or classify speaker emotions.
Speaker Identification: Assign unique speaker IDs using a speaker diarization model.
Emotion Detection: Add emotion tags for each dialogue.
Multilingual Support: Whisper handles multiple languages if required.