AI Video Processing System

Project Lifecycle

Load Video

The system accepts a wide variety of formats including MP4, MKV, MOV, and more. Simply select your video file to begin the process.

Transcribe Video

Uses Whisper model medium (English) for optimal balance between speed and accuracy. Optional PyAnnote integration provides speaker segmentation and diarization for multi-speaker videos.

Scene Detection

Choose between TransNetV2 (video-based) or LDA (Latent Dirichlet Allocation - text based) for scene detection. Additional option to merge scenes that fall below a specified threshold for more coherent results.

Result of Scene Detection (LDA)

Scene Scoring

Analyzes scenes using multiple metrics:

Dialogue importance
Sentiment analysis
Contextual relevance
Keyword detection

For scenes without dialogue, the system performs audio analysis to generate equivalent scores based on the same metrics.

Scene Selection and Trimming and Concatenation

Scene Selection

Select scenes based on your desired final video duration. Optional "overflow process" with two phases provides additional flexibility with customizable parameters to fine-tune your results.

Final Processing

Once satisfied with your selections, initiate the trim and concatenation process to generate your final video with only the most important content.

Performance benchmarks:

Reduces manual editing time from 2-3 hours to just 35-40 minutes per video
Achieves 70-80% accuracy in content selection compared to manually edited videos
Processes a 2-hour long-form video in only 35-40 minutes on a single CPU core

Notes:

I used fuzzy logic to classify scenes more flexibly, allowing them to be a mix of "Filler," "Connector," and "Key Scene" instead of forcing them into strict categories.

The adaptive weight system adjusts the importance of dialogue, sentiment, context, and keywords based on the scene. A greedy algorithm helps fine tune these weights to improve scene selection.

In Phase 2 of the overflow process, a natural stopping mechanism is implemented to prevent too much "overflow"..

My Journey into being obsessed with Data and AI:

From someone who always said, "I don’t like coding, it’s not for me", to someone who built an automated video processing system.

I created an intelligent video processing tool that automatically identifies and extracts the most important segments from long form videos saving hours of manual research and editing work in my content creation process.

The Problem

As a content creator working with long form videos, I was spending excessive hours manually watching, researching, and identifying important segments. Despite having zero background in coding (I actually hated it in the past!), I decided to tackle this inefficiency head on with nothing but my will and vision that if I created this my life will be easier😅.

Spoilers… Did it actually get easier? Well, you’ll have to contact me to find out! 😁

The Solution

I developed a tool that:

Loads videos through a simple PySide6 interface
Transcribes content using Whisper's medium English model
Analyzes scenes using either TransNetV2 (video-based) or LDA (text-based)
Scores scenes through diverse metrics to identify importance
Selects top-scoring scenes based on user defined parameters
Automatically trims and concatenates the final output

My Learning Journey

Starting from absolute zero, this project transformed my relationship with technology:

The Basics: Fumbled through installing packages and system variables without understanding what I was doing half the time.
The Breakthrough: Discovered IDEs (Visual Studio 2022 was a game-changer!) after struggling with Notepad😂.
Resource Management: Learned to optimize as my increasingly complex processes began hitting hardware limitations.
Architecture Design: Mastered creating system architectures, Mermaid diagrams, and technical plans.
Advanced Algorithms: Implemented sliding window approaches and applied a knapsack problem approach with dynamic programming to achieve optimal scene selection.
Natural Language Processing: Utilized Spacy, NLTK, NER, and experimented with various models (J-Hartmans emotion model, RoBERTa, BERT, Flan-T5, etc.).
Multimodal Analysis: Integrated CLIP, CLAP, and other multimodal models for comprehensive video understanding.
Robust Scoring: Developed metrics combining different techniques for audio, video, and text analysis.
Sophisticated Clustering: Implemented HDBSCAN alongside greedy and fuzzy algorithms.

The Challenge

The most difficult aspect wasn't implementing individual components but making them work harmoniously together. Each approach had limitations like processing timestamps with speaker labels that sometimes resulted in insufficient text requiring careful consideration and robust fallback mechanisms.

The Impact

This tool now forms a core part of my content creation workflow, saving me hours of tedious work by automatically identifying important segments without watching entire long form videos.

The project expanded beyond its original scope, leading me to create various automation scripts for daily tasks and teaching me the valuable lesson that when faced with a problem, there's always a solution. You might not be able to have the skills yet. But with persistence, effort, will and time YOU WILL.

As it turns out, creating scripts that save time is incredibly satisfying (or maybe that's just me being lazy! 😂)

The Real Impact in ME

This journey forged a will that always tries, never gives up, and constantly improves. I learned how to FOCUS, THINK deeply, and sustain work for long hours. Essentially, I transformed into a man of work and I'm loving it!

I now carry the unshakable belief that whatever challenge appears, I can conquer it. I may not know how today, but I will tomorrow! if not, then I'll keep going until I do.

The mental hurdles that once seemed so imposing "this topic is too complex," "this task is impossible," "I can't do this" have vanished completely. I know with certainty that I can accomplish anything; it's merely a matter of time, effort, FULL ATTENTION ON THE WORK, and continuous improvement through each failure. Eventually, you reach that beautiful moment of realization: "Oh, it isn't so hard after all."

I've become someone who genuinely loves the work itself finding joy in helping others through my skills and empathy. Because honestly, bridging the gap from someone who hated coding to someone who embraced it required overcoming a massive mental barrier. This journey has made me even more empathetic than I naturally was.

Isn't it simply wonderful to help people? To know that your work, regardless of scale, contributes to something greater? This experience has ignited in me an unstoppable drive to LEARN and grow every single day!

I could be a motivational speaker, honestly... will I do a good job? tell me what you think!

Page updated

Google Sites

Report abuse