Problem Statement
Developing a system that converts lecture video content into concise transcripts and summaries using AI, while also generating relevant questions and answers to support active learning and quick content revision.
Introduction
Students frequently find it difficult to follow lengthy lecture videos in today's rapid learning environment, particularly when attempting to review or concentrate on important ideas. In order to solve this, we developed an AI-powered system that automatically creates transcripts, concise summaries, and interactive quiz questions, thereby streamlining educational videos.
With the help of Gradio, the system integrates advanced NLP models Whisper for speech-to-text transcription, BART for summarization, and T5 for question generation all housed in a clear, simple web interface. Users can instantly access structured and interactive content with just a video upload, which speeds up, personalizes and engages learning.
In recent years, various tools have been introduced to assist in processing educational video content using techniques like speech-to-text conversion and extractive summarization. While these systems offer some level of automation, they often lack semantic depth, coherence, and interactivity. Extractive summarization approaches, for example, tend to copy exact sentences from transcripts, which can result in disjointed summaries lacking flow and contextual meaning. Additionally, many systems rely on pre-existing transcripts, making them unsuitable for raw video inputs without subtitles or captions.
Some earlier works, like those using the YouTube Transcript API with BERT-based summarization, addressed parts of the problem but did not deliver a complete, integrated solution. Most systems also lacked interactive learning features like quiz generation, which are essential for student engagement and knowledge retention.
Key research gaps we identified and aimed to fill include:
Limited use of abstractive summarization models like BART, which provide better contextual understanding than extractive methods.
Lack of unified systems that combine transcription, summarization, and quiz generation in one end-to-end solution.
Poor accessibility of many tools to non-technical users due to complex interfaces or dependency on manual steps.
Underutilization of modern transformer-based models such as Whisper, BART, and T5 together in a single educational application.
Objectives
To develop a system that extracts audio from lecture videos.
To transcribe speech to text.
To summarize text and generating quiz
Our system follows a streamlined AI-driven pipeline:
Input: User uploads a lecture video.
Audio Extraction: Video is processed using FFmpeg to extract audio.
Transcription: Whisper model transcribes the audio into accurate, readable text.
Summarization: The transcript is summarized using Facebook’s BART model for better readability and understanding.
Quiz Generation: Google’s T5 model generates relevant quiz questions from the summarized text.
Output: The final output includes transcript, summary, and quiz questions displayed to the user.
Performance comparison of our model’s summarization output with other baseline models
Learnings:
Gained practical knowledge of chaining multiple NLP models in a single pipeline.
Understood differences between extractive vs. abstractive summarization.
Learned to preprocess and optimize data for large language models.
Challenges:
Handling long lecture videos with poor audio quality.
Managing system performance on limited hardware resources.
Ensuring semantic alignment between summary and generated quiz content.
We tackled these by optimizing audio processing, testing multiple model variants, and applying result evaluation manually during development.
1) 7th international conference on communication and intelligent systems
2) 3rd international conference on Artificial intelligence: Theory and Applications.