In the digital learning era, students often face difficulty in navigating through lengthy lecture videos to extract key insights and revise important topics. Manually reviewing hours of video content becomes inefficient, especially during exams or when dealing with complex subjects. Educators, too, find it time-consuming to create supplementary materials like summaries and quizzes. To address these challenges, we propose an intelligent NLP-based system that automates the process of transcribing lecture videos, summarizing them, and generating quizzes for enhanced learning and retention.
In recent years, various tools have been introduced to assist in processing educational video content using techniques like speech-to-text conversion and extractive summarization. While these systems offer some level of automation, they often lack semantic depth, coherence, and interactivity. Extractive summarization approaches, for example, tend to copy exact sentences from transcripts, which can result in disjointed summaries lacking flow and contextual meaning. Additionally, many systems rely on pre-existing transcripts, making them unsuitable for raw video inputs without subtitles or captions.
Some earlier works, like those using the YouTube Transcript API with BERT-based summarization, addressed parts of the problem but did not deliver a complete, integrated solution. Most systems also lacked interactive learning features like quiz generation, which are essential for student engagement and knowledge retention.
Key research gaps we identified and aimed to fill include:
Limited use of abstractive summarization models like BART, which provide better contextual understanding than extractive methods.
Lack of unified systems that combine transcription, summarization, and quiz generation in one end-to-end solution.
Poor accessibility of many tools to non-technical users due to complex interfaces or dependency on manual steps.
Underutilization of modern transformer-based models such as Whisper, BART, and T5 together in a single educational application.
To develop a system that extracts audio from lecture videos.
To transcribe speech to text.
To summarize text and generating quiz
Our system follows a streamlined AI-driven pipeline:
Input: User uploads a lecture video.
Audio Extraction: Video is processed using FFmpeg to extract audio.
Transcription: Whisper model transcribes the audio into accurate, readable text.
Summarization: The transcript is summarized using Facebook’s BART model for better readability and understanding.
Quiz Generation: Google’s T5 model generates relevant quiz questions from the summarized text.
Output: The final output includes transcript, summary, and quiz questions displayed to the user.
Learnings:
Gained practical knowledge of chaining multiple NLP models in a single pipeline.
Understood differences between extractive vs. abstractive summarization.
Learned to preprocess and optimize data for large language models.
Challenges:
Handling long lecture videos with poor audio quality.
Managing system performance on limited hardware resources.
Ensuring semantic alignment between summary and generated quiz content.
We tackled these by optimizing audio processing, testing multiple model variants, and applying result evaluation manually during development.
7th international conference on communication and intelligent systems
3rd international conference on Artificial intelligence: Theory and Applications.