Utilizing Video as a Knowledge Base in the Development of Large Language Models in Cybersecurity Intelligent Penetration-testing Helper for Ethical Researcher (CIPHER)
by Iqbal Muhammad (이크발 무하마드)
by Iqbal Muhammad (이크발 무하마드)
The integration of artificial intelligence (AI) and large language models (LLMs) has transformed numerous fields, offering significant improvements in automation, decision-making, and task management. LLMs, trained on vast amounts of text-based data, have become proficient in understanding and generating human-like responses across a variety of domains. However, when it comes to domain-specific tasks like cybersecurity, LLMs face challenges due to the limited availability of specialized, well-documented knowledge. Developing effective LLMs for niche areas such as penetration testing requires innovative approaches to dataset generation, training, and fine-tuning. This research seeks to address these challenges by introducing a novel framework that utilizes video as a key knowledge base, providing a richer, more contextual source of information for LLM development.
Penetration testing, a critical process in cybersecurity, involves simulating cyberattacks to identify vulnerabilities in systems, networks, or applications. As cyber threats grow more sophisticated, so does the complexity of penetration testing, demanding expertise and up-to-date knowledge. The use of AI to enhance pentesting tools, such as the Cybersecurity Intelligent Penetration-testing Helper for Ethical Researchers (CIPHER), presents an opportunity to automate and assist ethical hackers in identifying vulnerabilities more efficiently. However, traditional LLMs may fall short in this domain due to their reliance on text-based datasets, which often lack the depth, specificity, and practical context needed for effective pentesting. This research explores the use of video content—such as tutorials, live demonstrations, and technical simulations—as a means to fill this knowledge gap, transforming multimedia content into actionable data for penetration testing.
To convert video content into a format suitable for LLM training, this thesis employs Whisper, a powerful AI-based automatic speech recognition (ASR) model developed by OpenAI. Whisper is capable of transcribing video audio into high-quality text, which can then be processed and annotated for use in machine-learning models. This study outlines a detailed methodology for utilizing Whisper to extract valuable insights from cybersecurity videos, ensuring that the transcriptions maintain the technical accuracy and context needed for penetration testing. By combining these transcriptions with additional metadata—such as the specific techniques demonstrated or tools used—the research creates a domain-specific dataset that enriches the LLM’s understanding of pentesting. This framework not only improves the LLM’s ability to assist ethical hackers but also demonstrates how video-derived knowledge can significantly enhance AI-driven penetration testing solutions.
more information about this research please contact iqbal@pusan.ac.kr