Scribe

Securing Vision

2024-2025

Abstract

In the realm of surveillance, where the demand for efficient storage management and swift searching capabilities is of great importance, our project aims to fulfill these requirements. The thought of our project recognizes the pressing need for a comprehensive solution that not only optimizes storage but also ensures rapid response during search queries, delivering precise information. At its core, our initiative seeks to convert video data, incoming from closed-circuit television cameras, into text and carefully storing it in a dynamic database. Building upon our innovative approach, our project integrates text-based activity recognition to efficiently identify and alert users to significant events through textual data, optimizing database storage by storing captions selectively. A pivotal aspect of our project is the development of a two-step search engine, offering user a seamless experience. The initial search will yield textual results, giving user a quick overview. Subsequently, for those desiring deeper insights or specific video retrievals, the system will facilitate with visual content, ensuring a comprehensive and user-centric approach to surveillance data exploration. Furthermore, our advanced analytics dashboard leverages this data to provide insights, trends, heat maps, and predictive analytics, enhancing foresight and strategic planning. Prioritizing inclusivity, our system features voice commands, text-to-speech, and high-contrast aids, ensuring accessibility for all users. This initiative aims to redefine surveillance technology by blending efficiency, precision and user-friendly functionality making surveillance data more insightful and universally accessible.

Introduction

In Pakistan, the crime rate has escalated to 3.98 due to inflation, with many incidents in major cities either going unreported or unresolved due to lack of evidence. Security employees often struggle with inefficient processes, spending hours on manually reviewing and exporting CCTV footage, which is only stored temporarily (30-90 days) due to limited capacity.

Our project aims to revolutionize this by integrating advanced technology to convert video footage into textual data. This conversion facilitates text-based activity recognition using natural language processing, enabling efficient identification and alerting of significant events. The textual data not only eases the search process but also conserves storage space by allowing unimportant footage to be discarded while retaining valuable textual records.

We are implementing a multimedia database tailored for managing both text and video data, which supports a two-step search engine that retrieves text descriptions first, followed by video upon request. This method is quicker and less labor-intensive than traditional video searches. Additionally, our system includes an advanced analytics dashboard providing insights through heat maps, event frequency analysis, and predictive analytics to enhance decision-making.

Key features also include voice command functionality, text-to-speech for results, and high-contrast visual aids to ensure accessibility and compliance with standards, making our surveillance system more robust, responsive, and suited to the challenges faced.

Objectives

The primary goals and objective of this project are:

⦁ Real-time transcribing of CCTV footage and efficient storage of the textual representation.

⦁ Implementing a user-friendly two-step search engine is a key objective. This approach allows users to quickly retrieve textual information in real-time, offering a streamlined and intuitive search experience. Additionally, the system facilitates the retrieval of corresponding video content upon user request.

⦁ Integrating text-based activity recognition to automatically identify significant events from the textual data, which helps in enhancing the alert system and improving the overall security responsiveness.

⦁ The project focuses on using text-based activity recognition for recognizing and discarding unnecessary video footage over time while retaining the associated textual information for an extended period. This strategy aims to optimize storage resources by preserving valuable textual data for longer durations, ensuring a balance between data retention and resource efficiency.

⦁ Developing an advanced analytics dashboard that leverages the textual data for generating insights, activity heat maps, and predictive analytics, thus providing a comprehensive understanding of the surveillance environment.

⦁ Ensuring inclusivity and wider accessibility through the implementation of voice commands for searches, text-to-speech outputs for delivering results, and high-contrast visual aids, catering to users with diverse needs and enhancing the usability of the surveillance system.

Dataset

The project's foundation lies in videos enriched with captions. For our project, we will be using UCF-crime dataset which consists of surveillance videos with their detailed captions. To facilitate analysis, these videos will be transformed into temporal clips and further into individual frames, providing a granular representation for efficient processing.

Model Architecture

⦁ Video Feature Extraction

Advanced neural networks such as VGG16 or Inception v4 are used for extracting visual features from individual frames of the video. VGG16 excels in recognizing intricate details, while Inception v4 captures contextual information effectively.

⦁ Textual

Frames are processed through an Encoder-Decoder pipeline, where the encoder, typically a Convolutional Neural Network (CNN), extracts visual features, and the decoder, often a Long Short-Term Memory (LSTM) network, generates captions sequentially.

⦁ Multimodal

Integration of both visual and textual modalities to create a comprehensive understanding of the video content. This involves merging visual and textual features at appropriate stages of the model to facilitate effective caption generation.

⦁ Evaluation Measures

Performance evaluation metrics such as BLEU (Bilingual Evaluation Understudy), METEOR (Metric for Evaluation of Translation with Explicit Ordering), and CIDEr (Consensus-based Image Description Evaluation) are employed to assess the quality and coherence of the generated captions against ground truth annotations. These measures help in quantitatively evaluating the effectiveness of the proposed model architecture.

⦁ Text-based activity recognition

Incorporating NLP technology for predictive text-based activity recognition, enabling the system to anticipate significant events from video caption data efficiently. Additionally, we are utilizing NLP to enhance system accessibility for disabled users by enabling voice command functionality, providing text-to-speech output, and integrating technology like Whisper AI for efficient voice-to-text conversion, further supporting seamless interaction and accessibility.