Camera captured Labelled Dataset from Indian-Roads for Autonomous Vehicles
Sarita Gautam, Anuj Kumar
Object detection is a core requirement for autonomous driving systems, as real-world road environments contain multiple dynamic and static objects that must be accurately identified to ensure safety and efficiency. Indian road conditions pose additional challenges due to dense traffic, heterogeneous vehicle types, and a high proportion of moving objects, which significantly increase collision risk compared to stationary obstacles. Moreover, publicly available datasets that realistically represent Indian roads are extremely limited.
To address this gap, we introduce a new, high-resolution dataset collected from real traffic scenarios in two major Indian cities—New Delhi and Chandigarh. The data were captured using a 64-megapixel camera mounted securely on a vehicle, enabling stable and continuous recording of road scenes. Traffic videos were first recorded during regular driving conditions, after which individual frames were systematically extracted for dataset construction. Frame extraction was performed using the VLC media player, and the complete procedure is illustrated step by step with visual guidance to ensure reproducibility. This dataset is designed to support robust object detection and classification research tailored to the unique characteristics of Indian road environments.
Videos from Chandigarh road. Please click on the link to watch the video.
Reliable object detection is fundamental to the safe operation of autonomous vehicles, as road environments are populated with numerous interacting objects that vary in size, motion, and behavior. The complexity of Indian roads further intensifies this challenge due to heavy traffic density, diverse categories of vehicles, and a predominance of moving road users, all of which elevate the likelihood of collisions. Despite these realities, there is a notable lack of publicly available datasets that accurately capture the visual and contextual diversity of Indian traffic conditions.
To overcome this limitation, we present a newly developed high-resolution dataset acquired from real-world traffic scenes in New Delhi and Chandigarh. Video data were recorded using a 64-megapixel camera rigidly mounted on a moving vehicle to ensure consistent and high-quality capture. From these recordings, individual frames were extracted in a structured manner to form the dataset. The frame extraction process was carried out using the VLC media player, and the complete workflow is documented with illustrative images to facilitate transparency and reproducibility. This dataset aims to provide a realistic and reliable benchmark for advancing object detection and classification methods under authentic Indian road conditions.
Frame Extraction from videos:
Object detection plays a critical role in enabling autonomous vehicles to operate safely, as real-world roads involve continuous interaction among multiple objects with varying sizes, motions, and behavioral patterns.
The task becomes considerably more challenging on Indian roads, which are characterized by high traffic congestion, a wide variety of vehicle types, and a large number of dynamic road users, thereby increasing the probability of collisions.
Although these conditions demand robust detection systems, there is a significant shortage of publicly available datasets that realistically reflect the visual complexity and diversity of Indian traffic environments.
To bridge this gap, a new high-resolution dataset has been developed using real traffic scenarios captured from two major Indian cities—New Delhi and Chandigarh.
The data acquisition was performed using a 64-megapixel camera securely mounted on a moving vehicle, ensuring stable, continuous, and high-quality video recordings of road scenes.
Recorded video sequences were systematically processed to extract individual frames, which collectively form the dataset.
Frame extraction was carried out using the VLC media player, and the entire procedure is documented step by step with illustrative images to support clarity, reproducibility, and ease of adoption.
The resulting dataset serves as a realistic benchmark for research on object detection and classification, specifically tailored to the unique and challenging conditions of Indian roads.
Frames ectraced from video sequences
Super annotate Tool Welcome :
This interface serves as the initial welcome screen of the SuperAnnotate platform and appears immediately after launching the SuperAnnotate application. From this window, users can initiate a new project by selecting the “New Project” option and assigning an appropriate project name. At this stage, the directory containing the image dataset must also be specified, as this enables the platform to load the images for annotation.
Once the project is successfully created, it opens in a separate workspace. This new window functions as the main annotation environment, where users can perform detailed labeling and manage all annotation-related tasks.
Once a folder is selected on the home screen, all associated images are automatically imported and displayed within the task panel. The annotation workspace provides a comprehensive toolbar on the left side, offering a range of tools for precise labeling, including rectangle, polyline, polygon, bounding box, ellipse, cuboid, zoom controls, eyedropper, and bucket tools.
Subsequently, annotation categories were defined based on the objects of interest, such as car, truck, bus, person, traffic light, green traffic light, and red traffic light. Using the appropriate annotation tools, each object within the images was then carefully labeled according to its class. The annotation process was carried out systematically, as illustrated in the figure below, to ensure accuracy and consistency across the dataset.
The selection of classes is guided by the types of objects visible within the images. Each object is enclosed using a bounding box, after which the corresponding class label is assigned. To clearly distinguish between different object categories, unique color codes are used for each class, ensuring better visual clarity and easier interpretation during annotation.
To assess the effectiveness of the annotated dataset, a transfer learning–based evaluation strategy was adopted. Transfer learning enables the reuse of knowledge from models pre-trained on large-scale datasets, allowing efficient learning when applied to related but newly introduced data. In this study, the primary focus during evaluation was the detection of traffic lights.
A dedicated traffic-light extraction model was first developed to isolate and crop traffic light regions from the original images. These cropped samples were then used to detect traffic lights along with other relevant objects. For model training, the pre-trained Inception-V3 architecture was employed within a single-shot detection framework, enabling efficient object localization and classification.
Sample cropped traffic light images used during experimentation are presented in Table-5. All experiments were conducted on a system equipped with an NVIDIA GeForce GTX 1650 Ti GPU with 8 GB RAM. The proposed approach achieved a test accuracy of 97.23%, demonstrating strong performance. The corresponding training and evaluation trends are illustrated in the line graph shown below.
The Kaggle link to our dataset is provided here:
Click Here: ➡️➡️➡️➡️