We prepared EnvoDat to support various computer vision applications including, (a) supervised learning algorithms, and (b) language and visual foundation models. We provide the annotated data in different formats e.g., JSON (COCO, OpenAI, PaliGemma, Florence2 etc.), XML (Pascal VOC), TXT (YOLOv* series and YOLO Darknet), CSV (Tensorflow Object Detection, RetinaNet Keras, etc) to suit your intended application.
Below are illustration of the grid of the annotated images that we used for the object detection and classification tasks. Color-coded circles at the left corner of each image indicate the split ratio of the image set. We used a split ratio of 70% for training (pink), 20% for validation (blue), and 10% for testing (orange). The training set is used to train the models. The validation set is used to tune the model's hyperparameters and to ensure generalization to unseen data or to prevent over-fitting. The test set is used to evaluate the model's final performance.
We provide a structured categorization of various objects and entities into predefined semantic categories based on their functions, physical properties, or contexts of use. We describe each semantic category as follows:
Human: We labelled any visible person as a 'person' regardless of activity or position.
Transport: This category includes vehicles used for transport e.g., trucks, cars, vans, buses, excavators, tractors, bicycles, scooters, motorcycles, and trains. We distinguish the vehicle type based on size and intended use. For example, a four-wheeled vehicle used for moving goods is labelled as a 'truck' whereas a smaller passenger vehicle is labelled 'car'.
Furniture and Appliances: Labels include furniture and household items, such as tables, chairs, desks, ventilators, and sockets. We labelled these items if they were identifiable in the scene, irrespective of the orientation, shape or obscured.
Lighting: Covers light sources and lighting-related objects, including light bulbs, reflections, shadows, torches, torchlights, etc. Our annotation focused more on highly visible light. We also annotated ambient lighting sources such as ceiling lights.
Architecture and Structure: This category includes architectural elements and structural features, like doors, pillars, staircases, windows (and frames), poles, houses, glass windows, buildings, and glass frames. We labelled these items once there are visible in the scene. We did not distinguish between their orientation, shape or colour. However, for "windows", we distinguish if they are glass windows or just window frames (glass frame). A single glass window should contain about four window frames (top, left, right and bottom frames).
Safety: This category encompasses safety-related items such as fire extinguishers, exit signs, warning signs, helmets, and signposts. Again, we did not distinguish between their type, colour, orientation or shape. As far as they are partially or fully visible in the scene, they should be labelled.
Utility: This category consists of general-use items, like trash bins, bags, boxes, buckets, backpacks, handbags, banners, cloths, fans, monitors, personal computers, sofas, and vacuum cleaners. If any item is ambiguous (e.g., similar to an item not in the category), we use the closest matching label, e.g., carton and box.
Nature and Vegetation: This contains any elements related to natural environments, including bushes, grass, flowers, trees, sky, and other similar natural features. We labelled items like 'water', 'bush', and 'grass' based on context, e.g., flowing water 'river' vs. stagnant water 'water pool', etc.
Hazard: Items that can pose a risk, such as 'wet floor' or 'dark room,' should be labelled if visible. Especially if they are relevant to navigation or safety tasks.
Flooring and Hard Surfaces: Covers surfaces and flooring materials. We distinguish between materials like 'concrete' vs. 'asphalt' based on texture. If the texture is indistinct, for example for walls, use 'textureless wall'.
Construction: These are construction-related items, such as ladders, iron and wooden rails, debris, guard rails, barricades, etc. We labelled each identifiable item individually, even if partially obscured.
Infrastructure and Traffic: Here, we identify infrastructure elements like 'traffic lights', and 'crosswalk'. If multiple infrastructure items appear in the scene, we label each distinctly.
Miscellaneous: We labelled any object or terrain class that does not fit into any of the above categories as 'miscellaneous'. Although, we used this label sparingly to maintain the dataset's specificity.
For details on how to replicate the above results, or train the models on your custom dataset, kindly follow the example instructions at the EnvoDat GitHub repository.