We provide an overall description of the proposed competition tracks.
T1. Spatiotemporal agent tracking annotations:
For T1, the participants are required to provide two types of annotations. The first type should consist of "agent tubes," which are sequences of bounding boxes that track an active road agent across consecutive frames (as will be provided in the training and validation splits). The second type of annotation should provide a detailed description of the object within these frames. For example, for objects classified as pedestrians, the description should include characteristics such as height, build, age, clothing, and behavioral attributes.
An example description might read: height - approximately 180 cm; build - medium; age - mid-30s; clothing - blue shirt and dark jeans; behavioral attributes - walking at a brisk pace.
T2. Spatiotemporal road event detection annotations:
For T2, participants are required to provide two types of annotations. The first type, called a "road event," includes the triplet (Agent, Action, Location). Each road event is represented as a tube of frame-level detections, similar to the format used in the training and validation splits of the datasets. The second type of annotation should be a textual description of the triple.
For example, for a triple (agent: pedestrian, action: moving_away, location: on_left_pavement), the textual description could be "A pedestrian on the left pavement is walking away". The textual annotations will be evaluated subjectively, and there is no strict format required.
T1. Spatiotemporal agent tracking annotations.
Quantitative Results (conventional annotations):
Qualitative Results (textual annotations):
HIL (no submission) - 0
PCIE_RoadPP (textual annotations retrieved from submission 473378 are not aligned with the test video frames) - 0
dingling3 (no submission) - 0
gro (provided textual annotations are not aligned with the test video frames) - 0
Track 1: by highest score - winner HIL
T2. Spatiotemporal road event detection annotations
Quantitative Results (conventional annotations):
Qualitative Results (textual annotations):
gro (provided textual annotations are not aligned with the test video frames) - 0
PCIE_RoadPP:
randomly sampled video 1 (average 5 tubes, three judges, 5 metrics with range (0-2)): 1.42
randomly sampled video 2 (average 5 tubes, three judges, 5 metrics with range (0-2)): 1.35
randomly sampled video 3 (average 5 tubes, three judges, 5 metrics with range (0-2)): 1.18
randomly sampled video 4 (average 5 tubes, three judges, 5 metrics with range (0-2)): 0.9
randomly sampled video 5 (average 5 tubes, three judges, 5 metrics with range (0-2)): 1.21
randomly sampled video 6 (average 5 tubes, three judges, 5 metrics with range (0-2)): 1.33
Scaled average: 0.61
Combined Results:
PCIE_RoadPP: 1.04 (0.43+0.61)
gro: 0.46 (0.46+0)
Track 2: by highest score - winner PCIE_RoadPP