The goal of this competition is 3-folds and the global target is to propose some machine learning models which will be able to detect within each frame:
The presence/absence of an object
The number and location of any object
Classify the object type
The images are taken under varying light conditions leading to darker or over exposed frames, background colors changing depending on water conditions and depths as well as different levels of blurriness (Figure 2). Each video frame can be either “empty” (Figure 2) or contain several “objects” as the head of the penguin carrying the bio-logging device, prey items such as fish, krill and jellyfish, krill aggregations, and other penguins (Figure 3).
Figure 2: Examples of frames labelled as “empty”, without objects of interest
Figure 3: Example of frames obtained from video data collected from penguins. Objects that could be identified within the frames are for example: the head of the penguin carrying the bio-logging device, prey items such as fish, krill and jellyfish, prey aggregations, and other penguins. Color, exposure and blurriness vary within and between videos.
Researchers of the Centre d’Etudes Biologiques de Chizé (CEBC, France), see also organisers below, have accumulated many hours of video-recording data on 20 Adélie penguins (Pygoscelis adeliae) from Dumont d'Urville French Antarctic station. The cameras recorded continuously at 30 fps for 5-6 hours. To build our dataset we extract frames from all videos and perform image labelling using the Open Souce platform Label Studio (https://labelstud.io/). Specifically, we have labelled four main classes "Carrier”, “Prey”, “Jellyfish”, “Other penguin” respectively indicating the presence of the penguin carrying the camera within the frame (visible in the frames in Figure 1 and 3), presence of prey, jellyfish and other penguins (Table 1).
Table 1: Example of object classes identified across video frames
We provide 50,000 images containing both examples of these objects as well as empty frames. To reflect the realistic data structure obtained from a video, the dataset will be imbalanced with a higher number of empty frames compared to those containing objects. It is worth to note that we already have dataset with its ground-truth computed and all the evaluation scripts and formats already defined. The dataset will be available after the competition through this website.
The three main tasks of the competition are as follow:
1. Binary classifier: This task aims at separating video frames into “empty” (characterized by just background), and “object” (characterized by the presence of objects of interest). This will return a 0-1 output (presence-absence of objects detected).
2. Object detection: This task aims at detecting objects within video frames. This will return number and location (e.g object center pixel coordinates, bounding boxes) of objects detected within each frame.
3. Object recognition: This task will return the same as above, with the additional class label representing the type of object recognized.
a) Input for participants: PNG images
b) Output from participants: Python codes used to manipulate and analyse the video images provided, model description and reasoning behind its use.
The approach of our proposed evaluation is to assess how close participants’ results are from the manual image labelling and classification. Results submitted by participants will be processed in the following way:
1. compute measures (presence/absence, location, number and type of object detected/classified) for each task performed;
2. evaluate the confusion matrix to understand the misclassification patterns;
3. rank models separately regarding accuracy and tasks.