Track 3: Affordance-Centric Question-driven Task Completion Challenge

  • This track aims at encouraging our participants to advance SOTA system on the Affordance-Centric Question-driven Task Completion (AQTC) task.

  • Our collected AssistQ dataset is the first dataset for the AQTC task. Participants should use the dataset to train/test their algorithm.

  • The competition is based on our AssistQ test set only.

  • Top 2 winners will be mentioned at the workshop and formally recognized.

For more details, please refer to our Challenge White Paper. For any questions about codalab, please post in its forum.

AssistQ Dataset

Due to the small data size, we first release the training set. The testing set will be released on May 01, 2022 (11:59PM Pacific Time).

  • Download the training set and testing set (without ground-truth labels): we will send you the download link after filling in AssistQ Downloading Agreement.

  • In each data folder, there are several files:

(1) video.mp4 / video.mov: instructional video;

(2) script.txt: the video script with the timestamp. For example,

0:00:00-0:00:04 How to start, stop, start and stop airfryer? Turn the temperature knob anticlockwise to 120 degrees.

0:00:04-0:00:07 Turn the time knob clockwise to 10 minutes.

...

The meaning of the annotation (from left to right): start time-end time: text script. The time format follows HH:MM:SS.

(3) buttons.csv: button bounding-box annotation. For example,

button1,362,86,185,72,airfryer-user.jpg,960,1280

button2,378,330,185,170,airfryer-user.jpg,960,1280

...

The meaning of the annotation (from left to right): button name, top-left x, top-left y, width, height, image filename, image width, image height.

(4) images/ folder: the folder contains the image files mentioned in buttons.csv.

  • Question-Answer Annotations: we aggregate the annotations of all data samples in train.json. A video can have multiple questions, and each question needs to be answered in multiple steps and multiple modalities. Specifically, each data index (e.g., coffeemachine_d2stw, diffuser_lxcd4) corresponds to a list that contains multiple question-answer pairs:

{'aircon_utr3b': [{...}, {...}, {...}, {...}, {...}, {...}], 'airfryer_gye82': [{...}, {...}, {...}, {...}, {...}], 'airfryer_pe2j7': [{...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}], 'airfryer_w9rzm': [{...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, ...], 'bicycle_g8h94': [{...}, {...}, {...}], ...}

For each data sample, there are multiple question-answer pairs:

[

{

"question": "How to bake a cake at 120 degrees for 15 minutes?",

"answers": [

["Turn <button1> clockwise", "Turn <button1> anticlockwise", "Turn <button2> clockwise", "Turn <button2> anticlockwise to 0 minutes", "Turn <button1> to 200 degrees", "Turn <button1> to 120 degrees", "Turn <button1> to 180 degrees", "Turn <button2> clockwise to 3 minutes", "Turn <button2> clockwise to 10 minutes", "Turn <button2> clockwise to 15 minutes"],

["Turn <button1> clockwise", "Turn <button1> anticlockwise", "Turn <button2> clockwise", "Turn <button2> anticlockwise to 0 minutes", "Turn <button1> to 200 degrees", "Turn <button1> to 120 degrees", "Turn <button1> to 180 degrees", "Turn <button2> clockwise to 3 minutes", "Turn <button2> clockwise to 10 minutes", "Turn <button2> clockwise to 15 minutes"]

], # candidate answers of each step (2 steps in this case)

"correct": [6, 10], # correct answer index of each step (starting from 1). We would not release this in the testing set

"images": ["airfryer-user.jpg", "airfryer-user.jpg"] # user view image of each step, mentioned in buttons.csv

},

{...},

...

]

Baseline

  • For starter baseline code: Github.

Evaluation Protocol

  • The "multi-step answers'" in AQTC is similar to "multi-round dialogues'' in Visual Dialog. Therefore, we follow the evaluation metrics in Visual Dialog:

(1) Recall@K measures how often the ground-truth answer is ranked in the top-k choices. Higher Recall@K denotes better performance. We evaluate Recall@1 and Recall@3 in experiments.

(2) Mean rank (MR) refers to the mean value of the predicted ranking position of the correct answer. The model should pursue lower MR.

(3) Mean reciprocal rank (MRR) is the mean value of the reciprocal predicted ranking position of the correct answer. Higher MRR is better.

  • Evaluation metrics (Recall@1, Recall@3, MR, MRR) are the average values for each answer step. They should be obtained from the AssistQ testing set.

  • The code for evaluation can also be found in Github.

Submission

  • Codalab competition with leaderboard: https://codalab.lisn.upsaclay.fr/competitions/4642

  • To submit your results to the leaderboard, you must construct a submission zip file containing submit_test.json for test data. Use the following command to generate the submission file.

zip -r submit_test.zip submit_test.json

  • The format of submit_test.json is very simple. You only need to organize the score of each answer as a dictionary:

{

"blender_92uto": # data index

[

{

"question": "How to bake a cake at 120 degrees for 15 minutes?", # do not change the question

"scores": # scores of candidate answers

[

[0.1, 0.2, 0.3, ...],

[0.3, 0.2, 0.1, ...]

]

},

{...},

...

],

"...": {...},

...

}

  • If you have a question about the submission format or if you are still having problems with your submission, please create a topic in the competition forum (rather than contact the organizers directly by e-mail) and we will answer it as soon as possible.

Registration & Report Submission Portal

Please send an email to loveu.cvpr22@gmail.com.

  • Format of Email subject: “YourName-Submission-LOVEU22-Track3”.

  • Please include metadata like your team members, institution, etc.

  • Attach your technical report and other relevant materials in the email.

For more details, please refer to our Challenge White Paper.

Timeline:

Note: due to the small data size, there is no val set separately. Participants may set it by themselves.


Communication & QA