Statement Verification and Evidence Finding with Tables is a shared task at SemEval 2021 (Task 9).
Tables are ubiquitous in documents and presentations for conveying important information in a concise manner. This is true in many domains, stretching from scientific to government documents. In fact, surrounding text in these articles are often statements summarizing or highlighting some information derived from the primary source data in tables. Describing all the information provided in a table in a readable manner would be lengthy and considerably more difficult to understand. We present a task for statement verification and evidence finding using tables from scientific articles. This important task promotes proper interpretation of the surrounding article.
The task will have two subtasks to explore table understanding:
A: Table Statement Support
Does the table support the given statement?
B: Relevant Cell Selection
Which cells in the table provide evidence for the statement?
Read Terms and Conditions below.
Download Trial Data
We have released two kinds of Training Data:
Manual annotations (Spreadsheet with natural statement IDs)
The Development Data is in the same form as the future test data and is similar to the Manual annotations from training data but manually corrected by organizers and annotated with evidence whenever possible.
The README for understanding the data format is available here, and the schema here. The changelog is available here.
We also update this FAQ with questions as they come in.Join our Google Group for questions and updates.
Submission policy: Please read!!!
Test data is used to simulate how your model would behave on previously unseen input:
Trained model + single test input = single output
If there existed one single “perfect” test example to evaluate the quality of any model, we would just give that example. However, this is never the case, in any ML experiment. One test input is insufficient to draw a full conclusion about any system. Therefore, we (and everyone else) test on a set of inputs and report the average performance, to provide a reasonable representation of expected model behavior.
Each input from the test dataset should be treated identically: the model is a pre-trained black box, and all it can do is ingest the input datapoint and return values for the pre-determined output schema (a label, a set of table cell ids, etc)
In general, DO NOT change your model/system after seeing the test data.
Acceptable use of test data:
Deciding to take into account only part of the available single input. For example, if your model calculates the entailed/refuted/unknown label as well as the supporting evidence at the same time, it is up to you whether you also want to take into account the gold label provided for Task B.
Calculating any other characteristics about the single input and taking them into account. Perhaps you calculate the length of the statement provided in the input, and somehow make use of that information (it could be that your model uses that as a feature - which your model had learned to do based on the training and dev data).
Unacceptable use of test data:
Treating the test dataset in any kind of batch fashion, that yields any information about the aggregate set that can then be used to retrain your model. For example, running any statistical analysis on the test dataset and using the results of that analysis as input to your model (or to re-train your model); thus, comparing the distribution of ANYTHING in the test dataset as a whole, with the training dataset, is not permitted.
Altering your model in any way after one or more test datapoints have been evaluated on. This is not an online task - we are not simulating learning on data after having tested on it. We are simulating the one-time behavior of your model on a single previously unseen input, which is the most common setup of ML experiments.
Phase B (Note that this is a subset of Phase A data, only the statement placeholders in each cell need evidence)
NEW!! Ground Truth
The official competition for Task A and Task B are hosted on Codalab (link)
Before the evaluation period, participants are invited to submit their results to the development phase in order to test that their results does not have any errors. The test evaluation phase will allow for up to 10 submissions but only the last submission will be considered. See FAQ related to this.
The evaluation code is also available here (link)
After submission, please submit your system details to this questionnaire https://forms.gle/AuKozN62CRQjoRG46 by Feb 1, 2021 (Anywhere on Earth). You will only receive your score and be a part of the leaderboard if you submit a questionnaire by this date.
Trial data ready: July 31, 2020
Training data ready: October 1, 2020 October 20, 2020
Development data ready: December 1, 2020
Task A Evaluation Period: Jan 20 - Jan 22, 2021 (Noon UTC)
Task B Evaluation Period: Jan 27 - Jan 29, 2021 (Noon UTC)
System Questionnaire: Feb 1, 2020
Paper submission due: February 23, 2021
Notification to authors: March 29, 2021
Camera ready due: April 5, 2021
SemEval workshop: Summer 2021
By submitting results to this competition, you consent to the public release of your scores at the SemEval workshop and in the associated proceedings, at the task organizers' discretion. Scores may include but are not limited to, automatic and manual quantitative judgments, qualitative judgments, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgment that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.
By downloading the data or by accessing it any manner, you agree to abide by the CC BY-4.0 license, as described here.
Organizers
Nancy Wang - IBM Research, USA
Sara Rosenthal - IBM Research, USA
Marina Danilevsky - IBM Research, USA
Diwakar Mahajan - IBM Research, USA