If you want to submit your system to our Codalab leaderboard, you'll want to:
Get our baseline system (to either extend it or for an example)
Submit instructions on our example worksheet.
We have a large, complicated system under active development. This is the system that we have used in our exhibition matches.
We have a simplified system that we encourage users to inspect or extend for submitting systems to the QANTA competition leaderboard.
The QANTA tossup dataset is updated annually with this year's version being referred to as "QANTA 2018". The dataset is described in our arXiv preprint. These data are useful for training systems on the QANTA shared task. You can download the dataset at the links below, or use a python script to download them.
QANTA 2021
QANTA 2018
Train: qanta.train.2018.04.18.json
Sqlite Version: qanta.2018.04.18.sqlite3
Wikipedia Title Set: wikipedia-titles.2018.04.18.json
Gameplay Data: protobowl-042818.log
Dataset README/Documentation
The QANTA dataset is based on the Wikipedia dumps from 4/18/2018. Since they are no longer available at the regular dumps location we also provide a copy below. For convenience we also provide a json file which contains only Wikipedia pages for answers in the dataset.
We also provide the preprocessed datasets to help build Machine Reading Comprehension (MRC) based models. We split the questions into individual sentences. For each sentence, we provide the top-5 sentences over the whole wikipedia using TFIDF scoring.
For each of train, dev, test, there will be two 'evidence' files (rough description below):
1. *evidence.json : The evidence consists of the top 5 sentences for every sentence for every question in the form of the wiki page name, and the paragraph and sentence index in that wiki page. This also has the correct answer span in these documents (sentences) (will be an empty list if answer not present in these sentences).
2. *evidence.text.json: Same as above, except the actual sentence text at every instance is included.
Train: qanta.train.evidence.json and qanta.train.evidence.text.json
Dev: qanta.dev.evidence.json and qanta.dev.evidence.text.json
Test: qanta.test.evidence.json and qanta.test.evidence.text.json
The training, development, and testing splits of Qanta sequential question-answering dataset, QBLink, are available at
The data are described in a EMNLP 2018 paper.
A test set (~1000 questions) that challenge both humans and computers. Created using the adversarial writing process described in our TACL paper: "Trick Me If You Can: Human-in-the-loop Generation of Adversarial Question Answering Examples". The data has the exact same format as the QANTA 2018 data posted above.
Data for TACL Publication: qanta.tacl-trick.json
README describing how the additional fields in this dataset map to the explanations used in the interface (IR vs. neural)
Prior Version: qanta.advtest.2018.04.18.json
Edit histories of questions as authors created adversarial questions (data, README)
There are human readable versions of the prelim and final questions used in the Dec 15 event.
Past data released by Mohit Iyyer.
We also provide the much larger paragraph based context files to help build Machine Reading Comprehension (MRC) based models. We split the questions into individual sentences. For each sentence, we first retrieve top-10 wikipedia articles over whole wikipedia using TFIDF scoring. Then inside these articles, we retrieve top-10 paragraphs with TFIDF scoring as candidates. We use TAGME to extract all entities linked to wikipedia page for each retrieved paragraph.
In the provided json file, for each question, it has the "annotated_paras" property with the list of extracted paragraphs. And each paragraph has text("paragraph") and Tagme annotated entities("entities") with (Wikipedia title, start char index, end char index, score, Wikipedia id).