Quiz Bowl Dataset
The QANTA tossup dataset is updated annually with this year's version being referred to as "QANTA 2018". These data are useful for training systems on the QANTA shared task. You can download the dataset at the links below, or use a python script to download them.
- QANTA 2018
The QANTA dataset is based on the Wikipedia dumps from 4/18/2018. Since they are no longer available at the regular dumps location we also provide a copy below. For convenience we also provide a json file which contains only Wikipedia pages for answers in the dataset.
- Filtered Wikipedia JSON
We also provide the preprocessed datasets to help build Machine Reading Comprehension(MRC) based models. We split the questions into individual sentences. For each sentence, we first retrieve top-10 wikipedia articles over whole wikipedia using TFIDF scoring. Then inside these articles, we retrieve top-10 paragraphs with TFIDF scoring as candidates. We use TAGME to extract all entities linked to wikipedia page for each retrieved paragraph.
- Train: qanta.train.paragraphs.2018.04.18.jsonl.zip
- Dev: qanta.dev.paragraphs.2018.04.18.jsonl.zip
- Test: qanta.test.paragraphs.2018.04.18.jsonl.zip
Qanta QBLink Dataset
The training, development, and testing splits of Qanta sequential question-answering dataset, QBLink, are available at
The data are described in a EMNLP 2018 paper.
Qanta Adversarial Dataset
A small test set (~1000 questions) of tossup questions that were crafted to challenge both humans and computers. Created using the process described in this preprint. The data has the exact same format as the QANTA 2018 data posted above.
Past data released by Mohit Iyyer.