Dec. 15, 2018

Human-Computer Question Answering Competition at UMD

Event completed: read our paper and download the data!

On Dec.15th, University of Maryland hosted a series of human-computer question answering competitions. The event recognized the best human quiz bowl teams, the best computer quiz bowl systems that can defeat other computer systems and the top human teams, and the best quiz bowl question writers who crafted high-quality quiz bowl questions that entertain and challenge humans while stumping existing computer systems.

Recap

Full statistics available, as are human readable versions of the prelim and final questions used in the Dec 15 event.

Human Quiz Bowl Teams

We invited human quiz bowl teams to join us on Dec 15th. to compete against other human teams and computer systems. We offered prizes at the high school, collegiate, and open levels:

    • Overall Champion / Open Division: Rage Against the Machine

    • College Team: University of Maryland

    • High School: Monticello

Computer Quiz Bowl Systems

Computer teams submitted question answering systems for the quiz bowl task. Systems will compete against each other, and top systems will get the unique opportunity to compete against top human systems. The computer winner was FYY, a BERT-based system (which lost to a human team in the single elimination playoffs).

Quiz Bowl Question Writers

Quiz bowl question writers crafted high-quality quiz bowl questions that challenged human teams and computer systems. We awarded prizes for the best questions and best packets.

In June 2018 at PACE-NSC, we had our first exhibition match on adversarial questions. From the paper, this represented our Round 1-IR data.

Building on this, we moved to our December 2018 event. This video introduces the video series.

Here we describe how authors create the adversarial questions.

Examples of the questions and tactic trivia experts used to stump computers.

With the questions in place, the competition began in earnest. Meet the teams and see how they fared in the tournament.

See how the semifinal teams compared to the top computer team.

See how the two top human teams compared to the top computer team.

Finally, we end with a discussion of the symbiosis between computer scientists and the trivia community.

Event schedule and setup

The event took place at the University of Maryland, College Park on December 15 and had three components:

    • A morning round-robin tournament to determine the best human teams

    • Concurrently in the morning, a workshop for the creators of computer systems to discuss their systems

    • In the afternoon, an eight-way final to decide the overall champion

Organizers

Organizing the tournament is Jordan Boyd-Graber with logistical support from MAQT; Eric Wallace is supporting the question writing competition; Chen Zhao and Ahmed Elgohary are supporting the computer competition; and Shi Feng and Pedro Rodriguez are helping in the background. Daniel Jensen is leading question compilation, and Kurtis Droge is providing freelance questions.

Contact us and stay informed

Get in-touch

FAQ

Q: Why pyramidal questions? You're the only idiot in the machine learning / natural language processing community using these questions.

A: primary goal of a dataset/task in machine learning is to distinguish good systems from bad systems. Single clue datasets (such as TriviaQA or SQuAD) require many more questions to discriminate between top question answerers. While it's true that quiz bowl questions are longer, it is easier to write a good pyramidal question on a single topic than five good single clue questions on five different topics.

Moreover, because quiz bowl is on a word-by-word basis, it offers far more opportunities to discriminate between question answerers. Because the questions are pyramidal, the answerer with the deeper knowledge can answer first.

Q: Why have computers and humans play against each other? Is this a gimmick?

A: Quiz Bowl is designed to be interactive. Humans play against each other in real time (for fun). While machine learning tasks need not be interactive with humans, interruptable questions allow easy comparison against human performance and enable opportunities to teach humans about machine learning, natural language processing, and question answering.

But hopefully it will be fun!

Q: What's the format of the games? Why is it structured like that?

A: We'll have 40 questions, tossups only. If there's a tie after 40 questions, we'll break the tie with three tie breaker questions. The team with the higher score after three questions will win. If the score is still tied after the tie-breaker questions, we'll read questions until the score changes. The first change in the score will decide the game.

We are not using bonus questions because tossup questions are more interesting (based on their pyramidal structure) to decide whether humans or computers are smarter. What would make bonus questions interesting would be to emphasize collaboration in human-computer hybrid teams. We're working on figuring out how to make that both interesting and fun, but we're not quite there yet.

Q: How will the questions be judged?

A: The questions will be judged by the following criteria:

    1. Questions should be accurate and only contain true information.

    2. Questions should be interesting; even if you know nothing about the subject, the question should be engaging and leave the listener to find out more about the topic.

    3. Questions should contain a variety of clues / information that reflect study and knowledge of a subject.

    4. Questions should be appropriately pyramidal for humans, effectively separating skill levels.

    5. Questions should be appropriately pyramidal for computers, effectively separating knowledge / comprehension.

Q: Computers involved in a trivia tournament ... will the questions suck?

A: The goal of having computers in the loop is to improve the questions from both a scientific perspective and a quiz bowl perspective. Here's how computers can help you write better questions:

  • Avoid stock clues in the lead-in

  • Automatically find similar questions (to find other interesting clues or avoid repetition)

  • Avoid hoses (if the computer thinks the answer might be X, so might a human ... perhaps you can rephrase)

  • Automate tasks like pronunciation guides and alternate answer lines (don't know if this will make it in for this iteration, but it's on our todo list!)

Again, we hope that the resulting questions are high quality from a human perspective. We hope that we'll attract good question writers, and that a reasonable reader will recognize them as good questions. We're not going for gimmicks and quirks in terms of the questions.

Q: Can we blend subcategories/categories?

A: Feel free to blend subcategories and categories. Our distribution requirements aren't super strict. But don't use blending to avoid writing about topics (e.g., "real" science) that should be covered in the set somewhere.

Q: Should the computer be able to answer all of the questions by the end?

A: No, while your answers have to be in our answer set, if the system on write.qanta.org cannot answer the question, that's fine too! It probably means that you're doing something unique and interesting. If humans will like and convert the question, then you're doing everything exactly right.

Q: Is this harder than one-line, single sentence QA?

A: From Dwight Wynne:

The pyramidal question is not inherently an easy question. Like any other question, its difficulty is determined by both the answer and clues selected. However, a one-line question is never easier than a pyramidal question containing the same one line. A question will be answered by the union of the sets of [answerers] who recognize and buzz correctly from each clue contained in the question. Since the pyramidal question contains all clues already present in the one-line question, plus additional clues, it must follow that at least as many [answerers] can answer the pyramidal question as the one-line question.

Q: What if we're interested only in complete sentences? (Or our systems can only answer complete sentences ...)

A: It's possible to only look at individual sentences; i.e., only provide answers after you have a complete sentence. If you're only interested in single sentences, you can only answer after the first sentence (concrete questions must uniquely identify the answer immediately). If you can always answer the question after the first sentence, you'll likely do quite well at the overall task.

Thus, quiz bowl is a superset of single sentence QA. While some questions to require reasoning across sentences, the vast majority of the time it's possible to only answer based on individual, complete sentences in isolation (each sentence in a question getting easier).

Q: I can't find my favorite answer in the system. Why is this?

A: This is a design decision that we made. These are the answers for which there have been three answers in mainstream quiz bowl tournaments. This is a tradeoff that we made to keep things relatively fair. We want questions that are challenging for computers not because they lack data but because they cannot understand English. By excluding rare answers and only focusing on frequently asked answers, if a computer gets it wrong, it's not because it lacks information to work off of ... it's because it didn't understand the question. We realize it's a little frustrating, so it's useful to check whether the answer is in bounds before writing the question. In many cases, you can tweak the question to ask about something more general (instead of asking about "William W. Belknap", ask about Grant, focusing on members of his cabinet).

Q: What if I already have some questions written that aren’t in the interface?

A: Due to the limitations of this competition, it will be necessary to restrict answer lines to the ones already in the answer set at write.qanta.org. It isn’t recommended, but if you have already written questions with answers outside of the set, you can email them to qanta@googlegroups.com to submit them. Otherwise, I would recommend that you check if an answer is available at write.qanta.org before writing a question.

Q: What is the best way to write questions?

A: We suggest the following procedure for writing questions. First, as you’re deciding on answer lines, check write.qanta.org to make sure they’re in bounds. Then, draft the question in whatever editor you’re most comfortable with. When you have a first draft, copy and paste it into the interface, edit to make sure the lead in (at least) isn’t trivially answerable by our baseline system, and then submit. We also suggest keeping your own backup just in case something goes wrong (we hope it doesn’t, but better safe than sorry). We realize this is slightly more hassle than normal question writing, but this will hopefully lead to better questions and also advance the state of the art in natural language processing.

Q: How do I submit questions? Why this craziness?

A: You submit your questions through a web form. We describe the system in this paper, and a tutorial video is below.

Q: How do I make an account on the interface?

A: Just login with a new email and password. This will create a new account. There can only be one account per email.

Q: What if I forget my password?

A: Send us an email at qanta@googlegroups.com.

Q: What are the strange highlighted colors in the interface?

A: Words which are highlighted are "important" for our Quizbowl AI system to make its predictions. If you modify those words (e.g., rephrase that sentence), there is a high chance the system will get more confused.

Instructional video on how to write adversarial questions.

We gratefully acknowledge AWS Credits for Research for supporting the competition: running the server for authoring questions and comparing computer results.