NeurIPS 2020 Efficient QA

For full details on the computer part of the competition, visit the competition's webpage!

The questions in NQ are posed by humans to computers, and the competition attracted some of the strongest and most efficient QA systems available today. However, humans also answer questions for fun and recreation and the ultimate goal of artificial intelligence is to create machines that answer questions as well as humans (known as the Turing test). Moreover, existing comparisons of human question answering ability often use unskilled humans, leading to claims of computers "putting millions of jobs at risk". Or, in competitions with trivia experts, arcane rules of competitions can tilt the playing field toward computers. We wanted to make sure that the competition was fair to determine whether machine or computers were better at question answering.

We advertised our competition to trivia enthusiasts on social media. Teams of up to eight players applied to be part of the competition. We selected five teams to participate in the preliminary competition.

To create a fair competition and to showcases all of the tiers of the efficient QA competition, we offered three ways to answer each question where either humans or computers have more resources to answer a question.

To complement the 500MB systems, humans had to instantly signal when they knew the answer to a question. This reflects instant recall of a fact by a single individual. In the next phase in competition with the \medium{} systems, both humans and computers had more resources: the human team could discuss the answer for thirty seconds, arguing why they believe their answer is correct and computers had over ten times the memory. Finally, to focus one reading comprehension, unlimited systems faced off against the human teams who also had access to snippets from search results using the question as a query. As with the previous phase, they have thirty seconds to discuss their answer.

While we were conducting our preliminary competition, researchers were building question answering systems. Meet the computer competitors and the winners of the evaluation!

The final competition is split into three parts. We selected questions based on the following criteria:

  • Diverse over topic, ensuring there were questions about history, literature, philosophy, sports, and popular culture. This results in fewer questions about sports and popular culture than the standard NQ distribution.

  • Not tied to 2018. Because the NQ dataset came from 2017--2019 and it is difficult for humans to forget the last three years, we excluded questions that depend on the current date.

  • Interesting questions. While not strictly adversarial, we wanted to showcase both human and computer ability, so we excluded questions that many humans would not know (e.g., "how many us states are there") or questions with answers that are difficult to evaluate in the NQ framework ("how many words are in Les MisĂ©rables?").

  • We avoided questions that were overly ambiguous, or where the gold answers had the issues mentioned above (answer changes based on time the question was asked, unclear answer type, mismatch with question intention, etc.).

In the first video, we meet the players and the humans take an early lead.

The computers make a comeback, however, in the second #NeurIPS human-computer match. This features my favorite computer answer. Q: "What's the insertion time for a BST?" A: 24 Hours or less

And the conclusion! Can the computers retain their momentum and overtake the humans?