2024 Competition:
For Computer Teams

Important Dates

Prizes

Registration

Register here before you submit your model to the leaderboard: https://forms.gle/oxRmPHtndEYMUsVD9

Development Tutorial / Example Code

Visit tutorial here: https://colab.research.google.com/drive/1bCt2870SdY6tI4uE3JPG8_3nLmNJXX6 (baseline source code available!)

We provide two systems that you can build off of to make your development process more straightforward. Feel free to use these examples as a foundation for creating your own systems. Or you can just use the JSON information in your own systems without using our code.

Generative QA

This type of QA system aims to generate an answer to a given question directly.

INPUT: (1) question string

E.g. qa_pipe(question)

OUTPUT:

Return in a JSON format: (1) guess string, (2) confidence score, a float number representing the probability (0-1) of your guess

E.g. {'guess': 'Apple', 'confidence': 0.02}

Reminder: You could check the tutorial provided to see how to calculate the probability of the generated tokens.

Extractive QA

This type of QA system aims to extract an answer span from a context passage for a given question.

INPUT: (1) question string, (2) context string

E.g. qa_pipe(question=question, context=context)

OUTPUT (same as Generative QA):

Return in a JSON format: (1) guess string, (2) confidence score, a float number representing the probability (0-1) of your guess

E.g. {'guess': 'Apple', 'confidence': 0.02}

Reminder: If you are playing around with an extractive QA model already, HF QA models output the score already, so you can wrap the score to confidence.

How to Submit

HuggingFace leaderboard to receive QA system submissions: AdvCalibration QA Leaderboard


Calibration Evaluation Metric

In our Adversarial Calibration QA task, we evaluate the QA model's reliability of their performance by measuring their calibration estimates where we consider the confidence of guess confidence values. To understand this concept better, we adopt the concept of "buzz" in Trivia Quiz, where buzz happens whenever the player is confident enough to predict the correct guess in the middle of a question. This also applies to our measurement of model calibration as we focus whether the model prediction probability matches its prediction accuracy. Our evaluation metric, Average Expected Buzz, quantifies the expected buzz confidence estimation.


FAQ

Contact

If you have any questions, please contact us at qanta@googlegroups.com.