2024 Competition:
For Computer Teams
Important Dates
11. October, 2024: Deadline for submitting system for in-person tournament
15. November, 2024: Deadline for submitting system for online tournament
1. December, 2024: Deadline for submitting system for computer-only tournament (Last call for submitting systems!)
Prizes
First place: $200
Second place: $150
Third place: $100
Fourth place: $50
Registration
Register here before you submit your model to the leaderboard: https://forms.gle/oxRmPHtndEYMUsVD9
Development Tutorial / Example Code
Visit tutorial here: https://colab.research.google.com/drive/1bCt2870SdY6tI4uE3JPG8_3nLmNJXX6 (baseline source code available!)
We provide two systems that you can build off of to make your development process more straightforward. Feel free to use these examples as a foundation for creating your own systems. Or you can just use the JSON information in your own systems without using our code.
Generative QA
This type of QA system aims to generate an answer to a given question directly.
INPUT: (1) question string
E.g. qa_pipe(question)
OUTPUT:
Return in a JSON format: (1) guess string, (2) confidence score, a float number representing the probability (0-1) of your guess
E.g. {'guess': 'Apple', 'confidence': 0.02}
Reminder: You could check the tutorial provided to see how to calculate the probability of the generated tokens.
Extractive QA
This type of QA system aims to extract an answer span from a context passage for a given question.
INPUT: (1) question string, (2) context string
E.g. qa_pipe(question=question, context=context)
OUTPUT (same as Generative QA):
Return in a JSON format: (1) guess string, (2) confidence score, a float number representing the probability (0-1) of your guess
E.g. {'guess': 'Apple', 'confidence': 0.02}
Reminder: If you are playing around with an extractive QA model already, HF QA models output the score already, so you can wrap the score to confidence.
How to Submit
HuggingFace leaderboard to receive QA system submissions: AdvCalibration QA Leaderboard
Calibration Evaluation Metric
In our Adversarial Calibration QA task, we evaluate the QA model's reliability of their performance by measuring their calibration estimates where we consider the confidence of guess confidence values. To understand this concept better, we adopt the concept of "buzz" in Trivia Quiz, where buzz happens whenever the player is confident enough to predict the correct guess in the middle of a question. This also applies to our measurement of model calibration as we focus whether the model prediction probability matches its prediction accuracy. Our evaluation metric, Average Expected Buzz, quantifies the expected buzz confidence estimation.
Read about evaluation metric here: https://drive.google.com/file/d/1byJ0_HYFBa-4y6SWHMf5JYC_cshE2JeG/view
FAQ
What if my system type is not specified here or not supported yet?
Please send a message to qanta@googlegroups.com so we could check how we could adapt the leaderboard for your purpose.
I don't understand where I could start to build a QA system for submission. Is the tutorial for submitting QA systems updated from last year?
Please check our submission tutorials: https://colab.research.google.com/drive/1bCt2870SdY6tI4uE3JPG8_3nLmNJXX6_?usp=sharing. From there, you could fine-tune or do anything above the base models.
I want to use API-based QA systems for submission, like GPT4. What should I do?
We don't support API-based models now but you could train your model with the GPT cache we provided: https://github.com/Pinafore/nlp-hw/tree/master/models.
What are the prizes for successful model submissions?
First place: $200
Second place: $150
Third place: $100
Fourth place: $50
The submitted models will be assessed and prized by the evaluation metric: https://drive.google.com/file/d/1byJ0_HYFBa-4y6SWHMf5JYC_cshE2JeG/view
How will the models "play" each other?
Upon collecting all submitted systems, we will run them on our secret test set and score them by our evaluation metric. Whoever has the highest score will be the winner of the tournament!
Could I know more about my system performance other than the score on the leaderboard?
Yes. Once your system is evaluated, a log file will be generated automatically: https://huggingface.co/datasets/umdclip/qanta_leaderboard_logs
Contact
If you have any questions, please contact us at qanta@googlegroups.com.