Three Numeral-Aware Tasks
Task 1: Quantitative Understanding (English)
In Task 1, our focus is on the quantitative understanding task, which is further divided into three subtasks: Quantitative Prediction (QP), Quantitative Natural Language Inference (QNLI), and Quantitative Question Answering (QQA). The Quantitative 101 dataset [1], a compilation of Numeracy-600K [2], EQUATE [3], and NumGLUE Task 3 [4], is employed for experimentation.
As all the datasets are publicly available, there will be no separate private test in NumEval. We invite participants to share their insights and discoveries collaboratively within the NumEval.
[2] Chen, Chung-Chi, et al. "Numeracy-600K: Learning numeracy for detecting exaggerated information in market comments." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019.
[3] Ravichander, Abhilasha, et al. "EQUATE: A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference." Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). 2019.
[4] Mishra, Swaroop, et al. "NUMGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks." 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022. Association for Computational Linguistics (ACL), 2022.
Task 1 Examples
Task 2: Reading Comprehension of the Numerals in Text (Chinese)
Task 2 is centered on the cloze test, where models are required to identify the correct numerical value from four given options, based on a provided news article. The dataset utilized for this task is NQuAD [5], which is in Chinese. Given that all datasets are publicly accessible, there will be no distinct private testing in NumEval. We cordially invite participants to contribute their findings and observations, fostering collaborative learning within the NumEval.
Task 2 Examples
Task 3: Numeral-Aware Headline Generation (English)
Task 3 consists of two distinct subtasks. In the first subtask, focused on numerical reasoning, models are required to compute the correct number to fill the blank in a news headline. Meanwhile, the second subtask centers on headline generation, wherein models must construct a headline based on the provided news. For Task 3, a separate private test set will be utilized. Participants are required to submit the output of their models for evaluation.