Evaluation

Task 1: Quantitative Understanding (English)

We use the Quantitative-101 Score [1] for ranking the overall performance. The Quantitative-101 Score is the average of the macro-F1 score of the QP task and the micro-F1 score of the QNLI and QQA tasks.

sklearn.metrics.f1_score [2] is the package that will be used for evaluation.

[1] Chen, Chung-Chi, et al. "Improving Numeracy by Input Reframing and Quantitative Pre-Finetuning Task." Findings of the Association for Computational Linguistics: EACL 2023. 2023.

[2] https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

Task 3: Numeral-Aware Headline Generation (English)

In the first subtask, focused on numerical reasoning, models are required to compute the correct number to fill the blank in a news headline.  Accuracy will be used for evaluating the results as Task 2 [3].

The second subtask centers on headline generation, wherein models must construct a headline based on the provided news. We will use the ROUGE, BERTScore, and MoverScore to evaluate the results [4]. Please refer to [5] for the evaluation code. 

[4] Jian-Tao Huang, Chung-Chi Chen, Hen-Hsen Huang, Hsin-Hsi Chen. "NumHG: A Dataset for Number-Focused Headline Generation", arXiv, 2023.

[5] https://github.com/ChunJiChen/NumEval_Evaluation

NumEval Task 3-2