Evaluation
Task 1: Quantitative Understanding (English)
We use the Quantitative-101 Score [1] for ranking the overall performance. The Quantitative-101 Score is the average of the macro-F1 score of the QP task and the micro-F1 score of the QNLI and QQA tasks.
sklearn.metrics.f1_score [2] is the package that will be used for evaluation.
[2] https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
Task 2: Reading Comprehension of the Numerals in Text (Chinese)
Accuracy will be used for evaluating the results. [3]
Task 3: Numeral-Aware Headline Generation (English)
In the first subtask, focused on numerical reasoning, models are required to compute the correct number to fill the blank in a news headline. Accuracy will be used for evaluating the results as Task 2 [3].
The second subtask centers on headline generation, wherein models must construct a headline based on the provided news. We will use the ROUGE, BERTScore, and MoverScore to evaluate the results [4]. Please refer to [5] for the evaluation code.