Statement

Native Chinese Reader (NCR) is a new machine reading comprehension (MRC) dataset with particularly long articles in both modern and classical Chinese,which is collected from the exam questions for the Chinese course in China’s high schools, and are designed to evaluate the language proficiency of native Chinese youth.

CURATION RATIONALE

In order to build a native-level Chinese comprehension system, we collect documents with questions from the exam questions for the Chinese course in China’s high schools, which are designed to evaluate the language proficiency of native Chinese youth. These questions are also not easy for native Chinese speakers and aim to push the frontier of building native-level Chinese MRC models.

COLLECTION PROCESS

All the questions and documents are collected from online open-access high-school education materials. After data cleaning, 8315 documents followed by 20284 questions are obtained. We randomly split the dataset at the document level, with 6315 for training, 1000 for validation and 1000 for testing. Furthermore, to make sure our test set has sufficient novel questions that never appear online, we also invited a few high-school Chinese teachers to manually generate 193 questions for a total of 73 additional documents to augment the test set. Finally, NCR consists of 6315 documents with 15419 questions for training, 1000 documents with 2443 questions for validation and 1073 documents with 2615 questions for testing.

LANGUAGE VARIETY

The data are in simplified Chinese (zh-Hans).

WRITING STYLE

This dataset contains both modern Chinese and classical Chinese. Classical Chinese is a writing style used in almost all formal writing until early 20th century, then it was gradually replaced by modern Chinese. It plays a critical role in Chinese culture and has led to

numerous idioms and proverbs. Even today, classical literature and poetry are still widely taught and examined in China’s education system. There are around 1/3 documents in NRC is written with classical Chinese.

ANNOTATOR DEMOGRAPHIC

Since all the materials are collected from online education materials or generated by qualified high-school teachers. The annotators are all teachers engaged in high school Chinese education.