Datasheets for Datasets

MOTIVATION

For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.

we seek to build a native-level Chinese comprehension system, we collect documents with questions from the exam questions for the Chinese course in China’s high schools, which are designed to evaluate the language proficiency of native Chinese youth. These questions are also not easy for native Chinese speakers and aim to push the frontier of building native-level Chinese MRC models.

Who created this dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

It was created by Haihua Institute for Frontier Information Technology.

Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.

Haihua Institute for Frontier Information Technology.

COMPOSITION

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.

Each instance contains a document and a list of multiple-choice questions related to this document. And each question is comprised of a question text and 2~4 options, of which exactly one is correct, and the correct answer is also included. An example instance is as follows:

{

"ID": 1,

"Content": "书家和善书者沈尹默“古之善书者，往往不知笔法。xxxxxx",

"Questions": [

{

"Question": "根据原文内容，下列说法不正确的一项是",

"Choices": [

"A．善书而不知笔法，这一现象出现在写字初期，当时笔法还未被充分发现和利用。",

"B．唐代爱好写字的人渐多，有一批人好奇立异，自创规则，经生体就是这么产生的。",

"C．二王、欧、虞、褚、颜诸家都是严格遵守笔法的典型，他们都属于书家的行列。",

"D．元明清三代，书画创作每况愈下，优秀作品越来越少，与守法度的习惯被破坏有关。"

"Answer": "B",

"Q_id": "000101"

{

"Question": "下列关于原文内容的理解和分析，不正确的一项是",

"Choices": [

"A．在写字过程中，那些与实际不能完全切合的人为的规则，不具有普遍的永久的活动性，因而不能称之为笔法。",

"B．书与画相似，书家之书正如画师之画，谨严而不失法度，而善书者之书正如文人的写意，别有风致。",

"C．苏东坡天分高，修养深，意造的书画自有天然之趣，但率先破法，放任不羁，成为后世不守法度的借口。",

"D．一味从心所欲做事是不可取的，但写字的人如能做到“从心所欲不逾矩”，却能达到最高的境界。"

"Answer": "C",

"Q_id": "000102"

}

"Type": "00"

}

ID: Document ID
Content Document text
Questions A list of quetions
Type Writing style of document, 00 for moder-style (without poetry), 11 for classical-style (without poetry),

22 for classical poetry, 33 for modern poetry

Questions:Question Question text
Questions:Choices A list of options
Qusetions:Answer The ground truth answer
Questions:Q_id Question id

How many instances are there in total (of each type, if appropriate)?

NCR consists of 6315 documents with 15419 questions for training, 1000 documents with 2443 questions for validation and 1073 documents with 2615 questions for testing.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).

The dataset is a sample of instances from the larger set containing all exam questions for the Chinese course in China’s high schools. It is impossible to collect all the questions since there are too many questions online and many new questions will be generated every day.

What data does each instance consist of? “Raw” data (e.g., unprocessed text or images)or features? In either case, please provide a description.

Data is all in the form of text.

Is there a label or target associated with each instance? If so, please provide a description.

We provide each question with the correct answer. For data in the validation and test set, we also provide the writing style of the document.

Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.

There is no information missing.

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? If so, please describe how these relationships are made explicit.

A document is associated with several questions, these questions share the same document.

Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.

We randomly split the dataset collected online at the document level, with 6315 for training, 1000 for validation and 1000 for testing. To make sure our test set has sufficient novel questions that never appear online, we also invited a few high-school Chinese teachers to manually generate 193 questions for a total of 73 additional documents to augment the test set. Finally NCR consists of 6315 documents with 15419 questions for training, 1000 documents with 2443 questions for validation and 1073 documents with 2615 questions for testing.

Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.

There are no errors, sources of noise, or redundancies in the dataset because we filter out these data.

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a future user? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.

The dataset is self-contained.

Does the dataset contain data that might be considered confidentiality, data that includes the content of individuals' non-public communications)? If so, please provide a description.

No, all data is public.

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.

The dataset does not contain any data that may be offensive, insulting, threatening, or might otherwise cause anxiety.

Does the dataset relate to people? If not, you may skip the remaining questions in this section.

This dataset does not relate to people.

COLLECTION

How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how.

A document is associated with several questions, these questions share the same document, which is directly observable.

What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated?

The data was collected by manual human curation. And the collected data was validated by another person to ensure the quality.

Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?

We contracted out the data collection process to 海天瑞声 ( SpeechOcean, http://en.speechocean.com/)

Over what timeframe was the data collected? Does this timeframe match the creation time frame of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.

The data collection process lasted for around 40 days. Since the dataset contains many articles in classical Chinese, which can be date back to thousands of years ago.

Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.

No ethical review processes were conducted.

PREPROCESSING

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section.

We clean the data by filter out some questions which are base on the format. For example, some questions are about the marked or boldface words.

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.

No.

Is the software used to preprocess/clean/label the instances available? If so, please provide a link or other access point.

No.

USES

Has the dataset been used for any tasks already? If so, please provide a description.

To examine the limit of current MRC methods, we organized a 3-month-long online competition using NCR with a training and validation set released. Participants are allowed to use any open-access pre-trained model or any open-access unlabeled data. Use of any external MRC supervision is forbidden, since a portion of the test questions are possibly accessible online. This aims to prevent human annotations overlapping with our held-out data for a fair competition. There are a total of 141 participating teams and the best submission model with the highest test accuracy is taken as the competition model. The team is from an industry lab. They first pre-trained an XLNet-based model on a company-collected large corpus. For each question, they use an information retrieval tool Okapi BM25 to extract the most relevant parts from the document and then run this pre-trained model for answer selection based on the extracted texts. The final model with the highest accuracy is released: https://github.com/xssstory/NCR_competition_model

Is there a repository that links to any or all papers or systems that use the dataset?

The competition website: https://www.biendata.xyz/competition/haihua_2021/

Github for the competition model with the highest accuracy: https://github.com/xssstory/NCR_competition_model

Github for baselines: https://github.com/xssstory/NCR_baseline

What (other) tasks could the dataset be used for?

This dataset may also be used for a question-answering generation task.

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?
Users should keep in mind that the questions and their answer comes from different teachers.

Are there tasks for which the dataset should not be used? If so, please provide a description.

Unknown.

DISTRIBUTION

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?

Yes, the dataset is public now.

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identififier (DOI)?

The dataset is available at https://github.com/xssstory/NCR_competition_model and https://drive.google.com/drive/folders/1Ci-KLHKk-yP-y5fWX4_cU8bA2fL_q76e?usp=sharing

When will the dataset be distributed?

It is available now.

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.

This dataset is released under the CC BY-SA 4.0 license for general research purposes.

Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.

No.

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.

No.

MAINTENANCE

Who is supporting/hosting/maintaining the dataset?

The dataset will be hosted on GitHub and google drive and will be maintained by Shusheng Xu and Yichen Liu.

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

The maintainers can be contacted at xuss20@mails.tsinghua.edu.cn and y17043@nyu.edu

Is there an erratum? If so, please provide a link or other access point.

No.

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to users (e.g., mailing list, GitHub)?

The dataset will be updated if necessary, updates will be communicated via the project website https://sites.google.com/view/native-chinese-reader/ Github at https://github.com/xssstory/NCR_competition_model and https://github.com/xssstory/NCR_baseline

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced.

N/A

Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its will be communicated to users.

Yes, it will be communicated to users via the project website https://sites.google.com/view/native-chinese-reader/ Github at https://github.com/xssstory/NCR_competition_model and https://github.com/xssstory/NCR_baseline

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to other users? If so, please provide a description.

This is not supported now.