Data Statements for NLP: Towards Best Practices
Call for Participation
We invite participants who are currently developing NLP datasets to join us for a one-day working meeting at LREC 2020 to develop data statements for their datasets and develop and refine best practices for data statement creation. In this open collaboration session, participants will develop data statements (Bender & Friedman 2018) for specific datasets, and in the process refine a set of best practices for creating data statements. Specifically, workshop participants will: (1) be introduced to the concept, structure, and uses of data statements; (2) draft a data statement for the dataset(s) they brought to the workshop; (3) work in small groups to critique and refine their data statements; and (4) reflect on best practices for writing and disseminating data statements.
This event will be organized differently from typical workshops. It is an open collaboration session providing a structured opportunity for a diverse range of participants in our community to help shape and codify best practices. The deliverables from this workshop will be (a) data statements for each participants' data set and (b) a preliminary best practices document. These will be disseminated online, together with the overview materials provided by the workshop organizers, with the data statements providing examples illustrating the results of following the preliminary best practices.
There will be no reviewing process ahead of this workshop, nor any proceedings. All participants are welcome, and we especially encourage attendance by people who are currently developing datasets for NLP. We have a small amount of funding available to support participation in this workshop. The application for that funding is January 15, 2020. For details, see “financial support” below.
We will work towards best practices for creating data statements, exploring questions like the following:
- How can the information required be efficiently collected?
- What steps can be taken in the planning for a dataset to facilitate the collection of relevant metadata about speakers and annotators?
- What heuristics are there for writing data statements that are concise and informative?
- How can we incorporate material from institutional review board/ethics committee applications into the data statement schema?
- How can we best settle on an appropriate level of detail given privacy concerns, especially for small or vulnerable populations?
- How can we produce data statements for older datasets that predate this practice?
- Finally, how can data statements be incorporated into metadata already associated with data sets, such as is called for by the CLARIN or META-SHARE schemas?
To ensure that the best practices developed are as broadly applicable as possible, we especially encourage participation from developers of datasets for low-resource languages and/or dataset developers from countries not well represented at major NLP conferences.
In order for these best practices to be responsive to the needs of researchers around the world, and not just those in the most well-resourced communities, it is critical that they be designed with a broad range of input. We have already secured funding to bring two invited participants from underrepresented communities to LREC to participate in this workshop plus the main conference, and are currently seeking additional funding. To be considered for this support, please email the following information to ebender-at-uw.edu by January 15, 2020:
- Name, country of residence, affiliation
- A brief description of a current or near-future dataset creation project you are involved with for which you’d like to work on a data statement at the workshop (including language(s) in the dataset, intended use case, a brief description of any annotations provided, and other details you would like to share)
- In what ways would your participation in this workshop broaden the perspectives likely to be represented at our workshop and at LREC?
Emily M. Bender, University of Washington, Department of Linguistics
Batya Friedman, University of Washington, Information School
Angelina McMillan-Major, University of Washington, Department of Linguistics
We thank the Tech Policy Lab at the University of Washington for its support of this workshop.