Developing Standards for Data Citation and Attribution for Reproducible Research in Linguistics

A project of the National Science Foundation (NSF 1447886, PIs Berez-Kroeker, Holton, Kung, Pulsifer)

Follow us on Twitter! #lingdata

This 2-year project (2015-2017) supports a series of three workshops and one panel presentation bringing together relevant stakeholders to develop and promote standards for data citation and attribution for linguistic data.

Linguistics is a data-driven social science in which inferences about human cognition and social structure are drawn from observations of linguistic practice. These observations, in the form of recordings and associated annotations, represent the primary data sets that underlie the field. This practice has its roots in philology, which relies on texts as a primary data source. However, three recent and inter-related factors make the data-oriented model of linguistics particularly relevant to the field at the current time. First, a major shift in technology has resulted in rapidly growing volumes of digital language data. Second, more than half of the world's languages are critically endangered, so that in the not-so-distant future archival data will be the only source of information on those languages. Third, the emergence of Documentary Linguistics as a recognized sub-field has led to an increased focus on data curation and management.

While linguists have always relied on language data, they have not always facilitated access to those data. Linguistic publications typically include short excerpts from data sets, ordinarily consisting of fewer than five words, and often without citation. Where citations are provided, the connection to the data set is usually only vaguely identified. An excerpt might be given a citation which refers to the name of the text from which it was extracted, but in practice the reader has no way to access that text. That is, in spite of the potential generated by recent shifts in the field, a great deal of linguistic research created today is not reproducible, either in principle or in practice. The workshops and panel presentation will facilitate development of standards for the curation and citation of linguistics data that are responsive to these changing conditions and shift the field of linguistics toward a more scientific, data-driven model which results in reproducible research. 

A primary factor hindering the development of reproducible research in linguistics is the lack of standards for data citation and attribution. Although language data are increasingly recognized as important, there are no widely established guidelines for the citation of these data. Equally important, there are no standards for attribution. Lacking such standards, journals, academic tenure and promotion committees, and peer review processes continue to emphasize linguistic analyses over linguistic data, and as a result linguists have little incentive to make data accessible. A data-driven linguistic science has the potential to provide substantiation of scientific claims by promoting attention to the care and structuring of language data.

By the end of the project, the researchers will have held three workshops to research and develop a model for data citation and attribution in linguistics; facilitated discipline-wide discussion on these topics at the 2017 annual meeting of the Linguistic Society of America; written a position paper on standards for citation and attribution in linguistics; and submitted a proposal for a Resolution on citation and attribution to the LSA.

This project is one of seven funded under a cross-directorate initiative Supporting Scientific Discovery through Norms and Practices for Software and Data Citation and Attribution. The NSF Workshop on Data and Software Citation (June 6-7, 2016) will bring together many of these projects to work toward data citation and attribution standards across scientific disciplines.