Survey of Reproducibility in Linguistics Journals, 2003-2012
Survey of Current Reproducibility Practices in Linguistics Journals, 2003-2012
[UNDER DEVELOPMENT - 11 August 2017]
Please cite this document as:
Berez-Kroeker, Andrea L., Lauren Gawne, Barbara F. Kelly & Tyler Heston. 2017. A survey of current reproducibility practices in linguistics journals, 2003-2012. https://sites.google.com/a/hawaii.edu/data-citation/survey.
1 Introduction
This is a survey of current practices in linguistics journals with regard to the two parameters for transparency that we deem important to reproducibility in linguistics publications: transparency about methods used in creating, collecting, and analyzing source data, and transparency about source data for linguistic research. We also discuss transparency in the citation of individual examples and transparency with regard to the storage and archiving of linguistic data. The aim of this study is to encourage discussion, although the results will probably come as no surprise. If we are to gain greater clarity of the extent to which the research community values reproducibility in linguistics, we need to have a deeper understanding of the barriers that are present in current practice.
Building on the work by Gawne and colleagues (2017) on transparency in descriptive grammars, this survey covers a sampling of linguistics journal articles intended to be representative of practices in the field more broadly. We surveyed nine linguistics journals, aiming for broad coverage in a number of dimensions. These included four journals with areal foci (International Journal of American Linguistics, Journal of African Languages and Linguistics, Oceanic Linguistics, Linguistics of the Tibeto-Burman Area); two targeted subfields (Journal of Sociolinguistics, Studies in Second Language Acquisition); divergent theoretical persuasions (Natural Language and Linguistic Theory, Studies in Language), and the top journal in the discipline (Language).
All items included in our survey span a ten-year period starting in 2003.[1] We selected this period because it begins five years after the publication of Himmelmann's 1998 paper on the development of language documentation as a way of increasing accountability in linguistics research, allowing authors reasonable time to assimilate new methods for increasing transparency of data and claims. Articles were selected from the ten-year span as follows. First, we narrowed our survey to the first non-special issue of each year (or for journals with more than four issues per year, the first two issues). Next, we generated a randomizing script to select 33 articles from each journal; the exception is Linguistics of the Tibeto-Burman Area, for which some issues were unavailable to us during data collection. We did not preclude article authors who were represented more than once in our survey. Some articles were removed from the study for either not being written in English (to avoid author error), or for not being data-driven in a way that would merit inclusion here. Our final data set includes 270 articles. Table 1 shows the number of articles per journal included in our study, and the abbreviations used henceforth to refer to those journals.
Table 1. Journal articles included in the survey
Articles were then coded for a range of variables, which are described and exemplified below. The variables were selected after exploratory coding of a subset of the articles and are based in part on theoretical discussions in the language documentation literature, including but by no means limited to Gippert et al. 2006, Bowern 2008, Lüpke 2010, Chelliah & de Reuse 2011, Thieberger 2012, Austin 2013, and Nakayama & Rice 2014. We take the final set of coding variables as a baseline of acceptability for reproducibility. We do not argue that this survey accounts definitively for all linguistic research, however we hope that in making our methodology clear in this paper we will encourage others to also critically review the state of research in their own subfield of linguistics.
2 Variables
2.1 Methodological variables.
Variables in this section measure transparency of research methods on a binary scale, or binary + not applicable (NA) scale.
Description of data collection methods (YES/NO). This variable tracks whether the author explicitly describes the methods used in data collection. Because journal articles are limited in space, even brief summary statements appearing in footnotes counted as instances of methodological description. An example from an article about complementation in Chimariko from IJAL shows a typical brief methods statement:
For this paper, I examined six oral narratives and a small set of elicited sentences. (The titles and sources of the narratives appear in table 2.) These materials were glossed and translated using all sources available, including Dixon (1910) and Berman (2001). Some words and morphemes remain unclear. In particular, the tense and aspect system coded in the verbal morphology needs further analysis. (Jany 2007:95, fn. 2.)
Some articles have entire sections devoted to descriptions of methods used, not only for data collection, but also for coding and analysis. Longer methodological descriptions were found mostly in experiment-based journals like S2LA.
Information about participants in the study (YES/NO). This variable tracks whether the author gives any demographic information about the people from whom data were collected. Some are brief; for example, in an article in JS on the Texas Hmong-speaking community, the author describes interviews with participants as arising from ‘home visits to 10 Hmong households in the Dallas/Fort Worth metropolitan area, which included recorded interviews of 18 people’ (Stanford 2010:97).
Other participant descriptions are lengthier, providing more information about relevant aspects of the population studied. For example, in a JS article on the effect of background noise on classroom conversation, McKellin et al. (2011) describe the broader student population from which participants were selected, and go on to mention the participants’ grade levels, their native languages, and even their hearing abilities:
The school’s total population was composed of 505 students, of whom 52 percent had first languages other than English. The study itself was conducted in four classes, including one split grade 1–3 class (from which only grade 3 students participated in our study), two grade 5 classes, and one grade 7 class. Six students in each class were selected by the teachers for the study from among the volunteers. Each was a native speaker of English who had no educationally disabling conditions. Their hearing levels were checked and found to be normal. Our research took place near the end of the school year; thus, students knew each other well and had established routines for classroom activities. (McKellin et al. 2011:72-73)
Often information about participants appears in a statement whose purpose is to thank those participants, rather than describe them for the readers. Even when brief, we still considered such statements—like this one from a LTBA article on ergativity in a Tibeto-Burman language—to be mentions of information about participants, in that if nothing else they alert the audience to the number of language speakers the researcher worked with:
I am indebted to my Kurtöp speaking friends for enthusiastically sharing their language with me and patiently answering all my questions. In particular these are Kuenga Lhendup, Pema Chhophyel, Kezang Wangchuk, Karma Tenzin, Tshering Yangdzom, Tenzin Wangchuk and Jurme Tenzin. (Hyslop 2010:1, fn. *)
We also considered statements about the language community as a whole to be mentions of participant information. For example, the author of an article in IJAL about theticity in the Cariban language Trio provides information on location and growing multilingualism in the community as a whole, although demographic details about the individual language consultants are not included:
Trio is a member of the Cariban language family and is spoken by approximately 2,000 speakers in the dense rainforests of the south of Suriname and across the border in Brazil. The Trio community is predominantly monolingual, although with increasing contact several speakers now have some competence in one of the national languages: Dutch, Portuguese, and the Surinamese lingua franca, Sranantongo. The data presented in this paper, unless otherwise specified, are taken from my own corpus of data collected over the past 12 years among the Trio in Suriname. (Carlin 2011:1, fn. 1).
Mention of data collection equipment (YES/NO/NA). Some authors, especially authors of phonetics articles, provide details about the hardware used to collect data; for example, the author of an article in LANG on final obstruent voicing in Lezgian writes, ‘[t]he recordings were made using a SONY MZ-R70 minidisk recorder and a Sky Tronic 173.623 microphone’ (Yu 2004:78). But we also considered mentions of other data collection apparatus, like experimental stimuli, questionnaires, production or comprehension tasks, and the like as research equipment here. An example is the following description of a picture task in a study of L2 and Deaf learners’ knowledge of English quantification that appeared in S2LA:
The ESPT-Q (Berent et al., 2008) was developed to assess students’ knowledge of English quantifier sentences containing the universal quantifiers each, every, and all (10 items each) as well as sentences containing Num QP (10 items). There were an additional 10 filler items that contained only NPs and no universal or Num QPs. The picture task contained a total of 50 items and 250 depictions. Four randomized versions of the task were developed […]. (Berent et al. 2012:47)
Mention of data analysis tools or software (YES/NO/NA). This variable tracks whether the author mentions the use of any tools that assist in analyzing data. For example, this reference to the statistical package Goldvarb appears in a paper on language use in Aotearoa in JS: “[t]o probe for this possibility I turn to multivariate analysis, using the multiple regression package Goldvarb X (Sankoff, Tagliamonte and Smith 2005)” (D’Arcy 2010:69).
The results of these methodological variables are shown in Table 2.
Table 2. Results of binary variables, expressed as a percentage of non-NA articles from each journal
In terms of authors who at least minimally describe their methods for data collection, the clear stand-out here is S2LA. All the authors we surveyed give information on data collection methods. This likely reflects the journal’s experimental focus and norms in that field. On the other end of the spectrum, NLLT and OL have the lowest percentage of collection methods reporting by authors. From our observations, authors in both journals take the position that the methods used are evident from the type of study presented in the paper. For example, many articles in OL are historical-comparative in nature; the understanding in that discipline is that data are collected by mining field notebooks, dictionaries, and wordlists. Among the areal journals, IJAL authors provide methods discussions most frequently, likely drawing on the Americanist tradition for self-reflective fieldwork à la Sapir (1924) and Boas (1928). In the mid range are JS and LANG. LANG draws on authors from a range of linguistic sub-disciplines, who bring with them the practices of their own fields.
Reference to the research participants is one of the most frequently provided pieces of methodological information. S2LA and JS both excel in this area, although with different foci: S2LA authors are more likely to describe sets of participants in terms of the variables tested in the study, while JS authors are more likely to include broader social demographic information. At the low end NLLT authors give participant information the least often; in many cases a reader may guess that the data were provided by the author, but this is often not confirmed.
Overt discussion of data collection equipment is not standard practice in any of the areal journals, nor in NLLT. S2LA authors mostly provide information about questionnaires used in experiments, while JS authors tend to describe the recording equipment used to collect spoken data. Across the areal journals, those articles that do contain information about equipment tend to be papers on phonetics, which has a tradition of including mentions of equipment in methodology sections.
Most mention of data analysis tools or software are references to statistical software (e.g. SPSS (IBM Corp 2013), R (R Core Team 2013)) or Praat (Boersma & Weenink 2015).
2.2 Source of data.
We coded for the source of the data used in each article; multiple sources were allowed. Sources include:
· INTRO: introspection
· OFN: fieldnotes collected by someone other than the author
· OWN: data collected by the author
· PC: personal communication
· PUBD: published
· UNST: not stated
· UNPUBD: unpublished data other than fieldnotes collected by someone other than the author
The frequencies of mention of data sources from the different journals are given in the Pareto chart in Figure 1 (each Pareto chart in this paper contains a bar graph showing frequencies, and a line graph above showing cumulative totals to 100%). Results from individual journals are shown in similar charts in the Supplemental Materials in Appendix1.
Figure 1. Mention of source of data across all journals
Across all journals, authors show a strong preference for using data they collect themselves (OWN), followed by the use of previously published data (PUBD), followed by the use of data of unstated origin (UNST).[2]
2.3 Citation conventions used in numbered examples from the source data.
The use of numbered examples is a hallmark of linguistics writing, and these are usually drawn from collected or ad hoc data. We discovered a broad range of methods for citing numbered examples back to their sources. Sources could be data sets (both publicly accessible and privately held), published texts such as Bible translations, or other academic publications. Articles with no numbered examples were coded as NA, and those with examples but no citation were coded as NONE. The full range of conventions we found are described in detail in the Supplemental Materials in Appendix 2, but only one is used frequently enough to include here, the ubiquitous reference to a previously printed publication:
· STANDARD: citation appears in ‘standard’ format for published examples, including the author’s name, year, and page number, as in this example from SL (Mauri 2008:23), in an article on the typology of disjunction:
Wari’, Chapacura-Wanam (Everett and Kern 1997:162)
mo ta pa’ ta’ hwam ca,
cond realis.fut kill 1sg:realis.fut fish 3sg.m
mo ta pa’ ta’ carawa ca
cond realis.fut kill 1sg:realis.fut animal 3sg.m
‘Either he will kill fish or he will hunt.’
Summarily, 56.7% of the articles we surveyed contain numbered examples with no citation of any kind. In every journal, and across all the journals, the STANDARD citation format for printed matter is the most common citation format, but even use of this format accounts for only 23.4% of all examples. All the other citations formats we found cover only <1%-3% of the data each. The frequency of citation conventions used across the journals, and in individual journals, are shown in the Supplemental Materials in Appendix 1.
2.4 Where the data are now.
We coded for where the data are currently located, if stated by the author. Options included:
· ARCH: archived in an institutional repository, either digital or physical
· HERE: the article contains the data, and is its own main source
· HERESUMMARY: data are summarized in the article, using descriptive statistics, tables, graphs, or other presentation
· PUBD: published
· ONL: online (a website or other non-archive internet-based storage)
· UNST: not stated
Figure 2 shows the location of data used in all journals; results from individual journals are shown in the Supplemental Materials in Appendix 1.
Figure 2. Described location of data across all journals
Over half the authors of articles we surveyed do not explicitly state where the data upon which their study is based can be located by a curious reader. There are some genres of linguistic data that may not lend themselves as naturally to a description of their location as, say, a set of experimental stimuli posted to the Open Science Framework.[3] As we discussed in §2.2, introspection is still a useful linguistic tool for many researchers. A move towards greater transparency in methodology and data source does not mean that researchers must abandon such methods, only that they mention, even briefly, that this is the process by which the data were acquired.
3 Where are we, and where can we go from here?
This survey of our discipline reveals both good news and bad for the current state of reproducibility in linguistics. First, the bad news: with regard to the metrics we examined here, there is a lot of room for improvement in order for linguistics publications to provide the kind of transparency of data and methods that we discuss in Section 1. We found that readers are implicitly asked to make assumptions about aspects of the research process: that data are collected in an appropriate manner, that data sets are locatable and verifiable, and that examples of linguistic phenomena are representative of the context(s) from which they are drawn. Few among us advertise in our publications that we have taken responsibility for the longevity and accessibility of our data sets, which means that precious endangered language data can disappear, and expensive experiments may be recreated out of ignorance, rather than from a spirit of scientific reproducibility. In short, we are in danger of being a social science asking its audience to take our word for it.
But our study also reveals some very good news, which holds promise for linguistics becoming more transparent in the future. We found that in fact different subfields do have strengths in facets of research transparency, as represented by the publications we surveyed. Practitioners in different subfields ‘do transparency’ differently well, and these practices could serve as models for an eventual amalgamated standard. For example, S2LA authors describe research methods exceptionally well—the strong experimental focus of the journal means that a methods section is a normalized expectation. Authors in S2LA, JS, and IJAL frequently provide information about research participants. Authors in S2LA and JS usually provide information about tools, hardware, and software, as do authors of phonetics papers across all journals.
Differences across subfields account for our findings: some journal authors omit the explication of some factors because they are generally understood, while others include them because of tradition. Claims about introspective data are generally understood to have been made by people with fluency, and historical-comparative data is understood to come from unpublished wordlists and published dictionaries. Field linguists describe the speech community and their fieldwork conditions by tradition; phoneticians have a tradition of describing equipment.
All authors in the survey include standard citations of published material, which precisely illustrates our point: because there is a disciplinary expectation to cite published material correctly, and a standard format for doing so, all authors in all journals we surveyed do it consistently. Linguistics has no such expectations, recommendations, or formats for other factors we examined here.
[1] A list of all articles included in our survey is found in the Supplemental Materials in Appendix 3, and the anonymizedcoded data set is included as a CSV file.
[2] Observations about tendencies in different subfields may be drawn from the results in the Appendix in the Supplemental Materials.
[3] http://osf.io/
Please cite this document as:
Berez-Kroeker, Andrea L., Lauren Gawne, Barbara F. Kelly & Tyler Heston. 2017. A survey of current reproducibility practices in linguistics journals, 2003-2012. https://sites.google.com/a/hawaii.edu/data-citation/survey.