Lesson 3

Synopsis

There is tremendous diversity in the kinds of data used in the study of language, which reflects the unusual position of linguistics as a discipline where scholars frequently adopt methods associated with the humanities, social sciences, cognitive sciences, and computer science, among other areas. While this inevitably presents challenges for any attempt to describe the full scope of linguistic data, it can also be seen as an under-recognized opportunity for the field to be at the forefront of questions of data management due to linguists’ familiarity with a much wider range of data types than scholars in many other areas. This chapter provides an overview of the different kinds of data that are used within linguistics, considering both data that can be viewed as a direct representation of observable linguistic behavior as well as secondary data types used to support linguistic analysis, such as data on language users, annotations made on primary data, and linguistic metadata. This chapter also considers the different ways that linguists structure data so that it can be used to justify abstract analyses.

Core concepts & keywords

Observable linguistic data: Broad category which includes naturalistic and elicited data collected for linguistic analysis.

Lexical data: Data used to describe the lexicon of a language and to create lexicographic resources such as dictionaries.

Specialized data from observable behavior: Linguistic data collected specifically to facilitate linguistic research rather than being produced through natural language use, e.g., the collection of grammaticality judgments on constructed sentences.

Metadata: "Data about data" that serves more of a book-keeping than analytic function.

Conventional corpora: Corpora which focus on linguistic data associated with a standardized writing system, e.g. the British National Corpus.

Unconventional corpora: Corpora containing data from heterogeneous writing systems, e.g. creole language corpora.

Syntagmatic structure: Relations that hold among linguistic elements comprising some kind of linguistic constituent, such as those represented by a syntactic constituency tree.

Paradigmatic structure: Relations that hold across linguistic elements due to some kind of connection that they have with each other in a grammatical system, such as those found among words which comprise an inflectional paradigm.

Annotation: Description or analysis that is associated with "raw data" or with other annotations. Examples include transcription, glossing, and part-of-speech annotation.

Inline annotation: Inline annotation: Annotations which are embedded within the primary data file, e.g. XML annotation via tags found within the primary text document itself.

Stand-off annotation: Annotations which are not included within the file containing primary data but in a separate file which references the original file, e.g. the time-aligned annotations produced using ELAN.

Activities

Exercises - Practice what you've learned

  • Find a corpus online and determine (1) if it is conventional or unconventional; (2) if it contains naturalistic or elicited data; and (3) if it contains primary and/or secondary data.

  • Consider a research topic in linguistics that interests you and decide on a hypothetical research question. If you were to explore your question, what kind of data would you need to collect? Would you look at text corpora, language user judgements, instrumental data, or another data type? Would you build onto your data with Interlinear Glossed Text, treebanks, XML annotations, feature structures, or another form of analysis? Think about why the data and analysis types you chose would be most appropriate for your research question.

Implement these practices in your career

  • Using your own data set, determine the following: Is the data considered primary or secondary? Does the data contain annotations? If so, are they inline or stand-off annotations?

  • Try to develop a system for inline XML markup of one of your text data sets (see end of section 3.2).

  • Think about where you may want to archive your data and make contact with the archive. Find out what metadata standards the archive requires (e.g. OLAC, CMDI). Research the required metadata standard, and plan how to collect the appropriate metadata.

Quiz - Test yourself!

Related readings

Berez-Kroeker, Andrea L.; Gawne, Lauren; Smythe Kung, Susan; Kelly, Barbara F.; Heston, Tyler; Holton, Gary; Pulsifer, Peter; Beaver, David I.; Chelliah, Shobhana; Dubinsky, Stanley; Meier, Richard P.; Thieberger, Nick; Rice; Keren; and Anthony C. Woodbury. 2017. Reproducible research in linguistics: A position statement on data citation and attribution in our field. Linguistics 56. 1–18.

Share your thoughts on this article or topic

Use #LingData #Annotation #Metadata on your favorite social media platform!

About the author:

Jeff Good

Jeff Good is Professor in the Department of Linguistics at the University at Buffalo. His research interests include morphosyntactic typology, language documentation, and comparative Niger-Congo linguistics. His documentary work focuses on endangered languages of the Lower Fungom region of Cameroon and includes significant interdisciplinary data collection components.

Picture of Jeff Good

Citations

Cite this chapter:

Good, Jeff. 2022. The scope of linguistic data. In The Open Handbook of Linguistic Data Management, edited by Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller, and Lauren B. Collister, 27-48. doi.org/10.7551/mitpress/12200.003.0007. Cambridge, MA: MIT Press Open.

Cite this online lesson:

Gabber, Shirley, Danielle Yarbrough, Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller, Lauren B. Collister, and Jeff Good. 2022. "Lesson 3." Linguistic Data Management: Online companion course to The Open Handbook of Linguistic Data Management. Website: https://sites.google.com/hawaii.edu/linguisticdatamanagement/course-lessons/03-the-scope-of-linguistic-data [Date accessed].