Jesse Egbert

The Lancaster-Northern Arizona Corpus of American Spoken English (LANA-CASE)

Collaborators at NAU: Jesse Egbert, Tove Larsson, Doug Biber, Randi Reppen, Lizzy Hanks

Collaborators at Lancaster University: Tony McEnery, Paul Baker, Vaclav Brezina, Gavin Brookes, Isobelle Clarke, Raffaella Bottini

The goal of this project is to compile a comparable American English counterpart to the widely known Spoken BNC2014 (Love et al., 2017). While there are several spoken corpora that represent specific subsets of the United States population, this corpus will be the first publicly available, large-scale corpus that represents general conversational American English. More details are available on our website and Twitter, @LANA_corpus.


Statutory interpretation and linguistic canons of construction

Collaborators: Thomas Lee, Lee | Nielsen and Corpus Juris Advisors; Margaret Wood, NAU

There is a movement to incorporate linguistics into statutory interpretation, or the interpretation of laws (i.e. statutes) by judges. In order to interpret the meaning of ambiguous words and grammatical structures used in a law, textualist judges often refer to linguistic prescriptions called linguistic 'canons of construction'. However, these canons have never been subjected to scrutiny based on linguistic theory or patterns of actual language use. We propose that the field of linguistics can aid in judicial interpretation, particularly with a method for investigating (and possibly improving) the validity of the linguistic canons of construction. We are currently investigating several of the canons of construction, including the Surplusage Canon, the Last Antecedent Canon, the Series Qualifier Canon, and the Nearest-Reasonable Referent Canon.


Language development among children with Down syndrome

Collaborator: Elizabeth Kay-Raining Bird, Dalhousie University

Verbal communication is a major challenge for individuals with Down syndrome (DS) (Chapman, et al., 1991). While much is known about the language development of monolinguals with DS (Martin et al., 2009), less research has focused on development among bilinguals in this population (Cleave, et al., 2012) and how this development varies across registers. We are exploring longitudinal language development in children and adolescents with DS in two registers (oral narratives and conversation) compared to language of typically developing (TD) children. Our research questions are:

1. To what extent does the grammar of children and adolescents with DS change over time and vary across registers?

2. To what extent does the spoken grammar of children and adolescents with DS approach the patterns of TD children matched for nonverbal mental age?


Exploring variation among web registers at the intersection of continuous linguistic and situational spaces

Main researcher: Jesse Egbert

Collaborator(s):, Doug Biber, NAU; Daniel Keller, NAU;

In previous research, we have explored the multi-dimensional patterns of variation among web registers in a continuous space of linguistic variation (see Biber and Egbert 2018) and in a continuous space of situational variation (see Biber, Egbert, Keller, 2020). In this ongoing project, we are bringing those analyses together, exploring how web registers can be described simultaneously with respect to both continuous linguistic and situational parameters. The theoretical contribution of this project is to illustrate how overall descriptions of register variation are more informative when they integrate situational and linguistic analyses, rather than treating the two as sequential steps in the analysis.

Biber, D. & Egbert, J. (2018). Register Variation Online. Cambridge: Cambridge University Press.

Biber, D., Egbert, J., & Keller, D. (2020). Reconceptualizing register in a continuous situational space. Corpus Linguistics and Linguistic Theory, 16(3), 581-616.


Functional units of conversational discourse

Collaborator(s): Stacey Wizner, NAU; Daniel Keller, NAU; Doug Biber, NAU; Tony McEnery, Lancaster University; Paul Baker, Lancaster University; Frazer Heritage, Lancaster University; Gill Phillips, Lancaster University; Ed Finegan, University of Southern California

Conversations can typically be segmented into multiple parts or units that are each characterized by an overarching communicative purpose (e.g. telling a story, expressing an opinion, figuring things out). In order to investigate linguistic variation across these functional units, as well as possible interactions with demographic speaker variables, we developed a new method to manually segment transcribed conversations into conversation units and code those units for one or more communicative functions. We have completed the development, piloting, and validation phases and are now coding a large sub-sample of the conversational files in the British National Corpus Spoken 2014 (see Egbert, Wizner, Keller, Biber & Baker, 2021).

Egbert, J., Wizner, S., Keller, D., Biber, D., McEnery, T., & Baker, P. (2021). Identifying and describing functional discourse units in the BNC Spoken 2014. Text & Talk, 41(5-6), 715-737.


Investigating lexical prevalence through frequency and dispersion

Collaborator: Brent Burch, Mathematics and Statistics, NAU; Doug Biber, NAU

State of the art research on lexis is founded upon many unwarranted assumptions about the nature of word prevalence and the best way of measuring it. In this series of related research projects we question those assumptions and, where necessary, propose new methods for measuring constructs such as frequency and dispersion. Burch, Egbert & Biber (2017) introduces a new measure of lexical dispersion (DA). Egbert, Burch & Biber (2020) proposes a modification to DA that accounts for dispersion across unequal sized parts and illustrates the importance of measuring dispersion across linguistically meaningful parts. Burch & Egbert (2019) introduces the zero-inflated beta distribution as a method for modeling word frequency and dispersion across texts. We are currently combining what we have learned to date to develop a new measure of lexical prevalence for vocabulary list creation (Egbert & Burch, in press) and the creation of hierarchical word tiers/classes (Burch & Egbert, 2022; in press).

Burch, B. & Egbert, J. (2019). Zero-inflated beta distribution applied to word frequency and lexical dispersion in corpus linguistics. Journal of Applied Statistics.

Burch, B., & Egbert, J. (2022). Confidence intervals for ratios of means applied to corpus-based word frequency classes. Journal of Applied Statistics, 1-19.

Burch, B., & Egbert, J. (in press). Word Use Equivalence and Hierarchical Word Tiers. Journal of Quantitative Linguistics.

Burch, B., Egbert, J., & Biber, D. (2017). Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science, 3(2): 189-216.

Egbert, J., & Burch, B. (in press). Which words matter most? Operationalizing lexical prevalence for rank-ordered word lists. Applied Linguistics.

Egbert, J., Burch, B., & Biber, D. (2020). Lexical dispersion and corpus design. International Journal of Corpus Linguistics, 25(1), 89-115.

Designing and evaluating language corpora

Collaborators: Bethany Gray, Iowa State University; Doug Biber, NAU

We define a corpus as a sample of natural texts drawn from a larger population. or target discourse domain. Whereas many other fields have established methods for sampling, most corpus compilers have typically ignored such methods and focused instead on collecting very large convenience samples. Drawing on theory and methods from other disciplines, as well as from extensive empirical research, we introduce methods and best practices for designing, collecting, and evaluating corpora for the extent to which they are situationally and linguistically representative of the target discourse domain.

Egbert, J., Biber, D., & Gray, B. (2022). Designing and Evaluating Language Corpora: A Practical Framework for Corpus Representativeness. Cambridge University Press.

Reconceptualizing register in a continuous situational space

Collaborators: Doug Biber, NAU; Daniel Keller, NAU

Corpus-based methods for the quantitative linguistic description of registers are well established. In contrast, situational analyses of registers have been based on qualitative descriptions of categorical situational characteristics. We address this inconsistency by describing the variation among texts and registers in a continuous (quantitative) situational space. We describe ‘registers’ as categorical constructs – culturally-recognized categories of texts – but propose that they should be described in continuous terms. Such descriptions allow quantitative comparisons of registers, as well as analysis of the extent to which a register is well-delimited in terms of its situational characteristics. These ideas were first introduced in Biber & Egbert (2018). In Biber, Egbert & Keller (in press), we describe how the situational characteristics of texts and registers can be analyzed in a continuous multi-dimensional space. And finally, we propose analysis of situational text types – categories that are statistically well-defined in their situational characteristics – as an approach to describing all texts, including texts that do not belong to a culturally-recognized register category. We have now turned our attention to exploring the quantitative relationships between continuous linguistic variables and continuous situational variables using correlations, multiple regression, and canonical correlation analysis.

Biber, D. & Egbert, J. (2018). Register Variation Online. Cambridge: Cambridge University Press.


Biber, D., Egbert, J., & Keller, D. (2020). Reconceptualizing register in a continuous situational space. Corpus Linguistics and Linguistic Theory, 16(3), 581-616.