9:00-11:30am: Pre-conference Workshop Session 1 (Straub 245)
1:00-4:00pm: Pre-conference Workshop Session 2 (Straub 245)
7:45am: Conference Registration Opens
8:30am: Opening Remarks (Straub 156)
9:00am: Presentation Sessions 1-2
10:00am: Coffee Break 1
10:30am: Plenary Session 1 (Tove Larsson)
12:00pm: LUNCH
1:30pm: Presentation Sessions 3-5
3:00pm: Coffee Break/Poster Session
4:00pm: Presentation Sessions 6-8
5:30pm: Dinner/Pub Crawl
8:00am: Presentation Sessions 9-12
10:00am: Coffee Break
10:30am: Plenary Session 2 (Scott Crossley)
12:00pm: LUNCH
1:30pm: Presentation Sessions 13-15
3:00pm: Coffee Break/Poster Session
4:00pm: Presentation Sessions 16-18
5:30pm: Plenary 3 (Jesse Egbert)
6:45pm: Catered Conference Reception
8:30am: Presentation Sessions 19-21
10:00am: Coffee Break
10:30am: Presentation Session 22-23
11:30am: Closing Remarks
11:45pm: Optional Coast Trip to Florence, Oregon
Given all the data that we have access to as corpus linguists, it is perhaps not surprising that exploring uncharted territory is such a predominant goal in our field. In fact, the stated motivation for our studies very often rests almost solely on arguments of novelty, of how “no previous study has ever looked at [this topic] before”. However, as will be argued in this talk, we have a great deal to gain as a field if we instead were to start building more systematically on previous research. Doing so will enable us to formulate and test increasingly specific hypotheses, thus continually furthering and refining our cumulative knowledge within a given domain.
Specific hypotheses, while central to the scientific method, tend not to be formulated or tested very frequently in corpus linguistics. One possible reason for this pertains to commonly used statistical methods, such as two-tailed t-tests, regression, and ANOVA. Somewhat simplified, in the traditional way of using these techniques, we are limited to asking “Is there a difference/relation between these groups/variables?” in an agnostic manner over and over. That is, we tend not to formally incorporate information about previously observed differences/relations into our analyses (Larsson, Biber, & Hancock, forthcoming).
In this talk, I outline what a cumulative approach to knowledge building might look like and illustrate the role of specific hypotheses in this endeavor. I also use case studies to give a non-technical introduction to how specific hypotheses informed by previous findings can be tested through minimally sufficient statistical techniques. The goal of the talk is to show how cumulative knowledge building can help move the field and its state-of-the-art forward in a way that keeps our focus firmly on the language data of interest.
There has been recent interest within philanthropic foundations to fund high-impact, open-science educational research that has low start-up costs and high return. Much of this work has involved student writing and broadly fits under the umbrella of corpus linguistics although it is generally described as dataset or benchmark development. The basic principles of the research follow the Common Task Method (CTM) developed by the Defense Advanced Research Projects Agency (DARPA) in the 1970s (Liberman & Wayne, 2020). The CTM is a cyclical approach to data analysis that includes shared data, objectives, and competitive evaluations. Philanthropic foundations, seeing the success of CTM-inspired datasets like the ImageNet project (Deng et al, 2009), envisioned educational datasets that focused on addressing large-scale concerns within American schools including poor performance in literacy and math assessments, educational inequality, and curriculum and instructional approaches that are unresponsive to technological enhancements.
This talk introduces the educational problem spaces that recent dataset competitions have addressed along with the important role that corpus linguistics played in benchmark development and competitive evaluation. Four recently developed corpora are presented that focus on literacy interventions: the Commonlit Ease of Readability (CLEAR) corpus (Crossley et al., 2023), the English Language Learner Insight, Proficiency, and Skills Evaluation (ELLIPSE) corpus (Crossley et al., 2023), the persuasive essays for rating, selecting, and understanding argumentative and discourse elements (PERSUADE) corpus (Crossley et al., 2022), and the Automated Student Assessment Prize (ASAP) 2 corpus. The talk includes guidelines on how to structure datasets, annotate them, and manage data science competitions. The steps presented include selecting/building robust, novel, and diverse datasets to enhance outcomes, designing competitions and evaluation metrics, and creating collaborative environments for dataset modeling and sharing. Results and models from recent Kaggle competitions are presented and discussed along with next steps.
Texts are the fundamental unit of discourse, and their importance in naturally-occurring language cannot be overstated (Egbert & Schnur, 2018; Anthony, 2022). To date, most linguistic research on texts has been carried out within the register studies tradition, or the “Text-Linguistic Approach to Register Variation” (TxtLx RV; Biber, 2019). The primary goal of the TxtLx RV is to describe—and compare—the situational and linguistic characteristics of registers. In these studies, texts typically serve as the sampling unit for corpus compilation, and often serve as the unit of observation for quantitative analyses. TxtLx RV studies use texts as a means to the end of describing and comparing culturally-recognized register categories. This has produced an abundance of cumulative evidence about general patterns of register variation, as well as the characteristics of specific registers.
Recently, there has been a shift to a different research goal: accounting for the variation among texts (as opposed to accounting for the variation among registers). Researchers who have adopted this goal have made important—and unanticipated—discoveries about the nature of texts and variation among them. For example, we now know that (1) texts within a register can vary from one another situationally and linguistically, and (2) some texts do not belong to any register at all (Biber & Egbert, 2023). Curiosity about these findings has led to a series of related studies aimed at better understanding text-to-text variation within online registers (Biber & Egbert, 2018; Biber, et al., 2020), conversation (Egbert et al., 2021), legal statutes (Wood, 2024), student essays (Goulart et al., 2024), fiction, political memoirs, and presidential speeches (Egbert & Gracheva, 2022). These studies have served to reinforce the reality of functional correspondence between situational context and language use, while at the same time raising questions about the role of register categories (Egbert et al., 2024).
In this talk, I describe the evolution of these research findings and propose a new theoretical and methodological approach to studying texts: the Text-Linguistic Approach to Functional Correspondence (TxtLx FC). The aim of TxtLx FC is to account for variation among texts (instead of registers), in terms of situational variables (including, but not limited to, register), linguistic features, and the functional correspondence between them. I summarize the implications of the text-linguistics (r)evolution and present an agenda for future TxtLx FC research.