Three pre-conference workshops will be held on Thursday, September 8, 2022 and are open to registered conference attendees. Seats are limited.
Digital tools for building and analyzing spoken corpora: Moving beyond lexicogrammar to pronunciation and fluency
Amanda Huensch, Idée Edalatishams, Romy Ghanem, Karin Puga, Shelley Staples, Mariana Centanin Bertho and Kevin Hirschi
Corpus researchers now have access to a wide variety of digital tools for easier and more accurate descriptions of spoken data and their features. However, many spoken corpora only allow researchers to focus on lexicogrammar and minimal fluency features (e.g., hesitation markers if transcribed). They typically do not allow researchers to examine pronunciation or fluency features. One factor is that many corpora do not provide sound files. Even when they are provided, researchers may still be impeded by the time necessary to annotate corpora for these features and a lack of training in how to do so. Some digital corpus tools afford partial automatization of processes that can increase efficiency and save time. Tools such as CLAN, ELAN, the Montreal Forced Aligner, and Praat have improved annotation and analysis processes for spoken data, with some being more suitable for certain aspects of oral language analysis than others. Our aim is to offer a hands-on, practice-based workshop that demonstrates the particular strengths of these four digital tools and how they can be most efficiently combined in the steps of data preparation (including transcription and segmentation), coding (including prosodic and fluency features), and analysis (including extraction of suprasegmental feature values). Using a 30-second dialogic L2 English speech file from the Corpus of Collaborative Oral Tasks (Crawford & McDonough, 2014), workshop attendees will be guided step-by-step through these processes. In using multiple digital tools, we take advantage of the best features of each and highlight the transfer of data from one program to another. Workshop attendees will receive a detailed protocol handout for each program. Issues related to intercoder reliability will also be discussed. The workshop is designed both for researchers new to using these digital tools with spoken corpora as well as those with previous experience.
Corpus In A Box: Automated Tools, Tutorials, & Advising (CIABATTA) tool
Adriana Picoral, Larissa Goulart, Aleks Novikov and Shelley Staples
The purpose of this workshop is to familiarize participants with the Corpus In A Box: Automated Tools, Tutorials, & Advising (CIABATTA) tool. CIABATTA is a compilation of guides and templates for corpus building developed by the Corpus & Repository of Writing (Crow) team, which provides a starting point for researchers developing new corpora.
This workshop will be divided into two parts. No previous coding experience is required for Part 1, which includes an overview of CIABATTA, a discussion of best practices, and ethical issues in corpus building. We will then demonstrate how to organize, convert, encode, standardize, and deidentify data using two automated tools, primarily the Corpus Text Processor. The corpus used in this part of the workshop will be compiled by participants. They will submit, through a google form, conference abstracts. This will allow participants to deal with the “messy” parts of corpus building such as organizing corpus files. The first part of this workshop will end with how to use our interactive interface for de-identification of the corpus texts.
Part 2 requires some familiarity with coding in Python to follow along, but those with no experience are welcome to stay to get more experience with this programming language. We will be using Jupyter Notebook in Google colab, which requires no previous installation of Python. In this part of the workshop, we will walk participants through the processes of organizing the metadata in the CIABATTA demonstration corpus, adding headers to our individual text files, and standardizing filenames. We will use this pre-designed demonstration corpus for this as we have pre-prepared metadata which will allow us to focus on illustrating the logic of the scripts. The workshop will conclude with a discussion on next steps on corpus building, and what tools to use for corpus analysis once your offline corpus is built.
Using structural equation modeling to answer previously unanswerable research questions
Tove Larsson & Gregory R. Hancock
This workshop offers an accessible and non-technical introduction to structural equation modeling (SEM), specifically measured variable path analysis, to researchers in corpus linguistics. SEM is a powerful analytic framework of statistical techniques covering, for example, confirmatory factor analysis, latent variable path analysis, and mixture models (see, e.g., Hancock & Schoonen, 2015). As outlined in Larsson, Plonsky, and Hancock (2021), techniques from the SEM family have great potential for corpus linguistics in that they allow us to answer research questions that more commonly applied techniques (e.g., multiple regression) cannot help us answer. For example, path models enable us to test hypothesized relations among variables in models with multiple dependent variables, and thus answer questions such as “What is the relative importance of register and discipline on phrasal complexity, as measured through adjectival, nominal, and prepositional modification in a noun phrase?” (see Larsson et al., 2021).
During the workshop, we will introduce the main components of SEM (focusing on measure variable path analysis), and help walk participants through the steps of this kind of analysis. At the end of the workshop, it is our hope that participants will (i) have the tools to be able to carry out simple analyses on their own data, and (ii) have been made aware of resources available to carry on their training after the workshop. Our intent is not to introduce techniques that add unnecessary complexity to already sophisticated analyses (see the discussion of minimally sufficient statistical methods in Egbert, Larsson, & Biber, 2020), but rather to introduce tools that allow us to answer research questions that are beyond reach given current statistical methods.