COLING 2025 On-Site Tutorial
Usage-based theories of human language, such as Construction Grammar, have been compelling theoretical lenses through which to view and evaluate what LLMs know and understand of language because of the parallels between usage-based learning and the data driven ``learning'' of pre-trained models. However, a key difference between a usage-based learning account for humans and that of LLMs is in embodiment and multimodality---for the most part, LLMs use text alone, whereas usage-based theories posit that each token of linguistic experience is stored with a wealth of experiential information that enriches the symbol through cross-modal association. Therefore, the first goal of this tutorial is to provide a summary of language acquisition and second language learning from a usage-based theoretical linguistic perspective. With this understanding of human usage-based learning, we will turn to evidence demonstrating the ways in which machine learning, primarily via large, pre-trained vision and language models, does and does not parallel human learning. The overarching goal of this is not to say that the two processes are similar or dissimilar in order to conclude that dissimilarity denotes inferiority (if the knowledge arrived at is the same, then it may not matter how it was learned). Rather, we explore the resulting differences in what is known and understood about the world, and take this as a starting point for considering how to supplement and improve natural language understanding (NLU), particularly for physically situated applications. Our target audience is those interested in the intersection of linguistic theory and NLU implementations, such as human-robot interaction.
Unlike our past NLP tools, such as syntactic parsers and automatic semantic role labeling, LLMs lack grounding in linguistic theory. Instead, their development is based on the encoder-decoder architecture, which was originally designed for sequence- to-sequence tasks, specifically translation (Bahdanau et al., 2016). This dichotomy impedes methods for evaluating LLMs, as their performance on meta-linguistic tasks, such as semantic role labeling, which previously served as benchmarks for the individual components in an NLP pipeline, are poor predictors of their fluency on downstream applications. However, the fact that LLMs, designed primarily to meet information-theoretic needs, can capture any linguistic information at all is fascinating (Rogers et al., 2020). Additionally, it offers a novel foundation for exploring what can be achieved through exposure to information alone.
Therefore, it has been compelling to turn to usage-based theories of language, such as Construction Grammar, to establish experimentally validated structures of language that speakers of a given language consistently recognize and are able to generalize over. We can then compare such structures to the linguistic structure that we can probe for within LLMs.
The takeaways of this tutorial, which we intend to hold in-person, will be an overview of the shared and divergent aspects of human and machine usage and data-driven learning, outlined from the theoretical perspective of usage-based psycholinguistic theory, with an emphasis on how this can shed light on the capabilities and limitations of LLMs, including multimodal models. This will serve as the bedrock for guiding participants and the NLP community towards more informed evaluation of large, pre-trained models, as well as energizing solutions drawing upon the multi-modal information and linguistic theory that enriches language and many dimensions of interaction.
This tutorial is fundamentally different from previous ones on grounding (Kordjamshidi et al., 2024; Fei et al., 2024; Alikhani and Stone, 2020; Krishnaswamy and Pustejovsky, 2022), as we do not focus solely on how grounding language models might be achieved. This tutorial is also distinct from previous ones that focus on the shortcomings of language models (Rawte et al., 2024) or comparing cognitive development in humans and AI models (Tayyar Madabushi et al., 2022). Instead, we discuss how combining the elements of grounding and linguistic theory can lead to better evaluation and, more importantly, the development of fundamentally different and improved models. This novel perspective, combining theoretical insights from linguistics, concepts of grounding, and the capabilities and limitations of multimodal LLMs, makes this tutorial timely and highly relevant.
This cross-disciplinary tutorial will be of interest to those researchers working on (or interested in) language models, linguistics, and grounding. The tutorial will contrast the state of the art in language models on the one hand with traditional NLP as well as usage-based linguistic theories and ground- ing on the other. This would enable researchers in all of these field to become aware of the achievements, limitations and opportunities available in the others so as to enable collaboration. Specifically, this tutorial will be of value to students and NLP practitioners interested in:
Using and evaluating large, pre-trained mod- els in applications that rely upon nuanced natural language understanding, and in particular, natural language understanding within complex, physically situated tasks
Building theoretically-motivated NLP resources, such as Frame Semantic resources and meaning representations
Multi-modality and multi-modal embedding spaces and vision and language models
This half-day, in-person tutorial will include 4 45- minute segments. During each of these segments, we will introduce participants to both foundational and cutting edge research associated with four distinct themes:
The theoretical aspects of usage-based linguistics, exploring how language is learned through increased exposure.
The corresponding learning in LLMs, with an emphasis on their abilities and shortcomings.
Experiments in grounding LLMs to bridge the gap in input modalities.
The latest methods of integrating these three elements for improved LLM evaluation and development.
In the first segment, we will present an overview of usage-based approaches to language learning and human grammar. Usage-based approaches to grammar emphasize acquisition of language through its use in everyday, physically situated dialogue (Tomasello, 2009). We will detail aspects of language acquisition from the perspective of a usage- based approach, specifically Construction Grammar (Goldberg, 2003; Hoffmann and Trousdale, 2013). Construction Grammar is a usage-based approach positing that our grammar is built up through usage and usage alone—there are no gram- matical rules that are memorized and no separate “syntax” and “lexicon” modules of our knowledge of language. Instead, there is only the “constructicon,” or a taxonomically related set of the constructions, which you’ve been exposed to in your acquisition of language and that you generalize over to extend to novel usages (Bybee, 2006). Usage-based theories like this have been an interesting theoretical lens to view LLMs through because LLMs are also built up through exposure to text and are very sensitive to frequencies in building up a grammar. Thus, there has been motivated interest in evalu- ating what LLMs seem to “understand” as far as constructions are concerned (Tayyar Madabushi et al., 2020; Weissweiler et al., 2022; Bonial and Tayyar Madabushi, 2024).
In the second segment, we will present research on the linguistic structures and reasoning capabilities present and absent in LLMs. We will begin with a brief overview of LLMs including a dis- cussion on the linguistic information these models capture, as demonstrated by methods now broadly known as BERTology (Rogers et al., 2020). We will then shift to the functional aspects of language use, such as “reasoning”, and how well LLMs capture such capabilities. We will begin with fundamental questions, such as why we would even expect models trained on language modeling tasks, akin to the cloze task, to perform reasoning at all (Brown et al., 2020). We will provide evidence for (Wei et al., 2022b; Srivastava et al., 2023) and against (Schaeffer et al., 2023; Lu et al., 2024; Huang et al., 2024) advanced reasoning in LLMs, including a quick overview of other relevant aspects, such as instruction tuning (Wei et al., 2022a) and scale (Kaplan et al., 2020). Importantly, we will discuss how usage-based linguistic approaches can help explain the seemingly idiosyncratic performance gains and failures of LLMs. We will wrap up this section by exploring promising paths for better harnessing the potential of LLMs while mitigating the weaknesses.
In the third segment, we will present an overview of research on the promise and pitfalls of leveraging LLMs in complex, physically situated tasks. A key difference that we see between a usage-based account of how humans acquire a language and how an LLM does this is in embodiment and multimodality—LLMs are using text alone, whereas construction grammar posits that each to- ken of linguistic experience is stored with a wealth of experiential information and cross-modal associations. Additionally, LLMs’ inability to interact with the world makes their learning unidirectional. The lack of multimodality and interaction thus significantly differentiates LLM learning from that of children (Bender and Koller, 2020). Consequently, it is unsurprising that current models struggle with fundamental limitations, such as hallucinations, despite multimodal representations (Bai et al., 2024).We will show how creating agent experiences, e.g., through multimodal simulations and situated learning, allow agents to learn the necessary features of objects through interaction and use, and how knowledge acquired in this fashion allows faster acquisition of language and linguistic knowledge through processes like analogical reasoning, than traditional language model learning through passive exposure to data (Ghaffari and Krishnaswamy, 2023, 2024). We will also show where such knowledge happens to already be present in language models and how it may be exploited.
In the fourth segment, we turn to lessons learned from our own research in situated, human-agent dialogue. Multimodal dialogue involving multiple participants presents complex computational chal- lenges, primarily due to the rich interplay of diverse communicative modalities including speech, ges- tures, actions, and gaze. These modalities interact in nuanced ways that traditional dialogue systems often struggle to accurately track and interpret. To this end, we extend the textual enrichment strategy of Dense Paraphrasing (Tu et al., 2023; Rim et al., 2023), in order to address these challenges by trans- lating each nonverbal modality into linguistic expressions. By normalizing multimodal information into a language-based form, we can create cross- modal coreference links and bind these references with action or gesture representations (annotations), derived from CV recognition algorithms.
Claire Bonial is a computational linguist specializing in the murky world of event semantics. In her efforts to make this world computationally tractable, she has collaborated on and been a foundational part of several important NLP lexical resources, including PropBank, VerbNet, and Abstract Meaning Representation. A focused contribution to these projects has been her theoretical and psycholinguistic research on both the syntax and semantics of English light verb constructions (e.g., “take a walk”, “make a mistake”). Bonial received her Ph.D. in Linguistics and Cognitive Science in 2014 from the University of Colorado Boulder and began her current position in the Content Understanding Branch at the Army Research Laboratory (ARL) in 2015. Since joining ARL, she has expanded her research portfolio to include multi-modal representations of events (text and imagery/video), human-robot dialogue, and misinformation detection.
Dr. Tayyar Madabushi's research focuses on understanding the fundamental mechanisms that underpin the performance and functioning of Large Language Models such as ChatGPT. His work was included in the discussion paper on the Capabilities and Risks of Frontier AI, which was used as one of the foundational research works for discussions at the UK AI Safety Summit held at Bletchley Park. He has worked to bridge the fields of construction grammar and pre-trained language models through the exploration of constructional information encoded in language models. His work on language models also includes collaborative industrial research aimed at rectifying biases in speech-to-text systems widely utilised across the UK. Before starting his PhD in automated question answering at the University of Birmingham, Dr. Tayyar Madabushi founded and headed a social media data analytics company based in Singapore.
Nikhil Krishnaswamy is Assistant Professor of Computer Science at Colorado State University and director of the Situated Grounding and Natural Lan- guage Lab (www.signallab.ai). He received his Ph.D. from Brandeis University in 2017. His primary research is in situated grounding and natural language semantics, using computational, formal, and simulation methods to study how language works and how humans use it. He is the co-creator of VoxML. He has taught courses on machine learning and NLP, previously taught at EACL 2017, ESSLLI 2022, and AACL 2022 on multimodal semantics of affordances and actions. He has routinely received positive feedback as an instructor, including “always willing to engage in in-depth discussions regarding class material.” He has served as senior area chair for LREC-COLING, area chair for COLING and EMNLP, and as PC member for ACL, EACL, NAACL, EMNLP, AAAI, AACL, etc.
James Pustejovsky is the TJX Feldberg Chair in Computer Science at Brandeis University, where he is also Chair of the Linguistics Program, Chair of the Computational Linguistics M.S. Program, and Director of the Lab for Linguistics and Com- putation. He received his B.S. from MIT and his Ph.D. from UMass Amherst. He has worked on computational and lexical semantics for 25 years and is chief developer of Generative Lexicon The- ory; the TARSQI platform for temporal reasoning in language; TimeML and ISO-TimeML, a recently adopted ISO standard for temporal information in language; the recently adopted standard ISO-Space, a specification for spatial information in language; and the co-creator of the VoxML modeling frame work for linguistic expressions and interactions as multimodal simulations VoxML (co-created with N. Krishnaswamy), enables real-time communi- cation between humans and computers or robots for joint tasks, utilizing speech, gesture, gaze, and action. He is currently working with robotics re- searchers in HRI to allow the VoxML platform to act as both a dialogue management system as well as a simulation environment that reveals realtime epistemic state and perceptual input to a computa- tional agent.
For more information on the tutorial Contact
htm43@bath.ac.uk