Psychological, Cognitive and Linguistic BERTology: An Idiomatic Multiword Expression Perspective


COLING 2022 Tutorial

Contents

Introduction and Motivation

The success of BERT and similar pre-trained language models (PLMs) has led to what might be described as an existential crisis for certain aspects of Natural Language Processing: PLMs can now do better than other models on numerous tasks in multiple evaluation scenarios and are argued to outperform human performances on some benchmarks (Wang et al., 2018; Sun et al., 2020; Hassan et al., 2018). In addition, PLMs also seem to have access to a variety of linguistic information as diverse as parse trees (Hewitt and Manning, 2019), entity types, relations, semantic roles (Tenney et al., 2019a), and constructional information (Tayyar Madabushi et al., 2020).

Does this mean that there is no longer a need to tap into the decades of progress that was made in traditional NLP and related fields including corpus and cognitive linguistics? In short, can deep(er) models replace linguistically motivated (layered) models and systematic engineering as we work towards high-level symbolic artificial intelligence systems?

This tutorial will explore these questions through the lens of a linguistically and cognitively important phenomena that PLMs do not (yet) handle very well: Idiomatic Multiword Expressions (MWEs) (Yu and Ettinger, 2020; Garcia et al., 2021; Tayyar Madabushi et al., 2021).

Tutorial Themes

In this tutorial we will introduce participants to the exciting research from topics associated with four themes: a) we will provide an overview of BERTology (Rogers et al., 2020) from a linguistic and cognitive stand point, b) introduce participants to the highlights from research on idiomaticity available in cognitive and corpus linguistics, along with studies that show the need for cultural, world and common sense knowledge in handling idiomaticity and related problems, c) traditional methods of handling idiomaticity by providing explicit information regarding idiomatic phrases to models, and c) the state of the art in idiomaticity detection and representation.

The first of these four themes – the capabilities of PLMs – will include an overview of BERTology and what PLMs' capabilities (or the lack thereof) means from a linguistic and cognitive standpoint with an emphasis on phenomenon such as understanding idiomatic expressions. For example, it has been shown that PLMs are good at syntax (Hewitt and Manning, 2019) and semantic roles (Tenney et al., 2019b), while being less effective at pragmatic inference, role-based event knowledge (Ettinger, 2020) and abstract attributes of objects (Da and Kasai, 2019). Importantly, PLMs are particularly bad at representing numbers (Wallace et al., 2019) and in reasoning based on the world knowledge they have access to (Forbes et al., 2019). We argue that this shows that PLMs are good at “low-level” linguistic tasks but struggle with “high-level” tasks associated with reasoning and understanding. These high-level cognitive tasks, such as the ability to make use of world and common sense knowledge are of particular relevance to idiomaticity. Consider the sentences “. . . cultivated land in this study accounts areas used for paddy fields and dry land” and “. . . It’s a great feeling to be back on dry land”. ‘Dry land’ literally refers to ‘dry ground’ in the first but refers to the more abstract ‘terra firma’ in the second.

In addressing the second theme, the tutorial will cover elements of MWE research that originate in linguistics. For example, the highly influential work in the field of linguistics by Nunberg et al. (1994), who discuss how, in some cases, parts of an idiom might be modified (e.g. “Your remark touched a nerve that I didn’t even know existed”), quantified (e.g. “touch a couple of nerves”), or emphasized (e.g. “Those strings, he wouldn’t pull for you”). Such an exploration would serve to highlight the kind of nuanced understanding of language and the world that is required to completely understand these utterances. Additionally, we will also explore studies on how humans process idiomaticity (Geeraert et al., 2020; Chanturia et al., 2011).

The third theme will deal with methods of identifying and representing idiomaticity using the traditional approach in NLP wherein a phenomenon is explicitly modeled more or less independently of other levels of analysis. Such work, as is the case with much of MWE research, hypothesizes that such explicit information will be useful to models on downstream tasks.

Finally, the tutorial will address the absolute state of the art in identifying and representing idiomaticity. This is made possible by the fact that some of the proposed presenters of this tutorial are also involved in the organisation the related task “Multilingual Idiomaticity Detection and Sentence Embedding” at SemEval 2022.

Tutorial Overview

Part 0 – Introduction

An introduction to what can be expected in the tutorial including the motivation and an overview of the tutorial structure and schedule.

Part 1 -- Psychological, Cognitive and Linguistic BERTology

A description of what Language models can do well and what they fail at with an emphasis on the cognitive and linguistic complexity of these tasks and what that might mean for language understanding from the perspectives of corpus, cognitive and computational linguistics. In particular, we aim to highlight the fact that PLMs are good at ``low-level'' linguistic tasks but struggle with ``high-level'' tasks associated with reasoning and understanding. The section will include methods of understanding what language models ``know'' through probing. We will highlight the shortcomings of language models in dealing with problems that require the understanding of nuanced language and the use of cultural, world and common sense knowledge, while also discussing how reliable probes themselves are and the fact that having access to certain information is not the same as being able to use it on downstream tasks.

Part 2 -- Linguistic and Cognitive aspects of MWEs

MWEs have been studied extensively in both linguistics and cognitive linguistics for several decades. This section will give an overview of some of the more nuanced aspects of MWEs as seen from these perspectives, thus providing a deeper understanding of what is required in handling them and what models will need to get right. This part will contrast these requirement with the strengths and weaknesses of PLMs discussed in Part 1

Part 3 -- Traditional Methods of Identifying and Representing MWEs

This section will provide an overview of the various approaches used in identifying and representing potential idiomatic MWEs rooted in traditional methods of NLP. Such methods, still common in MWE research, explicitly model linguistic phenomenon (in this case, idiomaticity) based on the hypothesis that such explicit information will be useful to NLP models and downstream tasks. Additionally, this section will contrast these methods to the use of PLMs, which challenge the need for such explicit information as they seems to have access to syntax, semantics, and so on (Part 1 of tutorial).

Part 4 -- The State-of-the-Art in Identification and Representation of MWEs

This section will deal with the state-of-the-art methods available for the identification and representation of MWEs and how they fare in downstream applications. In particular we will discuss the various methods used by participants of SemEval 2022 Task 2, a task that several of those who will be giving this proposal are involved in organizing. As such, we will discuss the very latest in methods used to address this problem.

Part 5 -- Conclusions and Avenues of Future Research

We will conclude the tutorial by highlighting what PLMs can do well and linguistic phenomena and non-linguistic knowledge that models might still need to make use of so as to move towards language understanding as exemplified by the problem of representing MWEs. Simultaneously, we will address the question of of whether or not linguistic phenomenon, such as literal/idiomatic ambiguity (Savary et al., 2019), are worth addressing using powerful machine learning methods, especially given that (in the case of idiomaticity) literal/idiomatic distributions are highly skewed in corpora, despite being theoretically possible (Pasquer et al., 2020). We will then discuss possible avenues for further research such as the use of knowledge-aware pre-training. This section will be discussion driven

Tutorial Outline

Part 0: Introduction

  • Motivation and background

  • Tutorial Structure and what to expect.


Part 1: Psychological, Cognitive and Linguistic BERTology

  • Probing, what it tells us and what it might miss.

  • What Language Models are obviously good, less good and bad at and what this says about their ability to simulate human-like thought.

  • Language modelling Fast and Slow: ``Low-level'' vs ''high-level'' cognitive and linguistic processes in Language Models from a Psychological perspective.

  • The need for developing high level cognitive capabilities in language models.


Part 2: Linguistic and Cognitive aspects of MWEs

  • MWEs in the Mind

  • L2 Language Learners and Idioms

  • Idioms: A linguistic perspective

  • Characteristics of MWEs that make them both Interesting and Hard for NLP.

  • The need for ``Deeper'' Linguistic Information and Knowledge in MWE understanding.

  • The Relation between what is needed for Understanding Idioms and what PLMs have access to.


Part 3: Traditional methods of identifying, discovering and representing MWEs.

  • MWE identification resources: annotated corpora

  • MWE discovery and unsupervised compositionality prediction

  • Downstream applications of MWE identification and discovery

  • Easy and hard MWEs to identify and discover

  • MWE identification models, a comparison: Traditional methods, CRFs and PLMs.


Part 4: The State-of-the-Art in Identification and Representation of MWEs

  • Identification: Motivation, methods and effectiveness

  • Representation: Motivation, methods and effectiveness

  • SemEval 2022 Task 2 -- Findings

  • Cognitive and Linguistic Considerations


Part 5: Conclusions and Avenues of Future Research

  • PLMs, BERTology, the Linguistic Nuances of Idioms, and World Knowledge: A Quick Recap.

  • Are Uncommon Linguistic Phenomenon Worth Studying?

  • How and why these Fields need each other.

  • Discussion

Instructors

Harish Tayyar Madabushi

University of Bath, UK. Has Worked extensively on research related to MWEs and language model. Is the principle organiser of the SemEval 2022 Task on MWEs. Research interests include the integration of cognitive linguistics and deep learning for effective NLP systems.

https://www.harishtayyarmadabushi.com/

Carlos Ramisch

Aix-Marseille University, France. Has been working with MWEs for more than 15 years and is the SIGLEX-MWE Section representative and co-organiser of numerous editions of the MWE workshop since 2010. He is the main developer and maintainer of the MWEtoolkit. He is also one of the main co-organisers of the PARSEME shared tasks (editions 2017, 2018 and 2020) focusing on the automatic identification of MWEs in 14-20 languages.

https://pageperso.lis-lab.fr/carlos.ramisch/

Marco Idiart

Federal University of Rio Grande do Sul (Brazil). Research interests include Textual Simplification of Complex Expressions, Cognitive Computational Models of Natural Languages, Analysis and Integration of MultiWord Expressions in Speech and Translation.

http://www.if.ufrgs.br/~idiart/

Aline Villavicencio

University of Sheffield, UK. Interests include Lexical semantics, multilinguality, and cognitively motivated NLP. This work includes techniques for Multiword Expression treatment using statistical methods and distributional semantic models, and applications like Text Simplification and Question Answering.

https://sites.google.com/view/alinev

Questions?

For more information on the tutorial Contact

htm43@bath.ac.uk