Publications

preprints

Dentella, V., Günther, F., & Leivada, E. (2024). Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans. arXiv preprint, https://arxiv.org/abs/2404.14883

Understanding the limits of language is a prerequisite for Large Language Models (LLMs) to act as theories of natural language. LLM performance in some language tasks presents both quantitative and qualitative differences from that of humans, however it remains to be determined whether such differences are amenable to model size. This work investigates the critical role of model scaling, determining whether increases in size make up for such differences between humans and models. We test three LLMs from different families (Bard, 137 billion parameters; ChatGPT-3.5, 175 billion; ChatGPT-4, 1.5 trillion) on a grammaticality judgment task featuring anaphora, center embedding, comparatives, and negative polarity. N=1,200 judgments are collected and scored for accuracy, stability, and improvements in accuracy upon repeated presentation of a prompt. Results of the best performing LLM, ChatGPT-4, are compared to results of n=80 humans on the same stimuli. We find that increased model size may lead to better performance, but LLMs are still not sensitive to (un)grammaticality as humans are. It seems possible but unlikely that scaling alone can fix this issue. We interpret these results by comparing language learning in vivo and in silico, identifying three critical differences concerning (i) the type of evidence, (ii) the poverty of the stimulus, and (iii) the occurrence of semantic hallucinations due to impenetrable linguistic reference.

Petilli, M. A., Rodio, F., Günther, F., & Marelli, M. (2023). Visual search and real-image similarity: an empirical assessment of the search surface through the lens of deep learning. psyArXiv preprint, https://psyarxiv.com/cxkzt

Data, Material, & Scripts

The search surface is a foundational concept in visual search literature. It describes the impact of target-distractor (TD) and distractor-distractor (DD) similarity on search efficiency. However, the search surface shape lacks direct quantitative support, being a summary approximation of a wide range of lab-based results poorly generalisable to real-world scenarios. This study exploits convolutional neural networks to quantitatively assess the similarity effects in search tasks using real images as stimuli and determine which levels of feature complexity the similarity effects rely on. Besides providing ecological converging evidence supporting the established search surface, our results reveal that TD and DD similarity mainly operate at two distinct layers of the network: DD similarity at the layer of coarse object features, while TD similarity at the layer of complex features used for classification. This suggests that these forms of similarities exert their major effects at two distinct levels of perceptual processing.

Cassani, G., Günther, F., Attanasio, G., Bianchi, F., & Marelli, M. (2023). Meaning Modulations and Stability in Large Language Models: An Analysis of BERT Embeddings for Psycholinguistic Research. psyArXiv preprint, https://psyarxiv.com/b45ys

Data, Material, & Scripts

Computational models of semantic representations have long assumed and produced a single static representation for each word type, ignoring the influence of linguistic context on semantic representations. Recent Large Language Models (LLMs) introduced in Natural Language Processing, however, learn token-level contextualised representations, holding promise to study how semantic representations change in different contexts. In this study we probe type- and token-level representations learned using a prominent example of such models, Bidirectional Encoder Representations from Transformers (BERT), for their ability to i) explain semantic effects found for isolated words (semantic relatedness and similarity ratings, lexical decision, and semantic priming), but critically also to ii) exhibit systematic interactions between lexical semantics and context, and iii) explain meaning modulations in context. Across a wide range of empirical studies on each of these topics, we show that BERT representations satisfy two desiderata for psychologically valid semantic representations: i) they have a stable semantic core which allows people to interpret words in isolation and prevents words to be used arbitrarily and ii) they interact with sentence context in systematic ways, with representations shifting as a function of their semantic core and the context. This demonstrates that a single, comprehensive model which simultaneously learns abstract, type-level prototype representations as well as mechanisms of how these interact with context can explain both isolated word effects and context-dependent variations. Notably, these variations are not limited to discrete word senses, eschewing a strict dichotomy between exemplar and prototype models and re-framing traditional notions of polysemy.

in press

De Varda, A., Gatti, D., Marelli, M., & Günther, F. (in press). Meaning Beyond Lexicality: Capturing Pseudoword Definitions with Language Models. Computational Linguistics.

Link

Pseudowords such as “knackets” or “spechy” – letter strings that are consistent with the orthotactical rules of a language but do not appear in its lexicon – are traditionally considered to be meaningless, and employed as such in empirical studies. However, recent studies that show specific semantic patterns associated to these words as well as semantic effects on human pseudoword processing have cast doubt on this view. While these studies suggest that pseudowords have meanings, they provide only extremely limited insight as to whether humans are able to ascribe explicit and declarative semantic content to unfamiliar word forms. In the present study, we employed an exploratory-confirmatory study design to examine this question. In a first exploratory study, we started from a pre-existing dataset of words and pseudowords alongside human-generated definitions for these items. Employing 18 different language models, we showed that the definitions actually produced for (pseudo)words were closer to their respective (pseudo)words than the definitions for the other items. Based on these initial results, we conducted a second, pre-registered, high-powered confirmatory study collecting a new, controlled set of (pseudo)word interpretations. This second study confirmed the results of the first one. Taken together, these findings support the idea that meaning construction is supported by a flexible form-to-meaning mapping system based on statistical regularities in the language environment that can accommodate novel lexical entries as soon as they are encountered.

2024

Pugacheva, V., & Günther, F. (2024). Lexical choice and word formation in a taboo game paradigm. Journal of Memory and Language, 135, 104477.