Publications

preprints

Cassani, G., Günther, F., Attanasio, G., Bianchi, F., & Marelli, M. (2023). Meaning Modulations and Stability in Large Language Models: An Analysis of BERT Embeddings for Psycholinguistic Research. psyArXiv preprint, https://psyarxiv.com/b45ys

Computational models of semantic representations have long assumed and produced a single static representation for each word type, ignoring the influence of linguistic context on semantic representations. Recent Large Language Models (LLMs) introduced in Natural Language Processing, however, learn token-level contextualised representations, holding promise to study how semantic representations change in different contexts. In this study we probe type- and token-level representations learned using a prominent example of such models, Bidirectional Encoder Representations from Transformers (BERT), for their ability to i) explain semantic effects found for isolated words (semantic relatedness and similarity ratings, lexical decision, and semantic priming), but critically also to ii) exhibit systematic interactions between lexical semantics and context, and iii) explain meaning modulations in context. Across a wide range of empirical studies on each of these topics, we show that BERT representations satisfy two desiderata for psychologically valid semantic representations: i) they have a stable semantic core which allows people to interpret words in isolation and prevents words to be used arbitrarily and ii) they interact with sentence context in systematic ways, with representations shifting as a function of their semantic core and the context. This demonstrates that a single, comprehensive model which simultaneously learns abstract, type-level prototype representations as well as mechanisms of how these interact with context can explain both isolated word effects and context-dependent variations. Notably, these variations are not limited to discrete word senses, eschewing a strict dichotomy between exemplar and prototype models and re-framing traditional notions of polysemy.

Gatti, D., & Günther, F. (2025). Zero-shot pseudowords memorability via representational content analysis. psyArXiv preprint, https://psyarxiv.com/cvxgf

Novel strings of letters (i.e., pseudowords) lack established meaning(s), yet they may still evoke systematic semantic signals that influence human behavior. Here, we tested whether semantic determinants of word memorability generalize to these novel strings. To do so we leveraged a distributional semantic model able to represent in a vector space, not only attested words but also unmapped strings as bags of character n-grams. A ridge model trained on item-level word memorability norms learned a linear mapping from 300-dimensional embeddings to recognition memorability and achieved strong out-of-fold performance. We then applied this model zero-shot to predict memorability for 2,100 phonotactically legal pseudowords whose baseline predictability was captured by orthographic and frequency features. Adding the zero-shot semantic score significantly improved the baseline model. These findings show that distributional representations derived from subword statistics carry mnemonic information that is not reducible to orthographic familiarity, and that novel strings are interpreted within a shared representational space learned from language experience. More broadly, they support the view that memorability is an intrinsic attribute predictable from representational information, even in the absence of learned meanings.

Günther, F., Bell, M. J., & Schäfer, M. (2025). Gold student meets star model: Predicting the interpretational diversity of novel compounds in an exploratory-confirmatory approach. psyArXiv preprint, https://psyarxiv.com/2ypfs_v1

Data, Material, & Scripts

Almost all linguistic expressions are ambiguous to some extent, and can be interpreted in various different ways. This is especially the case for novel expressions a speaker has never encountered before, in particular combined concepts expressed via compounds such as /gold student/ or /monkey ring/. Although previous studies have shown that word embeddings (meaning representations derived from text-based language models), can encode the interpretational diversity of such expressions, these previous studies have been limited to a small, rigid and high-level closed set of relational interpretations (e.g., `student MADE OF gold', `student ABOUT gold'). In contrast, the present study uses more ecologically-valid open-format interpretations provided by human participants, which are afterwards classified in a bottom-up manner in order to compute quantitative estimates of interpretational diversity. In an exploratory study on pre-existing data, we first investigate what measures derived from word embeddings capture interpretational diversity, with the vector norm of the embeddings emerging as the best predictor. In a subsequent high-powered confirmatory study, we then systematically select new items for maximal variation of this vector norm, and replicate the same pattern. This is the first study to show that text-based language models encode the unconstrained interpretational diversity of linguistic expressions, even within a single vector representation, and even for novel expressions that have never been observed in their training data.

Günther, F., Petrenco, A., & Gatti, D. (2025). Cross-linguistic zero-shot communication via ad-hoc pseudowords. psyArXiv preprint, https://psyarxiv.com/g9d4j_v1

Data, Material, & Scripts

In verbal communication, speakers must encode meanings into signs such as words. Within a given language community, the correspondence between word forms and meanings can become conventionalized. However, speakers from different language communities cannot rely on these shared conventions. Here, we investigate whether purely verbal communication using single words is still possible in such a context, enabled by generalized form-meaning mappings. In a pre-registered experiment, we presented Italian speakers with words and instructed them to come up with corresponding German translations. The resulting German-like pseudowords were then shown to German speakers, who were asked to guess the original words. Supporting our hypotheses, results showed that the German participants' guesses were semantically closer to the original words than to randomly selected control words. These findings highlight the remarkable human ability to spontaneously create and interpret meaningful signals, even in the absence of shared linguistic conventions and across language boundaries.

Leivada, E., Günther, F., Masullo, C., Duñabeitia, J. A., Westergaard, M., & Rothman, J. (2024). A multi-metric analysis of 50,000 linguistic profiles provides sparse evidence that language distance modulates bilingual cognition. psyArXiv preprint, https://psyarxiv.com/9uqbm

Data, Material, & Scripts

Similarity in mental, linguistic representations modulates the degree of recruitment of cognitive control mechanisms, which have been linked to neurocognitive adaptations in bilingual populations. While ample evidence exists for this claim, its coverage is limited, as testing is geared towards WEIRD communities that use sizeable, Indo-European languages, thus potentially providing a biased view of bilingual cognition. We assess the role of distance as a key moderator of bilingual adaptations through a large-scale aggregation analysis of 510 experiments. To measure distance, we develop a multi-metric approach, using state-of-the-art databases, such as Grambank. Analyzing data from 56,122 participants who speak 79 different languages, spanning 11 language families and a language isolate, we find sparse evidence for a distance effect. Our results suggest that moderators such as language distance can shed light on the cognitive divide between language and dialects in a way that addresses the perennial question of what makes bilinguals distinct.

Leivada, E., Montero, R., Morosi, P., Moskvina, N., Serrano, T., Aguilar, M., & Günther, F. (2025). Large Language Model probabilities cannot distinguish between possible and impossible language. arXiv preprint, https://doi.org/10.48550/arXiv.2509.15114

Data, Material, & Scripts

A controversial test for Large Language Models concerns the ability to discern possible from impossible language. While some evidence attests to the models' sensitivity to what crosses the limits of grammatically impossible language, this evidence has been contested on the grounds of the soundness of the testing material. We use model-internal representations to tap directly into the way Large Language Models represent the 'grammatical-ungrammatical' distinction. In a novel benchmark, we elicit probabilities from 4 models and compute minimal-pair surprisal differences, juxtaposing probabilities assigned to grammatical sentences to probabilities assigned to (i) lower frequency grammatical sentences, (ii) ungrammatical sentences, (iii) semantically odd sentences, and (iv) pragmatically odd sentences. The prediction is that if string-probabilities can function as proxies for the limits of grammar, the ungrammatical condition will stand out among the conditions that involve linguistic violations, showing a spike in the surprisal rates. Our results do not reveal a unique surprisal signature for ungrammatical prompts, as the semantically and pragmatically odd conditions consistently show higher surprisal. We thus demonstrate that probabilities do not constitute reliable proxies for model-internal representations of syntactic knowledge. Consequently, claims about models being able to distinguish possible from impossible language need verification through a different methodology.

Martinez-Tomás, C., Günther, F., Hinojosa, J. A., & Gatti, D. (2025). Conveying (discrete) emotionality with novel words. psyArXiv preprint, https://psyarxiv.com/gcnkx_v1

Data, Material, & Scripts

Affective dimensions (i.e., valence and arousal) and discrete basic emotions (i.e., anger, disgust, fear, happiness, sadness) are the main affective sources of information that explain the semantic features of words. Recent studies suggest that humans are able to assign emotionality even to pseudowords, plausible verbal stimuli that do not belong to a given language which serve as proxies for never encountered, real words (i.e., novel words). So far, evidence at our disposal is mainly limited to valence (i.e., the hedonic tone of a words’ reference, from pleasant to unpleasant), while investigating discrete emotionality is required for a more refined understanding of the processes at hand. Here, across three experiments, we probed i) humans’ ability to convey discrete emotions when generating novel word stimuli to express the meanings of given emotional words, and ii) humans’ ability to decode or understand such emotionality when processing these human-generated novel words. Leveraging estimates from a word embedding model, results showed that individuals can reliably encode and decode novel conceptual information carrying emotional information, with a better performance for anger and happiness stimuli. Theoretically, these processes can be interpreted from an evolutionary perspective and, more broadly, they can be traced back to humans’ ability to process systematic, non-arbitrary form-meaning information.

Murphy, E., Leivada, E., Dentella, V., Günther, F., & Marcus, G. (2025). Fundamental Principles of Linguistic Structure are Not Represented by o3. arXiv preprint, https://arxiv.org/abs/2502.10934

A core component of a successful artificial general intelligence would be the rapid creation and manipulation of grounded compositional abstractions and the demonstration of expertise in the family of recursive hierarchical syntactic objects necessary for the creative use of human language. We evaluated the recently released o3 model (OpenAI; o3-mini-high) and discovered that while it succeeds on some basic linguistic tests relying on linear, surface statistics (e.g., the Strawberry Test), it fails to generalize basic phrase structure rules; it fails with comparative sentences involving semantically illegal cardinality comparisons ('Escher sentences'); its fails to correctly rate and explain acceptability dynamics; and it fails to distinguish between instructions to generate unacceptable semantic vs. unacceptable syntactic outputs. When tasked with generating simple violations of grammatical rules, it is seemingly incapable of representing multiple parses to evaluate against various possible semantic interpretations. In stark contrast to many recent claims that artificial language models are on the verge of replacing the field of linguistics, our results suggest not only that deep learning is hitting a wall with respect to compositionality (Marcus 2022), but that it is hitting [a [stubbornly [resilient wall]]] that cannot readily be surmounted to reach human-like compositional reasoning simply through more compute

Petrenco, A., & Günther, F. (2025). Centroid analysis: Inferring concept representations from open-ended word responses. psyArXiv preprint, https://psyarxiv.com/2xbuh

Data, Materia, & Scripts

The present research proposes and evaluates a novel method - centroid analysis - for measuring representations and concepts at both individual and group levels by mapping open-ended responses onto a pre-existing semantic vector space. Centroid analysis allows to retrace the target concept as the geometric center of the semantic vectors of the responses generated by this concept. At the group level, centroid analysis enables researchers to compare conceptual structures across different populations to investigate how factors such as language, culture, cognitive differences, educational background, or exposure to specific narratives shape shared representations. At the individual level, centroid analysis allows for fine-grained assessments of how personal experiences, expertise, cognitive styles, or even temporary contextual influences affect conceptual representations. We evaluate this method using two distributional semantic models across several calculation methods, reference lexicon sizes, response types, and datasets with tasks ranging from single word substitutions to single and multiple free associations and multiple feature generation. We conclude that at the group level, the best method to retrace the response-generating concept as a vector in a multi-dimensional semantic space from the averaged vectors of participant responses is to collect multiple free associations (70 unique and 245 total responses per cue), use fastText for meaning-to-vector mapping for responses and cues, and to consider each response in the centroid calculation as often as it occurred in the data. At the individual level, the best results are achieved by employing fastText and considering at least 8 responses per item per participant in the centroid calculation.

Raveling, L., & Günther, F. (2025). Predicting the Rate of Novel Words from Word-Level Semantic Measures in a Taboo Game Setting. psyArXiv preprint, https://doi.org/10.31219/osf.io/4cxtw_v1

Data, Materia, & Scripts

The present study investigates novel word production rates for existing concepts, expressed by single words. To capture a wide variety of semantic nuances that influence individual processes of word creation, we quantify the distributional, categorical and psychological properties of single words. These features are tested for their correlation with novel word response rates in an online study employing the Taboo Game Paradigm. Speakers are presented with a word whose meaning they must express as accurately as possible with a single word without using the target word itself. Based on our experimental results, we conclude that words with higher distributional vector norms have a higher likelihood of being expressed through a novel word. We also find that participants produce a higher rate of novel words for more concrete items, as well as for items that have lower connectivity within a network of taxonomically related words. These results are interpreted in the light of theories about the production of words.

Schoenegger, P.*, Salvi, F.*, Liu, J.*, Nan, X.*, Debnath, R.*, Fasolo, B.**, Leivada, E.**, Recchia, G.**, Günther, F.**, Zarifhonarvar, A., Kwon, J., Islam, Z. U., Dehnert, M., Lee, D. Y. H., Reinecke, M. G., Kamper, D. G., Kobaş, M., Sandford, A., Kgomo, J., Hewitt, L., Kapoor, S., Oktar, K., Kucuk, E. E., Feng, B., Jones, C. R., Gainsburg, I., Olschewski, S., Heinzelmann, N., Cruz, F., Tappin, B. M., Ma, T., Park, P. S., Onyonka, R., Hjorth, A., Slattery, P., Zeng, Q., Finke, L., Grossmann, I., Salatiello, A., Karger, E. (2025). Large Language Models Are More Persuasive Than Incentivized Human Persuaders. arXiv preprint, https://arxiv.org/abs/2505.09662

* equal contribution

** equal contribution

We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz setting. In this preregistered, large-scale incentivized experiment, participants (quiz takers) completed an online quiz where persuaders (either humans or LLMs) attempted to persuade quiz takers toward correct or incorrect answers. We find that LLM persuaders achieved significantly higher compliance with their directional persuasion attempts than incentivized human persuaders, demonstrating superior persuasive capabilities in both truthful (toward correct answers) and deceptive (toward incorrect answers) contexts. We also find that LLM persuaders significantly increased quiz takers' accuracy, leading to higher earnings, when steering quiz takers toward correct answers, and significantly decreased their accuracy, leading to lower earnings, when steering them toward incorrect answers. Overall, our findings suggest that AI's persuasion capabilities already exceed those of humans that have real-money bonuses tied to performance. Our findings of increasingly capable AI persuaders thus underscore the urgency of emerging alignment and governance frameworks.

Wang, A., Brysbaert, M., & Günther, F. (2025). Adding volition to word processing: Expected utility norms for 80 thousand English words and multiword expressions. psyArXiv preprint, https://osf.io/preprints/psyarxiv/4cruz_v1

Data, Material, & Scripts

This study examined the concept of word usefulness by analyzing expected utility ratings for over 80,000 English words and multiword expressions. Participants used best-worst ratings to indicate how useful it is to know each word/expression. Our findings show a high level of agreement regarding the usefulness of words and expressions. Stimuli were rated as more useful if they were more frequent, widely known, learned early in life and central to the semantic network. Concreteness had a substantial negative correlation, indicating that abstract words in general received higher utility scores than concrete words. Positive stimuli received slightly lower utility scores than negative stimuli. Expected utility was a good predictor of which words are known to speakers of English as a first and second language, but did not contribute to predicting response times to known words. These findings suggest that expected utility is a variable affecting which words are likely to be learned, but does not affect word processing times (much). The expected utility scores are freely available for research and education.

accepted

Dudschig, C., Günther, F., & Mackenzie, I. G. (accepted). Cognitive plausibility of count-based versus prediction-based word embeddings: A large-scale N400 study. Biological Psychology.

Link/Download

The N400 is a central electrophysiological event-related-potential (ERP) marker thought to reflect meaning comprehension in the human brain. Typically, the N400 is larger when a word does not fit into a specific context (e.g., I drink coffee with cream and dog). Thus, one core factor determining the N400 amplitude is thought to be the predictability of a word within its context. Here, both long-term memory associations and short-term discourse context influence the N400 amplitude. In the present study, we used the N400 as a marker to investigate the cognitive plausibility of semantic similarity measures. Specifically, we compared traditional count-based measures to modern machine learning tools such as prediction-based word embeddings to assess whether prediction-based techniques potentially encapsulate learning mechanisms that align more closely with psychological plausibility. To do so, we examined the relationship between different similarity measures (LSA, HAL and word2vec) and the N400 amplitude in a large scale re-analysis of previously published EEG data. Model comparison suggested a superiority of HAL over LSA as a predictor in explaining single-trial N400 amplitudes, and also a benefit of prediction-based methods over count-based methods. This result aligns with the notion that such models might in the future provide further insights into how the brain navigates language understanding.

Leivada, E., Marcus, G., Günther, F., & Murphy, E. (2023). A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds? Philosophical Transcactions of the Royal Society A.

Link/Download Data, Material, & Scripts

Modern Artificial Intelligence applications show great potential for language-related tasks that rely on next-word prediction. The current generation of Large Language Models (LLMs) have been linked to claims about human-like linguistic performance and their applications are hailed both as a step towards artificial general intelligence and as a major advance in understanding the cognitive, and even neural basis of human language. To assess these claims, first we analyze the contribution of LLMs as theoretically informative representations of a target cognitive system vs. atheoretical mechanistic tools. Second, we evaluate the models' ability to see the bigger picture, through top-down feedback from higher levels of processing, which requires grounding in previous expectations and past world experience. We hypothesize that since models lack grounded cognition, they cannot take advantage of these features and instead solely rely on fixed associations between represented words and word vectors. To assess this, we designed and ran a novel 'leet task' (l33t t4sk), which requires decoding sentences in which letters are systematically replaced by numbers. The results suggest that humans excel in this task whereas models struggle, confirming our hypothesis. We interpret the results by identifying the key abilities that are still missing from the current state of development of these models, which require solutions that go beyond increased system scaling.

Petilli, M. A., & Günther, F. (accepted). Image-based word frequency norms. In Reference Module in Social Sciences, Elsevier.

Traditional word frequency norms are derived from text corpora. This article discusses word frequency derived from domain-specific visual corpora, where words are used to denote or label content in real-world scene images from large-scale datasets. These image-based frequency estimates capture aspects of language usage missing from traditional frequency measures, reflecting their nature at the intersection between language and vision. The article reviews the main approaches for creating these metrics (as well as measures derived from them), discusses studies validating their role as hybrid measures, and highlights their utility in complementing traditional word frequency norms to better address theoretical questions empirically.

2025

Dentella, V., Günther, F., & Leivada, E. (2025) Language in vivo vs. in silico: Size matters but Larger Language Models still do not comprehend language on a par with humans due to impenetrable semantic reference. PLoS One, 20(7), e0327794.

Link/Download Data, Material, & Scripts

Understanding the limits of language is a prerequisite for Large Language Models (LLMs) to act as theories of natural language. LLM performance in some language tasks presents both quantitative and qualitative differences from that of humans, however it remains to be determined whether such differences are amenable to model size. This work investigates the critical role of model scaling, determining whether increases in size make up for such differences between humans and models. We test three LLMs from different families (Bard, 137 billion parameters; ChatGPT-3.5, 175 billion; ChatGPT-4, 1.5 trillion) on a grammaticality judgment task featuring anaphora, center embedding, comparatives, and negative polarity. N = 1,200 judgments are collected and scored for accuracy, stability, and improvements in accuracy upon repeated presentation of a prompt. Results of the best performing LLM, ChatGPT-4, are compared to results of n = 80 humans on the same stimuli. We find that humans are overall less accurate than ChatGPT-4 (76% vs. 80% accuracy, respectively), but that this is due to ChatGPT-4 outperforming humans only in one task condition, namely on grammatical sentences. Additionally, ChatGPT-4 wavers more than humans in its answers (12.5% vs. 9.6% likelihood of an oscillating answer, respectively). Thus, while increased model size may lead to better performance, LLMs are still not sensitive to (un)grammaticality the same way as humans are. It seems possible but unlikely that scaling alone can fix this issue. We interpret these results by comparing language learning in vivo and in silico, identifying three critical differences concerning (i) the type of evidence, (ii) the poverty of the stimulus, and (iii) the occurrence of semantic hallucinations due to impenetrable linguistic reference.

Günther, F., & Cassani, C. (2025). Large Language Models in psycholinguistic studies. In Reference Module in Social Sciences, Elsevier.

Link/Download

We are currently witnessing a veritable explosion of studies employing Large Language Models (LLMs) in the cognitive sciences. Here, we focus on their use in psycholinguistics, that is, for the study of human language processing. LLMs are primarily trained to predict upcoming or masked words in a given context. We briefly describe the transformer architecture which endows LLMs with impressive abilities to achieve this objectives, and review how the components of this architecture are of interest to psycholinguistics. We then review how LLMs are applied in research, focusing on (1) measuring surprisal/probabilities of a word given a context; (2) extracting representations/embeddings these models produce, and (3) prompting/probing these models to produce an output, treating them similarly to human participants.

Günther, F., & Raveling, L. (2025). Interpretability Norms for Novel Words and Nonwords. In Reference Module in Social Sciences, Elsevier.

Link Download

The lexicon of a language is subject to constant change, and new words constantly enter the lexicon. In principle, any word form that is not currently in the lexicon but adheres to the orthotactic rules of a language can be a novel word, including morphologically complex words but also pseudowords. However, such novel words differ in their semantic interpretability—how easily speakers can come up with an interpretation for them—which is of interest as both an independent and dependent variable in theory-building, computational modeling, and empirical studies. Here, we provide an overview of studies that make available (large) norms of semantic interpretability ratings and judgments, which will serve as a useful resource for future research.

Günther, F., Raveling, L., Baier, F., & Petrenco, A. (2025). "This is a monkeylope!" - A registered report on the factors of novel word creation. Journal of Experimental Psychology: Learning, Memory, and Cognition. Advance online publication.

Link Download Data, Material, and Scripts

While native speakers of a language have tens of thousands of words at their disposal, they still regularly create and use novel words. Since this comes with communicative risks and costs – most importantly, not being understood by recipients – there have to be (perceived) advantages to creating novel words that outweigh these issues. We systematically explore these factors in a controlled experimental setting. We presented participants with images of existing and new animals with increasing degrees of conceptual distance to existing animals. Participants have to refer to these images either in an open format or with single words. The images are presented either in isolation (Experiment 1), or with two distractors created from different (Experiment 2) or the same base animals (Experiment 3). Each image is repeatedly presented to investigate effects of repeated reference and presentation. We measure how often participants create novel word labels for the stimuli, and derive hypotheses for the main effects and key interactions of conceptual distance, target frequency, distractor frequency, response format, and task setting from a theoretical framework based on the Gricean pragmatic principles as well as the concept of common ground. Conceptual distance, response format, and experimental task setting have clear effects on the number of novel word responses. However, we observe no effects for target or distractor frequency, suggesting no effect of common ground in our experimental setting. Participants also tend to over-specify their responses, rather than strictly adhering to pragmatic principles.

Petilli, M. A., Rodio, F., Günther, F., & Marelli, M. (2025). Visual search and real-image similarity: an empirical assessment of the search surface through the lens of deep learning. Psychonomic Bulletin & Review, 32, 822-838.