Languages constantly invent new words by combining existing parts. For example, the concept “a machine that can be programmed to automatically carry out sequences of arithmetic or logical operations” is expressed as computer in English, “电脑” (electric-brain) in Chinese, and tietokone (knowledge-machine) in Finnish.
None of these is "wrong." But why do languages prefer one combination over others? Is it just a historical accident—or is there a deeper logic at play?
Our Hypothesis
Words are a negotiation between speaker and listener
The core idea: The morpheme combination a language "chooses" should be the one that best balances how well a listener can recover the meaning with how easy it is for a speaker to produce.
Listener: Be Understandable
Needs to infer meaning from morphemes
Prefers clarity and recoverability
Speaker: Be Efficient
Prefers low effort, short forms
Prefers common, familiar morphemes
The Study
200 years of English word-making
We tested this idea at scale. Using a time-indexed lexicon constructed from COHA and COCA, we collected 4323 naturally occurring English compounds and derivations spanning 1820–2019.
For each word, we generated a large set of alternative morpheme combinations that were available in the lexicon at the same historical moment, and tested if our model ranks the actual attested form above those alternatives.
We built a semantic model that infers a concept from morphemes (using modern language model embeddings), and a cost model that minimizes production cost (using historical corpus statistics). We then integrate these components to model pragmatic speaker choice as predicted by the rational communication hypothesis.
Here we show, within the Rational Speech Act (RSA) framework, attested compositions are systematically ranked above unattested alternatives generated from contemporaneously available morphemes. Models integrating semantic informativeness with production cost outperform semantic-only and cost-only baselines on Mean Reciprocal Rank (MRR) and top-k accuracy (Acc@k), with the advantage of the Pragmatic Speaker model (𝑆1) over the semantic-only baseline growing as the candidate set expands, where meaning alone leaves morphological choice underdetermined.
These findings suggest that lexicalization reflects a communicative trade-off between expressiveness and efficiency, extending rational accounts of communication from utterance-level choice to the internal structure of words.
BibTex
@inproceedings{yang2026rational,
title={Rational Communication Shapes Morphological Composition},
author={Yang, Fengyuan and Peng, Yongqian and Ma, Yuxi and Xu, Chenheng and Zhu, Yixin},
booktitle={CogSci},
year={2026}
}