It’s easy to assume that large language models understand ordinary language. They write fluent emails, summarize documents, and answer questions with confidence. But human language is full of traps: we often use figurative language like metaphor and simile to illustrate a point, and sometimes we use irony: meaning the opposite of what our words literally say. In our new paper: “As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language,” we investigate what happens when these two challenges collide: figurative language and negation.
Figurative language asks a model to go beyond literal meaning. When someone says a task is “a walk in the park,” they do not mean it involves trees and benches. Negation adds another layer: “not a walk in the park” reverses the message, but not in a simple word-by-word way. For humans, this kind of interpretation feels natural, but for language models, it can be surprisingly fragile.
We build on Fig-QA, a dataset of creative metaphors and similes paired with literal interpretations. We add new annotations for metaphor, simile, negation, tense, and concreteness, then test a range of models, from embedding-based systems to recent LLMs. We also create a small literal negation dataset to ask whether models struggle more with negation, more with figurative language, or whether it is the combination that is particularly problematic.
Overall, we find that the combination of negation and figurative language presents a challenge even to modern LLMs, and that prompting methods have a large effect on performance. We might say that LLMs should be fine-tuned to understand the combination of negation and figurative language, however this is a piecemeal solution. A truly flexible and creative language user—like a human—can understand these novel combinations of phenomena without special training. It is also important to consider how language models may be used in downstream applications.
Prompting styles make a big difference to performance, but the prompting
style with highest performance is one which is arguably less true to how linguistic phenomena may occur in the real world. We therefore argue that
thought should be given to how LLM performance should be assessed with respect to their use in real-world scenarios.