Testing LLM Cognition is Harder Than You Thought

This piece, by Onno Berkan, was published on 03/04/25. The original text, by Anna Ivanova, was published by Nature Human Behavior on 01/15/25.

The article discusses properly evaluating large language models (LLMs) when testing their cognitive abilities. As AI systems become more advanced in understanding and generating human language, researchers have started testing them using methods traditionally designed for humans, assessing everything from memory and logical reasoning to creativity and personality traits.

The author presents 14 key guidelines for researchers to follow when studying AI psychology. One major concern is that AI models might simply memorize test answers rather than truly understand the tested concepts. This was demonstrated through the Winograd Schema Challenge (a test of common sense reasoning), which was initially thought to be a breakthrough when AI systems scored over 90% accuracy. However, researchers later discovered the models used simple shortcuts rather than actual understanding.

The study emphasizes several important cautions for researchers. First, they shouldn't rely on well-known test questions since AI models may be familiar with them from training. Additionally, researchers should be cautious about using automatically generated or crowdsourced test questions, as these might contain errors or biases, making their results meaningless.

Another crucial point is the need to consider cultural and linguistic biases. Most AI models are trained primarily on English text and data from Western, educated, industrialized, rich, and democratic (WEIRD) societies, which means their responses might not represent universal human behavior or work equally well in other languages or cultural contexts.

The author also addresses the challenges of evaluating commercial AI models, noting that it's nearly impossible to know precisely what data these models were trained on or whether they've been tuned explicitly for specific tasks. While some researchers argue against testing closed commercial models altogether, the author acknowledges that such testing still has practical value in understanding what these systems can and cannot do safely.

The paper concludes by calling for a balanced approach to AI evaluation, avoiding overenthusiasm and extreme skepticism. It emphasizes the importance of careful, systematic evaluation of AI capabilities while being transparent about their advances and limitations. The author hopes these guidelines will help improve the quality and validity of future research in AI psychology.

Want to submit a piece? Or trying to write a piece and struggling? Check out the guides here!

Thank you for reading. Reminder: Byte Sized is open to everyone! Feel free to submit your piece. Please read the guides first though.

All submissions to berkan@usc.edu with the header “Byte Sized Submission” in Word Doc format please. Thank you!

Page updated

Google Sites

Report abuse