Humanity's Last Exam

Humanity's Last Exam has been Published on Nature. An Unexpected Consortium Authorship.

Published in the journal Nature in early 2026 , Humanity's Last Exam (HLE) is a rigorous, multi-modal benchmark designed to evaluate large language models (LLMs) at the absolute frontier of human knowledge. The dataset was introduced in response to the rapid saturation of popular existing benchmarks, such as Measuring Massive Multitask Language Understanding (MMLU), where modern LLMs routinely score over 90%. This saturation has made it increasingly difficult to accurately gauge the true extent of state-of-the-art AI capabilities.

To address this measurement gap, HLE features 2,500 expert-level, closed-ended academic questions. The benchmark covers dozens of subjects, including the natural sciences and humanities, with a strong emphasis on world-class mathematics problems aimed at testing deep reasoning skills. It was collaboratively developed by nearly 1,000 subject-matter experts, primarily professors and researchers from over 500 global institutions. The questions are diverse, encompassing both text-only and multi-modal formats, and require either exact-match or multiple-choice answers for automated verification. Crucially, the questions are designed to be precise, unambiguous, and highly resistant to simple internet retrieval.

During the rigorous dataset creation pipeline, every proposed question was pre-tested against frontier LLMs; only questions that current models failed to answer correctly were passed on for human expert review. Consequently, evaluations show that state-of-the-art models exhibit notably low accuracy and high calibration errors on HLE, frequently delivering incorrect answers with high confidence. By establishing this challenging new baseline, HLE provides a vital, standardized reference point for scientists and policymakers to track genuine AI progress and better understand the current gap between machine capabilities and human expertise.

On a personal note, I actually contributed two questions in the field of transportation engineering to this very dataset back in 2025, both of which were successfully accepted. At the time of my submission, I had absolutely no idea that the organizing team intended to compile the project into a formal academic paper. It wasn't until early 2026, upon receiving an unexpected email from the HLE team, that I was astonished to discover the work had been published in Nature. To my surprise, I found myself listed as a consortium author on the publication. Looking back, it has been an incredible experience to inadvertently become part of such a significant milestone in AI research.

Fast forward to February 2026, and the landscape has already shifted dramatically. The most powerful models of today, such as Gemini 3.1 Pro and GPT-5.4, are already cracking the 40% mark on the HLE leaderboard. Given this blistering pace of innovation, it is highly probable that frontier AI will completely saturate and break through this formidable question bank within the next year or two. I’ve actually spent a lot of time reflecting on this explosive growth and the paradigm shift it brings to a bigger picture of human society; you can read some of my personal essays on AI's rapid evolution here.

On the Liberation of Creativity by Artificial Intelligence

On the AI Reviewer in Academia

Recent Advances in AI (November 2025)

Subjective Experience and the Force of Life

Page updated

Google Sites

Report abuse