Data Science Talk Series

LEXam: Benchmarking Legal Reasoning on 340 Law Exams

Speaker: Dr. Joel Niklaus, Hugging Face

Time: October 8, 2025, 10:00 am - 11:30 am

Room: E297L, Discovery Park, UNT

Coordinator: Dr. Haihua Chen

Abstract: Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. We introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach, such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions presents significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Adopting an LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. Project page: https://lexam-benchmark.github.io/

Bio of the speaker: Dr. Joel Niklaus is a Machine Learning Engineer at Hugging Face, working on synthetic pretraining data. He also serves as an advisor and angel investor to various AI companies. Previously, Joel was a Research Scientist at Harvey, specializing in large language model systems for legal applications. Before he was an AI Resident at (Google) X, where he trained multi-billion parameter models on hundreds of TPUs and achieved state-of-the-art results on the LegalBench evaluation dataset. He also conducted research at Thomson Reuters Labs on efficient domain-specific pretraining approaches.

Dr. Joel conducted research on LLMs at Stanford University under the supervision of Prof. Dan Ho and Prof. Percy Liang, and has led research projects for the Swiss Federal Supreme Court. With extensive experience in pretraining and fine-tuning LLMs across various compute environments, his research focuses on dataset curation for multilingual legal language models. His datasets have established the foundation for legal NLP in Switzerland. Joel has contributed to open-source projects including lighteval and Marin. His research has been published at leading NLP and machine learning conferences, covered by Anthropic and Swiss National Radio & Television, and honored with an Outstanding Paper Award at ACL. He holds a PhD in Natural Language Processing, a Master’s in Data Science, and a Bachelor’s in Computer Science, all from the University of Bern.

He has lectured at the University of Bern and the Bern University of Applied Sciences, delivering continuing education courses in natural language processing. Previously, he taught computer science at several Swiss high schools. He also brings experience in delivering corporate courses and talks.

Page updated

Google Sites

Report abuse