Accepted Papers

→ Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering

Manan Gupta ⋅ Dhruv Kumar

→ BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding

Patrick Knab ⋅ Orgest Xhelili ⋅ Inis Buzi ⋅ Drago A Nilo ⋅ Mohd S Khan ⋅ Lorenz Kolb ⋅ Manuel Scherzer ⋅ Kerem Yildirir ⋅ Christian Bartelt ⋅ Philipp J Schubert

→ Where’s the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions

Nicole H. ⋅ Nick Rui

→ Perplexity Cannot Always Tell Right from Wrong

Petar Veličković ⋅ Federico Barbero ⋅ Christos Perivolaropoulos ⋅ Simon Osindero ⋅ Razvan Pascanu

→ Evaluator Failure Modes in Agentic Uncertainty Quantification

Suresh Raghu ⋅ Satwik Pandey ⋅ Shashwat Pandey

→ CellARC: An Oracle-Calibrated Benchmark for Few-Shot Rule Induction

Miroslav Lžičař

→ YC-Bench: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

Muyu He ⋅ Vincent Tu ⋅ Adit Jain ⋅ Anand Kumar ⋅ Sachin Patro ⋅ Soumyadeep Bakshi ⋅ Nazneen Rajani

→ From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for VLM Weak Supervision Across Three Medical-Imaging Benchmarks

Bruce C Xu ⋅ Jose James ⋅ Alexander J Ryu

→ Hidden Sensitivity in Spatial Reasoning Evaluation: Diagnosis and Re-ranking with VSI-Bench

Phillip Y Lee ⋅ Jin Yoo ⋅ Minseo Kim ⋅ Leonidas Guibas ⋅ Minhyuk Sung

→ Toward Trustworthy LLM–GNN Fusion: A Fusion-Aware Evaluation and Reporting Framework

zhifei hu ⋅ Alexandra I Cristea

→ Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

Xing Zhang ⋅ Guanghui Wang ⋅ Yanwei CUI ⋅ Qucy W Qiu ⋅ Ziyuan Li ⋅ Bing Zhu ⋅ Peiyang He

→ The Propagation Field: A Geometric Substrate Theory of Deep Learning

Xingrui Gu

→ EditCLEVR: A Paired-Scene Intervention Benchmark for Compositional Faithfulness of Object-Centric Representations

Anuraag Gadehothur Karnam ⋅ Tarunesh Sathish

→ Combining Theory and Benchmarks for Length Generalisation: Formal Certificates Meet Large-Scale Evaluation

Zacharie Bugaud

→ PromptSplit: Revealing Prompt-Level Disagreement in Generative Models

Mehdi Lotfian ⋅ Mohammad Jalali ⋅ Farzan Farnia

→ SafetyRepro: Configuration-Conditional Rank Instability on Alignment Benchmarks

Yanhang Li ⋅ Zhichao Fan ⋅ Zexin Zhuang

→ From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA

Sena Korkut ⋅ Maria A Bravo ⋅ Sanghwan Kim ⋅ Zeynep Akata

→ Task-Aligned Stability Analysis of Vision-Language Models for Autonomous Driving Hazard Detection

Everett Richards

→ Simulating Field Experiments for Method Testing

Enoch H. Kang

→ Executable Ground Truth: A Closed-Loop Benchmark for Evaluating LLM Agents on Microservice Incident Remediation

Dhatri C ⋅ Tadisetty S Yashwanth

→ Universality, Composition Generalization, and Algorithm Emulation All In-Context

Jerry Yao-Chieh Hu ⋅ Hong-Yu Chen ⋅ Po-Chiao Lin ⋅ Maojiang Su ⋅ Han Liu

→ Conformalized Scaling Laws: Distribution-Free Prediction Intervals for Out-of-Distribution Compute Regimes

Kaustubh Bukkapatnam ⋅ Siddharth Karuturi

→ Constructing Korean Benchmark Suite for Reliable Evaluation of Foundation Models

Yeonkyoung So ⋅ Jongmin Kim ⋅ Sungmok Jung ⋅ Gyuseong Lee ⋅ Sangho Kim ⋅ Jongyeon Park ⋅ Joonhak Lee ⋅ Seho Pyo ⋅ gyeongje cho ⋅ Seorin Kim ⋅ Jisoo Kim ⋅ Suyoung Park ⋅ Hyunji M Park ⋅ Yelim Ahn ⋅ Yeongho Seo ⋅ Jaejin Lee

→ Context Saturation in Zero-Shot Time-Series Foundation Models

Miguel Nogales ⋅ Luca Butera ⋅ Alberto Ferrante ⋅ Cesare Alippi

→ MultiVulnBench: A Large-Scale Benchmark for Count Bias in LLM-Based Multi-Vulnerability Detection

Manan Gupta ⋅ Chinmay Pushkar ⋅ Sanchit Kabra ⋅ Dhruv Kumar ⋅ Jagat S Challa

→ A Controlled Benchmark for Lag-Structured Dependency Motifs

Bowen Qi

→ When Does Polynomial Attention Concentrate? A Relative-Margin Diagnostic for Zero-Shot Softmax Substitution

Sanny Kim

→ Certifiable Evaluation: A Low-Rank Framework for Foundation Model Benchmarking with Formal Performance Guarantees

Siddharth Karuturi ⋅ Kaustubh Bukkapatnam ⋅ Laksh Patel ⋅ Tanush A Shastry ⋅ Akshath Sharma ⋅ Mithil Shah ⋅ Matthew Park

→ Fuzzy-Clustered Mixture-of-Experts with Relational Regularization \ for Interpretable Subgroup Modeling under Data Scarcity

Chien-Hung Lai ⋅ Yuh-Shyan Hwang ⋅ Yi Lin

→ Frontier Inference Under Repeated Partial Reporting

Yanan Long

→ The Shape of Noise: Layer-Wise Perturbation Profiles for Diagnosing Vision Robustness

Son Nguyen ⋅ V. G Bao ⋅ Quang M Phan ⋅ Trong P Le

→ Scale Dependent Data Duplication

Joshua Kazdan ⋅ Noam Levi ⋅ Rylan Schaeffer ⋅ Jessica Chudnovsky ⋅ Abhay Puri ⋅ Bo He ⋅ Mehmet Donmez ⋅ Sanmi Koyejo ⋅ David Donoho

→ You're reading LLM leaderboards wrong: Disentangling models from pipelines in engineering benchmarks

Marius Tacke ⋅ Shivam Suri ⋅ Matthias Busch ⋅ Mahish K Guru ⋅ Christian J Cyron ⋅ Roland Aydin

→ Internal Data Repetition Destroys Language Models

Jessica Chudnovsky ⋅ Joshua Kazdan ⋅ Noam Levi ⋅ Rylan Schaeffer ⋅ Yegor Denisov-Blanch ⋅ Sanmi Koyejo ⋅ David Donoho

→ Certified Evaluation for LLMs in Optimization Modeling: From Graph Isomorphism to Formulation Isomorphism

Zhuohan Wang ⋅ Ziwei Zhu ⋅ Ziniu Li ⋅ Congliang Chen ⋅ Zhihang Lin ⋅ MingZhe Yang ⋅ Yizhou Han ⋅ Yufeng Lin ⋅ Angyang Gu ⋅ Xinglin Hu ⋅ Ruoyu Sun ⋅ Tian Ding

→ Joint Evaluation of Compliance, Planning, and Consistency under Paraphrase: A Relational-Complexity View of Frontier LLMs

Shivansh Bibra ⋅ Dhruv Kumar ⋅ Murari Mandal ⋅ Yash Sinha

→ LoopNav: Benchmarking Spatial Consistency in World Models

Kewei Lian ⋅ Shaofei Cai ⋅ Yitao Liang ⋅ Anji Liu

→ Symmetries of Functional Processes under Label Noise

Abhra Chaudhuri ⋅ Pedro Gomes

→ Null-Calibrated Evaluation of Sparse Autoencoder Decoder Reproducibility

Bright Liu

→ Instruction Bleed: A Theory-Anchored Benchmark for Cross-Module Interference in Prompt-Composed Agents

Ching-Yu Lin ⋅ Yifan Liu

→ Collaborative Adaptive Labeling with Imperfect Labelers and Selective Expert Escalation

Xinrui Ruan ⋅ Nanshan Jia ⋅ Waverly Wei ⋅ Sui Huang ⋅ Zhenyu Zhao ⋅ Zeyu Zheng ⋅ Jingshen Wang

→ The Prompt Is the Analytic Choice: Specification Curve Analysis for LLM-Based Social Science

Jacob Crainic ⋅ Brandon Yee ⋅ Pairie Koh

→ Identifying Efficient Queries for Black-Box Model Classification

Merrick Ohata ⋅ Carey Priebe ⋅ Hayden Helm

→ AIE-Bench: Benchmarking Agents That Build Agents

Abhishek Mishra ⋅ Selvam Palanimalai ⋅ Yogendra Manawat ⋅ Samuel Verboomen ⋅ Prannay Hebbar ⋅ Damir Vrabac ⋅ Deepak Nathani ⋅ Sumeet Motwani ⋅ Kunal Bhatia ⋅ Vignesh Baskaran

→ Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit

Zexin Zhuang ⋅ Yanhang Li ⋅ Zhichao Fan

→ A Benchmarked Diagnostic for Sparse Decomposability of Dense Causal Subspaces

Socrates Osorio ⋅ Joy Z. Yang

→ Efficient Safety Benchmarking via Item Response Theory

Fabio Spagliardi ⋅ Mírian Silva ⋅ Ayan Datta ⋅ Aiden Zhou ⋅ Vamshi Krishna Bonagiri ⋅ Diogo Cruz

→ A Numerical Study of Robustness Verification for Lightning Self-Attention

Yulia Alexandr ⋅ Hao Duan ⋅ Guido Montufar

→ When Agreement Becomes Unsafe: Loss-Aware Energy Control for Diagnostic Deliberation

Yuting Yan ⋅ Yinghao Fu ⋅ Haozhou Gao ⋅ Tianjian Zhang ⋅ Aoxi Liu ⋅ Shuang Li

→ Bounding Compositional Incoherence in Foundation Models

Anany Kotawala

→ Contextual Observability and Grammar Singularity for Compositional Task Families

Manoj Saravanan ⋅ Rohit Kumar Salla ⋅ Shrikar R Kota

→ Selective Perturbations as a Diagnostic for Benchmark-Based LLM Comparisons

Ivan Dubrovsky ⋅ Anastasia Orlova ⋅ Nina Gubina ⋅ Illarion Iov ⋅ Irena Gureeva ⋅ Nikolay Nikitin ⋅ Alexey Zaytsev

→ Correcting Optimizer Selection Bias via Large Deviation Hazards

Andrea Zerio ⋅ Andres R Masegosa

→ Stress-Testing Neural Network Verifiers with Provably Robust Instances

David Troxell ⋅ Yulia Alexandr ⋅ Sofia Hunt ⋅ Stephanie Lei ⋅ Guido Montufar

→ Functional Subspace, where language models can use vector algebra to solve problems

Jung H Lee ⋅ Sujith Vijayan

→ Uplifting Human Decision Making in AI Evaluation by Automating Benchmark Validity Analysis

Rodolfo Corona ⋅ Sang Truong ⋅ Ritwik Gupta ⋅ Nhi N Truong ⋅ Atnafu Lambebo Tonja ⋅ Mena Attia ⋅ Fahim Faisal ⋅ Kaushal K Maurya ⋅ Fred Philippy ⋅ Belu Ticona ⋅ Sumaya N Adan ⋅ Fazl Barez ⋅ Omar Florez ⋅ Supheakmungkol Sarin ⋅ Aseem Srivastava ⋅ Xiaoyuan Yi ⋅ Nick Haber ⋅ Dan Klein ⋅ Thamar Solorio ⋅ Xing Xie ⋅ Sanmi Koyejo ⋅ Robert Trager

→ Context Over Content: Exposing Evaluation Faking in Automated Judges

Manan Gupta

→ DeflectBench: A Benchmark for Evaluating Rhetorical Fallacy Generation in LLMs

Art Kanke

→ Active probabilistic reasoning in humans and LLMs

Gonçalo Guiomar ⋅ Elia Torre ⋅ Pehuen Moure ⋅ Victoria Shavina ⋅ Mario Giulianelli ⋅ Shih-Chii Liu ⋅ Valerio Mante

→ Evaluating LLM Reasoning on Operating System Algorithms via Step-Level Verification

Jalluri Mahesh Kumar ⋅ Junjunoori S Chakri ⋅ Yash Kothari ⋅ Murari Mandal ⋅ Yash Sinha ⋅ Dhruv Kumar

→ Operads for compositional reasoning in LLMs

Nathaniel Bottman ⋅ Kyle Richardson

→ Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Manan Gupta ⋅ Dhruv Kumar

→ Ensuring Calibration Robustness in Split Conformal Prediction Under Adversarial Attacks

Xunlei Qian ⋅ Yue Xing

→ Rethinking LLM Confidence: From Calibration to Coherence

Krish Matta ⋅ Atharv Naphade ⋅ Andy Zou

→ Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models

Sethuraman T V ⋅ Savya Khosla ⋅ Aditi Tiwari ⋅ Vidya Ganesh ⋅ Rakshana Jayaprakash ⋅ Aditya Jain ⋅ Vignesh Srinivasakumar ⋅ Onkar Susladkar ⋅ Joey Wang ⋅ Srinidhi Sunkara ⋅ Aditya Shanmugham ⋅ Abbaas A Nishar ⋅ Rakesh Vaideeswaran ⋅ Simon Jenni ⋅ Derek Hoiem

→ Aggregate Metrics Hide Shortcut Regimes: A Complexity-Stratified Benchmark for Novel View Synthesis

Han Lee ⋅ Rohan K Dalal ⋅ Irene Tang

→ How good is your harness?

Jiwoo Han ⋅ Yuekai Sun

→ Rethinking FID Through the Geometry of the Reference Dataset

Yunghee Lee ⋅ Byeonghyun Pak

→ A Unified Perturbation Framework for Analyzing Leaderboard Stability and Manipulation

Hosna Oyarhoseini ⋅ Jimmy Lin ⋅ Amir-Hossein Karimi

→ SemanticSRJudge: Spatially-Grounded VLM Evaluation for Super-Resolution Quality Assessment

Vishwajeet Shukla ⋅ Ankit Dhankhar ⋅ Ajay Bedi

→ Hidden in Plain Sight: Benchmarking Agent Safety Against Decomposition Attacks with DecompBench

Vikhyath Kothamasu ⋅ Virginia Smith ⋅ Chhavi Yadav

→ Choosing Training-Time Calibration Objectives for Frozen Foundation-Model Features: A Linear-Probing Benchmark

Heejin Choi

→ Cracks in the Foundation: Seemingly Minor Architectural Choices Impact Long Context Extension

Amanda Bertsch ⋅ Luca Soldaini ⋅ Matthew Gormley ⋅ Graham Neubig ⋅ Hannaneh Hajishirzi ⋅ Kyle Lo ⋅ Dirk Groeneveld

→ From Forecast Scores to Auditable Benchmarks: WorldFork for LLM Forecasting Evaluation

Hanson Wen ⋅ Shing C Gui

→ FRAME: Framework for Robotic Action and Motion Evaluation

Ameya Wagh ⋅ Vishnu Rudrasamudram

→ Benchmark Scores Rank Methods, Not Capabilities: Theory, Evidence, and Protocols for the Saturation-Collapse Cycle

Dipam Paul

→ Instance-Optimal Estimation with Multiple LLM Judges on a Budget

Junghyun Lee ⋅ Sanghwa Kim ⋅ Yassir Jedra ⋅ Alexandre Proutiere ⋅ Se-Young Yun

→ How long is a piece of string? A brief empirical analysis of tokenizers

Jonathan Roberts ⋅ Kai Han ⋅ Samuel Albanie

→ ShiftBench: A Benchmark for Per-Cohort Certify-or-Abstain Decisions on Positive Predictive Value Under Covariate Shift

Ananya Salian

→ Beyond Answer Correctness: Measuring and Reducing Explanation Faithfulness Gaps in Chart Understanding VLMs

Kshitij Dahiya ⋅ Vinay K Saini

→ Estimating Pass@ from Fewer Samples with Hierarchical Bayesian Priors

Alexandre Verine ⋅ Florian Le Bronnec ⋅ benjamin negrevergne ⋅ Alexandre Allauzen

→ A Cognitive Battery for Foundation Models: Theory-Grounded Benchmarks for Attention, Learning, Metacognition, Executive Function, and Social Cognition

Zacharie Bugaud

→ On the Rotation-Equivariance Geometry of Tabular Foundation Models

Mert Ogul

→ Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

Xingyu Ren ⋅ Youran Sun ⋅ Haoyu Liang

→ Benchmarks Are Not Atomic: Composition-Aware LLM Evaluation using BenchHub

Eunsu Kim ⋅ Haneul Yoo ⋅ Guijin Son ⋅ Hitesh Patel ⋅ Amit Agarwal ⋅ Alice Oh

→ Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots

Mark Vero ⋅ Fabian Kaczmarczyck ⋅ Ivan Petrov ⋅ Ilia Shumailov ⋅ Niels Heinen ⋅ Jamie Hayes ⋅ Tianqi Fan ⋅ Luca Invernizzi ⋅ Martin Vechev

→ Trace-Aware Routing for Cost-Effective Human–AI Collaborative Labeling

Waverly Wei ⋅ Xinrui Ruan ⋅ Zhenyu Zhao ⋅ Sui Huang ⋅ Zeyu Zheng ⋅ Jingshen Wang

→ Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

Yiding Song ⋅ Hanming Ye

→ Generalized Priority-Aware Shapley Value

Kiljae Lee ⋅ Ziqi Liu ⋅ Weijing Tang ⋅ Yuan Zhang

→ Generative vs Discriminative? Revisiting the shortcut learning debate in text classification

Siva Rajesh Kasa ⋅ Karthik Raavi ⋅ Sumegh Roychowdhury ⋅ Pattisapu Priyatam ⋅ Ashutosh Kumar ⋅ Yaswanth Biruduraju ⋅ SANTHOSH KASA ⋅ Ankith M S ⋅ Sumit Negi

→ Measuring the Limits of Continual Learning for LLMs

Nimit Kalra ⋅ Narutatsu Ri ⋅ Zerzar Bukhari ⋅ Ang Li ⋅ Sanae Lotfi ⋅ Liam Fowl ⋅ Micah Goldblum

→ Atomic Chess as a Counterfactual Benchmark for Quantifying Rule-Conditioned Generalization

Ryan Co ⋅ Karthik R Konuganti

→ Cross-Language Evaluation of Prompt Inversion: Similarity Metrics, Decoding Strategies, and Prefix Sensitivity in Japanese and English

Yusei Kitamura ⋅ Ahmad A Kamal ⋅ Masaya Fujisawa

→ Capacity-Gated Forgetting in LoRA Fine-Tuning: Rank, Proximity, and Endogenous Replay in Medical LLMs

Akanksha Narula ⋅ Aaditya Sharma ⋅ Dharya Jasuja ⋅ Aditya Dhawan

→ m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

Yosub Shin ⋅ Michael Buriek ⋅ Igor Molybog

→ Interactive Evaluation Requires a Design Science

Keyang Xuan ⋅ Peiyang Song ⋅ Pan Lu ⋅ Katie Collins ⋅ Pengrui Han ⋅ Wenkai Li ⋅ Zhenyu Zhang ⋅ Zexue He ⋅ Wenyue Hua ⋅ Manling Li ⋅ Jiaxuan You ⋅ Adrian Weller ⋅ Yizhong Wang ⋅ Jiaxin Pei

→ Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

Rafal Kocielnik ⋅ Pengrui Han ⋅ Peiyang Song ⋅ Myrl G Marmarelis ⋅ Ramit Debnath ⋅ Dean Mobbs ⋅ Anima Anandkumar ⋅ R. Michael Alvarez

→ Feedforward Mixing is as Sharp as it is Slow in Reverse

Benedict Aaron Tjandra ⋅ Avi Wigderson ⋅ João Madeira Araujo ⋅ Oleksandr Vitvitskyi ⋅ Federico Barbero ⋅ Petar Veličković

→ On Cost-Effective LLM-as-a-Judge Improvement Techniques

Ryan Lail ⋅ Luke Markham

→ Style Conventions Override Performance Predictions in Coding LLMs

Matthew Kotzbauer

→ Quantifying Empirical Compute-Supervision Tradeoffs in RLVR

Ryo Mitsuhashi ⋅ Patrick Chen ⋅ Isabelle Tseng ⋅ Jasin Cekinmez ⋅ Addison J. Wu

→ Spectral Signatures of Large Language Models

Zhuoying Zhang ⋅ Ishan V Prasad ⋅ Zihang Liu ⋅ Yuanzhe Hu ⋅ HENGRUI LUO ⋅ Pu Ren ⋅ Yaoqing Yang