Manan Gupta ⋅ Dhruv Kumar
→ BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding
Patrick Knab ⋅ Orgest Xhelili ⋅ Inis Buzi ⋅ Drago A Nilo ⋅ Mohd S Khan ⋅ Lorenz Kolb ⋅ Manuel Scherzer ⋅ Kerem Yildirir ⋅ Christian Bartelt ⋅ Philipp J Schubert
Nicole H. ⋅ Nick Rui
→ Perplexity Cannot Always Tell Right from Wrong
Petar Veličković ⋅ Federico Barbero ⋅ Christos Perivolaropoulos ⋅ Simon Osindero ⋅ Razvan Pascanu
→ Evaluator Failure Modes in Agentic Uncertainty Quantification
Suresh Raghu ⋅ Satwik Pandey ⋅ Shashwat Pandey
→ CellARC: An Oracle-Calibrated Benchmark for Few-Shot Rule Induction
Miroslav Lžičař
→ YC-Bench: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
Muyu He ⋅ Vincent Tu ⋅ Adit Jain ⋅ Anand Kumar ⋅ Sachin Patro ⋅ Soumyadeep Bakshi ⋅ Nazneen Rajani
Bruce C Xu ⋅ Jose James ⋅ Alexander J Ryu
→ Hidden Sensitivity in Spatial Reasoning Evaluation: Diagnosis and Re-ranking with VSI-Bench
Phillip Y Lee ⋅ Jin Yoo ⋅ Minseo Kim ⋅ Leonidas Guibas ⋅ Minhyuk Sung
→ Toward Trustworthy LLM–GNN Fusion: A Fusion-Aware Evaluation and Reporting Framework
zhifei hu ⋅ Alexandra I Cristea
→ Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems
Xing Zhang ⋅ Guanghui Wang ⋅ Yanwei CUI ⋅ Qucy W Qiu ⋅ Ziyuan Li ⋅ Bing Zhu ⋅ Peiyang He
→ The Propagation Field: A Geometric Substrate Theory of Deep Learning
Xingrui Gu
Anuraag Gadehothur Karnam ⋅ Tarunesh Sathish
Zacharie Bugaud
→ PromptSplit: Revealing Prompt-Level Disagreement in Generative Models
Mehdi Lotfian ⋅ Mohammad Jalali ⋅ Farzan Farnia
→ SafetyRepro: Configuration-Conditional Rank Instability on Alignment Benchmarks
Yanhang Li ⋅ Zhichao Fan ⋅ Zexin Zhuang
→ From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA
Sena Korkut ⋅ Maria A Bravo ⋅ Sanghwan Kim ⋅ Zeynep Akata
→ Task-Aligned Stability Analysis of Vision-Language Models for Autonomous Driving Hazard Detection
Everett Richards
→ Simulating Field Experiments for Method Testing
Enoch H. Kang
Dhatri C ⋅ Tadisetty S Yashwanth
→ Universality, Composition Generalization, and Algorithm Emulation All In-Context
Jerry Yao-Chieh Hu ⋅ Hong-Yu Chen ⋅ Po-Chiao Lin ⋅ Maojiang Su ⋅ Han Liu
Kaustubh Bukkapatnam ⋅ Siddharth Karuturi
→ Constructing Korean Benchmark Suite for Reliable Evaluation of Foundation Models
Yeonkyoung So ⋅ Jongmin Kim ⋅ Sungmok Jung ⋅ Gyuseong Lee ⋅ Sangho Kim ⋅ Jongyeon Park ⋅ Joonhak Lee ⋅ Seho Pyo ⋅ gyeongje cho ⋅ Seorin Kim ⋅ Jisoo Kim ⋅ Suyoung Park ⋅ Hyunji M Park ⋅ Yelim Ahn ⋅ Yeongho Seo ⋅ Jaejin Lee
→ Context Saturation in Zero-Shot Time-Series Foundation Models
Miguel Nogales ⋅ Luca Butera ⋅ Alberto Ferrante ⋅ Cesare Alippi
→ MultiVulnBench: A Large-Scale Benchmark for Count Bias in LLM-Based Multi-Vulnerability Detection
Manan Gupta ⋅ Chinmay Pushkar ⋅ Sanchit Kabra ⋅ Dhruv Kumar ⋅ Jagat S Challa
→ A Controlled Benchmark for Lag-Structured Dependency Motifs
Bowen Qi
Sanny Kim
Siddharth Karuturi ⋅ Kaustubh Bukkapatnam ⋅ Laksh Patel ⋅ Tanush A Shastry ⋅ Akshath Sharma ⋅ Mithil Shah ⋅ Matthew Park
Chien-Hung Lai ⋅ Yuh-Shyan Hwang ⋅ Yi Lin
→ Frontier Inference Under Repeated Partial Reporting
Yanan Long
→ The Shape of Noise: Layer-Wise Perturbation Profiles for Diagnosing Vision Robustness
Son Nguyen ⋅ V. G Bao ⋅ Quang M Phan ⋅ Trong P Le
→ Scale Dependent Data Duplication
Joshua Kazdan ⋅ Noam Levi ⋅ Rylan Schaeffer ⋅ Jessica Chudnovsky ⋅ Abhay Puri ⋅ Bo He ⋅ Mehmet Donmez ⋅ Sanmi Koyejo ⋅ David Donoho
Marius Tacke ⋅ Shivam Suri ⋅ Matthias Busch ⋅ Mahish K Guru ⋅ Christian J Cyron ⋅ Roland Aydin
→ Internal Data Repetition Destroys Language Models
Jessica Chudnovsky ⋅ Joshua Kazdan ⋅ Noam Levi ⋅ Rylan Schaeffer ⋅ Yegor Denisov-Blanch ⋅ Sanmi Koyejo ⋅ David Donoho
Zhuohan Wang ⋅ Ziwei Zhu ⋅ Ziniu Li ⋅ Congliang Chen ⋅ Zhihang Lin ⋅ MingZhe Yang ⋅ Yizhou Han ⋅ Yufeng Lin ⋅ Angyang Gu ⋅ Xinglin Hu ⋅ Ruoyu Sun ⋅ Tian Ding
Shivansh Bibra ⋅ Dhruv Kumar ⋅ Murari Mandal ⋅ Yash Sinha
→ LoopNav: Benchmarking Spatial Consistency in World Models
Kewei Lian ⋅ Shaofei Cai ⋅ Yitao Liang ⋅ Anji Liu
→ Symmetries of Functional Processes under Label Noise
Abhra Chaudhuri ⋅ Pedro Gomes
→ Null-Calibrated Evaluation of Sparse Autoencoder Decoder Reproducibility
Bright Liu
Ching-Yu Lin ⋅ Yifan Liu
→ Collaborative Adaptive Labeling with Imperfect Labelers and Selective Expert Escalation
Xinrui Ruan ⋅ Nanshan Jia ⋅ Waverly Wei ⋅ Sui Huang ⋅ Zhenyu Zhao ⋅ Zeyu Zheng ⋅ Jingshen Wang
→ The Prompt Is the Analytic Choice: Specification Curve Analysis for LLM-Based Social Science
Jacob Crainic ⋅ Brandon Yee ⋅ Pairie Koh
→ Identifying Efficient Queries for Black-Box Model Classification
Merrick Ohata ⋅ Carey Priebe ⋅ Hayden Helm
→ AIE-Bench: Benchmarking Agents That Build Agents
Abhishek Mishra ⋅ Selvam Palanimalai ⋅ Yogendra Manawat ⋅ Samuel Verboomen ⋅ Prannay Hebbar ⋅ Damir Vrabac ⋅ Deepak Nathani ⋅ Sumeet Motwani ⋅ Kunal Bhatia ⋅ Vignesh Baskaran
Zexin Zhuang ⋅ Yanhang Li ⋅ Zhichao Fan
→ A Benchmarked Diagnostic for Sparse Decomposability of Dense Causal Subspaces
Socrates Osorio ⋅ Joy Z. Yang
→ Efficient Safety Benchmarking via Item Response Theory
Fabio Spagliardi ⋅ Mírian Silva ⋅ Ayan Datta ⋅ Aiden Zhou ⋅ Vamshi Krishna Bonagiri ⋅ Diogo Cruz
→ A Numerical Study of Robustness Verification for Lightning Self-Attention
Yulia Alexandr ⋅ Hao Duan ⋅ Guido Montufar
→ When Agreement Becomes Unsafe: Loss-Aware Energy Control for Diagnostic Deliberation
Yuting Yan ⋅ Yinghao Fu ⋅ Haozhou Gao ⋅ Tianjian Zhang ⋅ Aoxi Liu ⋅ Shuang Li
→ Bounding Compositional Incoherence in Foundation Models
Anany Kotawala
→ Contextual Observability and Grammar Singularity for Compositional Task Families
Manoj Saravanan ⋅ Rohit Kumar Salla ⋅ Shrikar R Kota
→ Selective Perturbations as a Diagnostic for Benchmark-Based LLM Comparisons
Ivan Dubrovsky ⋅ Anastasia Orlova ⋅ Nina Gubina ⋅ Illarion Iov ⋅ Irena Gureeva ⋅ Nikolay Nikitin ⋅ Alexey Zaytsev
→ Correcting Optimizer Selection Bias via Large Deviation Hazards
Andrea Zerio ⋅ Andres R Masegosa
→ Stress-Testing Neural Network Verifiers with Provably Robust Instances
David Troxell ⋅ Yulia Alexandr ⋅ Sofia Hunt ⋅ Stephanie Lei ⋅ Guido Montufar
→ Functional Subspace, where language models can use vector algebra to solve problems
Jung H Lee ⋅ Sujith Vijayan
→ Uplifting Human Decision Making in AI Evaluation by Automating Benchmark Validity Analysis
Rodolfo Corona ⋅ Sang Truong ⋅ Ritwik Gupta ⋅ Nhi N Truong ⋅ Atnafu Lambebo Tonja ⋅ Mena Attia ⋅ Fahim Faisal ⋅ Kaushal K Maurya ⋅ Fred Philippy ⋅ Belu Ticona ⋅ Sumaya N Adan ⋅ Fazl Barez ⋅ Omar Florez ⋅ Supheakmungkol Sarin ⋅ Aseem Srivastava ⋅ Xiaoyuan Yi ⋅ Nick Haber ⋅ Dan Klein ⋅ Thamar Solorio ⋅ Xing Xie ⋅ Sanmi Koyejo ⋅ Robert Trager
→ Context Over Content: Exposing Evaluation Faking in Automated Judges
Manan Gupta
→ DeflectBench: A Benchmark for Evaluating Rhetorical Fallacy Generation in LLMs
Art Kanke
→ Active probabilistic reasoning in humans and LLMs
Gonçalo Guiomar ⋅ Elia Torre ⋅ Pehuen Moure ⋅ Victoria Shavina ⋅ Mario Giulianelli ⋅ Shih-Chii Liu ⋅ Valerio Mante
→ Evaluating LLM Reasoning on Operating System Algorithms via Step-Level Verification
Jalluri Mahesh Kumar ⋅ Junjunoori S Chakri ⋅ Yash Kothari ⋅ Murari Mandal ⋅ Yash Sinha ⋅ Dhruv Kumar
→ Operads for compositional reasoning in LLMs
Nathaniel Bottman ⋅ Kyle Richardson
→ Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
Manan Gupta ⋅ Dhruv Kumar
→ Ensuring Calibration Robustness in Split Conformal Prediction Under Adversarial Attacks
Xunlei Qian ⋅ Yue Xing
→ Rethinking LLM Confidence: From Calibration to Coherence
Krish Matta ⋅ Atharv Naphade ⋅ Andy Zou
→ Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models
Sethuraman T V ⋅ Savya Khosla ⋅ Aditi Tiwari ⋅ Vidya Ganesh ⋅ Rakshana Jayaprakash ⋅ Aditya Jain ⋅ Vignesh Srinivasakumar ⋅ Onkar Susladkar ⋅ Joey Wang ⋅ Srinidhi Sunkara ⋅ Aditya Shanmugham ⋅ Abbaas A Nishar ⋅ Rakesh Vaideeswaran ⋅ Simon Jenni ⋅ Derek Hoiem
Han Lee ⋅ Rohan K Dalal ⋅ Irene Tang
Jiwoo Han ⋅ Yuekai Sun
→ Rethinking FID Through the Geometry of the Reference Dataset
Yunghee Lee ⋅ Byeonghyun Pak
→ A Unified Perturbation Framework for Analyzing Leaderboard Stability and Manipulation
Hosna Oyarhoseini ⋅ Jimmy Lin ⋅ Amir-Hossein Karimi
→ SemanticSRJudge: Spatially-Grounded VLM Evaluation for Super-Resolution Quality Assessment
Vishwajeet Shukla ⋅ Ankit Dhankhar ⋅ Ajay Bedi
→ Hidden in Plain Sight: Benchmarking Agent Safety Against Decomposition Attacks with DecompBench
Vikhyath Kothamasu ⋅ Virginia Smith ⋅ Chhavi Yadav
Heejin Choi
→ Cracks in the Foundation: Seemingly Minor Architectural Choices Impact Long Context Extension
Amanda Bertsch ⋅ Luca Soldaini ⋅ Matthew Gormley ⋅ Graham Neubig ⋅ Hannaneh Hajishirzi ⋅ Kyle Lo ⋅ Dirk Groeneveld
→ From Forecast Scores to Auditable Benchmarks: WorldFork for LLM Forecasting Evaluation
Hanson Wen ⋅ Shing C Gui
→ FRAME: Framework for Robotic Action and Motion Evaluation
Ameya Wagh ⋅ Vishnu Rudrasamudram
Dipam Paul
→ Instance-Optimal Estimation with Multiple LLM Judges on a Budget
Junghyun Lee ⋅ Sanghwa Kim ⋅ Yassir Jedra ⋅ Alexandre Proutiere ⋅ Se-Young Yun
→ How long is a piece of string? A brief empirical analysis of tokenizers
Jonathan Roberts ⋅ Kai Han ⋅ Samuel Albanie
Ananya Salian
Kshitij Dahiya ⋅ Vinay K Saini
→ Estimating Pass@ from Fewer Samples with Hierarchical Bayesian Priors
Alexandre Verine ⋅ Florian Le Bronnec ⋅ benjamin negrevergne ⋅ Alexandre Allauzen
Zacharie Bugaud
→ On the Rotation-Equivariance Geometry of Tabular Foundation Models
Mert Ogul
Xingyu Ren ⋅ Youran Sun ⋅ Haoyu Liang
→ Benchmarks Are Not Atomic: Composition-Aware LLM Evaluation using BenchHub
Eunsu Kim ⋅ Haneul Yoo ⋅ Guijin Son ⋅ Hitesh Patel ⋅ Amit Agarwal ⋅ Alice Oh
→ Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots
Mark Vero ⋅ Fabian Kaczmarczyck ⋅ Ivan Petrov ⋅ Ilia Shumailov ⋅ Niels Heinen ⋅ Jamie Hayes ⋅ Tianqi Fan ⋅ Luca Invernizzi ⋅ Martin Vechev
→ Trace-Aware Routing for Cost-Effective Human–AI Collaborative Labeling
Waverly Wei ⋅ Xinrui Ruan ⋅ Zhenyu Zhao ⋅ Sui Huang ⋅ Zeyu Zheng ⋅ Jingshen Wang
→ Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
Yiding Song ⋅ Hanming Ye
→ Generalized Priority-Aware Shapley Value
Kiljae Lee ⋅ Ziqi Liu ⋅ Weijing Tang ⋅ Yuan Zhang
→ Generative vs Discriminative? Revisiting the shortcut learning debate in text classification
Siva Rajesh Kasa ⋅ Karthik Raavi ⋅ Sumegh Roychowdhury ⋅ Pattisapu Priyatam ⋅ Ashutosh Kumar ⋅ Yaswanth Biruduraju ⋅ SANTHOSH KASA ⋅ Ankith M S ⋅ Sumit Negi
→ Measuring the Limits of Continual Learning for LLMs
Nimit Kalra ⋅ Narutatsu Ri ⋅ Zerzar Bukhari ⋅ Ang Li ⋅ Sanae Lotfi ⋅ Liam Fowl ⋅ Micah Goldblum
→ Atomic Chess as a Counterfactual Benchmark for Quantifying Rule-Conditioned Generalization
Ryan Co ⋅ Karthik R Konuganti
Yusei Kitamura ⋅ Ahmad A Kamal ⋅ Masaya Fujisawa
Akanksha Narula ⋅ Aaditya Sharma ⋅ Dharya Jasuja ⋅ Aditya Dhawan
→ m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning
Yosub Shin ⋅ Michael Buriek ⋅ Igor Molybog
→ Interactive Evaluation Requires a Design Science
Keyang Xuan ⋅ Peiyang Song ⋅ Pan Lu ⋅ Katie Collins ⋅ Pengrui Han ⋅ Wenkai Li ⋅ Zhenyu Zhang ⋅ Zexue He ⋅ Wenyue Hua ⋅ Manling Li ⋅ Jiaxuan You ⋅ Adrian Weller ⋅ Yizhong Wang ⋅ Jiaxin Pei
→ Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior
Rafal Kocielnik ⋅ Pengrui Han ⋅ Peiyang Song ⋅ Myrl G Marmarelis ⋅ Ramit Debnath ⋅ Dean Mobbs ⋅ Anima Anandkumar ⋅ R. Michael Alvarez
→ Feedforward Mixing is as Sharp as it is Slow in Reverse
Benedict Aaron Tjandra ⋅ Avi Wigderson ⋅ João Madeira Araujo ⋅ Oleksandr Vitvitskyi ⋅ Federico Barbero ⋅ Petar Veličković
→ On Cost-Effective LLM-as-a-Judge Improvement Techniques
Ryan Lail ⋅ Luke Markham
→ Style Conventions Override Performance Predictions in Coding LLMs
Matthew Kotzbauer
→ Quantifying Empirical Compute-Supervision Tradeoffs in RLVR
Ryo Mitsuhashi ⋅ Patrick Chen ⋅ Isabelle Tseng ⋅ Jasin Cekinmez ⋅ Addison J. Wu
→ Spectral Signatures of Large Language Models
Zhuoying Zhang ⋅ Ishan V Prasad ⋅ Zihang Liu ⋅ Yuanzhe Hu ⋅ HENGRUI LUO ⋅ Pu Ren ⋅ Yaoqing Yang
→ FormalImG: Evaluating Structural Compositional Generalization for T2I Models
Hong-Jie You ⋅ Jie-Jing Shao ⋅ Xiao-Wen Yang ⋅ Zhi-Fan Wu ⋅ Lin-Han Jia ⋅ Lan-Zhe Guo ⋅ Yu-Feng Li
→ Theory-Grounded Evaluation Exposes the Authorship Gap in LLM Personalization
Yash G Sawant
Mary Llewellyn ⋅ Annie Gray
→ AMVICC: A Novel Benchmark for Cross-Modal Failure Mode Profiling for VLMs and IGMs
Pranay Goel ⋅ Aahana Basappa ⋅ Anusri Karra ⋅ Anish Karra ⋅ Kevin Zhu
→ Probabilistic Chain-of-Thought: Sequential Bayesian Inference over Latent Reasoning Correctness
Suriya D Saravanakumar ⋅ Ezra Wesenie ⋅ Kishore Nuthalapati ⋅ Laksh Patel
→ ContinuityBench: A Framework and Taxonomy for Evaluating Agent Recovery from Interrupted State
Aryan Gulati
→ Retrieval Dwelling: A Principled Sampling Strategy for Exploiting Spurious State Exploration
Rohit Sinha ⋅ saroj Kumar
Kanav Kapoor ⋅ Dhruv Kumar ⋅ Jagat S Challa ⋅ Murari Mandal ⋅ Yash Sinha
→ CLIP Models Generalize Less Than Compositional Benchmarks Suggest
Shuman Peng ⋅ Arnas Uselis ⋅ Darina Koishigarina ⋅ Martin Ester ⋅ Seong Joon Oh
→ Fast Inference via Hierarchical Speculative Decoding
Clara Mohri ⋅ Amir Globerson ⋅ Haim Kaplan ⋅ Yishay Mansour ⋅ Tal Schuster
→ GapPO: Gradient-Adaptive Pairwise Preference Optimization
Michelle Chang ⋅ Xiaodi Sun ⋅ Ethan C Chau ⋅ Zhaoqiong Huang ⋅ Arpita Das ⋅ Izzie Lau ⋅ Liyuan Zheng ⋅ Huancheng Chen ⋅ Jingwen Lu
Shubh Chapra ⋅ Dhruv Kumar ⋅ Murari Mandal ⋅ Yash Sinha