Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval Aarush Sinha
Query Timing Produces Opposite Positional Biases Between LLMs and Humans Jasin Cekinmez, Addison J. Wu, Thomas L. Griffiths
Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung, Drishti Sharma, Akshata, Kranthi Kiran GV, Wesley Tam, Bala Krishna S Vegesna
The $\Psi$ Paradox in Extreme Superposition: When ETF Alignment Does Not Predict Language Model Generalization Hyunjun Kim
WHEN STABILITY FAILS: HIDDEN FAILURE MODES OF LLMS IN DATA-CONSTRAINED SCIENTIFIC DECISION-MAKING Nazia Riasat
Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure Viliana Devbunova
The Limits of Long-Context Reasoning in Automated Bug Fixing Ravi Shanker Raju, Mengmeng Ji, Shubhangi Upasani, Bo Li, Urmish Thakker
Evaluating Ill-Defined Tasks in Large Language Models Yi Zhou, Basel Shbita
Probing and Steering Chain-of-Thought Unfaithfulness in Language Models Giovanni Maria Occhipinti, Alessandro Abate, Nandi Schoots
Style over Substance: LLM-as-a-Judge Fails to Evaluate Multi-Party Social Dialogue Kunal Samanta, Faisal Tareque Shohan, Amine Trabelsi, Richard Khoury
Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG Martin Asenov, Kenza Benkirane, Daniel Goldwater, Aneiss Ghodsi
The Continuous Space Gap: Why VLMs Fail in Continuous Geometric Reasoning Yikun Zong, Cheston Tan
Not All Time Is Gregorian: Evaluating LLMs on Cultural Calendar Systems Deepon Halder, Adish Pandya, Raj Dabre
Lost in Translation: Why SOTA LLMs Struggle with French NLU Frontiers David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard Khoury
Beyond Continuity: Challenges of Context Switching in Multi-Turn Dialogue with LLMs Aditya Sinha, Harald Steck, Vito Claudio Ostuni, Matteo Rinaldi
EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation Shih-Yang Liu, Maksim Khadkevich, Nai Chit FUNG, Charbel Sakr, Chao-Han Huck Yang, Chien-Yi Wang, Saurav Muralidharan, Hongxu Yin, Kwang-Ting Cheng, Jan Kautz, Yu-Chiang Frank Wang, Pavlo Molchanov, Min-Hung Chen
Knowing Is Not Seeing. Limits of Physical Problem Solving in VLMs Karim Elmaaroufi, Kevin Chon, Justin Svegliato, Lakshya A Agrawal, Matei Zaharia, Sanjit A. Seshia
RETRIEVAL-AUGMENTED GENERATION STILL HALLUCINATES UNDER PARTIAL EVIDENCE Mahule Roy, Subhas Roy
Improving Proxy Transfer via Intermediate Proxy Tuning Kevin Kuo, Ayush Sehgal, Robert Pare, Virginia Smith
When can you TRUST Large Language Models? Radu Paradovschi, Darvin Yi, Andrew Rabinovich, Zhao Chen
One Step Forward, Two Steps Back: Regression Errors and Cost Inefficiencies in LLM Iterative Refinement for Code Generation Lucas Teixeira Borges, RICARDO RIOS
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance? Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Sunny Panchal, Pulkit Madan, Leonid Sigal, Roland Memisevic
NON-MONOTONICITY AND CATASTROPHIC RISK OF PROMPT INTERVENTIONS IN ADVERSARIAL LLM CONTROL Koki Inoue, Naoya Takashima, Hayato Fujihara, SHUYA HIGUCHI, Kota Shimomura, Ryuta Shimogauchi, Takayoshi Yamashita
The Missing Red Line: How Commercial Pressure Erodes AI Safety Boundaries Nora Petrova, John Burden
Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models Hicham Eddoubi, Umar Faruk Abdullahi, Fadi Hassan
EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages Aman Sharma, Paras Chopra
Random Is Hard to Beat: Active Selection in Online DPO with Modern LLMs Giyeong Oh, Junghyun Lee, Jaehyun Park, Youngjae Yu, Wonho Bae, Junhyug Noh
A Pilot Study on Doubt Robustness of LLMs in Clinical Prediction Explanation Juhwan Choi, Sangchul Hahn, Eunho Yang
I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift Subramanyam Sahoo, Vinija Jain, Divya Chaudhary, Aman Chadha
REASONING WITHOUT STRUCTURAL PRIORS: LIMITS OF SYNTHETIC COT FOR MOLECULES Deepa Mal Korani, Mohammad Madani, Lawrence Phillips, Josefa Lia Stoisser, Marc Boubnovski Martell, Kristine Deibler
Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs Suraj Yadav, Siddharth Yadav, Parth Goyal
AI-rithmetic Alex Bie, Travis Dick, Alex Kulesza, Prabhakar Raghavan, Vinod Raman, Sergei Vassilvitskii
Challenges in Inference-Time Scaling with Uncertainty-Aware Tree Search Jacopo Minniti, Neil Band, Tim G. J. Rudner
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting Ishaan Watts, Catherine Li, Sachin Goyal, Jacob Mitchell Springer, Aditi Raghunathan
The Cost of Consistency: Why Cross-Plane Contrastive Learning Fails to Bridge the Gap Between MedSAM-3 and nnU-Net Madhu Shree Aravindan, Aaditi V Bajpai, Ramamoorthy Sriramulu
Why Large Language Models Fail for Hausa Educational Content: Cascading Errors from Translation to Speech to Comprehension Honour-Jesus Bezaleel, Pearse Jim, Moses Daudu
Barriers to Pareto Steerability in Preference-Conditioned LLM Alignment Fatemeh Nourzad, Daouda Sow, Yingbin Liang, Ming Shi, Ming Zhang, Yunxuan Li, Eylem Ekici, Ness Shroff
Attention Sinks as Internal Signals for Hallucination Detection in Large Language Models Jakub Binkowski, Kamil Adamczewski, Tomasz Jan Kajdanowicz
Learning State-Tracking from Code: REPL Traces and Probabilistic Automata Julien Siems, Riccardo Grazzi, Kirill Kalinin, Hitesh Ballani, Babak Rahmani
The Selective Safety Trap: How LLMs Scaling and Alignment Fail to Generalize Across Minority Demographics Iago Alves Brito, Walcy Rios, Julia Soares Dollis, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galvão Filho
When Rubrics Backfire: Systematic Preference Drift in LLM Judges Ruomeng Ding, Yifei Pang, He Sun, Yizhong Wang, Steven Wu, Zhun Deng
Synthetic Error Injection Fails to Elicit Self-Correction In Language Models David Xing Wu, Shreyas Kapur, Anant Sahai, Stuart Russell
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning Maggie Ziyu Huan, Yuetai Li, Tianyu Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue
Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap Yueqian Lin, Zhengmian Hu, Qinsi Wang, Yudong Liu, Hengfan Zhang, Jayakumar Subramanian, Nikos Vlassis, Hai Helen Li, Yiran Chen
Bigger Is Not Better Under Differential Privacy: Optimization Failure at Eleven-Billion Scale in Vision–Language Model Fine-Tuning Tzuen Su, Li-Hong Guo, Yangmi Su, Cheng-Yen Li
Evaluation-Conditioned Trojan Attack Zihan Zhu, Hanlin Zhang, Giovanni D'Antonio, Anton Tsitsulin, Sham M. Kakade, Vahab Mirrokni
FLUFFINJECTOR: DIAGNOSING LOGICAL CONSISTENCY FAILURES IN CHAIN-OF-THOUGHT REWARD MODELS Varshith Vijjapu, Krishiv Ray, Archana Vaidheeswaran
I Can’t Believe It’s Not Safer: Preference–Safety Disassociation in Clinical LLM Evaluation Fay Elhassan, David Sasu, Lars Henning Klein, Alexandra V. Kulinkina, Mary-Anne Hartley
I Can't Believe It Can't Count: Vision-Language Models Fail at Basic Enumeration Beyond the Subitizing Range Amirhossein Afsharrad, Seyed Shahabeddin Mousavi, Sanjay Lall
The Anatomy of Uncertainty in LLMs Aditya Taparia, Ransalu Senanayake, Kowshik Thopalli, Vivek Narayanaswamy
More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression Aryan Sood, Tanvi Sharma, Vansh Agrawal
Language-Dependent Miscalibration in Multilingual LLM Evaluators Ej Zhou, Lucas Resck, Zheng Hui, Anna Korhonen
Fairness Failure Modes of Multimodal LLMs Canyu Chen, Anglin Cai, Joan Nwatu, Yale Li, Han Liu, Jessica Hullman, Rada Mihalcea, Kathleen McKeown, Manling Li
I Can't Believe LLMs Still Can't Write Drama: Multi-Dimensional Failures in Script Continuation Shijian Ma, Yunqi Huang, Lin Yan
The Low-Frequency Trap: Why Scaling Doesn't Solve Simple Temporal Counting Sarvesh Baskar, Muhammad R. Islam, Zikui Cai, Ankit Nakhawa, Anirudh Satheesh, Tom Goldstein, Furong Huang
QuanBench Plus: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation Ali Slim, Haydar Hamieh, Jawad Kotaich, Yehya Ghosn, Mahdi Chehimi, Hasan Abed Al Kader Hammoud, Ammar Mohanna, Bernard Ghanem
Can LLMs Perceive Time? An Empirical Investigation Aniketh Garikaparthi
When Lie Detectors Learn Model Identity: Confounds in Black-Box Sandbagging Detection Lin Yulong, Pablo Bernabeu-Perez, Benjamin Arnav, Lennie Wells, Mary Phuong