Yuan Yuan*, Muyu He*, Adil Shahid, Jiani Huang, Ziyang Li, Li Zhang
Ongoing (aiming for EMNLP 2025)
We collected and cleaned a detective textual game called Ace Attorney to construct a long-context reasoning benchmark (over 50k tokens for each question); our initial findings revealed that most prominent LLMs were not able to answer these questions properly (accuracy less than 0.4)
Young-Min Cho, Yuan Yuan, Lyle Ungar
Ongoing (aiming for EMNLP 2025)
This research project investigates the effects of persona in conversational agents, focusing on how defining personas through gradients—such as varying levels of empathy—shapes agent behavior and whether persona traits interact across dimensions. It explores whether adjusting traits like empathy produces meaningful shifts in how agents respond, and whether combinations of traits, such as empathy and conciseness, influence each other—for example, questioning if being more empathetic leads to greater wordiness or if conciseness and informativeness are inherently linked. The project aims to deepen understanding of how personality traits shape conversational dynamics and how agents can better align with human perceptions of persona.
Bowen Jiang*, Yuan Yuan*, Xinyi Bai, Zhuoqun Hao, Alyson Yin, Yaojie Hu, Wenyu Liao, Lyle Ungar, Camillo J. Taylor
Under Review at ACL 2025
We presented a new text rendering and editing algorithm for diffusion models that improve text generation and allow users to specifyfonts to generate in images; our approach preserves the font features by using a segmentation model and additional image filterings, alleviating the need of any ground-truth font labels; submitted to ACL.
Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J. Taylor, Dan Roth
Under Review at COLM 2025
We introduce PersonaMem, a personalization benchmark that features scalable and persona-oriented multi-session user-LLM conversations, as well as fine-grained in-situ user query types designed to evaluate LLM capabilities in memorizing, tracking, and incorporating users’ dynamic profiles into personalized responses across diverse scenarios.
The paradigm of retrieval-augmented generated (RAG) helps mitigate hallucinations of large language models (LLMs). However, RAG also introduces biases contained within the retrieved documents. These biases can be amplified in scenarios which are multilingual and culturally-sensitive, such as territorial disputes. In this paper, we introduce BordIRLines, a benchmark consisting of 720 territorial dispute queries paired with 14k Wikipedia documents across 49 languages. To evaluate LLMs' cross-lingual robustness for this task, we formalize several modes for multilingual retrieval. Our experiments on several LLMs reveal that retrieving multilingual documents best improves response consistency and decreases geopolitical bias over using purely in-language documents, showing how incorporating diverse perspectives improves robustness. Also, querying in low-resource languages displays a much wider variance in the linguistic distribution of response citations. Our further experiments and case studies investigate how cross-lingual RAG is affected by aspects from IR to document contents. We release our benchmark and code to support further research towards ensuring equitable information access across languages at this https URL.
Bowen Jiang, Yangxinyu Xie, Xiaomeng Wang, Yuan Yuan, Zhuoqun Hao, Xinyi Bai, Weijie J. Su, Camillo J. Taylor, Tanwi Mallick
NAACL 2025 main
Unlike reasoning that aims to draw conclusions from premises, rationality ensures that those conclusions are reliably consistent, have anorderability of preference, and are aligned with evidence from various sources and logical principles. This survey is the first to comprehensively explore the notion of rationality in language and multimodal agents, inspired from cognitive science.
First Open Mic in NYC!