Hybrid Post-training for Creative Writing: Integrating GDPO and Expert Opinion Distillation
Yuan Yuan*, Bowen Jiang*, Ziyi Liu, Zhuoqun Hao
Work in Progress
We developed a hybrid post-training framework to enhance LLM creative writing by combining Group reward-Decoupled Normalization Policy Optimization (GDPO) with Supervised Fine-Tuning (SFT) on expert feedback. We implemented a multi-reward RL pipeline using different writing criteria to optimize the model via GDPO, enabling the model to balance diverse stylistic and structural reward signals. Then, we integrated LLM expert opinions into model parameters using SFT, transitioning from purely numeric reward-based optimization to a more nuanced, qualitative-informed distillation of expert critiques.
Bowen Jiang, Yuan Yuan, Zhuoqun Hao, Zhangchen Xu, Anvesh Rao Vijjini, Ziyi Liu, Radha Poovendran, Dan Roth, Camillo J. Taylor, Sihao Chen.
Under Review at ARR January
We built a synthetic dataset to evaluate and improve LLMs’ ability to detect implicit user preferences in long-term conversations; using Reinforcement Fine-Tuning (RFT), we design reward signals that encourage models to infer subtle, unspoken preferences and adapt responses accordingly. Our results show that RFT-trained models achieve stronger personalization and alignment with evolving user needs, advancing beyond memory evaluation toward more adaptive, context-aware AI assistants.
Young-Min Cho*, Yuan Yuan*, Sharath Chandra Guntuku, Lyle Ungar
Under Review of ARR January
We studied how defining personas with style features (e.g., “empathetic”) shapes agent behavior and whether traits interact across dimensions. Our experiments test whether adjusting empathy meaningfully shifts responses and explore cross-trait effects, such as whether higher empathy decreases helpfulness or if conciseness and
informativeness are inherently coupled. This work advances understanding of how personality traits influence conversational dynamics and how agents can better align with human perceptions of persona.
Sunny Rai, Jeffrey Cho, Yuan Yuan, Neil Sehgal, Sharath Chandra Guntuku, Lyle Ungar
Work in Progress
We investigated cross-cultural style variations between Chinese and American advice-seeking platforms (Zhihu and Reddit) by developing a question-matching pipeline using M3 embeddings and GPT-4o-mini to alignsemantically similar queries across languages. This work advanced our understanding of LLMs’ (GPT, Qwen) ability to replicate cultural nuances through multilingual advice generation and culturally informed prompting, by analyzing dimensions of style and human preferences.
AgentWorld: A Multi-Agent Research Platform for Collaborative AI Systems
Raphael Zhu, Jeffrey Cho, Yusen Zhang, Wenliang Zheng, Yuan Yuan, Jin Mo Yang, Chi Wang, Rui Zhang
Work in Progress
Developed AgentWorld, an open-source research platform built on the Kaetram engine designed to study multi-agent coordination within a persistent 2D multiplayer environment. The project addresses the need for a high-fidelity sandbox where AI agents can move beyond simple logic to perform complex, long-horizon tasks
such as crafting, trading, and combat. By providing a rich observation and action space via a unified API, AgentWorld enables the transition from isolated LLM benchmarking to dynamic, multi-agent behavioral research, offering significant implications for understanding how autonomous agents can collaborate or
compete in resource-constrained environments.
Yuan Yuan*, Muyu He*, Adil Shahid, Jiani Huang, Ziyang Li, Li Zhang
EMNLP 2025 Main
We collected and cleaned a detective textual game called Ace Attorney to construct a long-context reasoning benchmark (over 50k tokens for each question); our initial findings revealed that most prominent LLMs were not able to answer these questions properly (accuracy less than 0.4)
Bowen Jiang*, Yuan Yuan*, Xinyi Bai, Zhuoqun Hao, Alyson Yin, Yaojie Hu, Wenyu Liao, Lyle Ungar, Camillo J. Taylor
EMNLP 2025 Findings
We presented a new text rendering and editing algorithm for diffusion models that improve text generation and allow users to specifyfonts to generate in images; our approach preserves the font features by using a segmentation model and additional image filterings, alleviating the need of any ground-truth font labels; submitted to ACL.
Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J. Taylor, Dan Roth
COLM 2025 Main
We introduce PersonaMem, a personalization benchmark that features scalable and persona-oriented multi-session user-LLM conversations, as well as fine-grained in-situ user query types designed to evaluate LLM capabilities in memorizing, tracking, and incorporating users’ dynamic profiles into personalized responses across diverse scenarios.
The paradigm of retrieval-augmented generated (RAG) helps mitigate hallucinations of large language models (LLMs). However, RAG also introduces biases contained within the retrieved documents. These biases can be amplified in scenarios which are multilingual and culturally-sensitive, such as territorial disputes. In this paper, we introduce BordIRLines, a benchmark consisting of 720 territorial dispute queries paired with 14k Wikipedia documents across 49 languages. To evaluate LLMs' cross-lingual robustness for this task, we formalize several modes for multilingual retrieval. Our experiments on several LLMs reveal that retrieving multilingual documents best improves response consistency and decreases geopolitical bias over using purely in-language documents, showing how incorporating diverse perspectives improves robustness. Also, querying in low-resource languages displays a much wider variance in the linguistic distribution of response citations. Our further experiments and case studies investigate how cross-lingual RAG is affected by aspects from IR to document contents. We release our benchmark and code to support further research towards ensuring equitable information access across languages at this https URL.
Bowen Jiang, Yangxinyu Xie, Xiaomeng Wang, Yuan Yuan, Zhuoqun Hao, Xinyi Bai, Weijie J. Su, Camillo J. Taylor, Tanwi Mallick
NAACL 2025 main
Unlike reasoning that aims to draw conclusions from premises, rationality ensures that those conclusions are reliably consistent, have anorderability of preference, and are aligned with evidence from various sources and logical principles. This survey is the first to comprehensively explore the notion of rationality in language and multimodal agents, inspired from cognitive science.
First Open Mic in NYC!