Yihang Yao yihangya[at]andrew.cmu.edu

Hi, welcome to my website! I am a 4th-year Ph.D. candidate in the Safe AI Lab at Carnegie Mellon University, advised by Prof. Ding Zhao. I received my Bachelor's degree from Shanghai Jiao Tong University in 2022 and spent a wonderful time as a visiting student in the Intelligent Control Lab at CMU, working with Prof. Changliu Liu. I previously interned at MIT-IBM Watson AI Lab, IBM Research and have also been working with Google DeepMind since 2023.

My research focuses on reinforcement learning and large language models, with a current emphasis on training proactive agents for user-centric applications.

News

Talks
Selected Publications / Preprints
(* indicates equal contribution)
Pushing Forward Pareto Frontiers of Proactive Agents with Behavioral Agentic Optimization

Yihang Yao, Zhepeng Cen, Haohong Lin, Shiqi Liu, Zuxin Liu, Jiacheng Zhu, Zhang-Wei Hong, Laixi Shi, Ding Zhao

ICLR 2026 LLA Workshop, Preprint

TL;DR: BAO is an agentic RL framework that advances the Pareto frontier between task performance and user engagement in proactive agent training.

Paper / Website / Code

Tailored Primitive Initialization is the Secret Key to Reinforcement Learning

Yihang Yao, Guangtao Zeng, Raina Wu, Yang Zhang, Ding Zhao, Zhang-Wei Hong, Chuang Gan

2025, Preprint

TL;DR: Tailor is a mid-training data pipeline that automatically discovers and curates diverse, high-quality reasoning primitives to warm-start LLMs, leading to more stable and sample-efficient RL with higher downstream performance.

Paper

Behavior Injection: Preparing Language Models for Reinforcement Learning

Zhepeng Cen*, Yihang Yao*, William Han, Zuxin Liu, Ding Zhao

NeurIPS 2025

TL;DR: This paper proposes behavior injection, a mid-training approach that augments demonstration data with exploratory and exploitative behaviors to improve RL data co-influence and sampling efficiency.

Paper / Website / Code

Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training

Yihang Yao, Zhepeng Cen, Miao Li, William Han, Yuyou Zhang, Emerson Liu, Zuxin Liu, Chuang Gan, Ding Zhao

ACL 2025 Findings

TL;DR: MEND enhances LLM robustness by improving query symmetry awareness, boosting reasoning performance through structured dataset curation.

Paper

OASIS: Conditional Distribution Shaping for Offline Safe Reinforcement Learning

Yihang Yao*, Zhepeng Cen*, Wenhao Ding, Haohong Lin, Shiqi Liu, Tingnan Zhang, Wenhao Yu, Ding Zhao

NeurIPS 2024

TL;DR: We investigate offline RL from a data-centric perspective and propose a diffusion model-based data generator to curate training datasets aligned with user preferences.

Paper / Website / Code

Feasibility Consistent Representation Learning for Safe Reinforcement Learning

Zhepeng Cen, Yihang Yao, Zuxin Liu, Ding Zhao

ICML 2024

TL;DR: We introduce FCSRL, a framework that improves safety constraint estimation in RL through representation learning and self-supervised techniques.

Paper / Website / Code

Gradient Shaping for Multi-Constraint Safe Reinforcement Learning

Yihang Yao, Zuxin Liu, Zhepeng Cen, Peide Huang, Tingnan Zhang, Wenhao Yu, Ding Zhao

L4DC 2024

TL;DR: We introduce GradS, a gradient-based method for improving training efficiency in multi-constraint RL by manipulating gradients, optimizing both reward and constraint satisfaction.

Paper / Website / Code

Constraint-Conditioned Policy Optimization for Versatile Safe Reinforcement Learning

Yihang Yao*, Zuxin Liu*, Zhepeng Cen, Jiacheng Zhu, Wenhao Yu, Tingnan Zhang, Ding Zhao

NeurIPS 2023

TL;DR: We introduce CCPO, a framework for versatile/adaptive safe RL that enables efficient training and zero-shot adaptation to varying safety constraints.

Paper

Constrained Decision Transformer for Offline Safe Reinforcement Learning

Zuxin Liu*, Zijian Guo*, Yihang Yao, Zhepeng Cen, Wenhao Yu, Tingnan Zhang, Ding Zhao

ICML 2023

TL;DR: We propose CDT for offline safe RL, which leverages a multi-objective optimization approach to balance safety and task performance, achieving superior adaptability, robustness, and high-reward policies with zero-shot adaptation capabilities.

Paper / Code (DSRL)



Services