Reinforcement Learning is the Ultimate Answer!
Short Bio
๐ฃ Born in 2000, I grew up alongside the rise of intelligent machines. I received my B.Eng. degree from the School of Vehicle and Mobility at Tsinghua University, Beijing, China
, in 2021. I am currently pursuing a Ph.D. in Mechanical Engineering at Tsinghua University. From 2025 to 2026, I am also a visiting student researcher in the Department of Mechanical Engineering at UC Berkeley, California, United States
. My research focuses on ๐ง reinforcement learning post-training, with an emphasis on reasoning, planning, and decision-making.
- 6 papers accepted at top-tier (A-level) venues, including an ICLR 2026 Oral (top 1%).
- 4 additional papers currently under review at top-tier (A-level) venues.
- Actively seeking industry opportunities in LLM post-training, particularly in reinforcement learning, reasoning, and agentic systems.
Curriculum Vitae
๐ My CV is available here: Guojian_Zhan_Resume.pdf
Research Intern Experience
Didi Voyager AI Lab
- Worked on reinforcement learning post-training for a multimodal planning model, improving MPCI from 30 to 50.
- Developed Stable RLVR methods to enhance the reasoning capabilities of large language models.
ByteDance Seed Robotics
- Worked on RLVR post-training of vision-language models (VLMs) for planning.
- Explored Agentic RL post-training to strengthen the planning capabilities of VLM-based agents.
GPA & Awards
- ๐ Received my B.Eng. degree from Tsinghua University with a GPA of 3.78/4.0.
- ๐ Maintaining a graduate GPA of 3.93/4.0 during my Ph.D. studies.
- ๐ Received the Comprehensive Excellence Award and was named an Outstanding Graduate of Tsinghua University.
- ๐ Awarded the 2025 National Graduate Scholarship (็ ็ฉถ็ๅฝๅฎถๅฅๅญฆ้; top 0.2% nationwide).
News & Updates
- 2026-05: SAFLOW and STAPO were submitted to NeurIPS 2026.
- 2026-04: BOOM-H, BOOM-L, DSAC-E, DSAC-AID, and DADP were accepted to ICML 2026.
- 2026-02: MVP was selected for an oral presentation at ICLR 2026 (top 1%).
- 2026-01: MVP was accepted to ICLR 2026.
- 2025-12: A revision of CTPG was submitted to IEEE TPAMI.
- 2025-09: BOOM and MPGE were accepted to NeurIPS 2025.
- 2025-08: BPO was accepted by IEEE TNNLS.
- 2025-05: PINPE was accepted by IEEE RAL.
Research Highlights
Stable RL for LLMs
STAPO (arXiv), submitted to NeurIPS 2026.
Reinforcement learning holds great promise for improving LLM reasoning, but training can be unstable, with abrupt entropy shifts and gradient spikes. Through token-level analysis, we found that filtering a small fraction of spurious tokens can stabilize entropy dynamics and enable steady, sustained performance gains throughout RL training.
Mean Velocity Policy
MVP, ICLR 2026 Oral (top 1%).
MVP advances generative reinforcement learning toward single-step inference, enabling both high-fidelity action generation and fast execution. We identify a theoretical limitation of existing generative mean velocity field learning: the absence of boundary conditions. To address it, we introduce instantaneous state-wise velocity constraints and provide a rigorous proof of completeness. Combined with rejection sampling in an online RL setting, MVP achieves state-of-the-art performance on RoboMimic and OGBench.
Bootstrap Off-Policy Learning with a World Model
BOOM, NeurIPS 2025.
BOOM is a world-model-based RL framework that creates a bootstrap loop between a policy and a model-based planner. The policy warm-starts planning, while the planner generates stronger trajectories that improve the policy through likelihood-free alignment. BOOM achieves state-of-the-art performance on the DeepMind Control Suite and HumanoidBench.
Academic Service
๐ I serve as a reviewer for conferences and journals in AI and robotics, including NeurIPS, ICML, ICLR, CoRL, ICRA, IROS, IEEE TNNLS, IEEE RAL, and IEEE T-ITS.
Contact
- Email: zgj21@mails.tsinghua.edu.cn
- Email: zgj21@berkeley.edu
- Email: zishangzhan@gmail.com
