From Group-based to Temporal-based: Single-stream Policy Optimizations by Liao Chang
Abstract
Group-based advantage estimators (like GRPO) used in LLM policy gradient training are inefficient, causing wasted rollouts with zero advantage and "synchronization barriers" that stall training. This presentation introduces Single-stream Policy Optimization (SPO) , a method that replaces this inefficient grouping with temporal advantage estimation. SPO leverages a KL-Adaptive Value Tracker to maintain a robust online baseline, enabling efficient single-stream processing. This approach improves training stability by reducing advantage variance and achieves a 4.35x wallclock time speedup by eliminating group-based bottlenecks while maintains or even outperforms GRPO in mathematical reasoning tasks.
Biography
Liao Chang is a Ph.D. student in the CCDS at NTU, advised by Prof. Ke Yiping. His research focuses on applying Reinforcement Learning to scientific discovery, particularly in multistep retrosynthesis and de novo molecule design. He graduated from NTU with a Bachelor of Engineering (Highest Distinction) in 2021. During his undergraduate studies, he gained industry experience as an Algorithm Engineer at Huawei Shield Lab and a Data Scientist at Shopee. Beyond his research, he previously served as Vice President of the NTU Da Vinci 3D Printing and Robomaster Club. He is also a keen gym-goer, gamer and rock lover.