The Evolution of Policy Optimization: Understanding GRPO, DAPO, and Dr. GRPO's Theoretical Foundations
This article serves as the theoretical companion to "Bridging Theory and Practice: Understanding GRPO Implementation Details in Hugging Face's TRL Library." While the companion piece focuses on implementation specifics, here we'll explore the mathematical foundations and conceptual evolution of these cutting-edge reinforcement learning algorithms for language models.
I'll examine three key algorithms that represent the rapid advancement in this field:
-
GRPO (Group Relative Policy Optimization): The pioneering approach from DeepSeek that established a new paradigm for training reasoning capabilities in LLMs
-
DAPO (Decouple Clip and Dynamic sAmpling Policy Optimization): An open-source system that scaled reinforcement learning for LLMs while addressing key limitations in GRPO
-
Dr. GRPO (GRPO Done Right): A critical revisiting of GRPO that identified and corrected optimization biases in the original algorithm
Beyond mathematical formulations, I'll analyze the theoretical implications of design choices in each algorithm, particularly focusing on KL divergence terms, advantage estimation, and normalization strategies. As these developments continue at breakneck speed, I'll share insights on experimental design considerations and the importance of critical evaluation when implementing these techniques.
To ground this theoretical exploration, I'll conclude with key code snippets showcasing how the Hugging Face community has implemented these algorithms, highlighting where theory meets practice in production-ready systems.
GRPO (Group Relative Policy Optimization) is an online learning algorithm introduced by DeepSeek that improves iteratively by using data generated by the model itself during training. The core insight behind GRPO is maximizing sample efficiency through clever design choices that eliminate the need for a separate value function.
The computational efficiency of GRPO is achieved through several key design principles:
- Value Function Elimination: Unlike PPO, GRPO doesn't require a separate value network, significantly reducing parameter count and memory requirements
- Single Policy Sampling: Only one policy is used for exploration, saving inference time
- Batch Amortization: Multiple optimization steps per batch of samples, reducing the overall sampling cost
- Group Normalization: Stabilizing learning signals across varied problem difficulties
To understand how GRPO works, let's break it down into four main components:
At each training step, GRPO samples a batch of prompts and generates a set of G completions for each prompt (denoted as o_i). These completions form a "group" that will be used for relative advantage estimation.
For each of the G sequences, rewards are computed using a reward model. What distinguishes GRPO is how it normalizes these rewards within each group:
This group-relative normalization gives the method its name and helps stabilize training by automatically adapting to the difficulty of each prompt. Easier prompts (where most outputs earn similar rewards) won't dominate the learning signal over harder prompts.
To prevent the policy from diverging too far from its previous state, GRPO incorporates KL divergence. This is estimated using the approximator introduced by Schulman et al. (2020):
This formulation provides an unbiased estimate of the KL divergence using only individual token probabilities.
The full GRPO objective function balances advantage maximization with KL divergence constraints:
where:
This loss function incorporates two critical elements:
- Clipped Surrogate Objective: The
minandclipoperations ensure that policy updates don't deviate excessively from the previous policy, helping to stabilize training - KL Regularization: The β coefficient controls how strongly the model is anchored to the reference policy
One of GRPO's most significant innovations is its approach to amortizing the high cost of LLM generation. Rather than the traditional RL cycle of "sample, update, repeat," GRPO introduces a multi-iteration update strategy:
- Generate G completions for each prompt using the current policy (the expensive step)
- Update the policy using these completions
- Without generating new completions, perform additional policy updates using the same data
- Repeat for μ iterations before generating new samples
This approach is remarkably efficient because in LLM reinforcement learning, generating samples through model inference is typically the most computationally expensive operation. By extracting multiple policy updates from a single batch of generated completions, GRPO significantly improves training efficiency.
During these iterations, the algorithm carefully manages the reference policy to prevent instability. Each update uses the policy from the previous iteration as its reference point, creating a chain of gradually evolving policies while working with the same fixed set of samples.
When μ = 1 (the default in many implementations), the algorithm reduces to the standard single-update version. However, production deployments often use higher values (μ = 3-5) to maximize computational efficiency. This multi-iteration capability represents a practical but powerful engineering insight that has been adopted by subsequent algorithms in the GRPO family.
The DAPO paper highlights the limitations of the GRPO algorithm’s sample-level loss in long-CoT scenarios, where longer responses are under-penalized, leading to poorer quality outputs. The proposed solution is a token-level normalization, which better handles longer sequences by assigning more balanced rewards to individual tokens, regardless of response length:
Additionally, the DAPO paper proposed the following innovations:
-
Raise the Ceiling: Clip-Higher
DAPO introduces asymmetric clipping ranges (
$\epsilon_\text{low}$ and$\epsilon_\text{high}$ ) to address entropy collapse. With traditional symmetric clipping ($\epsilon$ = 0.2), high-probability tokens can easily be reinforced, but low-probability "exploration tokens" struggle to increase significantly. By using a higher upper bound, DAPO enables better exploration while maintaining training stability. -
The More the Merrier: Dynamic Sampling
As training progresses, more prompts achieve perfect accuracy, leading to zero advantage and thus no gradient signal. DAPO addresses this by intelligently filtering the training batch, over-sampling to ensure all prompts have accuracies between 0 and 1. This maintains consistent learning signals throughout training, improving sample efficiency without sacrificing performance.
The paper Understanding R1-Zero-Like Training: A Critical Perspective identified two key optimization biases in the original GRPO algorithm:
- Response Length Bias: GRPO's normalization by sequence length causes longer incorrect responses to be under-penalized. While DAPO's token-level approach reduces this bias, it doesn't eliminate it completely. Dr. GRPO solves this by normalizing with a constant instead:
Where L is typically set to the maximum completion length, ensuring consistent normalization regardless of response length.
- Question-Level Difficulty Bias: Normalizing advantages by \( \text{std}(\mathbf{r}) \) gives disproportionate weight to questions with low standard deviations (typically very easy or very hard questions). Dr. GRPO eliminates this scaling, treating all questions equally during optimization.
Both DAPO and Dr. GRPO exclude the KL divergence term (
- During reasoning training, model distribution can and should diverge significantly from the initial model
- Rule-based verifiers provide accurate rewards regardless of distribution shift, unlike learned reward models
This design choice not only reduces computational requirements but also allows for more effective exploration of solution strategies.
The TRL library implements all three algorithm variants:
if self.loss_type == "grpo":
loss = ((per_token_loss * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)).mean()
elif self.loss_type == "bnpo":
loss = (per_token_loss * completion_mask).sum() / completion_mask.sum().clamp(min=1.0)
elif self.loss_type == "dr_grpo":
loss = (per_token_loss * completion_mask).sum() / (per_token_loss.size(0) * self.max_completion_length)
else:
raise ValueError(f"Unknown loss type: {self.loss_type}")By default, TRL scales relative rewards by \( \text{std}(\mathbf{r}) \). To implement the Dr. GRPO approach, set scale_rewards=False in [GRPOConfig].
Select the appropriate loss type by setting loss_type in [GRPOConfig]:
"grpo": Original sequence-level normalization (DeepSeekMath paper)"bnpo": Token-level normalization similar to DAPO"dr_grpo": Constant normalization (Dr. GRPO paper)
To remove the KL term as recommended for rule-based rewards, set beta = 0.0 in [GRPOConfig].
- DeepSeek AI. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. [https://arxiv.org/abs/2402.03300]
- ByteDance Seed. (2025). DAPO: An Open-Source LLM Reinforcement Learning System at Scale. [https://arxiv.org/abs/2503.14476]
- Zichen Liu et al. (2025). Understanding R1-Zero-Like Training: A Critical Perspective. [https://arxiv.org/abs/2503.20783]
- Hugging Face. (2025). Transformer Reinforcement Learning (TRL) Library. [https://github.com/huggingface/trl/tree/main]
