The Evolution of Policy Optimization: Understanding GRPO, DAPO, and Dr. GRPO's Theoretical Foundations

Introduction

This article serves as the theoretical companion to "Bridging Theory and Practice: Understanding GRPO Implementation Details in Hugging Face's TRL Library." While the companion piece focuses on implementation specifics, here we'll explore the mathematical foundations and conceptual evolution of these cutting-edge reinforcement learning algorithms for language models.

I'll examine three key algorithms that represent the rapid advancement in this field:

GRPO (Group Relative Policy Optimization): The pioneering approach from DeepSeek that established a new paradigm for training reasoning capabilities in LLMs
DAPO (Decouple Clip and Dynamic sAmpling Policy Optimization): An open-source system that scaled reinforcement learning for LLMs while addressing key limitations in GRPO
Dr. GRPO (GRPO Done Right): A critical revisiting of GRPO that identified and corrected optimization biases in the original algorithm

Beyond mathematical formulations, I'll analyze the theoretical implications of design choices in each algorithm, particularly focusing on KL divergence terms, advantage estimation, and normalization strategies. As these developments continue at breakneck speed, I'll share insights on experimental design considerations and the importance of critical evaluation when implementing these techniques.

To ground this theoretical exploration, I'll conclude with key code snippets showcasing how the Hugging Face community has implemented these algorithms, highlighting where theory meets practice in production-ready systems.

The GRPO Family: Evolution of Policy Optimization for LLM Reasoning

GRPO: The Original Formulation

GRPO (Group Relative Policy Optimization) is an online learning algorithm introduced by DeepSeek that improves iteratively by using data generated by the model itself during training. The core insight behind GRPO is maximizing sample efficiency through clever design choices that eliminate the need for a separate value function.

The computational efficiency of GRPO is achieved through several key design principles:

Value Function Elimination: Unlike PPO, GRPO doesn't require a separate value network, significantly reducing parameter count and memory requirements
Single Policy Sampling: Only one policy is used for exploration, saving inference time
Batch Amortization: Multiple optimization steps per batch of samples, reducing the overall sampling cost
Group Normalization: Stabilizing learning signals across varied problem difficulties

To understand how GRPO works, let's break it down into four main components:

1. Generating Completions

At each training step, GRPO samples a batch of prompts and generates a set of G completions for each prompt (denoted as o_i). These completions form a "group" that will be used for relative advantage estimation.

2. Computing the Advantage

For each of the G sequences, rewards are computed using a reward model. What distinguishes GRPO is how it normalizes these rewards within each group:

$$\hat{A}_{i,t} = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$$

This group-relative normalization gives the method its name and helps stabilize training by automatically adapting to the difficulty of each prompt. Easier prompts (where most outputs earn similar rewards) won't dominate the learning signal over harder prompts.

3. Estimating the KL Divergence

To prevent the policy from diverging too far from its previous state, GRPO incorporates KL divergence. This is estimated using the approximator introduced by Schulman et al. (2020):

$$\mathbb{D}_{\text{KL}}\left[\pi_\theta |\pi_{\text{ref}}\right] = \frac{\pi_{\text{ref}}(o_{i,t} \mid q, o_{i,<t})}{\pi_\theta(o_{i,t} \mid q, o_{i,<t})} - \log \frac{\pi_{\text{ref}}(o_{i,t} \mid q, o_{i,<t})}{\pi_\theta(o_{i,t} \mid q, o_{i,<t})} - 1$$

This formulation provides an unbiased estimate of the KL divergence using only individual token probabilities.

4. Computing the Loss

The full GRPO objective function balances advantage maximization with KL divergence constraints:

$$ \mathcal{L}_{\text{GRPO}}(\theta) = - \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left[ l_{i,t} - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta | \pi_{\text{ref}}\right] \right] $$

where:

$$ l_{i,t} = \min \left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})} \hat{A}_{i,t}, , \text{clip}\left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right) $$

This loss function incorporates two critical elements:

Clipped Surrogate Objective: The min and clip operations ensure that policy updates don't deviate excessively from the previous policy, helping to stabilize training
KL Regularization: The β coefficient controls how strongly the model is anchored to the reference policy

Multi-iteration Updates: A Key Efficiency Innovation

One of GRPO's most significant innovations is its approach to amortizing the high cost of LLM generation. Rather than the traditional RL cycle of "sample, update, repeat," GRPO introduces a multi-iteration update strategy:

Generate G completions for each prompt using the current policy (the expensive step)
Update the policy using these completions
Without generating new completions, perform additional policy updates using the same data
Repeat for μ iterations before generating new samples

This approach is remarkably efficient because in LLM reinforcement learning, generating samples through model inference is typically the most computationally expensive operation. By extracting multiple policy updates from a single batch of generated completions, GRPO significantly improves training efficiency.

During these iterations, the algorithm carefully manages the reference policy to prevent instability. Each update uses the policy from the previous iteration as its reference point, creating a chain of gradually evolving policies while working with the same fixed set of samples.

When μ = 1 (the default in many implementations), the algorithm reduces to the standard single-update version. However, production deployments often use higher values (μ = 3-5) to maximize computational efficiency. This multi-iteration capability represents a practical but powerful engineering insight that has been adopted by subsequent algorithms in the GRPO family.

DAPO: Addressing Length Bias and KL Constraints

The DAPO paper highlights the limitations of the GRPO algorithm’s sample-level loss in long-CoT scenarios, where longer responses are under-penalized, leading to poorer quality outputs. The proposed solution is a token-level normalization, which better handles longer sequences by assigning more balanced rewards to individual tokens, regardless of response length:

$$ \mathcal{L}_{\text{DAPO}}(\theta) = - \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} l_{i,t}, $$

Additionally, the DAPO paper proposed the following innovations:

Raise the Ceiling: Clip-Higher

DAPO introduces asymmetric clipping ranges ($\epsilon_\text{low}$ and $\epsilon_\text{high}$) to address entropy collapse. With traditional symmetric clipping ($\epsilon$ = 0.2), high-probability tokens can easily be reinforced, but low-probability "exploration tokens" struggle to increase significantly. By using a higher upper bound, DAPO enables better exploration while maintaining training stability.
The More the Merrier: Dynamic Sampling

As training progresses, more prompts achieve perfect accuracy, leading to zero advantage and thus no gradient signal. DAPO addresses this by intelligently filtering the training batch, over-sampling to ensure all prompts have accuracies between 0 and 1. This maintains consistent learning signals throughout training, improving sample efficiency without sacrificing performance.

Dr. GRPO: Optimization Done Right

The paper Understanding R1-Zero-Like Training: A Critical Perspective identified two key optimization biases in the original GRPO algorithm:

Response Length Bias: GRPO's normalization by sequence length causes longer incorrect responses to be under-penalized. While DAPO's token-level approach reduces this bias, it doesn't eliminate it completely. Dr. GRPO solves this by normalizing with a constant instead:

$$ \mathcal{L}_{\text{Dr. GRPO}}(\theta) = - \frac{1}{LG} \sum_{i=1}^G \sum_{t=1}^{|o_i|} l_{i,t}, $$

Where L is typically set to the maximum completion length, ensuring consistent normalization regardless of response length.

Question-Level Difficulty Bias: Normalizing advantages by $ \text{std}(\mathbf{r}) $ gives disproportionate weight to questions with low standard deviations (typically very easy or very hard questions). Dr. GRPO eliminates this scaling, treating all questions equally during optimization.

To KL Or Not To KL

Both DAPO and Dr. GRPO exclude the KL divergence term ($\beta = 0$) based on two key insights:

During reasoning training, model distribution can and should diverge significantly from the initial model
Rule-based verifiers provide accurate rewards regardless of distribution shift, unlike learned reward models

This design choice not only reduces computational requirements but also allows for more effective exploration of solution strategies.

Implementation Details

Loss Types in Hugging Face TRL

The TRL library implements all three algorithm variants:

if self.loss_type == "grpo":
    loss = ((per_token_loss * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)).mean()
elif self.loss_type == "bnpo":
    loss = (per_token_loss * completion_mask).sum() / completion_mask.sum().clamp(min=1.0)
elif self.loss_type == "dr_grpo":
    loss = (per_token_loss * completion_mask).sum() / (per_token_loss.size(0) * self.max_completion_length)
else:
    raise ValueError(f"Unknown loss type: {self.loss_type}")

Configuration Tips

By default, TRL scales relative rewards by $ \text{std}(\mathbf{r}) $. To implement the Dr. GRPO approach, set scale_rewards=False in [GRPOConfig].

Select the appropriate loss type by setting loss_type in [GRPOConfig]:

"grpo": Original sequence-level normalization (DeepSeekMath paper)
"bnpo": Token-level normalization similar to DAPO
"dr_grpo": Constant normalization (Dr. GRPO paper)

To remove the KL term as recommended for rule-based rewards, set beta = 0.0 in [GRPOConfig].

References

DeepSeek AI. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. [https://arxiv.org/abs/2402.03300]
ByteDance Seed. (2025). DAPO: An Open-Source LLM Reinforcement Learning System at Scale. [https://arxiv.org/abs/2503.14476]
Zichen Liu et al. (2025). Understanding R1-Zero-Like Training: A Critical Perspective. [https://arxiv.org/abs/2503.20783]
Hugging Face. (2025). Transformer Reinforcement Learning (TRL) Library. [https://github.com/huggingface/trl/tree/main]

JenWei0312/.md

Select an option

No results found