Group Relative Policy Optimization (GRPO) is an innovative reinforcement learning algorithm aimed at enhancing large language models (LLMs) for reasoning tasks. This guide explains the GRPO process with detailed diagrams and step-by-step explanations.
The core GRPO process is depicted as a circular workflow with five key stages: