This lecture transitions from imitation learning to the foundations of Reinforcement Learning from Human Feedback (RLHF) and its modern extensions like Direct Preference Optimization (DPO). The instructor begins with a recap of behavior cloning and DAGGER, connecting them to recent advancements where reinforcement learning (RL) methods enable systems such as ChatGPT to learn from human preferences.
Key points:
- Behavior Cloning (BC): Reduces RL to supervised learning by learning direct mappings from states to actions using expert demonstrations.
- DAGGER (Dataset Aggregation): Improves upon BC by incorporating expert feedback iteratively to correct policy drift due to distribution mismatch.
- RLHF: Combines supervised fine-tuning with human feedback to align large models like ChatGPT.
RLHF enables models to learn complex tasks by leveraging human evaluations of policy outputs instead of explicit reward functions. The field is rapidly evolving:
- DPO (Direct Preference Optimization) has begun to outperform RLHF on several benchmarks.
- These approaches extend imitation learning principles by directly incorporating human preferences or demonstration data into the learning process.
Imitation learning uses demonstrations—sequences of state-action pairs—without explicit rewards. Two primary data sources:
- Explicit demonstrations: Human-guided examples (e.g., a robot shown how to pick up a cup).
- Natural trajectories: Observational data from expert behavior (e.g., doctor decisions in medical records).
Motivations:
- Rewards are often difficult to define explicitly.
- Expert demonstrations may already reflect the desired objective.
Main techniques:
- Behavior Cloning (BC):
- Treats imitation as a supervised learning problem.
- Learns a policy that minimizes prediction error against expert actions.
- DAGGER:
- Addresses distribution mismatch when learned policies deviate from training data.
- Iteratively queries the expert to label states encountered by the current policy.
Illustration Example:
A race car trained through BC may drive off-track due to unseen states; DAGGER mitigates this by asking an expert what to do when the policy fails.
The central question:
Can we recover the underlying reward function from expert demonstrations?
- Two policies are equivalent if they induce the same distribution over states and actions.
- If rewards depend only on states and actions, such policies will yield identical rewards.
Formally: $$ p_\pi(s, a) = p_{\pi^}(s, a) \implies \mathbb{E}{p\pi}[r(s,a)] = \mathbb{E}{p{\pi^}}[r(s,a)] $$
Thus, learning to match the distribution of expert trajectories indirectly recovers expert-level performance.
Assume the reward function is a linear combination of features: $$ r(s,a) = w^\top \mu(s,a) $$
where:
-
$\mu(s,a)$ : feature vector representing state-action attributes (e.g., speed, sentiment, collisions). -
$w$ : weight vector encoding feature importance.
- If a policy induces similar expected features as the expert, then: $$ | \mathbb{E}{\pi}[\mu] - \mathbb{E}{\pi^}[\mu] | \to 0 \implies r_{\pi} \approx r_{\pi^} $$
This formulation allows imitation learning to become a reward matching problem via feature distributions.
There is no unique reward function consistent with observed optimal behavior:
- Even a zero reward could be compatible with any trajectory.
- Therefore, reward inference is ill-posed without additional constraints.
To resolve this, we introduce a disambiguation principle: the Maximum Entropy principle.
Definition:
Given known constraints, choose the probability distribution
Principle:
The optimal
Interpretation in imitation learning:
- We aim to find the most uncertain (highest entropy) distribution over trajectories that still matches the expert’s observed statistics.
- This ensures no extra assumptions are imposed beyond what demonstrations reveal.
The method originated from research on modeling taxi driver behavior in Pittsburgh (Ziebart et al., 2008).
Goal:
- Infer the reward structure motivating expert trajectories (e.g., minimizing time, tolls, or distance).
- Learn a policy that matches expert performance under maximum uncertainty.
We define a distribution over trajectories
We seek:
- A valid probability distribution: $$ \sum_\tau p(\tau) = 1 $$
- That matches the expected feature statistics from expert demonstrations
$D$ : $$ \mathbb{E}{p(\tau)}[\mu(\tau)] = \mathbb{E}{\tau \sim D}[\mu(\tau)] $$
Objective:
Maximize entropy
- Trajectories can be generated by an implicit policy
$\pi$ . - Thus, learning
$p(\tau)$ is equivalent to learning a policy that produces those trajectories.
If we had rewards: $$ \mathbb{E}{p(\tau)}[r(\tau)] = \mathbb{E}{p_{\text{expert}}(\tau)}[r(\tau)] $$
Since the expert’s policy is optimal, matching its trajectory distribution achieves expert-level rewards.
The iterative approach:
-
Initialize a candidate reward function
$r_\phi(s,a)$ . -
Learn an optimal policy
$\pi^*$ under this reward. -
Generate trajectories using
$\pi^*$ and compute feature expectations. -
Update the reward parameters
$\phi$ to better align feature expectations with expert data. - Repeat until convergence.
This cyclical optimization captures both reward inference and policy learning, forming the foundation for Maximum Entropy Inverse Reinforcement Learning (MaxEnt IRL).
Summary of Part 1:
- Reviewed behavior cloning and DAGGER as forms of imitation learning.
- Introduced reward ambiguity and the need for maximum entropy regularization.
- Formulated the MaxEnt IRL objective: maximizing trajectory entropy subject to feature matching constraints.
- Set up the foundation for subsequent exploration into algorithmic solutions and applications in modern RLHF frameworks.
Maximum Entropy Inverse Reinforcement Learning — Derivation and Algorithm (part 2/4, REF 21:16–48:19)
This section formalizes the relationship between reward functions, optimal policies, and distributions over trajectories.
The key goal is to derive how to update the reward function parameters (
We study:
- The relation between reward functions and optimal policies.
- The mapping from policies to trajectory distributions.
- The update of reward parameters based on trajectory likelihood.
In the original MaxEnt IRL formulation (Ziebart et al., 2008), the dynamics model is assumed to be known.
We start from the entropy maximization problem:
To analyze the structural form of the solution, we introduce Lagrange multipliers.
Define the Lagrangian:
$$ \mathcal{L}(p, \lambda, \eta) = -\sum_{\tau} p(\tau)\log p(\tau)
- \lambda^\top \left( \sum_{\tau} p(\tau)\mu(\tau) - \hat{\mu}_D \right)
- \eta \left( \sum_{\tau} p(\tau) - 1 \right) $$
We take the derivative of
Solving for
where the partition function (normalizing constant) is:
This shows that the distribution over trajectories maximizing entropy under feature constraints is exponential in the reward (or feature) function.
Interpretation:
- Trajectories with higher reward have exponentially higher probability.
- This defines an exponential family distribution over trajectories.
Since we do not know the reward function
The probability of observing a trajectory
where
Thus, learning
Given dataset
Substitute the exponential form:
Since
Taking the derivative:
$$ \frac{\partial \mathcal{J}(\phi)}{\partial \phi} = \sum_{\tau_i \in D} \frac{\partial r_\phi(\tau_i)}{\partial \phi}
- |D| \sum_{\tau} p(\tau \mid \phi) \frac{\partial r_\phi(\tau)}{\partial \phi} $$
where:
The gradient has two competing terms:
- Empirical expectation under expert trajectories.
- Expected value under the model distribution.
Thus, the update rule aligns the model’s expected reward features with those of the expert.
The trajectory probability can be factorized as:
Using this decomposition, the gradient can be rewritten in terms of state expectations:
$$ \frac{\partial \mathcal{J}(\phi)}{\partial \phi} = \sum_{s \in D} \frac{\partial r_\phi(s)}{\partial \phi}
- |D| \sum_{s} p(s \mid \phi) \frac{\partial r_\phi(s)}{\partial \phi} $$
If the reward is linear in features ($r_\phi(s) = \phi^\top \mu(s)$), then:
Hence, the gradient simplifies to:
$$ \frac{\partial \mathcal{J}(\phi)}{\partial \phi} = \sum_{s \in D} \mu(s)
- |D| \sum_{s} p(s \mid \phi)\mu(s) $$
This aligns feature expectations from the expert and the model.
To evaluate
-
Initialization: $$ \rho_1(s) = P(s_1 = s) $$
-
Recurrence: For each time step
$t$ : $$ \rho_{t+1}(s') = \sum_{s, a} \rho_t(s)\pi(a \mid s) P(s' \mid s, a) $$ -
Expected State Distribution: $$ \bar{\rho}(s) = \sum_{t=1}^{T} \rho_t(s) $$
These
Algorithm: Maximum Entropy Inverse Reinforcement Learning
Inputs:
- Expert demonstrations
$D = {\tau_i}$ - Feature function
$\mu(s)$ - Known dynamics model
- Initial reward parameters
$\phi_0$
Procedure:
-
Initialize
$\phi = \phi_0$ -
Repeat until convergence:
- Compute optimal policy
$\pi_\phi$ under current reward using value iteration. - Compute state visitation frequencies
$\bar{\rho}_\phi(s)$ via dynamic programming. - Compute gradient: $$ \nabla_\phi \mathcal{J} = \sum_{s \in D} \mu(s) - \sum_{s} \bar{\rho}_\phi(s) \mu(s) $$
- Update reward parameters: $$ \phi \leftarrow \phi + \alpha \nabla_\phi \mathcal{J} $$
- Compute optimal policy
-
Output: Estimated reward function
$r_\phi(s)$ and policy$\pi_\phi$
-
Computing the Optimal Policy:
- Requires transition probabilities
$P(s' \mid s, a)$ .
- Requires transition probabilities
-
Computing State Visitation Frequencies:
- Requires the same model for propagating state probabilities.
Gradient computation itself does not require the model directly but depends on the previously computed state distributions.
Figure: Flow of the MaxEnt IRL Process
-
MaxEnt Principle → Distribution proportional to
$\exp(r_\phi(\tau))$ . -
Learning
$\phi$ → Maximize likelihood of expert trajectories. - Gradient → Difference between expert and model feature expectations.
- Dynamic Programming → Used for computing state visitation frequencies.
- Assumption → Dynamics model is known (essential for policy optimization and frequency estimation).
Final Equation Summary:
These form the mathematical backbone of Maximum Entropy Inverse Reinforcement Learning.
While computing optimal policies and state visitation frequencies in Maximum Entropy Inverse Reinforcement Learning (MaxEnt IRL), one requires access to the dynamics model—the transition probabilities
- The gradient update does not directly require the dynamics model after frequencies are computed.
- However, dynamic programming methods—used to compute these frequencies—depend heavily on it.
- This is a strong assumption, especially for human-centered systems (e.g., medical decision-making), where the true dynamics are rarely known.
- For physical simulations (like MuJoCo), the dynamics are known.
- For human or real-world environments, this assumption breaks down.
- Nonlinear Reward Models:
Chelsea Finn and colleagues (2016) extended MaxEnt IRL to general reward and cost functions such as deep neural networks. - Unknown Dynamics Models:
The same work removed the requirement to know transition dynamics. - Complex State Spaces:
Enabled applications in high-dimensional, continuous domains where dynamic programming is infeasible.
The MaxEnt IRL framework resolves reward ambiguity—many possible rewards can explain expert behavior—by selecting the maximum entropy distribution consistent with demonstrations.
Applications include:
- Modeling taxi driver routes.
- Learning control policies from demonstrations.
Imitation learning enables policy learning from demonstrations without needing explicit reward definitions.
- Reduced data requirements: Learns efficiently from limited demonstrations.
- Simplified objective: Converts RL to a supervised learning problem when using behavior cloning.
- Interpretability: Provides insight into human decision strategies.
- Behavior Cloning (BC):
Directly maps states to expert actions using supervised learning. - Maximum Entropy Principle:
Ensures unbiased trajectory distributions by maximizing entropy under constraints. - Reward Recovery:
Reformulates imitation as a maximum likelihood estimation of reward parameters.
The learned reward is not necessarily the human’s true reward—it is the reward most compatible with observed behavior under maximum entropy.
We now extend the concept of human input beyond demonstrations to interactive feedback for training RL agents.
Two main contexts:
- Task-specific guidance:
Humans provide corrective feedback to train agents for individual tasks (e.g., teaching a robot to clean a kitchen). - Value alignment:
Feedback helps align large models (like LLMs) with human intentions and ethical norms.
- A system where humans teach an agent to follow recipes.
- Demonstrated interactive teaching via real-time feedback.
- Highlighted the advantage of human guidance over exploration-only methods (like
$\epsilon$ -greedy).
- Environmental rewards:
Task-defined, e.g., “+1” for completing an objective. - Human rewards:
Direct feedback like “good” or “bad,” representing subjective approval.
Analogy: Similar to DAGGER, where the human acts as a constant coach guiding the policy’s corrections.
Developed by: Brad Knox & Peter Stone (UT Austin)
- Human provides explicit scalar feedback while observing the agent.
- The agent learns an explicit reward model from these human signals.
- Applied to Tetris, showing faster performance improvement compared to policy-only methods.
- Requires continuous human presence (like DAGGER).
- May not outperform fully autonomous RL over long training periods.
There exists a continuum of human involvement in RL training:
| Level of Human Input | Description | Example |
|---|---|---|
| Minimal | Passive demonstrations only | Behavior Cloning |
| Moderate | Occasional preference judgments | Pairwise Comparison |
| High | Continuous feedback and correction | DAGGER / TAMER |
Pairwise preference models fall into the middle ground, balancing feasibility and data richness.
- Asking humans for exact reward values is cognitively difficult.
- Easier to ask for comparisons:
“Which behavior do you prefer—A or B?”
- Search Ranking Systems:
Early work by Yisong Yue & Thorsten Joachims (Cornell) — users compare ranking results (A vs. B) instead of assigning numeric ratings. - Robotics and Driving:
Dorsa Sadigh’s group used human comparisons to guide autonomous driving behavior (e.g., safe vs. unsafe lane changes).
Pairwise comparison is easier and faster for humans while providing enough signal to infer latent reward preferences.
A statistical model connecting latent rewards to pairwise preferences.
For items
- If
$r(B_i) = r(B_j)$ →$P = 0.5$ (equal preference). - If
$r(B_i) \gg r(B_j)$ →$P \approx 1$ (strong preference). - Captures noisy decision-making—not deterministic, but probabilistic.
- Assumes a latent internal reward function that is not directly observable.
- Humans provide noisy preference signals that approximate these internal rewards.
- Can be combined with RLHF (Reinforcement Learning from Human Feedback) to train models using pairwise comparisons instead of explicit reward values.
The Bradley–Terry model is transitive:
If
- There are K actions (arms), each with an unknown reward.
- Humans provide pairwise comparisons instead of scalar feedback.
Estimate which action has the highest expected latent reward based on observed comparisons.
An action
The action that wins the largest number of pairwise comparisons: $$ a_i = \arg\max_i \sum_{j \neq i} \mathbb{I}[P(a_i \succ a_j) > 0.5] $$
The action with the highest average score: $$ \text{Score}(a_i) = \sum_{j \neq i} \begin{cases} 1 & \text{if } a_i \succ a_j \ 0.5 & \text{if } a_i = a_j \ 0 & \text{otherwise} \end{cases} $$
These measures form the theoretical foundations for preference-based reinforcement learning and dueling bandits algorithms, which optimize agents based on relative judgments rather than absolute reward values.
Figure: Spectrum of Human Feedback in RL
- MaxEnt IRL Assumptions: Known dynamics and linear rewards are restrictive but foundational.
- Modern Extensions: Deep reward networks and model-free methods generalize IRL.
- Human Feedback Spectrum: Ranges from passive demonstrations to continuous coaching.
- Pairwise Preference Learning: Offers a practical middle ground for human input.
- Bradley–Terry Model: Connects latent rewards to observed preferences.
- Bandit Preference Models: Extend these ideas to learning optimal actions under uncertainty.
Together, these ideas bridge imitation learning and human preference-based reinforcement learning, forming the conceptual groundwork for RLHF used in systems like ChatGPT.
Given noisy pairwise comparisons, infer an underlying reward function
Once a latent reward model is inferred:
- It can be used to determine which action or arm is best in a multi-armed bandit setting.
- In reinforcement learning (RL), it can be optimized to learn an optimal policy.
We assume
-
$i$ = item 1 -
$j$ = item 2 -
$\mu = 1$ → item$i$ is preferred -
$\mu = 0$ → item$j$ is preferred -
$\mu = 0.5$ → no preference
This converts the problem into a binary (or ternary) classification task, analogous to logistic regression.
To fit the model, we maximize the likelihood (or minimize cross-entropy loss):
Loss function:
where
- Linear function
- Neural network
- Any differentiable model of rewards
- Collect preference tuples
$(i, j, \mu)$ from human comparisons. - Define a parametric reward model
$r_\phi$ . - Compute preference probabilities using the Bradley–Terry model.
- Optimize
$\phi$ via gradient descent on$\mathcal{L}(\phi)$ . - Use the resulting
$r_\phi$ as the learned human-aligned reward.
In RL, each trajectory
If we have two trajectories,
We then assume:
Training the reward model proceeds just as in the bandit case, but using trajectory-level comparisons.
After training
Pipeline Summary:
- Collect trajectory pairs and human preference labels.
- Train
$r_\phi$ to predict preferences. - Use
$r_\phi$ as a reward signal in RL. - Train the policy
$\pi_\theta$ via PPO or another optimizer.
Objective: Train an agent in the MuJoCo simulator to perform a backflip.
- Show human labelers short video clips of two trajectories.
- Ask: “Which looks more like a successful backflip?”
- Collect
$\approx 900$ preference labels.
- The model learned a reward function purely from human comparisons.
- The agent successfully learned to perform a backflip using orders of magnitude less data than deep Q-learning (which requires millions of samples).
Key Takeaway:
Preference-based feedback can guide RL systems to learn complex behaviors efficiently.
Figure: Human feedback for preference-based RL
In the corresponding assignment:
- Students implement RLHF and Direct Preference Optimization (DPO).
- Preference datasets are pre-provided (no manual labeling required).
- Tasks involve understanding how reward models are trained from preferences and used in PPO-style updates.
- Supervised Fine-Tuning (SFT):
Train the model on expert demonstrations (akin to behavior cloning). - Reward Model Training:
Gather human comparison data between model outputs (rankings, not just pairs). - Policy Optimization:
Use PPO to maximize the learned reward function while maintaining alignment with human intent.
While early RLHF used pairwise comparisons, modern implementations may use multi-way rankings (e.g., rank 1–4 responses) to provide richer feedback signals.
- In standard RL, the agent learns one specific task (e.g., backflip).
- In RLHF for LLMs, the reward model must generalize across many tasks—writing, reasoning, coding, summarization, etc.
The reward model functions as a meta-objective:
It learns what humans value across tasks, enabling the agent to generalize to unseen prompts.
When given a new instruction like “Write a story about frogs,” the LLM’s policy must perform well according to human-aligned reward signals, even though it has never seen that specific task during training.
- Collect human demonstrations → Supervised learning (SFT).
-
Collect human comparisons → Train reward model
$r_\phi$ . - Optimize policy → PPO to align model outputs with human values.
- Iterate → Update dataset and retrain periodically for robustness.
- The 2017 paper on RL from Human Preferences laid the foundational groundwork for modern RLHF.
- Subsequent advances, especially in language model alignment, have transformed it into a meta-RL problem.
- The framework unifies imitation learning, preference learning, and policy optimization under a single principle:
Use human judgments—directly or indirectly—as the optimization signal.
These equations underpin both preference-based RL and modern RLHF systems.
- Pairwise comparisons can be leveraged to recover latent reward models.
- Human preference data serves as a powerful alternative to hand-crafted rewards.
- RLHF applies this idea at scale to align large models with human values.
- The meta-RL viewpoint recognizes RLHF as learning a general-purpose, human-aligned reward model.
- MDP with discount
$\gamma \in [0,1)$ , start-state distribution$d_0$ , dynamics$P(s' \mid s,a)$ , policy$\pi(a \mid s)$ . - Feature maps
$f(s)$ or$f(s,a)$ . - A (finite or infinite) trajectory
$\tau=(s_0,a_0,s_1,a_1,\dots)$ is drawn from the trajectory distribution $$ P_\pi(\tau)=d_0(s_0)\prod_{t\ge 0}\pi(a_t\mid s_t)P(s_{t+1}\mid s_t,a_t). $$ - Trajectory-level (discounted) feature statistic: $$ \mu(\tau)=\sum_{t=0}^{\infty}\gamma^t f(s_t) \quad\text{or}\quad \mu(\tau)=\sum_{t=0}^{\infty}\gamma^t f(s_t,a_t). $$
Define the (unnormalized) discounted occupancy of states and state-actions: $$ \begin{align} \rho_\pi(s) &\triangleq \sum_{t=0}^{\infty}\gamma^t \Pr_\pi(s_t=s), \tag{1}\ \rho_\pi(s,a) &\triangleq \sum_{t=0}^{\infty}\gamma^t \Pr_\pi(s_t=s, a_t=a). \tag{2} \end{align} $$
(If you prefer the normalized convention
These occupancies satisfy the flow constraints: $$ \begin{align} \rho_\pi(s) &= d_0(s) + \gamma\sum_{s',a'} \rho_\pi(s',a'),P(s\mid s',a'), \tag{3}\ \rho_\pi(s,a) &= \pi(a\mid s),\rho_\pi(s). \tag{4} \end{align} $$
Take expectations of the trajectory-level statistic under the trajectory distribution
-
State-feature case $$ \begin{align} \mathbb{E}{\tau\sim P\pi}!\left[\sum_{t=0}^{\infty}\gamma^t f(s_t)\right] &= \sum_{t=0}^{\infty}\gamma^t \sum_s \Pr_\pi(s_t=s), f(s) \tag{5}\ &= \sum_s \left(\sum_{t=0}^{\infty}\gamma^t \Pr_\pi(s_t=s)\right) f(s) \tag{6}\ &= \sum_s \rho_\pi(s), f(s). \tag{7} \end{align} $$
-
State–action feature case $$ \begin{align} \mathbb{E}{\tau\sim P\pi}!\left[\sum_{t=0}^{\infty}\gamma^t f(s_t,a_t)\right] &= \sum_{t=0}^{\infty}\gamma^t \sum_{s,a} \Pr_\pi(s_t=s,a_t=a), f(s,a) \tag{8}\ &= \sum_{s,a} \left(\sum_{t=0}^{\infty}\gamma^t \Pr_\pi(s_t=s,a_t=a)\right) f(s,a) \tag{9}\ &= \sum_{s,a} \rho_\pi(s,a), f(s,a). \tag{10} \end{align} $$
Conclusion: Matching expected discounted trajectory features is equivalent to matching the corresponding discounted occupancy–weighted feature sums. This is the precise bridge between the “distribution over trajectories” view and the earlier “discounted accumulated features per state or state–action” view.
-
Apprenticeship / feature expectation matching (Abbeel–Ng style)
With linear rewards$r_w(s)=w^\top f(s)$ or$r_w(s,a)=w^\top f(s,a)$ and$|w|\le 1$ : $$ J(\pi;w)=\mathbb{E}\pi\Big[\sum{t\ge 0}\gamma^t r_w(\cdot)\Big] = \sum_s \rho_\pi(s), w^\top f(s) \quad\text{or}\quad \sum_{s,a} \rho_\pi(s,a), w^\top f(s,a). \tag{11} $$ Therefore, if$\sum_s \rho_\pi(s) f(s) \approx \sum_s \rho_{\pi_E}(s) f(s)$ (or the state–action analogue), then$J(\pi;w)\approx J(\pi_E;w)$ for all bounded$w$ . Feature-expectation matching is thus occupancy matching in disguise. -
MaxEnt IRL constraints are the same thing
MaxEnt IRL writes an optimization over trajectory distributions$P(\tau)$ : $$ \max_{P} H(P)\quad\text{s.t.}\quad \sum_{\tau}P(\tau),\mu(\tau)=\hat\mu_E,;; \sum_{\tau}P(\tau)=1,;; P(\tau)\ge 0, \tag{12} $$ where$\mu(\tau)=\sum_t \gamma^t f(s_t)$ or$\sum_t \gamma^t f(s_t,a_t)$ . Using the identities in (5)–(10), $$ \sum_{\tau}P(\tau),\mu(\tau) = \sum_s \rho_{P}(s), f(s) \quad\text{or}\quad \sum_{s,a} \rho_{P}(s,a), f(s,a). \tag{13} $$ Thus the trajectory-level constraint is exactly the discounted occupancy–weighted feature matching constraint. MaxEnt merely chooses, among all$P(\tau)$ that satisfy this linear constraint, the one with maximum entropy.
-
Finite horizon
$T$ : replace$\sum_{t=0}^{\infty}\gamma^t$ by$\sum_{t=0}^{T-1}\gamma^t$ and define$\rho_\pi^T$ accordingly. All identities hold with$T$ -limited sums. -
Normalized occupancy
$\tilde\rho_\pi=(1-\gamma)\rho_\pi$ makes$\sum_s \tilde\rho_\pi(s)=1$ under stationary starts; then $$ \mathbb{E}{\tau}!\left[\sum{t\ge 0}\gamma^t f(\cdot)\right] = \frac{1}{1-\gamma}\sum \tilde\rho_\pi(\cdot), f(\cdot). \tag{14} $$
- The “distribution over trajectories” language is a convenient probabilistic wrapper.
- The constraints it imposes (matching expected discounted features) are identical to the earlier imitation learning formulation based on discounted occupancy measures.
- Linear rewards make this equivalence operational: matching feature expectations
$\Longleftrightarrow$ matching returns for any bounded linear reward.
Thank you for your patience, and I deeply apologize for the mistakes earlier. Let’s restart and walk through this from the ground up, focusing on the correct use of Lagrange multipliers to derive the reward equivalence and not feature equivalence. I truly appreciate you pushing me to clarify this.
We want to recover the reward function in the maximum entropy inverse reinforcement learning (MaxEnt IRL) framework by using Lagrange multipliers to match the expected rewards of the learned policy with the expert's expected rewards. The goal is to show how the trajectory distribution becomes exponential in terms of the reward function.
The fundamental goal of MaxEnt IRL is to recover a reward function such that the expected reward under the learned policy matches the expected reward of the expert's behavior. Formally:
Where:
-
$\tau_E$ is the expert’s trajectory, -
$\tau_\pi$ is the trajectory generated by the learned policy$\pi$ , -
$r_\phi(s_t, a_t)$ is the reward function.
MaxEnt IRL maximizes the entropy of the trajectory distribution while enforcing a constraint on the feature expectations. However, in the current context, we are interested in enforcing a reward equivalence constraint, not just feature equivalence.
Let the trajectory distribution of the expert be denoted by
We aim to maximize the entropy of the trajectory distribution
The entropy of the trajectory distribution is:
The constraint is that the expected reward under the learned policy should match the expected reward under the expert's policy. In other words:
We now introduce a Lagrange multiplier
This Lagrangian captures both the entropy maximization and the reward equivalence constraint.
Now, we take the derivative of
The derivative is:
Setting the derivative equal to zero gives us the equation:
Solving for
We can drop the constant
This is the desired result! The trajectory distribution is now expressed as being exponential in terms of the reward function
We can conclude that:
which shows that the trajectory distribution is proportional to the exponential of the sum of discounted rewards.
- The key idea is that the trajectory distribution is chosen to maximize entropy while ensuring that the expected rewards under the learned policy match the expected rewards of the expert's behavior.
- The Lagrange multiplier
$\lambda$ enforces this constraint, and the result is that the trajectory distribution is exponential in terms of the reward function, where the reward function is assumed to be linear in the feature function.
This derivation correctly shows how the trajectory distribution



