Skip to content

Instantly share code, notes, and snippets.

@kundeng
Created November 19, 2025 17:23
Show Gist options
  • Select an option

  • Save kundeng/65dd8a870c43b18801aaddd778b14f4c to your computer and use it in GitHub Desktop.

Select an option

Save kundeng/65dd8a870c43b18801aaddd778b14f4c to your computer and use it in GitHub Desktop.

Imitation Learning and Introduction to Max Entropy IRL (part 1/4, REF 00:00–20:47)

Overview (REF 00:00–04:30)

This lecture transitions from imitation learning to the foundations of Reinforcement Learning from Human Feedback (RLHF) and its modern extensions like Direct Preference Optimization (DPO). The instructor begins with a recap of behavior cloning and DAGGER, connecting them to recent advancements where reinforcement learning (RL) methods enable systems such as ChatGPT to learn from human preferences.

Key points:

  • Behavior Cloning (BC): Reduces RL to supervised learning by learning direct mappings from states to actions using expert demonstrations.
  • DAGGER (Dataset Aggregation): Improves upon BC by incorporating expert feedback iteratively to correct policy drift due to distribution mismatch.
  • RLHF: Combines supervised fine-tuning with human feedback to align large models like ChatGPT.

Motivation for Reinforcement Learning from Human Feedback (REF 04:30–05:54)

RLHF enables models to learn complex tasks by leveraging human evaluations of policy outputs instead of explicit reward functions. The field is rapidly evolving:

  • DPO (Direct Preference Optimization) has begun to outperform RLHF on several benchmarks.
  • These approaches extend imitation learning principles by directly incorporating human preferences or demonstration data into the learning process.

Imitation Learning Recap (REF 05:54–07:49)

Imitation learning uses demonstrations—sequences of state-action pairs—without explicit rewards. Two primary data sources:

  1. Explicit demonstrations: Human-guided examples (e.g., a robot shown how to pick up a cup).
  2. Natural trajectories: Observational data from expert behavior (e.g., doctor decisions in medical records).

Motivations:

  • Rewards are often difficult to define explicitly.
  • Expert demonstrations may already reflect the desired objective.

Main techniques:

  1. Behavior Cloning (BC):
    • Treats imitation as a supervised learning problem.
    • Learns a policy that minimizes prediction error against expert actions.
  2. DAGGER:
    • Addresses distribution mismatch when learned policies deviate from training data.
    • Iteratively queries the expert to label states encountered by the current policy.

Illustration Example:
A race car trained through BC may drive off-track due to unseen states; DAGGER mitigates this by asking an expert what to do when the policy fails.


From Imitation to Reward Recovery (REF 08:17–09:44)

The central question:
Can we recover the underlying reward function from expert demonstrations?

Key Concept:

  • Two policies are equivalent if they induce the same distribution over states and actions.
  • If rewards depend only on states and actions, such policies will yield identical rewards.

Formally: $$ p_\pi(s, a) = p_{\pi^}(s, a) \implies \mathbb{E}{p\pi}[r(s,a)] = \mathbb{E}{p{\pi^}}[r(s,a)] $$

Thus, learning to match the distribution of expert trajectories indirectly recovers expert-level performance.


Feature-based Linear Reward Model (REF 09:44–11:41)

Assume the reward function is a linear combination of features: $$ r(s,a) = w^\top \mu(s,a) $$

where:

  • $\mu(s,a)$: feature vector representing state-action attributes (e.g., speed, sentiment, collisions).
  • $w$: weight vector encoding feature importance.

Intuition:

  • If a policy induces similar expected features as the expert, then: $$ | \mathbb{E}{\pi}[\mu] - \mathbb{E}{\pi^}[\mu] | \to 0 \implies r_{\pi} \approx r_{\pi^} $$

This formulation allows imitation learning to become a reward matching problem via feature distributions.


Challenge: Non-uniqueness of Reward Functions (REF 11:41–12:38)

There is no unique reward function consistent with observed optimal behavior:

  • Even a zero reward could be compatible with any trajectory.
  • Therefore, reward inference is ill-posed without additional constraints.

To resolve this, we introduce a disambiguation principle: the Maximum Entropy principle.


Principle of Maximum Entropy (REF 12:38–14:34)

Definition: Given known constraints, choose the probability distribution $p$ with the maximum entropy: $$ H(p) = - \sum_x p(x) \log p(x) $$

Principle:
The optimal $p$ represents the least-biased estimate consistent with known data.

Interpretation in imitation learning:

  • We aim to find the most uncertain (highest entropy) distribution over trajectories that still matches the expert’s observed statistics.
  • This ensures no extra assumptions are imposed beyond what demonstrations reveal.

Motivation and Example: Taxi Driver Behavior (REF 16:01–16:58)

The method originated from research on modeling taxi driver behavior in Pittsburgh (Ziebart et al., 2008).
Goal:

  • Infer the reward structure motivating expert trajectories (e.g., minimizing time, tolls, or distance).
  • Learn a policy that matches expert performance under maximum uncertainty.

Formulating Max Entropy IRL (REF 16:58–18:52)

We define a distribution over trajectories $\tau$:

$$ p(\tau) $$

We seek:

  1. A valid probability distribution: $$ \sum_\tau p(\tau) = 1 $$
  2. That matches the expected feature statistics from expert demonstrations $D$: $$ \mathbb{E}{p(\tau)}[\mu(\tau)] = \mathbb{E}{\tau \sim D}[\mu(\tau)] $$

Objective: Maximize entropy $H(p(\tau))$ subject to these constraints: $$ \max_{p(\tau)} H(p(\tau)) \quad \text{s.t.} \quad \sum_\tau p(\tau) = 1, ; \mathbb{E}_{p(\tau)}[\mu(\tau)] = \hat{\mu}_D $$


Connection to Policies and Rewards (REF 18:52–20:47)

Distribution over Trajectories vs Policies

  • Trajectories can be generated by an implicit policy $\pi$.
  • Thus, learning $p(\tau)$ is equivalent to learning a policy that produces those trajectories.

Reward Matching Condition

If we had rewards: $$ \mathbb{E}{p(\tau)}[r(\tau)] = \mathbb{E}{p_{\text{expert}}(\tau)}[r(\tau)] $$

Since the expert’s policy is optimal, matching its trajectory distribution achieves expert-level rewards.


Algorithmic Framework for MaxEnt IRL

The iterative approach:

  1. Initialize a candidate reward function $r_\phi(s,a)$.
  2. Learn an optimal policy $\pi^*$ under this reward.
  3. Generate trajectories using $\pi^*$ and compute feature expectations.
  4. Update the reward parameters $\phi$ to better align feature expectations with expert data.
  5. Repeat until convergence.

This cyclical optimization captures both reward inference and policy learning, forming the foundation for Maximum Entropy Inverse Reinforcement Learning (MaxEnt IRL).


Figure: Conceptual Flow of MaxEnt IRL

MaxEnt IRL Process


Summary of Part 1:

  • Reviewed behavior cloning and DAGGER as forms of imitation learning.
  • Introduced reward ambiguity and the need for maximum entropy regularization.
  • Formulated the MaxEnt IRL objective: maximizing trajectory entropy subject to feature matching constraints.
  • Set up the foundation for subsequent exploration into algorithmic solutions and applications in modern RLHF frameworks.

Maximum Entropy Inverse Reinforcement Learning — Derivation and Algorithm (part 2/4, REF 21:16–48:19)

1. Overview of the MaxEnt IRL Optimization Framework (REF 21:16)

This section formalizes the relationship between reward functions, optimal policies, and distributions over trajectories.
The key goal is to derive how to update the reward function parameters ($\phi$) using maximum entropy principles and gradient optimization.

We study:

  1. The relation between reward functions and optimal policies.
  2. The mapping from policies to trajectory distributions.
  3. The update of reward parameters based on trajectory likelihood.

In the original MaxEnt IRL formulation (Ziebart et al., 2008), the dynamics model is assumed to be known.


2. Revisiting the Constrained Optimization Problem (REF 22:07)

We start from the entropy maximization problem:

$$ \max_{p(\tau)} H(p(\tau)) \quad \text{s.t.} \quad \sum_{\tau} p(\tau) = 1, \quad \mathbb{E}_{p(\tau)}[\mu(\tau)] = \hat{\mu}_D $$

To analyze the structural form of the solution, we introduce Lagrange multipliers.


3. Lagrangian Formulation (REF 23:25–24:23)

Define the Lagrangian:

$$ \mathcal{L}(p, \lambda, \eta) = -\sum_{\tau} p(\tau)\log p(\tau)

  • \lambda^\top \left( \sum_{\tau} p(\tau)\mu(\tau) - \hat{\mu}_D \right)
  • \eta \left( \sum_{\tau} p(\tau) - 1 \right) $$

We take the derivative of $\mathcal{L}$ with respect to $p(\tau)$ for each trajectory $\tau$:

$$ \frac{\partial \mathcal{L}}{\partial p(\tau)} = - (1 + \log p(\tau)) + \lambda^\top \mu(\tau) + \eta = 0 $$


4. Exponential Family Form (REF 25:42–27:01)

Solving for $p(\tau)$ and exponentiating both sides gives:

$$ p(\tau) = \frac{1}{Z(\lambda)} \exp(\lambda^\top \mu(\tau)) $$

where the partition function (normalizing constant) is:

$$ Z(\lambda) = \sum_{\tau} \exp(\lambda^\top \mu(\tau)) $$

This shows that the distribution over trajectories maximizing entropy under feature constraints is exponential in the reward (or feature) function.

Interpretation:

  • Trajectories with higher reward have exponentially higher probability.
  • This defines an exponential family distribution over trajectories.

5. From Entropy Maximization to Likelihood Maximization (REF 27:28–29:48)

Since we do not know the reward function $r_\phi(\tau)$, we treat it as parameterized by $\phi$ and estimate it using maximum likelihood.

The probability of observing a trajectory $\tau_i$ given parameters $\phi$ is:

$$ p(\tau_i \mid \phi) = \frac{1}{Z(\phi)} \exp(r_\phi(\tau_i)) $$

where

$$ Z(\phi) = \sum_{\tau} \exp(r_\phi(\tau)) $$

Thus, learning $\phi$ becomes equivalent to maximizing the log-likelihood of the expert trajectories under this model.


6. Log-Likelihood Objective and Gradient (REF 31:44–36:42)

6.1 Log-Likelihood

Given dataset $D = {\tau_1, \dots, \tau_N}$, we define:

$$ \mathcal{J}(\phi) = \sum_{\tau_i \in D} \log p(\tau_i \mid \phi) $$

Substitute the exponential form:

$$ \mathcal{J}(\phi) = \sum_{\tau_i \in D} \left[ r_\phi(\tau_i) - \log Z(\phi) \right] $$

Since $\log Z(\phi)$ is independent of any single trajectory, it simplifies to:

$$ \mathcal{J}(\phi) = \sum_{\tau_i \in D} r_\phi(\tau_i) - |D| \cdot \log \sum_{\tau} e^{r_\phi(\tau)} $$


6.2 Gradient of the Log-Likelihood

Taking the derivative:

$$ \frac{\partial \mathcal{J}(\phi)}{\partial \phi} = \sum_{\tau_i \in D} \frac{\partial r_\phi(\tau_i)}{\partial \phi}

  • |D| \sum_{\tau} p(\tau \mid \phi) \frac{\partial r_\phi(\tau)}{\partial \phi} $$

where:

$$ p(\tau \mid \phi) = \frac{e^{r_\phi(\tau)}}{Z(\phi)} $$


6.3 Interpretation

The gradient has two competing terms:

  1. Empirical expectation under expert trajectories.
  2. Expected value under the model distribution.

Thus, the update rule aligns the model’s expected reward features with those of the expert.


7. Transition from Trajectories to States (REF 37:11–39:03)

The trajectory probability can be factorized as:

$$ p(\tau) = p(s_1) \prod_{t=1}^{T} \pi(a_t \mid s_t) , p(s_{t+1} \mid s_t, a_t) $$

Using this decomposition, the gradient can be rewritten in terms of state expectations:

$$ \frac{\partial \mathcal{J}(\phi)}{\partial \phi} = \sum_{s \in D} \frac{\partial r_\phi(s)}{\partial \phi}

  • |D| \sum_{s} p(s \mid \phi) \frac{\partial r_\phi(s)}{\partial \phi} $$

If the reward is linear in features ($r_\phi(s) = \phi^\top \mu(s)$), then:

$$ \frac{\partial r_\phi(s)}{\partial \phi} = \mu(s) $$

Hence, the gradient simplifies to:

$$ \frac{\partial \mathcal{J}(\phi)}{\partial \phi} = \sum_{s \in D} \mu(s)

  • |D| \sum_{s} p(s \mid \phi)\mu(s) $$

This aligns feature expectations from the expert and the model.


8. Computing State Visitation Frequencies (REF 39:31–42:43)

To evaluate $p(s \mid \phi)$, we compute state-action visitation frequencies using dynamic programming, assuming the dynamics model is known and the problem is tabular.

  1. Initialization: $$ \rho_1(s) = P(s_1 = s) $$

  2. Recurrence: For each time step $t$: $$ \rho_{t+1}(s') = \sum_{s, a} \rho_t(s)\pi(a \mid s) P(s' \mid s, a) $$

  3. Expected State Distribution: $$ \bar{\rho}(s) = \sum_{t=1}^{T} \rho_t(s) $$

These $\bar{\rho}(s)$ values represent the state visitation frequencies, used in computing the gradient and updating $\phi$.


9. Full MaxEnt IRL Algorithm (REF 43:13–44:37)

Algorithm: Maximum Entropy Inverse Reinforcement Learning

Inputs:

  • Expert demonstrations $D = {\tau_i}$
  • Feature function $\mu(s)$
  • Known dynamics model
  • Initial reward parameters $\phi_0$

Procedure:

  1. Initialize $\phi = \phi_0$

  2. Repeat until convergence:

    1. Compute optimal policy $\pi_\phi$ under current reward using value iteration.
    2. Compute state visitation frequencies $\bar{\rho}_\phi(s)$ via dynamic programming.
    3. Compute gradient: $$ \nabla_\phi \mathcal{J} = \sum_{s \in D} \mu(s) - \sum_{s} \bar{\rho}_\phi(s) \mu(s) $$
    4. Update reward parameters: $$ \phi \leftarrow \phi + \alpha \nabla_\phi \mathcal{J} $$
  3. Output: Estimated reward function $r_\phi(s)$ and policy $\pi_\phi$


10. Dependencies and Assumptions (REF 44:37–48:19)

Steps Requiring the Dynamics Model:

  1. Computing the Optimal Policy:
    • Requires transition probabilities $P(s' \mid s, a)$.
  2. Computing State Visitation Frequencies:
    • Requires the same model for propagating state probabilities.

Gradient computation itself does not require the model directly but depends on the previously computed state distributions.


Visualization

Figure: Flow of the MaxEnt IRL Process MaxEnt IRL Workflow


11. Summary of Key Insights

  1. MaxEnt Principle → Distribution proportional to $\exp(r_\phi(\tau))$.
  2. Learning $\phi$ → Maximize likelihood of expert trajectories.
  3. Gradient → Difference between expert and model feature expectations.
  4. Dynamic Programming → Used for computing state visitation frequencies.
  5. Assumption → Dynamics model is known (essential for policy optimization and frequency estimation).

Final Equation Summary:

$$ \begin{align} \nabla_\phi \mathcal{J} &= \mathbb{E}_{\tau \sim p_{\text{expert}}}[\mu(\tau)] - \mathbb{E}_{\tau \sim p_\phi}[\mu(\tau)] \tag{1} \\ p_\phi(\tau) &= \frac{1}{Z(\phi)} \exp(r_\phi(\tau)) \tag{2} \\ r_\phi(s) &= \phi^\top \mu(s) \tag{3} \end{align} $$

These form the mathematical backbone of Maximum Entropy Inverse Reinforcement Learning.

From MaxEnt IRL to Human Feedback and Preference Learning (part 3/4, REF 48:48–1:06:48)

1. Revisiting Assumptions in MaxEnt IRL (REF 48:48–50:12)

Key Observation:

While computing optimal policies and state visitation frequencies in Maximum Entropy Inverse Reinforcement Learning (MaxEnt IRL), one requires access to the dynamics model—the transition probabilities $P(s' \mid s, a)$.

Notes:

  • The gradient update does not directly require the dynamics model after frequencies are computed.
  • However, dynamic programming methods—used to compute these frequencies—depend heavily on it.
  • This is a strong assumption, especially for human-centered systems (e.g., medical decision-making), where the true dynamics are rarely known.

Limitation:

  • For physical simulations (like MuJoCo), the dynamics are known.
  • For human or real-world environments, this assumption breaks down.

2. Advances Beyond Linear and Known Dynamics Models (REF 49:15–50:41)

Extensions of MaxEnt IRL:

  1. Nonlinear Reward Models:
    Chelsea Finn and colleagues (2016) extended MaxEnt IRL to general reward and cost functions such as deep neural networks.
  2. Unknown Dynamics Models:
    The same work removed the requirement to know transition dynamics.
  3. Complex State Spaces:
    Enabled applications in high-dimensional, continuous domains where dynamic programming is infeasible.

Key Insight:

The MaxEnt IRL framework resolves reward ambiguity—many possible rewards can explain expert behavior—by selecting the maximum entropy distribution consistent with demonstrations.

Applications include:

  • Modeling taxi driver routes.
  • Learning control policies from demonstrations.

3. Recap: Why Imitation Learning Matters (REF 50:41–52:31)

Purpose:

Imitation learning enables policy learning from demonstrations without needing explicit reward definitions.

Benefits:

  • Reduced data requirements: Learns efficiently from limited demonstrations.
  • Simplified objective: Converts RL to a supervised learning problem when using behavior cloning.
  • Interpretability: Provides insight into human decision strategies.

Core Concepts to Master:

  1. Behavior Cloning (BC):
    Directly maps states to expert actions using supervised learning.
  2. Maximum Entropy Principle:
    Ensures unbiased trajectory distributions by maximizing entropy under constraints.
  3. Reward Recovery:
    Reformulates imitation as a maximum likelihood estimation of reward parameters.

Note:

The learned reward is not necessarily the human’s true reward—it is the reward most compatible with observed behavior under maximum entropy.


4. Transition: From Imitation Learning to Human Feedback (REF 52:31–53:29)

We now extend the concept of human input beyond demonstrations to interactive feedback for training RL agents.

Human Feedback in RL:

Two main contexts:

  1. Task-specific guidance:
    Humans provide corrective feedback to train agents for individual tasks (e.g., teaching a robot to clean a kitchen).
  2. Value alignment:
    Feedback helps align large models (like LLMs) with human intentions and ethical norms.

5. Early Work: Human-in-the-Loop Learning (REF 53:29–54:55)

Example 1: Sophie's Kitchen (Thomaz & Breazeal, MIT)

  • A system where humans teach an agent to follow recipes.
  • Demonstrated interactive teaching via real-time feedback.
  • Highlighted the advantage of human guidance over exploration-only methods (like $\epsilon$-greedy).

Dual Feedback Channels:

  1. Environmental rewards:
    Task-defined, e.g., “+1” for completing an objective.
  2. Human rewards:
    Direct feedback like “good” or “bad,” representing subjective approval.

Analogy: Similar to DAGGER, where the human acts as a constant coach guiding the policy’s corrections.


6. TAMER Framework (REF 55:24–56:22)

Developed by: Brad Knox & Peter Stone (UT Austin)

Concept:

  • Human provides explicit scalar feedback while observing the agent.
  • The agent learns an explicit reward model from these human signals.
  • Applied to Tetris, showing faster performance improvement compared to policy-only methods.

Limitations:

  • Requires continuous human presence (like DAGGER).
  • May not outperform fully autonomous RL over long training periods.

7. Spectrum of Human Feedback (REF 56:50–57:18)

There exists a continuum of human involvement in RL training:

Level of Human Input Description Example
Minimal Passive demonstrations only Behavior Cloning
Moderate Occasional preference judgments Pairwise Comparison
High Continuous feedback and correction DAGGER / TAMER

Pairwise preference models fall into the middle ground, balancing feasibility and data richness.


8. Pairwise Preference Learning (REF 57:18–58:44)

Motivation:

  • Asking humans for exact reward values is cognitively difficult.
  • Easier to ask for comparisons:
    “Which behavior do you prefer—A or B?”

Example Domains:

  1. Search Ranking Systems:
    Early work by Yisong Yue & Thorsten Joachims (Cornell) — users compare ranking results (A vs. B) instead of assigning numeric ratings.
  2. Robotics and Driving:
    Dorsa Sadigh’s group used human comparisons to guide autonomous driving behavior (e.g., safe vs. unsafe lane changes).

Insight:

Pairwise comparison is easier and faster for humans while providing enough signal to infer latent reward preferences.


9. The Bradley–Terry Preference Model (REF 1:01:08–1:02:35)

A statistical model connecting latent rewards to pairwise preferences.

Model Definition:

For items $B_i$ and $B_j$ with latent rewards $r(B_i)$ and $r(B_j)$, the probability that a human prefers $B_i$ over $B_j$ is:

$$ P(B_i \succ B_j) = \frac{e^{r(B_i)}}{e^{r(B_i)} + e^{r(B_j)}} $$

Properties:

  • If $r(B_i) = r(B_j)$$P = 0.5$ (equal preference).
  • If $r(B_i) \gg r(B_j)$$P \approx 1$ (strong preference).
  • Captures noisy decision-making—not deterministic, but probabilistic.

10. Relation to Human Reward Inference (REF 1:02:35–1:03:58)

  • Assumes a latent internal reward function that is not directly observable.
  • Humans provide noisy preference signals that approximate these internal rewards.
  • Can be combined with RLHF (Reinforcement Learning from Human Feedback) to train models using pairwise comparisons instead of explicit reward values.

Important Note:

The Bradley–Terry model is transitive: If $P(i \succ j)$ and $P(j \succ k)$ are known, one can infer $P(i \succ k)$ approximately, enabling consistency in preference inference.


11. Preference Modeling in K-Armed Bandits (REF 1:02:06–1:06:18)

Setup:

  • There are K actions (arms), each with an unknown reward.
  • Humans provide pairwise comparisons instead of scalar feedback.

Objective:

Estimate which action has the highest expected latent reward based on observed comparisons.

Probabilistic Preference Function:

$$ P(a_i \succ a_j) = \frac{e^{r(a_i)}}{e^{r(a_i)} + e^{r(a_j)}} $$


12. Ranking-Based Optimality Criteria (REF 1:04:55–1:06:48)

1. Condorcet Winner:

An action $a_i$ is a Condorcet winner if it is preferred to every other action with probability greater than 0.5: $$ P(a_i \succ a_j) > 0.5 \quad \forall j \neq i $$

2. Copeland Winner:

The action that wins the largest number of pairwise comparisons: $$ a_i = \arg\max_i \sum_{j \neq i} \mathbb{I}[P(a_i \succ a_j) > 0.5] $$

3. Borda Winner:

The action with the highest average score: $$ \text{Score}(a_i) = \sum_{j \neq i} \begin{cases} 1 & \text{if } a_i \succ a_j \ 0.5 & \text{if } a_i = a_j \ 0 & \text{otherwise} \end{cases} $$

These measures form the theoretical foundations for preference-based reinforcement learning and dueling bandits algorithms, which optimize agents based on relative judgments rather than absolute reward values.


13. Conceptual Figure

Figure: Spectrum of Human Feedback in RL Human Feedback Spectrum


14. Summary of Key Insights

  1. MaxEnt IRL Assumptions: Known dynamics and linear rewards are restrictive but foundational.
  2. Modern Extensions: Deep reward networks and model-free methods generalize IRL.
  3. Human Feedback Spectrum: Ranges from passive demonstrations to continuous coaching.
  4. Pairwise Preference Learning: Offers a practical middle ground for human input.
  5. Bradley–Terry Model: Connects latent rewards to observed preferences.
  6. Bandit Preference Models: Extend these ideas to learning optimal actions under uncertainty.

Together, these ideas bridge imitation learning and human preference-based reinforcement learning, forming the conceptual groundwork for RLHF used in systems like ChatGPT.

Learning from Pairwise Preferences to RLHF and Meta-RL (part 4/4, REF 1:07:13–1:13:47)

1. From Preferences to Latent Reward Models (REF 1:07:13–1:08:35)

Goal:

Given noisy pairwise comparisons, infer an underlying reward function $r_\phi$ that represents human preferences.

Motivation:

Once a latent reward model is inferred:

  1. It can be used to determine which action or arm is best in a multi-armed bandit setting.
  2. In reinforcement learning (RL), it can be optimized to learn an optimal policy.

Data Representation:

We assume $N$ tuples of the form $(i, j, \mu)$:

  • $i$ = item 1
  • $j$ = item 2
  • $\mu = 1$ → item $i$ is preferred
  • $\mu = 0$ → item $j$ is preferred
  • $\mu = 0.5$ → no preference

This converts the problem into a binary (or ternary) classification task, analogous to logistic regression.


Likelihood Maximization:

To fit the model, we maximize the likelihood (or minimize cross-entropy loss):

$$ P(i \succ j \mid \phi) = \frac{e^{r_\phi(i)}}{e^{r_\phi(i)} + e^{r_\phi(j)}} $$

Loss function:

$$ \mathcal{L}(\phi) = -\sum_{n=1}^N \Big[ \mu_n \log P(i_n \succ j_n \mid \phi) + (1 - \mu_n)\log P(j_n \succ i_n \mid \phi) \Big] $$

where $r_\phi(\cdot)$ may be a:

  • Linear function
  • Neural network
  • Any differentiable model of rewards

Algorithm: Learning Reward from Preferences

  1. Collect preference tuples $(i, j, \mu)$ from human comparisons.
  2. Define a parametric reward model $r_\phi$.
  3. Compute preference probabilities using the Bradley–Terry model.
  4. Optimize $\phi$ via gradient descent on $\mathcal{L}(\phi)$.
  5. Use the resulting $r_\phi$ as the learned human-aligned reward.

2. Extending to RL with States and Trajectories (REF 1:08:35–1:09:30)

In RL, each trajectory $\tau = (s_1, a_1, s_2, a_2, \dots, s_T)$ produces cumulative reward:

$$ R_\phi(\tau) = \sum_{t=1}^{T} r_\phi(s_t, a_t) $$

Preference-based Training:

If we have two trajectories, $\tau_i$ and $\tau_j$, humans can indicate which one is preferred.
We then assume:

$$ P(\tau_i \succ \tau_j \mid \phi) = \frac{e^{R_\phi(\tau_i)}}{e^{R_\phi(\tau_i)} + e^{R_\phi(\tau_j)}} $$

Training the reward model proceeds just as in the bandit case, but using trajectory-level comparisons.


Reinforcement Learning from Human Preferences (Deep RLHF)

After training $r_\phi$, we can use any policy optimization method—e.g., Proximal Policy Optimization (PPO)—to learn an optimal policy with respect to the learned reward.

Pipeline Summary:

  1. Collect trajectory pairs and human preference labels.
  2. Train $r_\phi$ to predict preferences.
  3. Use $r_\phi$ as a reward signal in RL.
  4. Train the policy $\pi_\theta$ via PPO or another optimizer.

3. Case Study: Learning to Do a Backflip (REF 1:09:30–1:10:56)

Experiment: “Deep Reinforcement Learning from Human Preferences” (Christiano et al., 2017)

Objective: Train an agent in the MuJoCo simulator to perform a backflip.

Procedure:

  1. Show human labelers short video clips of two trajectories.
  2. Ask: “Which looks more like a successful backflip?”
  3. Collect $\approx 900$ preference labels.

Result:

  • The model learned a reward function purely from human comparisons.
  • The agent successfully learned to perform a backflip using orders of magnitude less data than deep Q-learning (which requires millions of samples).

Key Takeaway:
Preference-based feedback can guide RL systems to learn complex behaviors efficiently.


Illustration

Figure: Human feedback for preference-based RL
RL from Human Preferences Example


4. Practical Implementation in Coursework (REF 1:10:56–1:11:25)

In the corresponding assignment:

  • Students implement RLHF and Direct Preference Optimization (DPO).
  • Preference datasets are pre-provided (no manual labeling required).
  • Tasks involve understanding how reward models are trained from preferences and used in PPO-style updates.

5. Connection to RLHF for Large Language Models (REF 1:11:53–1:12:22)

Modern RLHF Pipeline (as used in ChatGPT):

  1. Supervised Fine-Tuning (SFT):
    Train the model on expert demonstrations (akin to behavior cloning).
  2. Reward Model Training:
    Gather human comparison data between model outputs (rankings, not just pairs).
  3. Policy Optimization:
    Use PPO to maximize the learned reward function while maintaining alignment with human intent.

Extended Ranking Framework:

While early RLHF used pairwise comparisons, modern implementations may use multi-way rankings (e.g., rank 1–4 responses) to provide richer feedback signals.


6. Meta-Reinforcement Learning Perspective (REF 1:12:50–1:13:20)

Distinction from Standard RL:

  • In standard RL, the agent learns one specific task (e.g., backflip).
  • In RLHF for LLMs, the reward model must generalize across many tasks—writing, reasoning, coding, summarization, etc.

Meta-RL Interpretation:

The reward model functions as a meta-objective:
It learns what humans value across tasks, enabling the agent to generalize to unseen prompts.


Example:

When given a new instruction like “Write a story about frogs,” the LLM’s policy must perform well according to human-aligned reward signals, even though it has never seen that specific task during training.


7. Conceptual Summary of RLHF Pipeline

Pipeline Diagram:
RLHF Pipeline

  1. Collect human demonstrations → Supervised learning (SFT).
  2. Collect human comparisons → Train reward model $r_\phi$.
  3. Optimize policy → PPO to align model outputs with human values.
  4. Iterate → Update dataset and retrain periodically for robustness.

8. Concluding Insights (REF 1:13:20–1:13:47)

  • The 2017 paper on RL from Human Preferences laid the foundational groundwork for modern RLHF.
  • Subsequent advances, especially in language model alignment, have transformed it into a meta-RL problem.
  • The framework unifies imitation learning, preference learning, and policy optimization under a single principle:
    Use human judgments—directly or indirectly—as the optimization signal.

Core Equation Summary

$$ \begin{align} P(\tau_i \succ \tau_j \mid \phi) &= \frac{e^{R_\phi(\tau_i)}}{e^{R_\phi(\tau_i)} + e^{R_\phi(\tau_j)}} \tag{1} \\ R_\phi(\tau) &= \sum_{t=1}^{T} r_\phi(s_t, a_t) \tag{2} \\ \mathcal{L}(\phi) &= -\sum_{n=1}^N \Big[ \mu_n \log P(i_n \succ j_n \mid \phi) + (1-\mu_n)\log P(j_n \succ i_n \mid \phi) \Big] \tag{3} \end{align} $$

These equations underpin both preference-based RL and modern RLHF systems.


Key Takeaways

  1. Pairwise comparisons can be leveraged to recover latent reward models.
  2. Human preference data serves as a powerful alternative to hand-crafted rewards.
  3. RLHF applies this idea at scale to align large models with human values.
  4. The meta-RL viewpoint recognizes RLHF as learning a general-purpose, human-aligned reward model.

Appendixes.

How “trajectory distributions” reduce to discounted feature expectations

Setup

  • MDP with discount $\gamma \in [0,1)$, start-state distribution $d_0$, dynamics $P(s' \mid s,a)$, policy $\pi(a \mid s)$.
  • Feature maps $f(s)$ or $f(s,a)$.
  • A (finite or infinite) trajectory $\tau=(s_0,a_0,s_1,a_1,\dots)$ is drawn from the trajectory distribution $$ P_\pi(\tau)=d_0(s_0)\prod_{t\ge 0}\pi(a_t\mid s_t)P(s_{t+1}\mid s_t,a_t). $$
  • Trajectory-level (discounted) feature statistic: $$ \mu(\tau)=\sum_{t=0}^{\infty}\gamma^t f(s_t) \quad\text{or}\quad \mu(\tau)=\sum_{t=0}^{\infty}\gamma^t f(s_t,a_t). $$

Discounted occupancy measures

Define the (unnormalized) discounted occupancy of states and state-actions: $$ \begin{align} \rho_\pi(s) &\triangleq \sum_{t=0}^{\infty}\gamma^t \Pr_\pi(s_t=s), \tag{1}\ \rho_\pi(s,a) &\triangleq \sum_{t=0}^{\infty}\gamma^t \Pr_\pi(s_t=s, a_t=a). \tag{2} \end{align} $$

(If you prefer the normalized convention $\tilde\rho_\pi=(1-\gamma)\rho_\pi$, multiply results by $\tfrac{1}{1-\gamma}$ accordingly.)

These occupancies satisfy the flow constraints: $$ \begin{align} \rho_\pi(s) &= d_0(s) + \gamma\sum_{s',a'} \rho_\pi(s',a'),P(s\mid s',a'), \tag{3}\ \rho_\pi(s,a) &= \pi(a\mid s),\rho_\pi(s). \tag{4} \end{align} $$

Equivalence: expectations over trajectories vs. occupancies

Take expectations of the trajectory-level statistic under the trajectory distribution $P_\pi$ and exchange sums:

  • State-feature case $$ \begin{align} \mathbb{E}{\tau\sim P\pi}!\left[\sum_{t=0}^{\infty}\gamma^t f(s_t)\right] &= \sum_{t=0}^{\infty}\gamma^t \sum_s \Pr_\pi(s_t=s), f(s) \tag{5}\ &= \sum_s \left(\sum_{t=0}^{\infty}\gamma^t \Pr_\pi(s_t=s)\right) f(s) \tag{6}\ &= \sum_s \rho_\pi(s), f(s). \tag{7} \end{align} $$

  • State–action feature case $$ \begin{align} \mathbb{E}{\tau\sim P\pi}!\left[\sum_{t=0}^{\infty}\gamma^t f(s_t,a_t)\right] &= \sum_{t=0}^{\infty}\gamma^t \sum_{s,a} \Pr_\pi(s_t=s,a_t=a), f(s,a) \tag{8}\ &= \sum_{s,a} \left(\sum_{t=0}^{\infty}\gamma^t \Pr_\pi(s_t=s,a_t=a)\right) f(s,a) \tag{9}\ &= \sum_{s,a} \rho_\pi(s,a), f(s,a). \tag{10} \end{align} $$

Conclusion: Matching expected discounted trajectory features is equivalent to matching the corresponding discounted occupancy–weighted feature sums. This is the precise bridge between the “distribution over trajectories” view and the earlier “discounted accumulated features per state or state–action” view.

How this plugs into imitation learning objectives

  1. Apprenticeship / feature expectation matching (Abbeel–Ng style)
    With linear rewards $r_w(s)=w^\top f(s)$ or $r_w(s,a)=w^\top f(s,a)$ and $|w|\le 1$: $$ J(\pi;w)=\mathbb{E}\pi\Big[\sum{t\ge 0}\gamma^t r_w(\cdot)\Big] = \sum_s \rho_\pi(s), w^\top f(s) \quad\text{or}\quad \sum_{s,a} \rho_\pi(s,a), w^\top f(s,a). \tag{11} $$ Therefore, if $\sum_s \rho_\pi(s) f(s) \approx \sum_s \rho_{\pi_E}(s) f(s)$ (or the state–action analogue), then $J(\pi;w)\approx J(\pi_E;w)$ for all bounded $w$. Feature-expectation matching is thus occupancy matching in disguise.

  2. MaxEnt IRL constraints are the same thing
    MaxEnt IRL writes an optimization over trajectory distributions $P(\tau)$: $$ \max_{P} H(P)\quad\text{s.t.}\quad \sum_{\tau}P(\tau),\mu(\tau)=\hat\mu_E,;; \sum_{\tau}P(\tau)=1,;; P(\tau)\ge 0, \tag{12} $$ where $\mu(\tau)=\sum_t \gamma^t f(s_t)$ or $\sum_t \gamma^t f(s_t,a_t)$. Using the identities in (5)–(10), $$ \sum_{\tau}P(\tau),\mu(\tau) = \sum_s \rho_{P}(s), f(s) \quad\text{or}\quad \sum_{s,a} \rho_{P}(s,a), f(s,a). \tag{13} $$ Thus the trajectory-level constraint is exactly the discounted occupancy–weighted feature matching constraint. MaxEnt merely chooses, among all $P(\tau)$ that satisfy this linear constraint, the one with maximum entropy.

Notes on horizon and normalization

  • Finite horizon $T$: replace $\sum_{t=0}^{\infty}\gamma^t$ by $\sum_{t=0}^{T-1}\gamma^t$ and define $\rho_\pi^T$ accordingly. All identities hold with $T$-limited sums.
  • Normalized occupancy $\tilde\rho_\pi=(1-\gamma)\rho_\pi$ makes $\sum_s \tilde\rho_\pi(s)=1$ under stationary starts; then $$ \mathbb{E}{\tau}!\left[\sum{t\ge 0}\gamma^t f(\cdot)\right] = \frac{1}{1-\gamma}\sum \tilde\rho_\pi(\cdot), f(\cdot). \tag{14} $$

Takeaway

  • The “distribution over trajectories” language is a convenient probabilistic wrapper.
  • The constraints it imposes (matching expected discounted features) are identical to the earlier imitation learning formulation based on discounted occupancy measures.
  • Linear rewards make this equivalence operational: matching feature expectations $\Longleftrightarrow$ matching returns for any bounded linear reward.

Thank you for your patience, and I deeply apologize for the mistakes earlier. Let’s restart and walk through this from the ground up, focusing on the correct use of Lagrange multipliers to derive the reward equivalence and not feature equivalence. I truly appreciate you pushing me to clarify this.

Goal:

We want to recover the reward function in the maximum entropy inverse reinforcement learning (MaxEnt IRL) framework by using Lagrange multipliers to match the expected rewards of the learned policy with the expert's expected rewards. The goal is to show how the trajectory distribution becomes exponential in terms of the reward function.

Step-by-Step Derivation (Correctly Applied)

1. Fundamental Setup: Matching Expert Rewards

The fundamental goal of MaxEnt IRL is to recover a reward function such that the expected reward under the learned policy matches the expected reward of the expert's behavior. Formally:

$$ \mathbb{E}_{\tau_E}\left[ \sum_{t=0}^{T} \gamma^t r_\phi(s_t, a_t) \right] = \mathbb{E}_{\tau_\pi}\left[ \sum_{t=0}^{T} \gamma^t r_\phi(s_t, a_t) \right], $$

Where:

  • $\tau_E$ is the expert’s trajectory,
  • $\tau_\pi$ is the trajectory generated by the learned policy $\pi$,
  • $r_\phi(s_t, a_t)$ is the reward function.

2. Maximum Entropy Formulation

MaxEnt IRL maximizes the entropy of the trajectory distribution while enforcing a constraint on the feature expectations. However, in the current context, we are interested in enforcing a reward equivalence constraint, not just feature equivalence.

Let the trajectory distribution of the expert be denoted by $P_E(\tau)$, and the trajectory distribution of the learned policy by $P_\pi(\tau)$.

2.1. The Objective: Maximize Entropy

We aim to maximize the entropy of the trajectory distribution $P_\pi(\tau)$, which is subject to the constraint that the expected reward matches that of the expert.

The entropy of the trajectory distribution is:

$$ H(P_\pi) = -\sum_{\tau} P_\pi(\tau) \log P_\pi(\tau) $$

The constraint is that the expected reward under the learned policy should match the expected reward under the expert's policy. In other words:

$$ \sum_{\tau} P_\pi(\tau) \sum_{t=0}^{T} \gamma^t r_\phi(s_t, a_t) = \sum_{\tau} P_E(\tau) \sum_{t=0}^{T} \gamma^t r_\phi(s_t, a_t). $$

3. Setting Up the Lagrangian

We now introduce a Lagrange multiplier $\lambda$ to enforce the reward equivalence constraint. The Lagrangian for this optimization problem is:

$$ \mathcal{L}(P_\pi, \lambda) = -\sum_{\tau} P_\pi(\tau) \log P_\pi(\tau) + \lambda \left( \sum_{\tau} P_\pi(\tau) \sum_{t=0}^{T} \gamma^t r_\phi(s_t, a_t) - \sum_{\tau} P_E(\tau) \sum_{t=0}^{T} \gamma^t r_\phi(s_t, a_t) \right). $$

This Lagrangian captures both the entropy maximization and the reward equivalence constraint.

4. Taking the Derivative of the Lagrangian

Now, we take the derivative of $\mathcal{L}(P_\pi, \lambda)$ with respect to $P_\pi(\tau)$ and set it equal to zero in order to maximize the Lagrangian.

The derivative is:

$$ \frac{\partial \mathcal{L}}{\partial P_\pi(\tau)} = -\log P_\pi(\tau) - 1 + \lambda \left( \sum_{t=0}^{T} \gamma^t r_\phi(s_t, a_t) \right). $$

5. Solving for the Optimal Trajectory Distribution

Setting the derivative equal to zero gives us the equation:

$$ \log P_\pi(\tau) + 1 = \lambda \sum_{t=0}^{T} \gamma^t r_\phi(s_t, a_t). $$

Solving for $P_\pi(\tau)$, we obtain:

$$ P_\pi(\tau) = \exp\left( \lambda \sum_{t=0}^{T} \gamma^t r_\phi(s_t, a_t) - 1 \right). $$

We can drop the constant $-1$ since it doesn’t affect the distribution (it will just change the normalizing constant). Therefore, we have:

$$ P_\pi(\tau) \propto \exp\left( \lambda \sum_{t=0}^{T} \gamma^t r_\phi(s_t, a_t) \right). $$

6. Interpreting the Result

This is the desired result! The trajectory distribution is now expressed as being exponential in terms of the reward function $r_\phi(s_t, a_t)$.

We can conclude that:

$$ P_\pi(\tau) \propto \exp\left( \sum_{t=0}^{T} \gamma^t r_\phi(s_t, a_t) \right), $$

which shows that the trajectory distribution is proportional to the exponential of the sum of discounted rewards.

7. Final Remarks

  • The key idea is that the trajectory distribution is chosen to maximize entropy while ensuring that the expected rewards under the learned policy match the expected rewards of the expert's behavior.
  • The Lagrange multiplier $\lambda$ enforces this constraint, and the result is that the trajectory distribution is exponential in terms of the reward function, where the reward function is assumed to be linear in the feature function.

Conclusion

This derivation correctly shows how the trajectory distribution $P(\tau)$ in MaxEnt IRL becomes exponential with respect to the reward function, and it does so by applying the Lagrange multiplier method to enforce the reward equivalence constraint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment