kundeng

Imitation Learning and Introduction to Max Entropy IRL (part 1/4, REF 00:00–20:47)

Overview (REF 00:00–04:30)

This lecture transitions from imitation learning to the foundations of Reinforcement Learning from Human Feedback (RLHF) and its modern extensions like Direct Preference Optimization (DPO). The instructor begins with a recap of behavior cloning and DAGGER, connecting them to recent advancements where reinforcement learning (RL) methods enable systems such as ChatGPT to learn from human preferences.

Key points:

Behavior Cloning (BC): Reduces RL to supervised learning by learning direct mappings from states to actions using expert demonstrations.
DAGGER (Dataset Aggregation): Improves upon BC by incorporating expert feedback iteratively to correct policy drift due to distribution mismatch.
RLHF: Combines supervised fine-tuning with human feedback to align large models like ChatGPT.

	# Stop on any error
	$ErrorActionPreference = "Stop"

	Write-Host "=== Installing ROCm Nightly PyTorch (gfx1151) in current directory ==="

	# Locate uv
	$uvExe = (Get-Command uv -ErrorAction SilentlyContinue \| Select-Object -ExpandProperty Source)
	if (-not $uvExe) { throw "uv not found. Please install with: winget install astral-sh.uv" }
	function Run-UV { param([Parameter(ValueFromRemainingArguments = $true)] $Args); & $uvExe @Args }

	# Stop on any error
	$ErrorActionPreference = "Stop"

	Write-Host "=== Installing ROCm PyTorch Environment on Windows (gfx1151) ==="

	# ---------------------------
	# Helper: robust download
	# ---------------------------
	function Download-WithRetries {
	param(

	# User for local dev
	FROM app/base
	RUN npm install -g orion-cli
	# This forces package-catalog update. Should speed up further runs
	RUN meteor show meteor-platform