Reinforcement Learning for Language Models

Yoav Goldberg, April 2023.

Why RL?

With the release of the ChatGPT model and followup large language models (LLMs), there was a lot of discussion of the importance of "RLHF training", that is, "reinforcement learning from human feedback". I was puzzled for a while as to why RL (Reinforcement Learning) is better than learning from demonstrations (a.k.a supervised learning) for training language models. Shouldn't learning from demonstrations (or, in language model terminology "instruction fine tuning", learning to immitate human written answers) be sufficient? I came up with a theoretical argument that was somewhat convincing. But I came to realize there is an additional argumment which not only supports the case of RL training, but also requires it, in particular for models like ChatGPT. This additional argument is spelled out in (the first half of) a talk by John Schulman from OpenAI. This post pretty much

neubig / dispatch_openai_requests.py

Last active February 19, 2024 17:55

A simple script to get results from the OpenAI Asynchronous API

	# NOTE:
	# You can find an updated, more robust and feature-rich implementation
	# in Zeno Build
	# - Zeno Build: https://github.com/zeno-ml/zeno-build/
	# - Implementation: https://github.com/zeno-ml/zeno-build/blob/main/zeno_build/models/providers/openai_utils.py

	import openai
	import asyncio
	from typing import Any

akshaychawla / cscheduler.py

Last active December 22, 2023 11:56

Learning rate schedulers for PyTorch. (1) Cosine annealing with warmup and (2) Linear with warmup

	"""
	Useful learning rate schedulers
	Warmup
	CosineAnnealingLRWarmup
	"""
	import torch
	import math
	import functools

	def _cosine_decay_warmup(iteration, warmup_iterations, total_iterations):

hosackm / colorlog.py

Created July 28, 2020 01:39

Colored logger module using Colorama

	import logging
	from colorama import init, Fore, Back

	init(autoreset=True)


	class ColorFormatter(logging.Formatter):
	# Change this dictionary to suit your coloring needs!
	COLORS = {
	"WARNING": Fore.RED,

donglixp / rouge_perl_setup.sh

Last active May 10, 2022 04:39

Setup ROUGE-1.5.5

	# install XML::DOM plugin, instructions https://web.archive.org/web/20171107220839/www.summarizerman.com/post/42675198985/figuring-out-rouge
	cpan App::cpanminus
	cpanm XML::DOM
	# test fails with XLM::Parser missing error, install it
	sudo apt-get install libexpat1-dev
	cpanm XML::Parser

	cd /mnt/data/
	git clone https://github.com/andersjo/pyrouge.git

thomwolf / top-k-top-p.py

Last active October 25, 2025 20:25

Sample the next token from a probability distribution using top-k and/or nucleus (top-p) sampling

	def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
	""" Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
	Args:
	logits: logits distribution shape (vocabulary size)
	top_k >0: keep only top k tokens with highest probability (top-k filtering).
	top_p >0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
	Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
	"""
	assert logits.dim() == 1 # batch size 1 for now - could be updated for more but the code would be less clear
	top_k = min(top_k, logits.size(-1)) # Safety check

HarshTrivedi / pad_packed_demo.py

Last active November 7, 2025 15:47 — forked from Tushar-N/pad_packed_demo.py

Minimal tutorial on packing (pack_padded_sequence) and unpacking (pad_packed_sequence) sequences in pytorch.

	import torch
	from torch import LongTensor
	from torch.nn import Embedding, LSTM
	from torch.autograd import Variable
	from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

	## We want to run LSTM on a batch of 3 character sequences ['long_str', 'tiny', 'medium']
	#
	# Step 1: Construct Vocabulary
	# Step 2: Load indexed data (list of instances, where each instance is list of character indices)

W4ngatang / download_glue_data.py

Last active October 21, 2025 02:22

Script for downloading data of the GLUE benchmark (gluebenchmark.com)

	''' Script for downloading all GLUE data.

	Note: for legal reasons, we are unable to host MRPC.
	You can either use the version hosted by the SentEval team, which is already tokenized,
	or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
	For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
	You should then rename and place specific files in a folder (see below for an example).

	mkdir MRPC
	cabextract MSRParaphraseCorpus.msi -d MRPC

aculich / wikipedia-infoboxes-in-pandas.ipynb

Last active March 21, 2024 04:51

How to extract Wikipedia infoboxes and wikitables using Pandas

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

simme / Install_tmux

Created October 19, 2011 07:55

Install and configure tmux on Mac OS X

	# First install tmux
	brew install tmux

	# For mouse support (for switching panes and windows)
	# Only needed if you are using Terminal.app (iTerm has mouse support)
	Install http://www.culater.net/software/SIMBL/SIMBL.php
	Then install https://bitheap.org/mouseterm/

	# More on mouse support http://floriancrouzat.net/2010/07/run-tmux-with-mouse-support-in-mac-os-x-terminal-app/

Haozhe Ji haozheji

Reinforcement Learning for Language Models

Why RL?