Sergei Dolgov mebubo

Reinforcement Learning for Language Models

Yoav Goldberg, April 2023.

Why RL?

With the release of the ChatGPT model and followup large language models (LLMs), there was a lot of discussion of the importance of "RLHF training", that is, "reinforcement learning from human feedback". I was puzzled for a while as to why RL (Reinforcement Learning) is better than learning from demonstrations (a.k.a supervised learning) for training language models. Shouldn't learning from demonstrations (or, in language model terminology "instruction fine tuning", learning to immitate human written answers) be sufficient? I came up with a theoretical argument that was somewhat convincing. But I came to realize there is an additional argumment which not only supports the case of RL training, but also requires it, in particular for models like ChatGPT. This additional argument is spelled out in (the first half of) a talk by John Schulman from OpenAI. This post pretty much

	# train_grpo.py
	#
	# See https://github.com/willccbb/verifiers for ongoing developments
	#
	"""
	citation:

	@misc{brown2025grpodemo,
	title={Granular Format Rewards for Eliciting Mathematical Reasoning Capabilities in Small Language Models},
	author={Brown, William},

	import argparse
	import random
	import sys

	from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache
	import torch

	parser = argparse.ArgumentParser()
	parser.add_argument("question", type=str)
	parser.add_argument(

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<title>Fullscreen Image Viewer</title>
	<style>
	body {
	font-family: Arial, sans-serif;
	}

	# This script will transcribe an audio file (mp3, wav, etc.) to text and then clean the text using a local LLM model via Ollama. Technically, this script will work with any LLM that supports the standard OpenAI bindings with minor adjustments.

	# GETTING STARTED:
	# 1. Install required python packages (pip install openai python-dotenv)
	# 2. Git clone a copy of ggerganov/whisper (https://github.com/ggerganov/whisper.cpp)
	# 3. Build the whisper binary (see the whisper.cpp README for instructions)
	# 4. Download one of the whisper models (largev2 is the most accurate for all languages, though the base model works reasonably well for English).
	# 5. Install ffmpeg (brew install ffmpeg on macOS, apt-get install ffmpeg)
	# 6. Install ollama (https://ollama.com/download)
	# 7. Download an LLM model (https://ollama.com/library)

	There are two prompts, that chain together. The first prompt does most of the work, and the second prompt organizes the sections. I found because of the nature of how LLMs write, I couldn't get just one prompt to never jump back and forth in topics.

	Prompt 1, which takes as input a raw transcript and generates a structured-text version...

	"""# Instructions
	A transcript is provided below of a voice memo I recorded as a "note to self". please extract all the points made or thoughts described, and put them in bullet-point form. use nested bullet points to indicate structure, e.g. a top-level bullet for each topic area and sub-bullets underneath. use multi-level nesting as appropriate to organize the thinking logically. use markdown formatting with `*` instead of `-` for bullet points.

	DO NOT OMIT ANY POINTS MADE. This is not a summarization task — your only goal is to structure the thoughts there so they are logically organized and easy to read. Be concise because the reader is busy, but again DO NOT omit any

	"""

	The code below combines approaches published by both @eugene-yh and @jinyongyoo on Github.

	Thanks for the contributions guys!

	"""

	import torch
	import peft

	# Clone llama.cpp
	git clone https://github.com/ggerganov/llama.cpp.git
	cd llama.cpp

	# Build it
	make clean
	LLAMA_METAL=1 make

	# Download model
	export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin

	# coding=utf-8
	# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software


	# [[2023-01-13]] log

	## URLs

	- <strong>www.amazon.de</strong>
	- [Prime Video - Video on Demand - Online-Videothek: Filme und Serien online ansehen oder als Einzelabruf online leihen oder kaufen](https://www.amazon.de/Amazon-Video/b/?node=3010075031&ref=atv_surl_aiv&redirectToCMP=1) /Amazon-Video/b/
	- <strong>www.youtube.com</strong>
	- [Parwal vs Kundru \| कुंदरु या परवल \| Pointed Gourd Vs Ivy Gourd \| Everyday Life # 267 - YouTube](https://www.youtube.com/watch?v=6v4XD9T9-Rg&themeRefresh=1) /watch
	- [YouTube](https://www.youtube.com/) /