Created
June 30, 2025 12:48
-
-
Save CliffordAnderson/239103807211a04e64b227506c7f5869 to your computer and use it in GitHub Desktop.
download-hf-dataset-as-jsonl.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "nbformat": 4, | |
| "nbformat_minor": 0, | |
| "metadata": { | |
| "colab": { | |
| "private_outputs": true, | |
| "provenance": [], | |
| "include_colab_link": true | |
| }, | |
| "kernelspec": { | |
| "name": "python3", | |
| "display_name": "Python 3" | |
| }, | |
| "language_info": { | |
| "name": "python" | |
| } | |
| }, | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "view-in-github", | |
| "colab_type": "text" | |
| }, | |
| "source": [ | |
| "<a href=\"https://colab.research.google.com/gist/CliffordAnderson/239103807211a04e64b227506c7f5869/download-hf-dataset-as-jsonl.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "notebook_description" | |
| }, | |
| "source": [ | |
| "# Data Pipeline: HuggingFace → OpenAI Fine-tuning Format\n", | |
| "\n", | |
| "## Purpose of the Notebook\n", | |
| "This notebook downloads a dataset from HuggingFace that has Wikipedia stubs about women in religion and converts it into the format that OpenAI wants for fine-tuning: a typical ETL pipeline.\n", | |
| "\n", | |
| "The end goal is training a model to generate those Wikipedia infoboxes automatically." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "cell_2_desc" | |
| }, | |
| "source": [ | |
| "### Get the dependencies\n", | |
| "We begin by nstalling the HuggingFace datasets library. The `-U` flag because we always want to avoid a dependency conflict." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "id": "449eqpRFwARJ" | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "%pip install -U datasets" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "cell_3_desc" | |
| }, | |
| "source": [ | |
| "### Load the training data\n", | |
| "Here we're pulling down the training split from someone's dataset on the Hub. The HuggingFace datasets library makes this easy compared to manually downloading and parsing files." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "from datasets import load_dataset\n", | |
| "\n", | |
| "training_data = load_dataset(\n", | |
| " \"andersoncliffb/women-religion-stubs-with-infoboxes\",\n", | |
| " split=\"train\"\n", | |
| ")" | |
| ], | |
| "metadata": { | |
| "id": "ofrcTvWWwKHg" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "cell_4_desc" | |
| }, | |
| "source": [ | |
| "### Dump training data to JSONL\n", | |
| "Converting to JSONL: one JSON object per line - clean, simple, and easy to stream. Saving to `/content/` because on Google Colab." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "training_data.to_json(\"/content/women-religion-stubs-with-infoboxes-training.jsonl\", orient=\"records\", lines=True)\n" | |
| ], | |
| "metadata": { | |
| "id": "p-3LepmqwTEP" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "cell_5_desc" | |
| }, | |
| "source": [ | |
| "### Load validation data\n", | |
| "Same deal as training data, but grabbing the validation split to evaluate our model properly." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "validation_data = load_dataset(\n", | |
| " \"andersoncliffb/women-religion-stubs-with-infoboxes\",\n", | |
| " split=\"validation\"\n", | |
| ")" | |
| ], | |
| "metadata": { | |
| "id": "r3XEwW7yyG39" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "cell_6_desc" | |
| }, | |
| "source": [ | |
| "### More JSONL dumping\n", | |
| "Rinse and repeat for the validation set. Keeping things consistent." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "validation_data.to_json(\"/content/women-religion-stubs-with-infoboxes-validation.jsonl\", orient=\"records\", lines=True)" | |
| ], | |
| "metadata": { | |
| "id": "f5rtydNvyQFV" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "cell_7_desc" | |
| }, | |
| "source": [ | |
| "### The meat - OpenAI format converter\n", | |
| "This is where the magic happens. OpenAI wants their training data in this chat format with system/user/assistant roles. We're basically teaching the model: \"Hey, when someone gives you a Wikipedia stub, spit out an infobox.\" The function wraps each training example in this conversational structure." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "import json\n", | |
| "\n", | |
| "def create_openai_format(data):\n", | |
| " base_prompt_zero_shot = \"Create a Wikipedia infobox from the Wikipedia stub.\"\n", | |
| " formatted_data = []\n", | |
| " for d in data:\n", | |
| " messages = []\n", | |
| " messages.append({\"role\": \"system\", \"content\": base_prompt_zero_shot})\n", | |
| " messages.append({\"role\": \"user\", \"content\": d[\"stub\"]})\n", | |
| " messages.append({\"role\": \"assistant\", \"content\": d[\"infobox\"]})\n", | |
| " formatted_data.append({\"messages\": messages})\n", | |
| " return formatted_data" | |
| ], | |
| "metadata": { | |
| "id": "AyJx4S0QS8T7" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "cell_8_desc" | |
| }, | |
| "source": [ | |
| "### Apply the transformation\n", | |
| "Just running our converter function on both datasets. Nothing fancy here." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "training_data_openai = create_openai_format(training_data)\n", | |
| "validation_data_openai = create_openai_format(validation_data)" | |
| ], | |
| "metadata": { | |
| "id": "yJG88IFdTfid" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "cell_9_desc" | |
| }, | |
| "source": [ | |
| "### Write OpenAI training data\n", | |
| "Writing out the OpenAI-formatted training data using pandas." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "import pandas as pd\n", | |
| "\n", | |
| "pd.DataFrame(training_data_openai).to_json(\n", | |
| " \"/content/women-religion-stubs-with-infoboxes-training-openai.jsonl\",\n", | |
| " orient=\"records\",\n", | |
| " lines=True\n", | |
| ")" | |
| ], | |
| "metadata": { | |
| "id": "BNTxlfXWTujy" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "cell_10_desc" | |
| }, | |
| "source": [ | |
| "### Write OpenAI validation data\n", | |
| "Same thing for validation data. Consistency is key." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "pd.DataFrame(validation_data_openai).to_json(\n", | |
| " \"/content/women-religion-stubs-with-infoboxes-validation-openai.jsonl\",\n", | |
| " orient=\"records\",\n", | |
| " lines=True\n", | |
| ")" | |
| ], | |
| "metadata": { | |
| "id": "Vp8QtFNfTgXD" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "cell_11_desc" | |
| }, | |
| "source": [ | |
| "### Download the goods\n", | |
| "Moving the training file off Colab and onto your local machine." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "from google.colab import files\n", | |
| "\n", | |
| "files.download(\"/content/women-religion-stubs-with-infoboxes-training-openai.jsonl\")" | |
| ], | |
| "metadata": { | |
| "id": "pO6XWcomxuOF" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "cell_12_desc" | |
| }, | |
| "source": [ | |
| "### Download validation too\n", | |
| "And the validation file. Now you've got everything you need for fine-tuning." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "files.download(\"/content/women-religion-stubs-with-infoboxes-validation-openai.jsonl\")" | |
| ], | |
| "metadata": { | |
| "id": "YzLMUA_tTxyh" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "summary" | |
| }, | |
| "source": [ | |
| "## TL;DR\n", | |
| "This is a pretty standard data prep pipeline. Load dataset from HuggingFace → export as regular JSONL → reformat for OpenAI's picky fine-tuning API → download everything. The end goal is training a model to generate those Wikipedia infoboxes automatically." | |
| ] | |
| } | |
| ] | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment