Skip to content

Instantly share code, notes, and snippets.

@krammnic
Created June 5, 2025 22:56
Show Gist options
  • Select an option

  • Save krammnic/a80602f5bff03921096876c995351da8 to your computer and use it in GitHub Desktop.

Select an option

Save krammnic/a80602f5bff03921096876c995351da8 to your computer and use it in GitHub Desktop.
from transformers import AutoTokenizer
mname = "google/gemma-2-2b-it" # or any checkpoint that has a fast tokenizer.
vocab_keep_items = 5000
tokenizer = AutoTokenizer.from_pretrained(mname)
assert tokenizer.is_fast, "This only works for fast tokenizers."
tokenizer.save_pretrained("big-tokenizer")
# Should be a generator of list of texts.
training_corpus = [
["""<bos><start_of_turn>user
hello there<end_of_turn>
<start_of_turn>model
hi<end_of_turn>
<start_of_turn>user
whatsup?<end_of_turn>
<start_of_turn>model
"""],
]
new_tokenizer = tokenizer.train_new_from_iterator(training_corpus, vocab_size=vocab_keep_items)
new_tokenizer.save_pretrained("small-tokenizer")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment