Skip to content

Instantly share code, notes, and snippets.

@stanek-michal
Created August 31, 2023 17:44
Show Gist options
  • Select an option

  • Save stanek-michal/d60a12a3e8dabf7bdd64662480991f2c to your computer and use it in GitHub Desktop.

Select an option

Save stanek-michal/d60a12a3e8dabf7bdd64662480991f2c to your computer and use it in GitHub Desktop.
Tutorial - how to run OSS LLM AI models locally
-1) Quick primer on HF models
Models can be grabbed from HuggingFace, most UIs have a text window where you paste in a HF identifier such as:
TheBloke/CodeLlama-7B-Instruct-GGML
the UI then downloads the model from HF automatically.
here is an example link to a model
https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGML
GGML, GPTQ, GGUF are different model quantization formats. Usually you want GGML. It can run on CPU and by default the UIs will run on CPU only (slow). If you want to run on GPU, you need to move all the 'layers' to GPU. Type in 99 as number of layers and it will put the entire model on GPU. You can split layers between CPU and GPU but then it is much slower.
GGUF is a new format, might be unsupported yet, no added value to GGML, but llama.cpp loader was quick to deprecate GGML so you might have to use GGUF
GPTQ is a special one intended for using on GPU, supported by Auto-GPTQ library or GPTQ-for-LLama.
7B, 13B, 34B, 70B are model sizes. 7B are the simplest/dumbest but require the least resources. For coding and logic/reasoning I recommend 34B and up, which when quantized to 4-bit can fit on a 24GB VRAM GPU which is the common amount of VRAM (AWS). For reference, GPT-3.5 by OpenAI is 175B.
If you see different quant versions like q8, q4, q4_k_s, q4_k_m, q5_0, etc, I would recommend the q4_k_m as a rule of thumb best version. Could try q5_k_m if it fits on the GPU.
0) Primer on CodeLLama
https://ai.meta.com/blog/code-llama-large-language-model-coding/
https://github.com/facebookresearch/codellama
CodeLLama recently by Meta seems to be the best local OSS coding model, similar to GPT3.5. It is worse at following instructions and chained prompting but seems to be around GPT3.5 in terms of coding ability. I recommend the 33B version filling up all the 24GB VRAM on GPU.
There are three versions on HF:
- CodeLLama
* this one behaves like CoPilot, there is no instruct, don't ask it any questions, use it only for code completion. Give it part of a function, it will complete the whole function. Or give it a function signature of a function, write an intro code comment about what the function does (as if it was a real comment in a real repo) and then let the model complete the function. Infilling code in between is also supported with [FILL] [/FILL] tags but I haven't used it.
https://huggingface.co/TheBloke/CodeLlama-34B-GPTQ
https://github.com/facebookresearch/codellama/blob/main/example_completion.py
https://github.com/facebookresearch/codellama/blob/main/example_infilling.py
- CodeLLama Instruct
* this is an Instruct model which means it was tuned to follow instructions like you are familiar with ChatGPT. Still might not be as good for chain prompting. There is a proper instruct format (prefix/postfix around prompt) which you must use or you will get bad results. You can't just type in anything like in ChatGPT. Example use with pytorch and Auto-GPTQ down below.
https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGML
- CodeLLama Python
* while the previous ones seem to be good at most/all languages (best evals were for C++ for example), Meta also made a dedicated one for Python.
https://huggingface.co/TheBloke/CodeLlama-7B-Python-GGML
Method 1 - lmstudio (easiest?)
https://lmstudio.ai/
Seems to be an easy, packaged local solution. Unfortunately not OSS. I haven't tried it as it requires M1/M2 or Windows. Might have trouble with old model formats.
example models to try:
TheBloke/Samantha-7B-GGML
TheBloke/CodeLlama-7B-GGML
TheBloke/CodeLlama-7B-Python-GGML
TheBloke/CodeLlama-7B-Instruct-GGML
Method 2 - text-generation-webui otherwise known as Oobabooga
https://github.com/oobabooga/text-generation-webui
This UI loader is often recommended but not necessarily the best one. It is quite bloated, especially the one-click .exe install which downloads half the internet. I much prefer using the docker/manual pip pytorch solution to download only the dependencies I need without bloat like Whisper or StableDiffusion etc.
Oobabooga supports an optional OpenAI REST API so could be used as a drop-in in scripts.
Note: to get GPU support in ooobabooga I had to do this before installing:
pip uninstall llama-cpp-python
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
Note that CUBLAS is the library you always want to use for CUDA GPUs.
Method 3 - KoboldCPP / KoboldAI
https://github.com/LostRuins/koboldcpp
One of the best ways to run GGML or GGUF models, it's in active development and they added support for GGUF immediately. UI might be difficult to understand for pure instruct as its main use is storywriting.
Again you need to select CUBLAS to use CUDA GPU, and put 99 as the number of layers.
Method 4 - Raw PyTorch code with Auto-GPTQ library
For GPTQ models I have been using raw PyTorch code without any UI to have greater control.
Link to script examples can be found at the end of this file. But first, let's set up a AWS VM and install dependencies.
AWS GPU VM setup and dependencies
Launch a g5.4xlarge EC2 instance in AWS. This gives you an A10G card with 24GB VRAM and an already installed PyTorch environment (read the intro message on the server after SSHing to see how to activate it). It also has Conda installed.
I use Ubuntu as the distro.
You can always use the command 'nvidia-smi' to see current VRAM usage and other GPU status info.
Let's install dependencies for Auto-GPTQ to use GPTQ models from HF:
sudo apt update
sudo apt install git-lfs # for big files on HF
source activate pytorch
pip install transformers -U
pip install accelerate -U
pip install optimum
git clone https://github.com/PanQiWei/AutoGPTQ
pip install gekko
cd AutoGPTQ/
pip3 install .[triton]
Triton is an acceleration thing for CUDA, you want this.
You might need to install these too:
pip install flash-attn==1.0.3.post0
pip install triton==2.0.0.dev20221202
pip install einops
pip install bitsandbytes
For context, bitsandbytes is a thing that auto-quantizes hf models (meaning half-float, not huggingface) to 4-bit int which allows you to run models that are listed as 'hf' on only 24G VRAM, without a suffix like GGML or GPTQ.
Example models like this:
https://huggingface.co/NousResearch/CodeLlama-34b-hf
https://huggingface.co/TheBloke/CodeLlama-34B-fp16
Alternative way to install triton:
git clone https://github.com/openai/triton.git;
cd triton/python;
pip install cmake; # build time dependency
pip install -e .
After dependencies are installed you can test Auto-GPTQ in pytorch. Here is example code for CodeLLama Instruct, adapted from TheBloke's examples on HF:
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import os
#model_name_or_path = "TheBloke/CodeLlama-34B-GPTQ" #tested
model_name_or_path = "TheBloke/CodeLlama-34B-Instruct-GPTQ" #tested
use_triton = True
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
#model_basename=model_basename,
#revision="gptq-4bit-128g-actorder_True",
use_safetensors=True,
trust_remote_code=True,
inject_fused_attention=False,
device="cuda:0",
use_triton=use_triton,
quantize_config=None)
# read input file with the code
INPUT_CODE_PATH = "input_code.txt"
INPUT_CONTEXT_PATH = "input_context.txt"
input_code = ""
input_context = "" # can be used for written context about the code like commit logs, PR descriptions, discussions, etc
if os.path.exists(INPUT_CODE_PATH):
with open(INPUT_CODE_PATH, 'r') as f:
input_code = f.read()
if os.path.exists(INPUT_CONTEXT_PATH):
with open(INPUT_CONTEXT_PATH, 'r') as f:
input_context = f.read()
prompt_template = ""
PROMPT = f"""Tell me if there are any serious bugs or omissions in the below code.
"""
if input_context:
prompt_template = f"""[INST] {PROMPT}Here is some context for the code:
{input_context}
And here is the code:
{input_code}
[/INST]"""
else:
prompt_template = f"""[INST] {PROMPT}Here is the code:
{input_code}
[/INST]"""
print("\n\n*** Input prompt:\n")
print(prompt_template)
print("\n\n*** Generate:")
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
#output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
#output = model.generate(inputs=input_ids, do_sample=True, temperature=0.70, max_length=512, top_p=0.95,repetition_penalty=1.15)
output = model.generate(inputs=input_ids, do_sample=True, temperature=0.70, max_length=4096)
print(tokenizer.decode(output[0]))
@stanek-michal
Copy link
Author

stanek-michal commented Oct 5, 2023

Correction to the above: need to use a PyTorch distro in AWS, not regular Ubuntu, for example:
Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20231003

Ensure at least 128GB disk, best 256GB or more.

@stanek-michal
Copy link
Author

New simplistic command line LLM loader:
https://ollama.ai/

@stanek-michal
Copy link
Author

Continue.dev VSCode plugin for LLMs (also private/air-gapped option with Ollama):
https://continue.dev/docs/walkthroughs/codellama

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment