Created
August 31, 2023 17:44
-
-
Save stanek-michal/d60a12a3e8dabf7bdd64662480991f2c to your computer and use it in GitHub Desktop.
Tutorial - how to run OSS LLM AI models locally
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| -1) Quick primer on HF models | |
| Models can be grabbed from HuggingFace, most UIs have a text window where you paste in a HF identifier such as: | |
| TheBloke/CodeLlama-7B-Instruct-GGML | |
| the UI then downloads the model from HF automatically. | |
| here is an example link to a model | |
| https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGML | |
| GGML, GPTQ, GGUF are different model quantization formats. Usually you want GGML. It can run on CPU and by default the UIs will run on CPU only (slow). If you want to run on GPU, you need to move all the 'layers' to GPU. Type in 99 as number of layers and it will put the entire model on GPU. You can split layers between CPU and GPU but then it is much slower. | |
| GGUF is a new format, might be unsupported yet, no added value to GGML, but llama.cpp loader was quick to deprecate GGML so you might have to use GGUF | |
| GPTQ is a special one intended for using on GPU, supported by Auto-GPTQ library or GPTQ-for-LLama. | |
| 7B, 13B, 34B, 70B are model sizes. 7B are the simplest/dumbest but require the least resources. For coding and logic/reasoning I recommend 34B and up, which when quantized to 4-bit can fit on a 24GB VRAM GPU which is the common amount of VRAM (AWS). For reference, GPT-3.5 by OpenAI is 175B. | |
| If you see different quant versions like q8, q4, q4_k_s, q4_k_m, q5_0, etc, I would recommend the q4_k_m as a rule of thumb best version. Could try q5_k_m if it fits on the GPU. | |
| 0) Primer on CodeLLama | |
| https://ai.meta.com/blog/code-llama-large-language-model-coding/ | |
| https://github.com/facebookresearch/codellama | |
| CodeLLama recently by Meta seems to be the best local OSS coding model, similar to GPT3.5. It is worse at following instructions and chained prompting but seems to be around GPT3.5 in terms of coding ability. I recommend the 33B version filling up all the 24GB VRAM on GPU. | |
| There are three versions on HF: | |
| - CodeLLama | |
| * this one behaves like CoPilot, there is no instruct, don't ask it any questions, use it only for code completion. Give it part of a function, it will complete the whole function. Or give it a function signature of a function, write an intro code comment about what the function does (as if it was a real comment in a real repo) and then let the model complete the function. Infilling code in between is also supported with [FILL] [/FILL] tags but I haven't used it. | |
| https://huggingface.co/TheBloke/CodeLlama-34B-GPTQ | |
| https://github.com/facebookresearch/codellama/blob/main/example_completion.py | |
| https://github.com/facebookresearch/codellama/blob/main/example_infilling.py | |
| - CodeLLama Instruct | |
| * this is an Instruct model which means it was tuned to follow instructions like you are familiar with ChatGPT. Still might not be as good for chain prompting. There is a proper instruct format (prefix/postfix around prompt) which you must use or you will get bad results. You can't just type in anything like in ChatGPT. Example use with pytorch and Auto-GPTQ down below. | |
| https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGML | |
| - CodeLLama Python | |
| * while the previous ones seem to be good at most/all languages (best evals were for C++ for example), Meta also made a dedicated one for Python. | |
| https://huggingface.co/TheBloke/CodeLlama-7B-Python-GGML | |
| Method 1 - lmstudio (easiest?) | |
| https://lmstudio.ai/ | |
| Seems to be an easy, packaged local solution. Unfortunately not OSS. I haven't tried it as it requires M1/M2 or Windows. Might have trouble with old model formats. | |
| example models to try: | |
| TheBloke/Samantha-7B-GGML | |
| TheBloke/CodeLlama-7B-GGML | |
| TheBloke/CodeLlama-7B-Python-GGML | |
| TheBloke/CodeLlama-7B-Instruct-GGML | |
| Method 2 - text-generation-webui otherwise known as Oobabooga | |
| https://github.com/oobabooga/text-generation-webui | |
| This UI loader is often recommended but not necessarily the best one. It is quite bloated, especially the one-click .exe install which downloads half the internet. I much prefer using the docker/manual pip pytorch solution to download only the dependencies I need without bloat like Whisper or StableDiffusion etc. | |
| Oobabooga supports an optional OpenAI REST API so could be used as a drop-in in scripts. | |
| Note: to get GPU support in ooobabooga I had to do this before installing: | |
| pip uninstall llama-cpp-python | |
| CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python | |
| Note that CUBLAS is the library you always want to use for CUDA GPUs. | |
| Method 3 - KoboldCPP / KoboldAI | |
| https://github.com/LostRuins/koboldcpp | |
| One of the best ways to run GGML or GGUF models, it's in active development and they added support for GGUF immediately. UI might be difficult to understand for pure instruct as its main use is storywriting. | |
| Again you need to select CUBLAS to use CUDA GPU, and put 99 as the number of layers. | |
| Method 4 - Raw PyTorch code with Auto-GPTQ library | |
| For GPTQ models I have been using raw PyTorch code without any UI to have greater control. | |
| Link to script examples can be found at the end of this file. But first, let's set up a AWS VM and install dependencies. | |
| AWS GPU VM setup and dependencies | |
| Launch a g5.4xlarge EC2 instance in AWS. This gives you an A10G card with 24GB VRAM and an already installed PyTorch environment (read the intro message on the server after SSHing to see how to activate it). It also has Conda installed. | |
| I use Ubuntu as the distro. | |
| You can always use the command 'nvidia-smi' to see current VRAM usage and other GPU status info. | |
| Let's install dependencies for Auto-GPTQ to use GPTQ models from HF: | |
| sudo apt update | |
| sudo apt install git-lfs # for big files on HF | |
| source activate pytorch | |
| pip install transformers -U | |
| pip install accelerate -U | |
| pip install optimum | |
| git clone https://github.com/PanQiWei/AutoGPTQ | |
| pip install gekko | |
| cd AutoGPTQ/ | |
| pip3 install .[triton] | |
| Triton is an acceleration thing for CUDA, you want this. | |
| You might need to install these too: | |
| pip install flash-attn==1.0.3.post0 | |
| pip install triton==2.0.0.dev20221202 | |
| pip install einops | |
| pip install bitsandbytes | |
| For context, bitsandbytes is a thing that auto-quantizes hf models (meaning half-float, not huggingface) to 4-bit int which allows you to run models that are listed as 'hf' on only 24G VRAM, without a suffix like GGML or GPTQ. | |
| Example models like this: | |
| https://huggingface.co/NousResearch/CodeLlama-34b-hf | |
| https://huggingface.co/TheBloke/CodeLlama-34B-fp16 | |
| Alternative way to install triton: | |
| git clone https://github.com/openai/triton.git; | |
| cd triton/python; | |
| pip install cmake; # build time dependency | |
| pip install -e . | |
| After dependencies are installed you can test Auto-GPTQ in pytorch. Here is example code for CodeLLama Instruct, adapted from TheBloke's examples on HF: | |
| from transformers import AutoTokenizer, pipeline, logging | |
| from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig | |
| import os | |
| #model_name_or_path = "TheBloke/CodeLlama-34B-GPTQ" #tested | |
| model_name_or_path = "TheBloke/CodeLlama-34B-Instruct-GPTQ" #tested | |
| use_triton = True | |
| tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True) | |
| model = AutoGPTQForCausalLM.from_quantized(model_name_or_path, | |
| #model_basename=model_basename, | |
| #revision="gptq-4bit-128g-actorder_True", | |
| use_safetensors=True, | |
| trust_remote_code=True, | |
| inject_fused_attention=False, | |
| device="cuda:0", | |
| use_triton=use_triton, | |
| quantize_config=None) | |
| # read input file with the code | |
| INPUT_CODE_PATH = "input_code.txt" | |
| INPUT_CONTEXT_PATH = "input_context.txt" | |
| input_code = "" | |
| input_context = "" # can be used for written context about the code like commit logs, PR descriptions, discussions, etc | |
| if os.path.exists(INPUT_CODE_PATH): | |
| with open(INPUT_CODE_PATH, 'r') as f: | |
| input_code = f.read() | |
| if os.path.exists(INPUT_CONTEXT_PATH): | |
| with open(INPUT_CONTEXT_PATH, 'r') as f: | |
| input_context = f.read() | |
| prompt_template = "" | |
| PROMPT = f"""Tell me if there are any serious bugs or omissions in the below code. | |
| """ | |
| if input_context: | |
| prompt_template = f"""[INST] {PROMPT}Here is some context for the code: | |
| {input_context} | |
| And here is the code: | |
| {input_code} | |
| [/INST]""" | |
| else: | |
| prompt_template = f"""[INST] {PROMPT}Here is the code: | |
| {input_code} | |
| [/INST]""" | |
| print("\n\n*** Input prompt:\n") | |
| print(prompt_template) | |
| print("\n\n*** Generate:") | |
| input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda() | |
| #output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512) | |
| #output = model.generate(inputs=input_ids, do_sample=True, temperature=0.70, max_length=512, top_p=0.95,repetition_penalty=1.15) | |
| output = model.generate(inputs=input_ids, do_sample=True, temperature=0.70, max_length=4096) | |
| print(tokenizer.decode(output[0])) |
Author
Author
More links:
https://huggingface.co/blog/codellama
Author
Correction to the above: need to use a PyTorch distro in AWS, not regular Ubuntu, for example:
Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20231003
Ensure at least 128GB disk, best 256GB or more.
Author
New simplistic command line LLM loader:
https://ollama.ai/
Author
Continue.dev VSCode plugin for LLMs (also private/air-gapped option with Ollama):
https://continue.dev/docs/walkthroughs/codellama
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
other UIs/loaders I forgot to mention:
Also if you are looking for a cloud GPU provider cheaper than AWS for private use, you can look at runpod, it's about 40cents an hour for a 4090 24G GPU.