stanek-michal · August 31, 2023 17:44 · stanek-michal · Oct 5, 2023 · stanek-michal · Oct 9, 2023
diff --git a/local_llm_tutorial.txt b/local_llm_tutorial.txt
 -1) Quick primer on HF models

 Models can be grabbed from HuggingFace, most UIs have a text window where you paste in a HF identifier such as:
 TheBloke/CodeLlama-7B-Instruct-GGML

 the UI then downloads the model from HF automatically.

 here is an example link to a model
 https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGML

 GGML, GPTQ, GGUF are different model quantization formats. Usually you want GGML. It can run on CPU and by default the UIs will run on CPU only (slow). If you want to run on GPU, you need to move all the 'layers' to GPU. Type in 99 as number of layers and it will put the entire model on GPU. You can split layers between CPU and GPU but then it is much slower.

 GGUF is a new format, might be unsupported yet, no added value to GGML, but llama.cpp loader was quick to deprecate GGML so you might have to use GGUF

 GPTQ is a special one intended for using on GPU, supported by Auto-GPTQ library or GPTQ-for-LLama.

 7B, 13B, 34B, 70B are model sizes. 7B are the simplest/dumbest but require the least resources. For coding and logic/reasoning I recommend 34B and up, which when quantized to 4-bit can fit on a 24GB VRAM GPU which is the common amount of VRAM (AWS). For reference, GPT-3.5 by OpenAI is 175B.

 If you see different quant versions like q8, q4, q4_k_s, q4_k_m, q5_0, etc, I would recommend the q4_k_m as a rule of thumb best version. Could try q5_k_m if it fits on the GPU.

 0) Primer on CodeLLama

 https://ai.meta.com/blog/code-llama-large-language-model-coding/
 https://github.com/facebookresearch/codellama
 CodeLLama recently by Meta seems to be the best local OSS coding model, similar to GPT3.5. It is worse at following instructions and chained prompting but seems to be around GPT3.5 in terms of coding ability. I recommend the 33B version filling up all the 24GB VRAM on GPU.

 There are three versions on HF:
 - CodeLLama 
  * this one behaves like CoPilot, there is no instruct, don't ask it any questions, use it only for code completion. Give it part of a function, it will complete the whole function. Or give it a function signature of a function, write an intro code comment about what the function does (as if it was a real comment in a real repo) and then let the model complete the function. Infilling code in between is also supported with [FILL] [/FILL] tags but I haven't used it.
  https://huggingface.co/TheBloke/CodeLlama-34B-GPTQ
  https://github.com/facebookresearch/codellama/blob/main/example_completion.py
  https://github.com/facebookresearch/codellama/blob/main/example_infilling.py
 - CodeLLama Instruct
  * this is an Instruct model which means it was tuned to follow instructions like you are familiar with ChatGPT. Still might not be as good for chain prompting. There is a proper instruct format (prefix/postfix around prompt) which you must use or you will get bad results. You can't just type in anything like in ChatGPT. Example use with pytorch and Auto-GPTQ down below.
  https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGML
 - CodeLLama Python
  * while the previous ones seem to be good at most/all languages (best evals were for C++ for example), Meta also made a dedicated one for Python.
  https://huggingface.co/TheBloke/CodeLlama-7B-Python-GGML

 Method 1 - lmstudio (easiest?)

 https://lmstudio.ai/
 Seems to be an easy, packaged local solution. Unfortunately not OSS.  I haven't tried it as it requires M1/M2 or Windows. Might have trouble with old model formats.

 example models to try:
 TheBloke/Samantha-7B-GGML
 TheBloke/CodeLlama-7B-GGML
 TheBloke/CodeLlama-7B-Python-GGML
 TheBloke/CodeLlama-7B-Instruct-GGML

 Method 2 - text-generation-webui otherwise known as Oobabooga

 https://github.com/oobabooga/text-generation-webui
 This UI loader is often recommended but not necessarily the best one. It is quite bloated, especially the one-click .exe install which downloads half the internet. I much prefer using the docker/manual pip pytorch solution to download only the dependencies I need without bloat like Whisper or StableDiffusion etc.
 Oobabooga supports an optional OpenAI REST API so could be used as a drop-in in scripts.
 Note: to get GPU support in ooobabooga I had to do this before installing:
        pip uninstall llama-cpp-python
        CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
        
 Note that CUBLAS is the library you always want to use for CUDA GPUs.
        
 Method 3 - KoboldCPP / KoboldAI

 https://github.com/LostRuins/koboldcpp
 One of the best ways to run GGML or GGUF models, it's in active development and they added support for GGUF immediately. UI might be difficult to understand for pure instruct as its main use is storywriting.

 Again you need to select CUBLAS to use CUDA GPU, and put 99 as the number of layers.

 Method 4 - Raw PyTorch code with Auto-GPTQ library

 For GPTQ models I have been using raw PyTorch code without any UI to have greater control.

 Link to script examples can be found at the end of this file. But first, let's set up a AWS VM and install dependencies.

 AWS GPU VM setup and dependencies

 Launch a g5.4xlarge EC2 instance in AWS. This gives you an A10G card with 24GB VRAM and an already installed PyTorch environment (read the intro message on the server after SSHing to see how to activate it). It also has Conda installed.
 I use Ubuntu as the distro.

 You can always use the command 'nvidia-smi' to see current VRAM usage and other GPU status info.

 Let's install dependencies for Auto-GPTQ to use GPTQ models from HF:

  sudo apt update
  sudo apt install git-lfs # for big files on HF
  source activate pytorch
  pip install transformers -U
  pip install accelerate -U
  pip install optimum
  git clone https://github.com/PanQiWei/AutoGPTQ
  pip install gekko
  cd AutoGPTQ/
  pip3 install .[triton]
 
 Triton is an acceleration thing for CUDA, you want this.
 You might need to install these too:
  pip install flash-attn==1.0.3.post0
  pip install triton==2.0.0.dev20221202
  pip install einops
  pip install bitsandbytes
  
 For context, bitsandbytes is a thing that auto-quantizes hf models (meaning half-float, not huggingface) to 4-bit int which allows you to run models that are listed as 'hf' on only 24G VRAM, without a suffix like GGML or GPTQ.
 Example models like this:
 https://huggingface.co/NousResearch/CodeLlama-34b-hf
 https://huggingface.co/TheBloke/CodeLlama-34B-fp16
  
 Alternative way to install triton:
  git clone https://github.com/openai/triton.git;
  cd triton/python;
  pip install cmake; # build time dependency
  pip install -e .


 After dependencies are installed you can test Auto-GPTQ in pytorch. Here is example code for CodeLLama Instruct, adapted from TheBloke's examples on HF:

 from transformers import AutoTokenizer, pipeline, logging
 from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
 import os

 #model_name_or_path = "TheBloke/CodeLlama-34B-GPTQ" #tested
 model_name_or_path = "TheBloke/CodeLlama-34B-Instruct-GPTQ" #tested

 use_triton = True
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

 model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        #model_basename=model_basename,
        #revision="gptq-4bit-128g-actorder_True",
        use_safetensors=True,
        trust_remote_code=True,
        inject_fused_attention=False,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

 # read input file with the code
 INPUT_CODE_PATH = "input_code.txt"
 INPUT_CONTEXT_PATH = "input_context.txt"
 input_code = ""
 input_context = "" # can be used for written context about the code like commit logs, PR descriptions, discussions, etc

 if os.path.exists(INPUT_CODE_PATH):
    with open(INPUT_CODE_PATH, 'r') as f:
        input_code = f.read()
 if os.path.exists(INPUT_CONTEXT_PATH):
    with open(INPUT_CONTEXT_PATH, 'r') as f:
        input_context = f.read()

 prompt_template = ""
 PROMPT = f"""Tell me if there are any serious bugs or omissions in the below code.
 """

 if input_context:
    prompt_template = f"""[INST] {PROMPT}Here is some context for the code:

 {input_context}


 And here is the code:
 {input_code}
 [/INST]"""

 else:
    prompt_template = f"""[INST] {PROMPT}Here is the code:

 {input_code}
 [/INST]"""

 print("\n\n*** Input prompt:\n")
 print(prompt_template)
 print("\n\n*** Generate:")

 input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
 #output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
 #output = model.generate(inputs=input_ids, do_sample=True, temperature=0.70, max_length=512, top_p=0.95,repetition_penalty=1.15)
 output = model.generate(inputs=input_ids, do_sample=True, temperature=0.70, max_length=4096)
 print(tokenizer.decode(output[0]))
	-1) Quick primer on HF models

	Models can be grabbed from HuggingFace, most UIs have a text window where you paste in a HF identifier such as:
	TheBloke/CodeLlama-7B-Instruct-GGML

	the UI then downloads the model from HF automatically.

	here is an example link to a model
	https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGML

	GGML, GPTQ, GGUF are different model quantization formats. Usually you want GGML. It can run on CPU and by default the UIs will run on CPU only (slow). If you want to run on GPU, you need to move all the 'layers' to GPU. Type in 99 as number of layers and it will put the entire model on GPU. You can split layers between CPU and GPU but then it is much slower.

	GGUF is a new format, might be unsupported yet, no added value to GGML, but llama.cpp loader was quick to deprecate GGML so you might have to use GGUF

	GPTQ is a special one intended for using on GPU, supported by Auto-GPTQ library or GPTQ-for-LLama.

	7B, 13B, 34B, 70B are model sizes. 7B are the simplest/dumbest but require the least resources. For coding and logic/reasoning I recommend 34B and up, which when quantized to 4-bit can fit on a 24GB VRAM GPU which is the common amount of VRAM (AWS). For reference, GPT-3.5 by OpenAI is 175B.

	If you see different quant versions like q8, q4, q4_k_s, q4_k_m, q5_0, etc, I would recommend the q4_k_m as a rule of thumb best version. Could try q5_k_m if it fits on the GPU.

	0) Primer on CodeLLama

	https://ai.meta.com/blog/code-llama-large-language-model-coding/
	https://github.com/facebookresearch/codellama
	CodeLLama recently by Meta seems to be the best local OSS coding model, similar to GPT3.5. It is worse at following instructions and chained prompting but seems to be around GPT3.5 in terms of coding ability. I recommend the 33B version filling up all the 24GB VRAM on GPU.

	There are three versions on HF:
	- CodeLLama
	* this one behaves like CoPilot, there is no instruct, don't ask it any questions, use it only for code completion. Give it part of a function, it will complete the whole function. Or give it a function signature of a function, write an intro code comment about what the function does (as if it was a real comment in a real repo) and then let the model complete the function. Infilling code in between is also supported with [FILL] [/FILL] tags but I haven't used it.
	https://huggingface.co/TheBloke/CodeLlama-34B-GPTQ
	https://github.com/facebookresearch/codellama/blob/main/example_completion.py
	https://github.com/facebookresearch/codellama/blob/main/example_infilling.py
	- CodeLLama Instruct
	* this is an Instruct model which means it was tuned to follow instructions like you are familiar with ChatGPT. Still might not be as good for chain prompting. There is a proper instruct format (prefix/postfix around prompt) which you must use or you will get bad results. You can't just type in anything like in ChatGPT. Example use with pytorch and Auto-GPTQ down below.
	https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGML
	- CodeLLama Python
	* while the previous ones seem to be good at most/all languages (best evals were for C++ for example), Meta also made a dedicated one for Python.
	https://huggingface.co/TheBloke/CodeLlama-7B-Python-GGML

	Method 1 - lmstudio (easiest?)

	https://lmstudio.ai/
	Seems to be an easy, packaged local solution. Unfortunately not OSS. I haven't tried it as it requires M1/M2 or Windows. Might have trouble with old model formats.

	example models to try:
	TheBloke/Samantha-7B-GGML
	TheBloke/CodeLlama-7B-GGML
	TheBloke/CodeLlama-7B-Python-GGML
	TheBloke/CodeLlama-7B-Instruct-GGML

	Method 2 - text-generation-webui otherwise known as Oobabooga

	https://github.com/oobabooga/text-generation-webui
	This UI loader is often recommended but not necessarily the best one. It is quite bloated, especially the one-click .exe install which downloads half the internet. I much prefer using the docker/manual pip pytorch solution to download only the dependencies I need without bloat like Whisper or StableDiffusion etc.
	Oobabooga supports an optional OpenAI REST API so could be used as a drop-in in scripts.
	Note: to get GPU support in ooobabooga I had to do this before installing:
	pip uninstall llama-cpp-python
	CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

	Note that CUBLAS is the library you always want to use for CUDA GPUs.

	Method 3 - KoboldCPP / KoboldAI

	https://github.com/LostRuins/koboldcpp
	One of the best ways to run GGML or GGUF models, it's in active development and they added support for GGUF immediately. UI might be difficult to understand for pure instruct as its main use is storywriting.

	Again you need to select CUBLAS to use CUDA GPU, and put 99 as the number of layers.

	Method 4 - Raw PyTorch code with Auto-GPTQ library

	For GPTQ models I have been using raw PyTorch code without any UI to have greater control.

	Link to script examples can be found at the end of this file. But first, let's set up a AWS VM and install dependencies.

	AWS GPU VM setup and dependencies

	Launch a g5.4xlarge EC2 instance in AWS. This gives you an A10G card with 24GB VRAM and an already installed PyTorch environment (read the intro message on the server after SSHing to see how to activate it). It also has Conda installed.
	I use Ubuntu as the distro.

	You can always use the command 'nvidia-smi' to see current VRAM usage and other GPU status info.

	Let's install dependencies for Auto-GPTQ to use GPTQ models from HF:

	sudo apt update
	sudo apt install git-lfs # for big files on HF
	source activate pytorch
	pip install transformers -U
	pip install accelerate -U
	pip install optimum
	git clone https://github.com/PanQiWei/AutoGPTQ
	pip install gekko
	cd AutoGPTQ/
	pip3 install .[triton]

	Triton is an acceleration thing for CUDA, you want this.
	You might need to install these too:
	pip install flash-attn==1.0.3.post0
	pip install triton==2.0.0.dev20221202
	pip install einops
	pip install bitsandbytes

	For context, bitsandbytes is a thing that auto-quantizes hf models (meaning half-float, not huggingface) to 4-bit int which allows you to run models that are listed as 'hf' on only 24G VRAM, without a suffix like GGML or GPTQ.
	Example models like this:
	https://huggingface.co/NousResearch/CodeLlama-34b-hf
	https://huggingface.co/TheBloke/CodeLlama-34B-fp16

	Alternative way to install triton:
	git clone https://github.com/openai/triton.git;
	cd triton/python;
	pip install cmake; # build time dependency
	pip install -e .


	After dependencies are installed you can test Auto-GPTQ in pytorch. Here is example code for CodeLLama Instruct, adapted from TheBloke's examples on HF:

	from transformers import AutoTokenizer, pipeline, logging
	from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
	import os

	#model_name_or_path = "TheBloke/CodeLlama-34B-GPTQ" #tested
	model_name_or_path = "TheBloke/CodeLlama-34B-Instruct-GPTQ" #tested

	use_triton = True
	tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

	model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
	#model_basename=model_basename,
	#revision="gptq-4bit-128g-actorder_True",
	use_safetensors=True,
	trust_remote_code=True,
	inject_fused_attention=False,
	device="cuda:0",
	use_triton=use_triton,
	quantize_config=None)

	# read input file with the code
	INPUT_CODE_PATH = "input_code.txt"
	INPUT_CONTEXT_PATH = "input_context.txt"
	input_code = ""
	input_context = "" # can be used for written context about the code like commit logs, PR descriptions, discussions, etc

	if os.path.exists(INPUT_CODE_PATH):
	with open(INPUT_CODE_PATH, 'r') as f:
	input_code = f.read()
	if os.path.exists(INPUT_CONTEXT_PATH):
	with open(INPUT_CONTEXT_PATH, 'r') as f:
	input_context = f.read()

	prompt_template = ""
	PROMPT = f"""Tell me if there are any serious bugs or omissions in the below code.
	"""

	if input_context:
	prompt_template = f"""[INST] {PROMPT}Here is some context for the code:

	{input_context}


	And here is the code:
	{input_code}
	[/INST]"""

	else:
	prompt_template = f"""[INST] {PROMPT}Here is the code:

	{input_code}
	[/INST]"""

	print("\n\n*** Input prompt:\n")
	print(prompt_template)
	print("\n\n*** Generate:")

	input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
	#output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
	#output = model.generate(inputs=input_ids, do_sample=True, temperature=0.70, max_length=512, top_p=0.95,repetition_penalty=1.15)
	output = model.generate(inputs=input_ids, do_sample=True, temperature=0.70, max_length=4096)
	print(tokenizer.decode(output[0]))
No results found