lkarlslund · October 23, 2025 07:12
diff --git a/gistfile1.txt b/gistfile1.txt
 The Qwen3 Next 80B A3B Q2 quant fits on 32GB VRAM with 50K context. Here's how - it's bleeding edge so you'll need to compile llama-cpp yourself.

 Build:

 git clone https://github.com/cturan/llama.cpp.git llama.cpp-qwen3-next
 cd llama.cpp-qwen3-next
 git checkout qwen3_next
 time cmake -B build  -DGGML_CUDA=ON
 time cmake --build build --config Release --parallel $(nproc --all)

 Download model Q2 or Q4 quants from:

 https://huggingface.co/lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF/

 Running the model:

 build/bin/llama-server -m Qwen__Qwen3-Next-80B-A3B-Instruct-Q2_0.gguf --port 5005 --no-mmap -ngl 999 --ctx-size 50000

 You can adjust -ngl xx depending on VRAM available.

 Q2_0 fully offloaded to CUDA: 5-600t/s prompt parsing, 5-60t/s generating
 Q4_0 with 30 layers offloaded: 35t/s prompt parsing, 10t/s generating (slooooow)
 Q4_0 just on CPU: 10t/s prompt, 5t/s generated
	The Qwen3 Next 80B A3B Q2 quant fits on 32GB VRAM with 50K context. Here's how - it's bleeding edge so you'll need to compile llama-cpp yourself.

	Build:

	git clone https://github.com/cturan/llama.cpp.git llama.cpp-qwen3-next
	cd llama.cpp-qwen3-next
	git checkout qwen3_next
	time cmake -B build -DGGML_CUDA=ON
	time cmake --build build --config Release --parallel $(nproc --all)

	Download model Q2 or Q4 quants from:

	https://huggingface.co/lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF/

	Running the model:

	build/bin/llama-server -m Qwen__Qwen3-Next-80B-A3B-Instruct-Q2_0.gguf --port 5005 --no-mmap -ngl 999 --ctx-size 50000

	You can adjust -ngl xx depending on VRAM available.

	Q2_0 fully offloaded to CUDA: 5-600t/s prompt parsing, 5-60t/s generating
	Q4_0 with 30 layers offloaded: 35t/s prompt parsing, 10t/s generating (slooooow)
	Q4_0 just on CPU: 10t/s prompt, 5t/s generated
No results found