Skip to content

Instantly share code, notes, and snippets.

@lkarlslund
Last active October 23, 2025 07:12
Show Gist options
  • Select an option

  • Save lkarlslund/85ceeb0813a9cf10e7525013e309f725 to your computer and use it in GitHub Desktop.

Select an option

Save lkarlslund/85ceeb0813a9cf10e7525013e309f725 to your computer and use it in GitHub Desktop.
Running Qwen3 Next 80B A3B Q2_0 on RTX 5090 (32GB VRAM)
The Qwen3 Next 80B A3B Q2 quant fits on 32GB VRAM with 50K context. Here's how - it's bleeding edge so you'll need to compile llama-cpp yourself.
Build:
git clone https://github.com/cturan/llama.cpp.git llama.cpp-qwen3-next
cd llama.cpp-qwen3-next
git checkout qwen3_next
time cmake -B build -DGGML_CUDA=ON
time cmake --build build --config Release --parallel $(nproc --all)
Download model Q2 or Q4 quants from:
https://huggingface.co/lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF/
Running the model:
build/bin/llama-server -m Qwen__Qwen3-Next-80B-A3B-Instruct-Q2_0.gguf --port 5005 --no-mmap -ngl 999 --ctx-size 50000
You can adjust -ngl xx depending on VRAM available.
Q2_0 fully offloaded to CUDA: 5-600t/s prompt parsing, 5-60t/s generating
Q4_0 with 30 layers offloaded: 35t/s prompt parsing, 10t/s generating (slooooow)
Q4_0 just on CPU: 10t/s prompt, 5t/s generated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment