Last active
October 23, 2025 07:12
-
-
Save lkarlslund/85ceeb0813a9cf10e7525013e309f725 to your computer and use it in GitHub Desktop.
Running Qwen3 Next 80B A3B Q2_0 on RTX 5090 (32GB VRAM)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| The Qwen3 Next 80B A3B Q2 quant fits on 32GB VRAM with 50K context. Here's how - it's bleeding edge so you'll need to compile llama-cpp yourself. | |
| Build: | |
| git clone https://github.com/cturan/llama.cpp.git llama.cpp-qwen3-next | |
| cd llama.cpp-qwen3-next | |
| git checkout qwen3_next | |
| time cmake -B build -DGGML_CUDA=ON | |
| time cmake --build build --config Release --parallel $(nproc --all) | |
| Download model Q2 or Q4 quants from: | |
| https://huggingface.co/lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF/ | |
| Running the model: | |
| build/bin/llama-server -m Qwen__Qwen3-Next-80B-A3B-Instruct-Q2_0.gguf --port 5005 --no-mmap -ngl 999 --ctx-size 50000 | |
| You can adjust -ngl xx depending on VRAM available. | |
| Q2_0 fully offloaded to CUDA: 5-600t/s prompt parsing, 5-60t/s generating | |
| Q4_0 with 30 layers offloaded: 35t/s prompt parsing, 10t/s generating (slooooow) | |
| Q4_0 just on CPU: 10t/s prompt, 5t/s generated |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment