Skip to content

Instantly share code, notes, and snippets.

@miminashi
Last active May 9, 2025 19:41
Show Gist options
  • Select an option

  • Save miminashi/e39b1992efb5c15fa1115532584ff273 to your computer and use it in GitHub Desktop.

Select an option

Save miminashi/e39b1992efb5c15fa1115532584ff273 to your computer and use it in GitHub Desktop.
2025-05-10
ubuntu@t120h-k80:~/llama.cpp (master)$ ./build/bin/llama-bench -p 0 -n 128,256,512 \
> -m ~/.cache/llama.cpp/unsloth_Qwen3-32B-GGUF_Qwen3-32B-Q8_0.gguf \
> -m ~/.cache/llama.cpp/unsloth_Qwen3-30B-A3B-GGUF_Qwen3-30B-A3B-Q8_0.gguf \
> -m ~/.cache/llama.cpp/mmns_Qwen3-32B-F16.gguf \
> -m ~/.cache/llama.cpp/mmns_Qwen3-30B-A3B-F16.gguf \
> -m ~/.cache/llama.cpp/unsloth_DeepSeek-R1-Distill-Llama-70B-GGUF_DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: Tesla K80, compute capability 3.7, VMM: yes
Device 1: Tesla K80, compute capability 3.7, VMM: yes
Device 2: Tesla K80, compute capability 3.7, VMM: yes
Device 3: Tesla K80, compute capability 3.7, VMM: yes
Device 4: Tesla K80, compute capability 3.7, VMM: yes
Device 5: Tesla K80, compute capability 3.7, VMM: yes
Device 6: Tesla K80, compute capability 3.7, VMM: yes
Device 7: Tesla K80, compute capability 3.7, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3 32B Q8_0 | 32.42 GiB | 32.76 B | CUDA | 99 | tg128 | 2.06 ± 0.00 |
| qwen3 32B Q8_0 | 32.42 GiB | 32.76 B | CUDA | 99 | tg256 | 2.02 ± 0.00 |
| qwen3 32B Q8_0 | 32.42 GiB | 32.76 B | CUDA | 99 | tg512 | 1.95 ± 0.00 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | tg128 | 10.19 ± 0.00 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | tg256 | 9.89 ± 0.00 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | tg512 | 9.27 ± 0.00 |
| qwen3 32B F16 | 61.03 GiB | 32.76 B | CUDA | 99 | tg128 | 1.28 ± 0.00 |
| qwen3 32B F16 | 61.03 GiB | 32.76 B | CUDA | 99 | tg256 | 1.28 ± 0.00 |
| qwen3 32B F16 | 61.03 GiB | 32.76 B | CUDA | 99 | tg512 | 1.28 ± 0.01 |
| qwen3moe 30B.A3B F16 | 56.89 GiB | 30.53 B | CUDA | 99 | tg128 | 6.53 ± 0.00 |
| qwen3moe 30B.A3B F16 | 56.89 GiB | 30.53 B | CUDA | 99 | tg256 | 6.40 ± 0.00 |
| qwen3moe 30B.A3B F16 | 56.89 GiB | 30.53 B | CUDA | 99 | tg512 | 6.13 ± 0.00 |
| llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | CUDA | 99 | tg128 | 1.15 ± 0.06 |
| llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | CUDA | 99 | tg256 | 1.05 ± 0.02 |
| llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | CUDA | 99 | tg512 | 1.07 ± 0.01 |
build: b486ba05 (5321)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment