Skip to content

Instantly share code, notes, and snippets.

@Artefact2
Last active January 24, 2026 21:16
Show Gist options
  • Select an option

  • Save Artefact2/b5f810600771265fc1e39442288e8ec9 to your computer and use it in GitHub Desktop.

Select an option

Save Artefact2/b5f810600771265fc1e39442288e8ec9 to your computer and use it in GitHub Desktop.
GGUF quantizations overview

Which GGUF is right for me? (Opinionated)

Good question! I am collecting human data on how quantization affects outputs. See here for more information: ggml-org/llama.cpp#5962

In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.

llama.cpp feature matrix

See the wiki upstream: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix

KL-divergence statistics for Mistral-7B

  • Last updated 2024-02-27 (add IQ4_XS).
  • imatrix from wiki.train, 200*512 tokens.
  • KL-divergence measured on wiki.test.

image

Bits per weight KL-divergence median KL-divergence q99 Top tokens differ ln(PPL(Q)/PPL(base))
IQ1_S 1.78 0.5495 5.5174 0.3840 0.9235
IQ2_XXS 2.20 0.1751 2.4983 0.2313 0.2988
IQ2_XS 2.43 0.1146 1.7693 0.1943 0.2046
IQ2_S 2.55 0.0949 1.6284 0.1806 0.1722
IQ2_M 2.76 0.0702 1.0935 0.1557 0.1223
Q2_K_S 2.79 0.0829 1.5111 0.1735 0.1600
Q2_K 3.00 0.0588 1.0337 0.1492 0.1103
IQ3_XXS 3.21 0.0330 0.5492 0.1137 0.0589
IQ3_XS 3.32 0.0296 0.4550 0.1071 0.0458
Q3_K_S 3.50 0.0304 0.4481 0.1068 0.0511
IQ3_S 3.52 0.0205 0.3018 0.0895 0.0306
IQ3_M 3.63 0.0186 0.2740 0.0859 0.0268
Q3_K_M 3.89 0.0171 0.2546 0.0839 0.0258
Q3_K_L 4.22 0.0152 0.2202 0.0797 0.0205
IQ4_XS 4.32 0.0088 0.1082 0.0606 0.0079
IQ4_NL 4.56 0.0085 0.1077 0.0605 0.0074
Q4_K_S 4.57 0.0083 0.1012 0.0600 0.0081
Q4_K_M 4.83 0.0075 0.0885 0.0576 0.0060
Q5_K_S 5.52 0.0045 0.0393 0.0454 0.0005
Q5_K_M 5.67 0.0043 0.0368 0.0444 0.0005
Q6_K 6.57 0.0032 0.0222 0.0394 −0.0008

ROCm benchmarks for Mistral-7B

  • Last updated 2024-03-15 (bench #6083).

image

GiB pp512 -ngl 99 tg128 -ngl 99 pp512 -ngl 0 tg128 -ngl 0 pp512 -ngl 0 #6083
IQ1_S 1.50 709.29 74.85 324.35 15.66 585.61
IQ2_XS 2.05 704.52 58.44 316.10 15.11 557.68
IQ3_XS 2.79 682.72 45.79 300.61 10.49 527.83
IQ4_XS 3.64 712.96 64.17 292.36 11.06 495.92
Q4_0 3.83 870.44 63.42 310.94 10.44 554.56
Q5_K 4.78 691.40 46.52 273.83 8.54 453.58
Q6_K 5.53 661.98 47.57 261.16 7.34 415.22
Q8_0 7.17 881.95 39.74 270.70 5.74 440.44
f16 13.49 211.12 3.06 303.60
@unixmonk
Copy link

@vahidx4r4x

i did write FROM "D:\This PC\Desktop\gpt-oss-20B-jail-broke.i1-Q6_K.gguf" to a .Modelfile and created the model in ollama but it will refuse to answer personal financial tip and etc.

Ollama seems to have a lot of problems when it comes to abliterated models I've found; make sure you up the context window limit to at least 8k tokens or so especially for a reasoning model like OSS. Another thing is to make sure the template and parameters are as expected for the model, a lot of the time the template is incorrect and the model has no idea what to do.

Honestly I'd just try vLLM or LM-Studio first, less head-ache than the above.

@randomqhacker
Copy link

I would love to see where MXFP4 falls on your graph! It seems to be debated whether it's any good as an after the fact quantization, rather than QAT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment