Which GGUF is right for me? (Opinionated)

Good question! I am collecting human data on how quantization affects outputs. See here for more information: ggml-org/llama.cpp#5962

In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.

llama.cpp feature matrix

	Bits per weight	KL-divergence median	KL-divergence q99	Top tokens differ	ln(PPL(Q)/PPL(base))
IQ1_S	1.78	0.5495	5.5174	0.3840	0.9235
IQ2_XXS	2.20	0.1751	2.4983	0.2313	0.2988
IQ2_XS	2.43	0.1146	1.7693	0.1943	0.2046
IQ2_S	2.55	0.0949	1.6284	0.1806	0.1722
IQ2_M	2.76	0.0702	1.0935	0.1557	0.1223
Q2_K_S	2.79	0.0829	1.5111	0.1735	0.1600
Q2_K	3.00	0.0588	1.0337	0.1492	0.1103
IQ3_XXS	3.21	0.0330	0.5492	0.1137	0.0589
IQ3_XS	3.32	0.0296	0.4550	0.1071	0.0458
Q3_K_S	3.50	0.0304	0.4481	0.1068	0.0511
IQ3_S	3.52	0.0205	0.3018	0.0895	0.0306
IQ3_M	3.63	0.0186	0.2740	0.0859	0.0268
Q3_K_M	3.89	0.0171	0.2546	0.0839	0.0258
Q3_K_L	4.22	0.0152	0.2202	0.0797	0.0205
IQ4_XS	4.32	0.0088	0.1082	0.0606	0.0079
IQ4_NL	4.56	0.0085	0.1077	0.0605	0.0074
Q4_K_S	4.57	0.0083	0.1012	0.0600	0.0081
Q4_K_M	4.83	0.0075	0.0885	0.0576	0.0060
Q5_K_S	5.52	0.0045	0.0393	0.0454	0.0005
Q5_K_M	5.67	0.0043	0.0368	0.0444	0.0005
Q6_K	6.57	0.0032	0.0222	0.0394	−0.0008

	GiB	pp512 -ngl 99	tg128 -ngl 99	pp512 -ngl 0	tg128 -ngl 0	pp512 -ngl 0 #6083
IQ1_S	1.50	709.29	74.85	324.35	15.66	585.61
IQ2_XS	2.05	704.52	58.44	316.10	15.11	557.68
IQ3_XS	2.79	682.72	45.79	300.61	10.49	527.83
IQ4_XS	3.64	712.96	64.17	292.36	11.06	495.92
Q4_0	3.83	870.44	63.42	310.94	10.44	554.56
Q5_K	4.78	691.40	46.52	273.83	8.54	453.58
Q6_K	5.53	661.98	47.57	261.16	7.34	415.22
Q8_0	7.17	881.95	39.74	270.70	5.74	440.44
f16	13.49			211.12	3.06	303.60