zlib-ng: VPCLMULQDQ AVX2 vs PCLMULQDQ CRC32 Benchmark

Machine

CPU: 11th Gen Intel Core i7-1185G7 @ 3.00GHz (Tiger Lake)
Cores: 4 cores / 8 threads
L1d/L1i: 48 KiB / 32 KiB (x4)
L2: 1280 KiB (x4)
L3: 12288 KiB
OS: Windows 11 Pro
Compiler: MSVC (Visual Studio 18 2026)
Build: Release, static

Results

Median CPU time (ns) over 5 repetitions.

CRC32 (unaligned)

Size	pclmulqdq (ns)	vpclmulqdq_avx2 (ns)	Improvement
1	10.5	7.8	+26%
8	31.1	31.4	-1%
16	54.5	57.5	-6%
32	78.2	62.8	+20%
64	78.5	69.8	+12%
512	122.1	97.3	+20%
4096	350.3	279.0	+20%
32768	2335.5	1743.9	+25%
262144	18684.6	15694.8	+16%
4194304	313895.1	244140.6	+22%

CRC32 (aligned)

Size	pclmulqdq (ns)	vpclmulqdq_avx2 (ns)	Improvement
8	28.0	33.1	-18%
16	17.4	17.4	0%
32	20.9	21.4	-2%
64	26.2	27.9	-6%
512	68.1	52.5	+23%
4096	313.9	244.1	+22%
32768	1946.3	1751.6	+10%
262144	17127.6	13950.9	+19%
4194304	348772.3	232630.3	+33%

CRC32+Copy (unaligned)

Size	pclmulqdq (ns)	vpclmulqdq_avx2 (ns)	Improvement
32	77.9	62.3	+20%
512	117.3	109.4	+7%
8192	680.1	558.0	+18%
32768	2441.4	2441.4	0%
65536	5231.6	4185.3	+20%

CRC32+Copy (aligned)

Size	pclmulqdq (ns)	vpclmulqdq_avx2 (ns)	Improvement
32	23.4	25.7	-10%
512	55.8	52.3	+6%
8192	488.3	488.3	0%
32768	2441.4	1918.2	+21%
65536	4534.0	3503.3	+23%

Summary

VPCLMULQDQ AVX2 (256-bit carry-less multiply) provides consistent speedups over PCLMULQDQ (128-bit) for buffer sizes >= 512 bytes, with gains of 16-33% on larger buffers. The benefit comes from processing 256 bits per fold iteration instead of 128 bits. For small buffers (< 64 bytes), both paths share the same tail-processing code, so performance is equivalent.

nmoinvaz/zlib-ng-vpclmulqdq-avx2-benchmarks.md

Select an option

No results found