You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
zlib-ng: VPCLMULQDQ AVX2 vs PCLMULQDQ CRC32 Benchmark
Machine
CPU: 11th Gen Intel Core i7-1185G7 @ 3.00GHz (Tiger Lake)
Cores: 4 cores / 8 threads
L1d/L1i: 48 KiB / 32 KiB (x4)
L2: 1280 KiB (x4)
L3: 12288 KiB
OS: Windows 11 Pro
Compiler: MSVC (Visual Studio 18 2026)
Build: Release, static
Results
Median CPU time (ns) over 5 repetitions.
CRC32 (unaligned)
Size
pclmulqdq (ns)
vpclmulqdq_avx2 (ns)
Improvement
1
10.5
7.8
+26%
8
31.1
31.4
-1%
16
54.5
57.5
-6%
32
78.2
62.8
+20%
64
78.5
69.8
+12%
512
122.1
97.3
+20%
4096
350.3
279.0
+20%
32768
2335.5
1743.9
+25%
262144
18684.6
15694.8
+16%
4194304
313895.1
244140.6
+22%
CRC32 (aligned)
Size
pclmulqdq (ns)
vpclmulqdq_avx2 (ns)
Improvement
8
28.0
33.1
-18%
16
17.4
17.4
0%
32
20.9
21.4
-2%
64
26.2
27.9
-6%
512
68.1
52.5
+23%
4096
313.9
244.1
+22%
32768
1946.3
1751.6
+10%
262144
17127.6
13950.9
+19%
4194304
348772.3
232630.3
+33%
CRC32+Copy (unaligned)
Size
pclmulqdq (ns)
vpclmulqdq_avx2 (ns)
Improvement
32
77.9
62.3
+20%
512
117.3
109.4
+7%
8192
680.1
558.0
+18%
32768
2441.4
2441.4
0%
65536
5231.6
4185.3
+20%
CRC32+Copy (aligned)
Size
pclmulqdq (ns)
vpclmulqdq_avx2 (ns)
Improvement
32
23.4
25.7
-10%
512
55.8
52.3
+6%
8192
488.3
488.3
0%
32768
2441.4
1918.2
+21%
65536
4534.0
3503.3
+23%
Summary
VPCLMULQDQ AVX2 (256-bit carry-less multiply) provides consistent speedups over PCLMULQDQ (128-bit) for buffer sizes >= 512 bytes, with gains of 16-33% on larger buffers. The benefit comes from processing 256 bits per fold iteration instead of 128 bits. For small buffers (< 64 bytes), both paths share the same tail-processing code, so performance is equivalent.