Why Your LLM Benchmark Numbers Keep Changing on Apple Silicon

Investigating performance variance in MLX inference on a MacBook Pro M1

TL;DR

We observed up to 30% variance in decode throughput across benchmark runs of the same code on the same hardware — enough to turn a "no regression" into an apparent 14% slowdown, or inflate a 48% improvement to 80%. Using apple-smi to monitor GPU temperature, power draw, and memory pressure, we identified three root causes: thermal throttling, memory pressure from unified memory, and DVFS (Dynamic Voltage and Frequency Scaling). This post documents the methodology and provides guidelines for producing reliable benchmarks on Apple Silicon.

The Problem

While developing a hybrid KV cache for SGLang's MLX backend on Apple Silicon, we encountered a frustrating situation: the same benchmark, run minutes apart, produced wildly different results.

Run	BS=1 Decode (tok/s)	BS=4 Decode (tok/s)
Early run (favorable)	42.3	83.8
Later run (unfavorable)	28.7	63.5
Documented baseline	40.2	46.6

A 42 → 29 tok/s swing on BS=1 is a 31% drop with zero code changes. This makes it nearly impossible to evaluate whether an optimization actually works.

Test Setup

Hardware: Apple M1 (MacBookPro17,1), 8 GPU cores, 16GB unified memory
OS: macOS 26.3.1 (Tahoe)
Model: Qwen3-0.6B (BF16)
Benchmark: sglang.bench_one_batch with input_len=60, output_len=10
Monitoring: apple-smi v0.1.4 — captures GPU/CPU temperature, power draw, GPU frequency, and memory usage

The Experiment

We ran 6 sequential benchmark iterations with apple-smi snapshots before and after each run:

Run 1: BS=1 (cold start)
Run 2: BS=4 (immediately after)
Run 3: BS=1 (after 15s cooldown)
Run 4: BS=4 (after 15s cooldown)
Run 5: BS=1 (back-to-back, no cooldown)
Run 6: BS=4 (back-to-back, no cooldown)

Raw Results

Run	Label	Decode (tok/s)	GPU Temp Pre→Post	CPU Temp Pre→Post	Power Pre→Post
1	BS=1 cold	31.6	30°C → 60°C	73°C → 74°C	17W → 18W
2	BS=4 cold	67.8	59°C → 62°C	73°C → 76°C	16W → 24W
3	BS=1 cooled	32.7	30°C → 64°C	73°C → 74°C	13W → 19W
4	BS=4 cooled	69.7	30°C → 60°C	70°C → 76°C	13W → 23W
5	BS=1 hot	30.5	60°C → 62°C	74°C → 76°C	17W → 22W
6	BS=4 hot	69.1	62°C → 65°C	75°C → 77°C	18W → 21W

Root Cause Analysis

1. Thermal Throttling

The M1's GPU temperature swings from 30°C to 65°C during a single benchmark run. Apple Silicon uses aggressive thermal management — when the die temperature rises, the SoC reduces clock speeds and power delivery to stay within its thermal envelope.

Key observation: the GPU starts "cold" at 30°C after idle periods, heats to 60°C+ during computation, and the warmup phase itself changes the thermal state for the benchmark phase that follows. This means:

A "cold" first run benefits from peak clocks during warmup but may throttle during the benchmark
A "hot" follow-up run starts already throttled

The M1 MacBook Pro has a fanless design, making it especially susceptible — there's no active cooling to dissipate heat between runs.

2. Memory Pressure and Swap

RAM usage: 11,340 / 16,384 MiB (69% used)
Swap usage: 7,842 MiB

With only 16GB of unified memory shared between CPU, GPU, and system, and ~8GB of swap active, the system is under significant memory pressure. On Apple Silicon, GPU memory is system memory — there's no dedicated VRAM. This means:

The OS may page out GPU buffers to swap and page them back in, adding latency spikes
Background processes competing for memory cause unpredictable eviction of model weights or KV cache buffers
Swap I/O competes with the SSD controller for bus bandwidth

3. DVFS (Dynamic Voltage and Frequency Scaling)

The GPU frequency was consistently reported at ~715 MHz across all runs, which is near the M1's base GPU clock. However, Apple's DVFS operates at a finer granularity than what apple-smi can capture at 1-second intervals — the GPU may boost to higher frequencies for microsecond bursts during computation, with the boost headroom depending on:

Current thermal state (more headroom when cold)
Power budget (the M1 has a 13-15W TDP)
Whether the CPU is also active (CPU and GPU share the power budget)

4. Background System Activity

macOS runs numerous background services that spike CPU and I/O:

Spotlight indexing (mds_stores) can saturate SSD bandwidth
Time Machine snapshots
cloudd and iCloud sync
WindowServer compositing — even rendering the terminal output competes for GPU cycles

The CPU temperature baseline of 70-76°C even before our benchmark starts indicates significant background activity.

What the Data Tells Us

BS=1 variance is real but manageable

Condition	Decode (tok/s)	Variance from mean
Cold start	31.6	±0% (reference)
After cooldown	32.7	+3.5%
Back-to-back (hot)	30.5	-3.5%

Range: 30.5 – 32.7 tok/s (~7% spread). This is within the noise floor for meaningful conclusions. However, comparing against the documented baseline of 40.2 tok/s (measured on a different day) shows a 24% gap — likely due to different background load and memory pressure conditions.

BS=4 is more stable

Condition	Decode (tok/s)	Variance from mean
Cold start	67.8	-1.6%
After cooldown	69.7	+1.2%
Back-to-back (hot)	69.1	+0.3%

Range: 67.8 – 69.7 tok/s (~3% spread). The BS=4 batched path is more stable because it's GPU-compute-bound (the GPU stays busy), reducing the impact of scheduling jitter.

The first decode step is always slow

Every BS=4 run shows Decode 0 at 35-38 tok/s while Decode 1+ runs at 66-76 tok/s. This ~2× gap on the first step is caused by:

Mode transition overhead: the hybrid cache switches from native mlx-lm caches to contiguous buffers, patching attention modules on the first batched decode
MLX lazy evaluation: the first decode triggers JIT compilation of the new computation graph
Cache warmup: the contiguous KV buffers are freshly allocated and may not be in the GPU's cache hierarchy

Guidelines for Reliable Benchmarking on Apple Silicon

Based on these findings, here are concrete recommendations:

Before benchmarking

Close unnecessary applications — especially browsers, IDEs, and anything GPU-accelerated
Wait for background indexing to finish — check with sudo fs_usage -f filesys mds_stores
Monitor memory pressure — if swap exceeds ~2GB, reboot or close apps first
Let the system idle for 2+ minutes before starting — check with apple-smi that GPU temp is at baseline (~30°C)

During benchmarking

Always use a warmup phase — the benchmark tool already does this, but ensure warmup is at least 32 tokens
Run benchmarks sequentially, never in parallel — even two Python processes compete for the unified memory bus
Add cooldown periods — at least 15 seconds between runs to allow thermal recovery
Run each configuration 3+ times — report median, not mean (mean is skewed by outliers)

When comparing

Use A/B testing in the same session — git stash → run baseline → git stash pop → run modified, back-to-back
Report the system state — include apple-smi output with results so reviewers can assess conditions
Only trust relative comparisons — absolute numbers are meaningless across sessions; compare against a baseline run in the same session
Use median decode throughput, not total throughput — the "Total" metric includes prefill and amortizes differently across batch sizes

Example: proper A/B comparison

# ✅ Good: sequential A/B with system monitoring
apple-smi                          # Check system state
git stash                          # Switch to baseline
bench_one_batch --batch-size 4     # Baseline run
sleep 15                           # Cooldown
git stash pop                      # Switch to modified
bench_one_batch --batch-size 4     # Modified run
apple-smi                          # Verify system state hasn't degraded

# ❌ Bad: parallel runs
bench_one_batch --batch-size 1 &   # Competing for GPU
bench_one_batch --batch-size 4 &   # Results are meaningless

Conclusion

On Apple Silicon, you cannot trust a single benchmark number. The unified memory architecture, fanless thermal design, and aggressive DVFS create a system where performance varies by 7-30% depending on conditions invisible to the application.

The fix is not to eliminate variance (you can't), but to measure it, report it, and design your comparison methodology to account for it. Use apple-smi to capture system state, run back-to-back A/B comparisons in the same thermal window, and report medians with system context.

Our hybrid KV cache optimization for SGLang's MLX backend shows a consistent ~48% improvement at BS=4 (69 tok/s vs 47 tok/s baseline) when measured with proper methodology — not the 80% we initially reported from a cherry-picked favorable run.

Data collected on March 13, 2026 using apple-smi v0.1.4 and SGLang's bench_one_batch tool. Model: Qwen3-0.6B on Apple M1 MacBook Pro (16GB).

yeahdongcn/blog3.md

Select an option

No results found