Investigating performance variance in MLX inference on a MacBook Pro M1
We observed up to 30% variance in decode throughput across benchmark runs of the same code on the same hardware — enough to turn a "no regression" into an apparent 14% slowdown, or inflate a 48% improvement to 80%. Using apple-smi to monitor GPU temperature, power draw, and memory pressure, we identified three root causes: thermal throttling, memory pressure from unified memory, and DVFS (Dynamic Voltage and Frequency Scaling). This post documents the methodology and provides guidelines for producing reliable benchmarks on Apple Silicon.
While developing a hybrid KV cache for SGLang's MLX backend on Apple Silicon, we encountered a frustrating situation: the same benchmark, run minutes apart, produced wildly different results.
| Run | BS=1 Decode (tok/s) | BS=4 Decode (tok/s) |
|---|---|---|
| Early run (favorable) | 42.3 | 83.8 |
| Later run (unfavorable) | 28.7 | 63.5 |
| Documented baseline | 40.2 | 46.6 |
A 42 → 29 tok/s swing on BS=1 is a 31% drop with zero code changes. This makes it nearly impossible to evaluate whether an optimization actually works.
- Hardware: Apple M1 (MacBookPro17,1), 8 GPU cores, 16GB unified memory
- OS: macOS 26.3.1 (Tahoe)
- Model: Qwen3-0.6B (BF16)
- Benchmark:
sglang.bench_one_batchwith input_len=60, output_len=10 - Monitoring:
apple-smiv0.1.4 — captures GPU/CPU temperature, power draw, GPU frequency, and memory usage
We ran 6 sequential benchmark iterations with apple-smi snapshots before and after each run:
Run 1: BS=1 (cold start)
Run 2: BS=4 (immediately after)
Run 3: BS=1 (after 15s cooldown)
Run 4: BS=4 (after 15s cooldown)
Run 5: BS=1 (back-to-back, no cooldown)
Run 6: BS=4 (back-to-back, no cooldown)
| Run | Label | Decode (tok/s) | GPU Temp Pre→Post | CPU Temp Pre→Post | Power Pre→Post |
|---|---|---|---|---|---|
| 1 | BS=1 cold | 31.6 | 30°C → 60°C | 73°C → 74°C | 17W → 18W |
| 2 | BS=4 cold | 67.8 | 59°C → 62°C | 73°C → 76°C | 16W → 24W |
| 3 | BS=1 cooled | 32.7 | 30°C → 64°C | 73°C → 74°C | 13W → 19W |
| 4 | BS=4 cooled | 69.7 | 30°C → 60°C | 70°C → 76°C | 13W → 23W |
| 5 | BS=1 hot | 30.5 | 60°C → 62°C | 74°C → 76°C | 17W → 22W |
| 6 | BS=4 hot | 69.1 | 62°C → 65°C | 75°C → 77°C | 18W → 21W |
The M1's GPU temperature swings from 30°C to 65°C during a single benchmark run. Apple Silicon uses aggressive thermal management — when the die temperature rises, the SoC reduces clock speeds and power delivery to stay within its thermal envelope.
Key observation: the GPU starts "cold" at 30°C after idle periods, heats to 60°C+ during computation, and the warmup phase itself changes the thermal state for the benchmark phase that follows. This means:
- A "cold" first run benefits from peak clocks during warmup but may throttle during the benchmark
- A "hot" follow-up run starts already throttled
The M1 MacBook Pro has a fanless design, making it especially susceptible — there's no active cooling to dissipate heat between runs.
RAM usage: 11,340 / 16,384 MiB (69% used)
Swap usage: 7,842 MiB
With only 16GB of unified memory shared between CPU, GPU, and system, and ~8GB of swap active, the system is under significant memory pressure. On Apple Silicon, GPU memory is system memory — there's no dedicated VRAM. This means:
- The OS may page out GPU buffers to swap and page them back in, adding latency spikes
- Background processes competing for memory cause unpredictable eviction of model weights or KV cache buffers
- Swap I/O competes with the SSD controller for bus bandwidth
The GPU frequency was consistently reported at ~715 MHz across all runs, which is near the M1's base GPU clock. However, Apple's DVFS operates at a finer granularity than what apple-smi can capture at 1-second intervals — the GPU may boost to higher frequencies for microsecond bursts during computation, with the boost headroom depending on:
- Current thermal state (more headroom when cold)
- Power budget (the M1 has a 13-15W TDP)
- Whether the CPU is also active (CPU and GPU share the power budget)
macOS runs numerous background services that spike CPU and I/O:
- Spotlight indexing (
mds_stores) can saturate SSD bandwidth - Time Machine snapshots
clouddand iCloud syncWindowServercompositing — even rendering the terminal output competes for GPU cycles
The CPU temperature baseline of 70-76°C even before our benchmark starts indicates significant background activity.
| Condition | Decode (tok/s) | Variance from mean |
|---|---|---|
| Cold start | 31.6 | ±0% (reference) |
| After cooldown | 32.7 | +3.5% |
| Back-to-back (hot) | 30.5 | -3.5% |
Range: 30.5 – 32.7 tok/s (~7% spread). This is within the noise floor for meaningful conclusions. However, comparing against the documented baseline of 40.2 tok/s (measured on a different day) shows a 24% gap — likely due to different background load and memory pressure conditions.
| Condition | Decode (tok/s) | Variance from mean |
|---|---|---|
| Cold start | 67.8 | -1.6% |
| After cooldown | 69.7 | +1.2% |
| Back-to-back (hot) | 69.1 | +0.3% |
Range: 67.8 – 69.7 tok/s (~3% spread). The BS=4 batched path is more stable because it's GPU-compute-bound (the GPU stays busy), reducing the impact of scheduling jitter.
Every BS=4 run shows Decode 0 at 35-38 tok/s while Decode 1+ runs at 66-76 tok/s. This ~2× gap on the first step is caused by:
- Mode transition overhead: the hybrid cache switches from native mlx-lm caches to contiguous buffers, patching attention modules on the first batched decode
- MLX lazy evaluation: the first decode triggers JIT compilation of the new computation graph
- Cache warmup: the contiguous KV buffers are freshly allocated and may not be in the GPU's cache hierarchy
Based on these findings, here are concrete recommendations:
- Close unnecessary applications — especially browsers, IDEs, and anything GPU-accelerated
- Wait for background indexing to finish — check with
sudo fs_usage -f filesys mds_stores - Monitor memory pressure — if swap exceeds ~2GB, reboot or close apps first
- Let the system idle for 2+ minutes before starting — check with
apple-smithat GPU temp is at baseline (~30°C)
- Always use a warmup phase — the benchmark tool already does this, but ensure warmup is at least 32 tokens
- Run benchmarks sequentially, never in parallel — even two Python processes compete for the unified memory bus
- Add cooldown periods — at least 15 seconds between runs to allow thermal recovery
- Run each configuration 3+ times — report median, not mean (mean is skewed by outliers)
- Use A/B testing in the same session —
git stash→ run baseline →git stash pop→ run modified, back-to-back - Report the system state — include
apple-smioutput with results so reviewers can assess conditions - Only trust relative comparisons — absolute numbers are meaningless across sessions; compare against a baseline run in the same session
- Use median decode throughput, not total throughput — the "Total" metric includes prefill and amortizes differently across batch sizes
# ✅ Good: sequential A/B with system monitoring
apple-smi # Check system state
git stash # Switch to baseline
bench_one_batch --batch-size 4 # Baseline run
sleep 15 # Cooldown
git stash pop # Switch to modified
bench_one_batch --batch-size 4 # Modified run
apple-smi # Verify system state hasn't degraded
# ❌ Bad: parallel runs
bench_one_batch --batch-size 1 & # Competing for GPU
bench_one_batch --batch-size 4 & # Results are meaninglessOn Apple Silicon, you cannot trust a single benchmark number. The unified memory architecture, fanless thermal design, and aggressive DVFS create a system where performance varies by 7-30% depending on conditions invisible to the application.
The fix is not to eliminate variance (you can't), but to measure it, report it, and design your comparison methodology to account for it. Use apple-smi to capture system state, run back-to-back A/B comparisons in the same thermal window, and report medians with system context.
Our hybrid KV cache optimization for SGLang's MLX backend shows a consistent ~48% improvement at BS=4 (69 tok/s vs 47 tok/s baseline) when measured with proper methodology — not the 80% we initially reported from a cherry-picked favorable run.
Data collected on March 13, 2026 using apple-smi v0.1.4 and SGLang's bench_one_batch tool. Model: Qwen3-0.6B on Apple M1 MacBook Pro (16GB).