Investigating performance variance in MLX inference on a MacBook Pro M1
We observed up to 30% variance in decode throughput across benchmark runs of the same code on the same hardware β enough to turn a "no regression" into an apparent 14% slowdown, or inflate a 48% improvement to 80%. Using apple-smi to monitor GPU temperature, power draw, and memory pressure, we identified three root causes: thermal throttling, memory pressure from unified memory, and DVFS (Dynamic Voltage and Frequency Scaling). This post documents the methodology and provides guidelines for producing reliable benchmarks on Apple Silicon.
While developing a hybrid KV cache for SGLang's MLX backend on Apple Silicon, we encountered a frustrating situation: the same benchmark, run minutes apart, produced wildly different results.