R0CKSTAR yeahdongcn

Investigating performance variance in MLX inference on a MacBook Pro M1

TL;DR

We observed up to 30% variance in decode throughput across benchmark runs of the same code on the same hardware — enough to turn a "no regression" into an apparent 14% slowdown, or inflate a 48% improvement to 80%. Using apple-smi to monitor GPU temperature, power draw, and memory pressure, we identified three root causes: thermal throttling, memory pressure from unified memory, and DVFS (Dynamic Voltage and Frequency Scaling). This post documents the methodology and provides guidelines for producing reliable benchmarks on Apple Silicon.

The Problem

While developing a hybrid KV cache for SGLang's MLX backend on Apple Silicon, we encountered a frustrating situation: the same benchmark, run minutes apart, produced wildly different results.

SGLang already runs on macOS via PyTorch's MPS (Metal Performance Shaders) backend — you can launch a server, send requests, and get responses. But performance on Apple Silicon has been underwhelming. In this post, we describe how we integrated a native MLX execution path into SGLang that delivers up to 5.3× higher throughput while using significantly less memory.

The Problem: PyTorch MPS Overhead

When SGLang runs on macOS with PyTorch MPS, every operation — matrix multiplications, attention, normalization — goes through PyTorch's MPS backend, which translates PyTorch ops into Metal Performance Shaders. This translation layer adds substantial overhead:

Op dispatch overhead: Each PyTorch operation is individually dispatched to MPS, missing optimization opportunities that come from fusing operations together.
Memory duplication: PyTorch loads model weights into MPS memory and allocates a large KV cache, leaving less room for actual inference workloads.
No fused kernels: Op

Why macOS Matters for Local AI

Apple Silicon machines have quietly become some of the most interesting systems for local AI workloads:

Powerful GPUs
Large unified memory (up to 192GB)
High memory bandwidth
Metal compute acceleration

Yet most modern inference frameworks still prioritize Linux + CUDA GPUs.

	ARG UBUNTU_VERSION=22.04
	ARG DOCKERHUB_REGISTRY=docker.io/library
	# ---------- Stage 1: Build Vulkan SDK ----------
	FROM --platform=linux/arm64 ${DOCKERHUB_REGISTRY}/ubuntu:${UBUNTU_VERSION} AS builder

	ENV DEBIAN_FRONTEND=noninteractive

	# Install build tools and dependencies
	RUN apt update && apt install -y --no-install-recommends \
	build-essential cmake ninja-build git python3-pip curl pkg-config \

	Docker 6 hrs 16 mins █████████▏░░░░░░░░░░░ 43.8%
	Makefile 3 hrs 39 mins █████▎░░░░░░░░░░░░░░░ 25.5%
	Go 55 mins █▎░░░░░░░░░░░░░░░░░░░ 6.5%
	HTML 45 mins █░░░░░░░░░░░░░░░░░░░░ 5.3%
	Python 37 mins ▉░░░░░░░░░░░░░░░░░░░░ 4.4%

	#!/bin/bash

	#Harbor on Ubuntu 18.04

	#Prompt for the user to ask if the install should use the IP Address or Fully Qualified Domain Name of the Harbor Server
	PS3='Would you like to install Harbor based on IP or FQDN? '
	select option in IP FQDN
	do
	case $option in
	IP)

	- (BOOL)isDockVisible
	{
	pid_t pid = 0;
	for (NSRunningApplication *runningApp in
	[[NSWorkspace sharedWorkspace] runningApplications]) {
	if ([[runningApp bundleIdentifier] isEqualToString:@"com.apple.dock"]) {
	pid = [runningApp processIdentifier];
	break;
	}
	}

	Printing description of entry:
	{
	kCGWindowAlpha = 1;
	kCGWindowBounds = {
	Height = 1050;
	Width = 1680;
	X = 0;
	Y = 0;
	};
	kCGWindowIsOnscreen = 1;

	Cache location
	The location of the cache depends on the operating system, the defaults are:

	Linux: $XDG_CACHE_HOME or ~/.cache/electron/
	MacOS: ~/Library/Caches/electron/
	Windows: $LOCALAPPDATA/electron/Cache or ~/AppData/Local/electron/Cache/
	You can set the ELECTRON_CACHE environment variable to set cache location explicitly.