Skip to content

Instantly share code, notes, and snippets.

View yeahdongcn's full-sized avatar

R0CKSTAR yeahdongcn

View GitHub Profile
@yeahdongcn
yeahdongcn / blog3.md
Created March 13, 2026 07:58
Why Your LLM Benchmark Numbers Keep Changing on Apple Silicon

Investigating performance variance in MLX inference on a MacBook Pro M1

TL;DR

We observed up to 30% variance in decode throughput across benchmark runs of the same code on the same hardware β€” enough to turn a "no regression" into an apparent 14% slowdown, or inflate a 48% improvement to 80%. Using apple-smi to monitor GPU temperature, power draw, and memory pressure, we identified three root causes: thermal throttling, memory pressure from unified memory, and DVFS (Dynamic Voltage and Frequency Scaling). This post documents the methodology and provides guidelines for producing reliable benchmarks on Apple Silicon.

The Problem

While developing a hybrid KV cache for SGLang's MLX backend on Apple Silicon, we encountered a frustrating situation: the same benchmark, run minutes apart, produced wildly different results.

@yeahdongcn
yeahdongcn / blog2.md
Last active March 12, 2026 06:23
Accelerating SGLang Inference on macOS: 5Γ— Faster with Native MLX

SGLang already runs on macOS via PyTorch's MPS (Metal Performance Shaders) backend β€” you can launch a server, send requests, and get responses. But performance on Apple Silicon has been underwhelming. In this post, we describe how we integrated a native MLX execution path into SGLang that delivers up to 5.3Γ— higher throughput while using significantly less memory.

The Problem: PyTorch MPS Overhead

When SGLang runs on macOS with PyTorch MPS, every operation β€” matrix multiplications, attention, normalization β€” goes through PyTorch's MPS backend, which translates PyTorch ops into Metal Performance Shaders. This translation layer adds substantial overhead:

  1. Op dispatch overhead: Each PyTorch operation is individually dispatched to MPS, missing optimization opportunities that come from fusing operations together.
  2. Memory duplication: PyTorch loads model weights into MPS memory and allocates a large KV cache, leaving less room for actual inference workloads.
  3. No fused kernels: Op
@yeahdongcn
yeahdongcn / blog1.md
Last active March 12, 2026 06:23
🍎 Running SGLang Natively on macOS: LLMs and Diffusion Models on Apple Silicon

Why macOS Matters for Local AI

Apple Silicon machines have quietly become some of the most interesting systems for local AI workloads:

  • Powerful GPUs
  • Large unified memory (up to 192GB)
  • High memory bandwidth
  • Metal compute acceleration

Yet most modern inference frameworks still prioritize Linux + CUDA GPUs.

@yeahdongcn
yeahdongcn / Dockerfile.vulkan
Created September 19, 2025 00:56
VULKAN SDK
ARG UBUNTU_VERSION=22.04
ARG DOCKERHUB_REGISTRY=docker.io/library
# ---------- Stage 1: Build Vulkan SDK ----------
FROM --platform=linux/arm64 ${DOCKERHUB_REGISTRY}/ubuntu:${UBUNTU_VERSION} AS builder
ENV DEBIAN_FRONTEND=noninteractive
# Install build tools and dependencies
RUN apt update && apt install -y --no-install-recommends \
build-essential cmake ninja-build git python3-pip curl pkg-config \
@yeahdongcn
yeahdongcn / πŸ“Š Weekly development breakdown
Last active May 16, 2025 01:55
πŸ“Š Weekly development breakdown
Docker 6 hrs 16 mins β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 43.8%
Makefile 3 hrs 39 mins β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Žβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 25.5%
Go 55 mins β–ˆβ–Žβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 6.5%
HTML 45 mins β–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 5.3%
Python 37 mins β–‰β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 4.4%
@yeahdongcn
yeahdongcn / harbor.sh
Last active February 24, 2022 01:00 — forked from kacole2/harbor.sh
Quick Start Harbor Installation Script on Ubuntu 18.04
#!/bin/bash
#Harbor on Ubuntu 18.04
#Prompt for the user to ask if the install should use the IP Address or Fully Qualified Domain Name of the Harbor Server
PS3='Would you like to install Harbor based on IP or FQDN? '
select option in IP FQDN
do
case $option in
IP)
@yeahdongcn
yeahdongcn / gist:21c10ead7284ca59eac45d445d78e34b
Created July 15, 2019 09:53
Check whether dock is visible
- (BOOL)isDockVisible
{
pid_t pid = 0;
for (NSRunningApplication *runningApp in
[[NSWorkspace sharedWorkspace] runningApplications]) {
if ([[runningApp bundleIdentifier] isEqualToString:@"com.apple.dock"]) {
pid = [runningApp processIdentifier];
break;
}
}
@yeahdongcn
yeahdongcn / gist:aab7c4259fc74c138a4bd1c267fccfb9
Created July 15, 2019 09:01
Dock hidden/shown window properties
Printing description of entry:
{
kCGWindowAlpha = 1;
kCGWindowBounds = {
Height = 1050;
Width = 1680;
X = 0;
Y = 0;
};
kCGWindowIsOnscreen = 1;
@yeahdongcn
yeahdongcn / electron-download cache location
Created October 22, 2018 01:36
If you encounter some checksum failure, probably you'd better to clear the cache and try it again.
Cache location
The location of the cache depends on the operating system, the defaults are:
Linux: $XDG_CACHE_HOME or ~/.cache/electron/
MacOS: ~/Library/Caches/electron/
Windows: $LOCALAPPDATA/electron/Cache or ~/AppData/Local/electron/Cache/
You can set the ELECTRON_CACHE environment variable to set cache location explicitly.
$git archive --format zip --output "./output.zip" master -0