Skip to content

Instantly share code, notes, and snippets.

@bedwards
Last active November 1, 2025 12:47
Show Gist options
  • Select an option

  • Save bedwards/68f31bf922311c02e4951de09c66cea6 to your computer and use it in GitHub Desktop.

Select an option

Save bedwards/68f31bf922311c02e4951de09c66cea6 to your computer and use it in GitHub Desktop.
GLM-4.6 on Mac Studio: Complete Guide

GLM-4.6 on Mac Studio: Complete Guide

The biggest, baddest GLM-4.6 running locally on Metal GPU

This guide covers everything from zero to serving GLM-4.6 via OpenAI-compatible API on your Mac Studio, plus uploading your merged model to Hugging Face.


Part 1: Local Setup (Nothing → Serving)

Prerequisites

Hardware:

  • Mac Studio (Apple Silicon)
  • 192GB Unified Memory
  • ~500GB free disk space

OS: macOS (any recent version)

What you'll get: GLM-4.6 running on Metal GPU, accessible via OpenAI-compatible API for use in OpenCode and other IDEs.


Step 1: Build llama.cpp with Metal

# Clone and build with Metal support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake . -B build -DGGML_METAL=ON
cmake --build build --config Release -j

Time: 5-10 minutes
What this does: Compiles llama.cpp to use your Mac's GPU via Metal framework


Step 2: Download and Merge Model

Choose your quantization:

  • Q4_K_M (140GB) - Safe choice, excellent quality, comfortable fit
  • Q5_K_M (180GB) - Higher quality, tighter fit, slightly risky

Recommended: Q4_K_M for reliability.

brew install aria2
pip3 install 'huggingface_hub<1.0' hf_transfer --break-system-packages

# Download Q4_K_M
nohup ~/bin/run-aria2c.sh &
disown %1

# Monitor download
watch -n 5 ~/bin/monitor-aria2c.sh

# Merge sharded files into single GGUF
./build/bin/llama-gguf-split --merge glm46/Q4_K_M/GLM-4.6-Q4_K_M-00001-of-*.gguf glm46-q4.gguf

Time: 10-15 hours for download (varies with internet speed), 15-20 minutes for merge
What this does: Downloads 5-6 sharded files from Unsloth's repo and combines them into one file Ollama and llama.cpp can use


Step 3: Test with CLI (Interactive Mode)

cd ~/llama.cpp

# Run interactive chat
./build/bin/llama-cli \
  --model glm46-q4.gguf \
  --jinja \
  --threads -1 \
  --n-gpu-layers 999 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 40 \
  --ctx-size 32768 \
  -ot ".ffn_.*_exps.=CPU" \
  --color \
  -i

What the flags mean:

  • --jinja - Required for GLM-4.6's chat template (without this, output is garbage)
  • --n-gpu-layers 999 - Push as much as possible to Metal GPU
  • -ot ".ffn_.*_exps.=CPU" - Keep MoE expert layers on CPU (prevents memory overflow)
  • --ctx-size 32768 - Context window (32K tokens)
  • -i - Interactive mode

Test it: Type a message, press Enter, verify you get good responses using Metal GPU.

Exit: Type /bye or press Ctrl+C


Step 4: Serve as OpenAI-Compatible API

cd ~/llama.cpp

# Start server
./build/bin/llama-server \
  --model glm46-q4.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 32768 \
  --n-gpu-layers 999 \
  -ot ".ffn_.*_exps.=CPU" \
  --threads -1 \
  --jinja \
  --alias "glm-4.6" \
  --chat-template auto

API endpoint: http://localhost:8080/v1
Model name: glm-4.6

Keep this terminal open. The server runs in foreground. Use Command+Tab to switch to other apps.


Step 5: Configure OpenCode

In OpenCode settings:

  1. Open OpenCode settings/preferences
  2. Find "AI Provider" or "Model Configuration"
  3. Add new provider:
    • Provider: OpenAI
    • Base URL: http://localhost:8080/v1
    • API Key: not-needed (or leave blank)
    • Model: glm-4.6

Works with: OpenCode, Cursor, Continue, Cline, or any IDE that supports custom OpenAI endpoints.


Performance Expectations

Speed: 5-10 tokens/second on Mac Studio with 192GB
Memory usage (Q4_K_M):

  • Model: ~140GB
  • Metal GPU: ~30-40GB (attention + shared layers)
  • CPU/RAM: ~100-110GB (MoE expert layers)
  • KV cache: ~20-30GB
  • Total: ~160-170GB working set

Memory usage (Q5_K_M):

  • Model: ~180GB
  • Metal GPU: ~40-60GB
  • CPU/RAM: ~120-140GB
  • KV cache: ~20-30GB
  • Total: ~200-220GB working set (tight!)

Key Concepts Explained

GGUF vs Safetensors:

  • GGUF = Format llama.cpp uses (optimized for inference)
  • Safetensors = Format Hugging Face uses (optimized for training/storage)

Sharded vs Unsharded:

  • Sharded = Model split into multiple files (Ollama doesn't support yet)
  • Unsharded = Single file (works everywhere)
  • Unsloth ships sharded, we merged it to unsharded

Quantization (Q4_K_M, Q5_K_M, Q8_0):

  • Q8_0 = Highest quality, ~250GB
  • Q5_K_M = Excellent quality, ~180GB
  • Q4_K_M = Very good quality, ~140GB
  • Q2_K = Low quality, ~70GB
  • Higher Q number = better quality, bigger file

Metal vs CPU:

  • Metal = Apple's GPU framework (like CUDA for NVIDIA)
  • Unified Memory = On Mac, GPU and CPU share the same RAM pool
  • MoE offloading = Keep some layers on CPU to avoid GPU memory limits

llama.cpp vs Ollama:

  • llama.cpp = Lower-level inference engine with full control
  • Ollama = User-friendly wrapper around llama.cpp (easier but less flexible)
  • Ollama doesn't support sharded files yet, llama.cpp does everything

Part 2: Upload to Hugging Face

Prep Your Model Card

Create README.md with the content from Part 3 below.

Upload Files

# Login to Hugging Face (you'll need an account and API token)
huggingface-cli login

# Create repo on HF website first:
# Go to https://huggingface.co/new
# Repository name: GLM-4.6-Q4_K_M-Unsharded-Metal
# License: Same as original GLM-4.6
# Click "Create model"

# Upload the merged GGUF file
huggingface-cli upload YOUR_USERNAME/GLM-4.6-Q4_K_M-Unsharded-Metal \
  ~/llama.cpp/glm46-q4.gguf \
  glm46-q4.gguf

# Upload README
huggingface-cli upload YOUR_USERNAME/GLM-4.6-Q4_K_M-Unsharded-Metal \
  README.md \
  README.md

# Create .gitattributes for large file handling
echo "*.gguf filter=lfs diff=lfs merge=lfs -text" > .gitattributes

huggingface-cli upload YOUR_USERNAME/GLM-4.6-Q4_K_M-Unsharded-Metal \
  .gitattributes \
  .gitattributes

Time: 1-3 hours depending on upload speed (140GB file)


Part 3: Hugging Face Metadata & Content

Model Card (README.md)

# GLM-4.6-Q4_K_M-Unsharded-Metal

**Single-file GGUF for Mac Studio and Apple Silicon**

Unsharded Q4_K_M quantization of GLM-4.6 optimized for Apple Silicon Macs with Metal GPU acceleration.

## Quick Start

```bash
# Download
huggingface-cli download YOUR_USERNAME/GLM-4.6-Q4_K_M-Unsharded-Metal

# Run with llama.cpp
./llama-server \
  --model glm46-q4.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 999 \
  -ot ".ffn_.*_exps.=CPU" \
  --jinja \
  --alias "glm-4.6"

Access at http://localhost:8080/v1 (OpenAI-compatible API)

What This Is

The original GLM-4.6 Q4_K_M from Unsloth comes as 5-6 separate files. Ollama doesn't support sharded files yet. This is those files merged into one, ready to use with llama.cpp and Metal GPU.

Why Q4_K_M?

  • Sweet spot for quality vs size
  • Fits comfortably in 192GB Mac Studio
  • Excellent performance (minimal quality loss vs Q8_0)
  • Fast inference with Metal acceleration

Technical Details

Model: GLM-4.6 (355B parameters, MoE architecture)
Quantization: Q4_K_M
File size: ~140GB
Context window: 200K tokens
Hardware tested: Mac Studio, 192GB unified memory

Memory usage:

  • Model: 140GB
  • Metal GPU: ~30-40GB (attention layers)
  • CPU/RAM: ~100-110GB (MoE experts)
  • Total working set: ~170GB

Performance: 5-10 tokens/sec on Mac Studio M2 Ultra

Why Unsharded?

Sharded (original Unsloth):

  • 5-6 separate files
  • Ollama can't use it
  • Annoying to manage

Unsharded (this repo):

  • Single file
  • Works with Ollama (once they add GLM-4.6 support)
  • Works with llama.cpp now
  • Easier to download and use

Usage with llama.cpp

Interactive Chat

./llama-cli \
  --model glm46-q4.gguf \
  --jinja \
  --n-gpu-layers 999 \
  -ot ".ffn_.*_exps.=CPU" \
  --ctx-size 32768 \
  -i

OpenAI-Compatible Server

./llama-server \
  --model glm46-q4.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 999 \
  -ot ".ffn_.*_exps.=CPU" \
  --jinja \
  --alias "glm-4.6" \
  --chat-template auto

Use with any OpenAI-compatible client:

  • Endpoint: http://localhost:8080/v1
  • Model: glm-4.6
  • API Key: Not required

Works in: OpenCode, Cursor, Continue, Cline, or any IDE supporting custom OpenAI endpoints.

Key Flags Explained

  • --jinja - REQUIRED for GLM-4.6 chat template
  • --n-gpu-layers 999 - Load everything possible to Metal GPU
  • -ot ".ffn_.*_exps.=CPU" - Keep MoE expert layers on CPU (prevents OOM)
  • --ctx-size 32768 - Context window (use up to 200K if you have RAM)

Without --jinja the output will be gibberish. This flag tells llama.cpp to use GLM-4.6's special chat format.

Hardware Requirements

Minimum:

  • Apple Silicon Mac (M1/M2/M3/M4)
  • 160GB+ RAM
  • 150GB free disk space

Recommended:

  • Mac Studio with 192GB unified memory
  • 200GB free disk space
  • macOS Sonoma or later

Will NOT work on:

  • Intel Macs (no Metal support)
  • Macs with <160GB RAM
  • NVIDIA/AMD GPUs (use CUDA builds instead)

Comparison: Q4_K_M vs Other Quants

Quant Size Quality Speed Fits in 192GB?
Q8_0 250GB Highest Slower Tight fit
Q5_K_M 180GB Excellent Fast Yes
Q4_K_M 140GB Very Good Faster Yes (safe)
Q3_K_M 110GB Good Fastest Yes
Q2_K 70GB Poor Very fast Yes

Q4_K_M is the sweet spot for most users. Q5_K_M if you want maximum quality and can spare the extra 40GB.

Known Issues

  1. Tool calling format differs from original - Uses Qwen3-style JSON instead of XML (Ollama limitation)
  2. Ollama native support pending - Use llama.cpp directly for now
  3. First token slow with full context - Expected with 200K context window

Build Process (How This Was Made)

# 1. Build llama.cpp with Metal
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake . -B build -DGGML_METAL=ON
cmake --build build --config Release -j

# 2. Download sharded files
pip3 install huggingface-hub
huggingface-cli download unsloth/GLM-4.6-GGUF --include "Q4_K_M/*" --local-dir ./glm46

# 3. Merge into single file
./build/bin/llama-gguf-split --merge \
  glm46/Q4_K_M/GLM-4.6-Q4_K_M-00001-of-*.gguf \
  glm46-q4.gguf

Total time: ~15 hours (mostly download)

Credits

  • Original Model: Zhipu AI (GLM-4.6)
  • Quantization: Unsloth (sharded Q4_K_M GGUF)
  • Merge: BrianMac (this unsharded version)
  • Runtime: llama.cpp team
  • Inspiration: Frustration with Ollama's lack of sharded GGUF support

License

Same as original GLM-4.6 model from Zhipu AI.

Related Models

Support

Having issues? Check these:

  1. Did you use --jinja flag? (99% of problems)
  2. Is your Mac Studio running low on RAM? (check Activity Monitor)
  3. Are you on Apple Silicon? (Intel Macs won't work)
  4. Try reducing --ctx-size if running out of memory

Benchmarks

Tested on Mac Studio M2 Ultra, 192GB unified memory:

  • Tokens/sec: 6-8 t/s average
  • First token latency: 2-3 seconds (empty context)
  • First token latency: 10-15 seconds (32K context)
  • Memory usage: 165-175GB total
  • GPU utilization: 40-60% (attention layers)
  • CPU utilization: 20-40% (MoE layers)

For comparison:

  • DeepSeek-V3 Q4_K_M: 4-6 t/s (similar architecture)
  • Qwen2.5-72B Q5_K_M: 10-12 t/s (smaller model)
  • Llama-3.1-70B Q8_0: 12-15 t/s (non-MoE)

MoE models are inherently slower due to routing overhead, but GLM-4.6 offers significantly better reasoning and coding than similarly-sized dense models.


---

### Hugging Face Web UI Fields

**Model name:** `GLM-4.6-Q4_K_M-Unsharded-Metal`

**Short description:**  
`Unsharded Q4_K_M quantization of GLM-4.6 for Mac Studio and Apple Silicon. Single-file GGUF optimized for Metal GPU. 140GB, runs at 6-8 tokens/sec on 192GB Mac Studio.`

**Tags:**

glm glm-4.6 gguf quantized q4_k_m unsharded metal apple-silicon mac-studio llama-cpp conversational code reasoning moe mixtral


**Language:**

en (English)


**License:**

other (Same as original GLM-4.6 from Zhipu AI)


**Model type:**

Text Generation


**Library:**

GGUF


**Base model:**

zai-org/GLM-4.6


**Datasets used:**

(Leave blank - inference only)


**Additional metadata (in model card YAML frontmatter):**

```yaml
---
language:
  - en
license: other
tags:
  - glm
  - glm-4.6
  - gguf
  - quantized
  - q4_k_m
  - unsharded
  - metal
  - apple-silicon
  - mac-studio
  - llama-cpp
  - conversational
  - code
  - reasoning
  - moe
library_name: gguf
base_model: zai-org/GLM-4.6
inference: true
pipeline_tag: text-generation
model_type: glm
quantization: q4_k_m
---

Troubleshooting

Common Issues

"Model outputs gibberish"

  • Add --jinja flag (99% of the time this is the issue)

"Out of memory error"

  • Reduce --ctx-size to 16384 or 8192
  • Try Q3_K_M instead of Q4_K_M
  • Close other apps to free RAM

"Not using Metal GPU"

  • Rebuild llama.cpp with -DGGML_METAL=ON
  • Check you're running on Apple Silicon (not Intel)
  • Verify in Activity Monitor → GPU tab

"Too slow / hanging"

  • Add -ot ".ffn_.*_exps.=CPU" flag
  • Reduce --n-gpu-layers to 50-60
  • First token with large context is always slow (expected)

"Download interrupted"

  • Resume with same huggingface-cli download command
  • It will continue from where it left off

Why This Matters

Before this:

  • GLM-4.6 comes sharded (6 files)
  • Ollama doesn't support sharded files
  • Manual merge required for every user
  • Confusing for newcomers

After this:

  • Single file, ready to use
  • Download and run immediately
  • Works with all llama.cpp-based tools
  • Will work with Ollama once they add GLM-4.6 support

TL;DR: Making GLM-4.6 accessible to Mac Studio users without the hassle.


That's it! You now have the biggest, baddest GLM-4.6 running locally on your Mac Studio, serving OpenAI-compatible API, and (optionally) uploaded to Hugging Face for others to use.

#!/bin/sh
set -e
echo "## Processes\n"
ps -ef | grep aria2c | rg -v 'grep aria2c' | rg -v '(monitor|run)-aria2c.sh' | rg -v 'rg -v '
echo "\n\n## Latest files\n"
# ls -lhtr glm46/Q4_K_M/ | tail
fd -l -t f '' glm46
echo "\n\n## Latest log messages\n"
tail nohup.out | tac
#!/bin/bash
BASE_URL="https://huggingface.co/unsloth/GLM-4.6-GGUF/resolve/main/Q4_K_M"
OUTPUT_DIR="glm46/Q4_K_M"
for i in {1..5}; do
FILE=$(printf "GLM-4.6-Q4_K_M-%05d-of-00005.gguf" $i)
echo "Downloading $FILE..."
aria2c -x 16 -s 16 -k 1M --max-tries=0 --retry-wait=3 \
"${BASE_URL}/${FILE}" \
-d "$OUTPUT_DIR" \
-o "$FILE"
if [ $? -ne 0 ]; then
echo "Failed to download $FILE"
exit 1
fi
done
echo "All files downloaded successfully!"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment