The biggest, baddest GLM-4.6 running locally on Metal GPU
This guide covers everything from zero to serving GLM-4.6 via OpenAI-compatible API on your Mac Studio, plus uploading your merged model to Hugging Face.
Hardware:
- Mac Studio (Apple Silicon)
- 192GB Unified Memory
- ~500GB free disk space
OS: macOS (any recent version)
What you'll get: GLM-4.6 running on Metal GPU, accessible via OpenAI-compatible API for use in OpenCode and other IDEs.
# Clone and build with Metal support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake . -B build -DGGML_METAL=ON
cmake --build build --config Release -jTime: 5-10 minutes
What this does: Compiles llama.cpp to use your Mac's GPU via Metal framework
Choose your quantization:
- Q4_K_M (140GB) - Safe choice, excellent quality, comfortable fit
- Q5_K_M (180GB) - Higher quality, tighter fit, slightly risky
Recommended: Q4_K_M for reliability.
brew install aria2
pip3 install 'huggingface_hub<1.0' hf_transfer --break-system-packages
# Download Q4_K_M
nohup ~/bin/run-aria2c.sh &
disown %1
# Monitor download
watch -n 5 ~/bin/monitor-aria2c.sh
# Merge sharded files into single GGUF
./build/bin/llama-gguf-split --merge glm46/Q4_K_M/GLM-4.6-Q4_K_M-00001-of-*.gguf glm46-q4.ggufTime: 10-15 hours for download (varies with internet speed), 15-20 minutes for merge
What this does: Downloads 5-6 sharded files from Unsloth's repo and combines them into one file Ollama and llama.cpp can use
cd ~/llama.cpp
# Run interactive chat
./build/bin/llama-cli \
--model glm46-q4.gguf \
--jinja \
--threads -1 \
--n-gpu-layers 999 \
--temp 1.0 \
--top-p 0.95 \
--top-k 40 \
--ctx-size 32768 \
-ot ".ffn_.*_exps.=CPU" \
--color \
-iWhat the flags mean:
--jinja- Required for GLM-4.6's chat template (without this, output is garbage)--n-gpu-layers 999- Push as much as possible to Metal GPU-ot ".ffn_.*_exps.=CPU"- Keep MoE expert layers on CPU (prevents memory overflow)--ctx-size 32768- Context window (32K tokens)-i- Interactive mode
Test it: Type a message, press Enter, verify you get good responses using Metal GPU.
Exit: Type /bye or press Ctrl+C
cd ~/llama.cpp
# Start server
./build/bin/llama-server \
--model glm46-q4.gguf \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 32768 \
--n-gpu-layers 999 \
-ot ".ffn_.*_exps.=CPU" \
--threads -1 \
--jinja \
--alias "glm-4.6" \
--chat-template autoAPI endpoint: http://localhost:8080/v1
Model name: glm-4.6
Keep this terminal open. The server runs in foreground. Use Command+Tab to switch to other apps.
In OpenCode settings:
- Open OpenCode settings/preferences
- Find "AI Provider" or "Model Configuration"
- Add new provider:
- Provider: OpenAI
- Base URL:
http://localhost:8080/v1 - API Key:
not-needed(or leave blank) - Model:
glm-4.6
Works with: OpenCode, Cursor, Continue, Cline, or any IDE that supports custom OpenAI endpoints.
Speed: 5-10 tokens/second on Mac Studio with 192GB
Memory usage (Q4_K_M):
- Model: ~140GB
- Metal GPU: ~30-40GB (attention + shared layers)
- CPU/RAM: ~100-110GB (MoE expert layers)
- KV cache: ~20-30GB
- Total: ~160-170GB working set
Memory usage (Q5_K_M):
- Model: ~180GB
- Metal GPU: ~40-60GB
- CPU/RAM: ~120-140GB
- KV cache: ~20-30GB
- Total: ~200-220GB working set (tight!)
GGUF vs Safetensors:
- GGUF = Format llama.cpp uses (optimized for inference)
- Safetensors = Format Hugging Face uses (optimized for training/storage)
Sharded vs Unsharded:
- Sharded = Model split into multiple files (Ollama doesn't support yet)
- Unsharded = Single file (works everywhere)
- Unsloth ships sharded, we merged it to unsharded
Quantization (Q4_K_M, Q5_K_M, Q8_0):
- Q8_0 = Highest quality, ~250GB
- Q5_K_M = Excellent quality, ~180GB
- Q4_K_M = Very good quality, ~140GB
- Q2_K = Low quality, ~70GB
- Higher Q number = better quality, bigger file
Metal vs CPU:
- Metal = Apple's GPU framework (like CUDA for NVIDIA)
- Unified Memory = On Mac, GPU and CPU share the same RAM pool
- MoE offloading = Keep some layers on CPU to avoid GPU memory limits
llama.cpp vs Ollama:
- llama.cpp = Lower-level inference engine with full control
- Ollama = User-friendly wrapper around llama.cpp (easier but less flexible)
- Ollama doesn't support sharded files yet, llama.cpp does everything
Create README.md with the content from Part 3 below.
# Login to Hugging Face (you'll need an account and API token)
huggingface-cli login
# Create repo on HF website first:
# Go to https://huggingface.co/new
# Repository name: GLM-4.6-Q4_K_M-Unsharded-Metal
# License: Same as original GLM-4.6
# Click "Create model"
# Upload the merged GGUF file
huggingface-cli upload YOUR_USERNAME/GLM-4.6-Q4_K_M-Unsharded-Metal \
~/llama.cpp/glm46-q4.gguf \
glm46-q4.gguf
# Upload README
huggingface-cli upload YOUR_USERNAME/GLM-4.6-Q4_K_M-Unsharded-Metal \
README.md \
README.md
# Create .gitattributes for large file handling
echo "*.gguf filter=lfs diff=lfs merge=lfs -text" > .gitattributes
huggingface-cli upload YOUR_USERNAME/GLM-4.6-Q4_K_M-Unsharded-Metal \
.gitattributes \
.gitattributesTime: 1-3 hours depending on upload speed (140GB file)
# GLM-4.6-Q4_K_M-Unsharded-Metal
**Single-file GGUF for Mac Studio and Apple Silicon**
Unsharded Q4_K_M quantization of GLM-4.6 optimized for Apple Silicon Macs with Metal GPU acceleration.
## Quick Start
```bash
# Download
huggingface-cli download YOUR_USERNAME/GLM-4.6-Q4_K_M-Unsharded-Metal
# Run with llama.cpp
./llama-server \
--model glm46-q4.gguf \
--host 0.0.0.0 \
--port 8080 \
--n-gpu-layers 999 \
-ot ".ffn_.*_exps.=CPU" \
--jinja \
--alias "glm-4.6"Access at http://localhost:8080/v1 (OpenAI-compatible API)
The original GLM-4.6 Q4_K_M from Unsloth comes as 5-6 separate files. Ollama doesn't support sharded files yet. This is those files merged into one, ready to use with llama.cpp and Metal GPU.
Why Q4_K_M?
- Sweet spot for quality vs size
- Fits comfortably in 192GB Mac Studio
- Excellent performance (minimal quality loss vs Q8_0)
- Fast inference with Metal acceleration
Model: GLM-4.6 (355B parameters, MoE architecture)
Quantization: Q4_K_M
File size: ~140GB
Context window: 200K tokens
Hardware tested: Mac Studio, 192GB unified memory
Memory usage:
- Model: 140GB
- Metal GPU: ~30-40GB (attention layers)
- CPU/RAM: ~100-110GB (MoE experts)
- Total working set: ~170GB
Performance: 5-10 tokens/sec on Mac Studio M2 Ultra
Sharded (original Unsloth):
- 5-6 separate files
- Ollama can't use it
- Annoying to manage
Unsharded (this repo):
- Single file
- Works with Ollama (once they add GLM-4.6 support)
- Works with llama.cpp now
- Easier to download and use
./llama-cli \
--model glm46-q4.gguf \
--jinja \
--n-gpu-layers 999 \
-ot ".ffn_.*_exps.=CPU" \
--ctx-size 32768 \
-i./llama-server \
--model glm46-q4.gguf \
--host 0.0.0.0 \
--port 8080 \
--n-gpu-layers 999 \
-ot ".ffn_.*_exps.=CPU" \
--jinja \
--alias "glm-4.6" \
--chat-template autoUse with any OpenAI-compatible client:
- Endpoint:
http://localhost:8080/v1 - Model:
glm-4.6 - API Key: Not required
Works in: OpenCode, Cursor, Continue, Cline, or any IDE supporting custom OpenAI endpoints.
--jinja- REQUIRED for GLM-4.6 chat template--n-gpu-layers 999- Load everything possible to Metal GPU-ot ".ffn_.*_exps.=CPU"- Keep MoE expert layers on CPU (prevents OOM)--ctx-size 32768- Context window (use up to 200K if you have RAM)
Without --jinja the output will be gibberish. This flag tells llama.cpp to use GLM-4.6's special chat format.
Minimum:
- Apple Silicon Mac (M1/M2/M3/M4)
- 160GB+ RAM
- 150GB free disk space
Recommended:
- Mac Studio with 192GB unified memory
- 200GB free disk space
- macOS Sonoma or later
Will NOT work on:
- Intel Macs (no Metal support)
- Macs with <160GB RAM
- NVIDIA/AMD GPUs (use CUDA builds instead)
| Quant | Size | Quality | Speed | Fits in 192GB? |
|---|---|---|---|---|
| Q8_0 | 250GB | Highest | Slower | Tight fit |
| Q5_K_M | 180GB | Excellent | Fast | Yes |
| Q4_K_M | 140GB | Very Good | Faster | Yes (safe) |
| Q3_K_M | 110GB | Good | Fastest | Yes |
| Q2_K | 70GB | Poor | Very fast | Yes |
Q4_K_M is the sweet spot for most users. Q5_K_M if you want maximum quality and can spare the extra 40GB.
- Tool calling format differs from original - Uses Qwen3-style JSON instead of XML (Ollama limitation)
- Ollama native support pending - Use llama.cpp directly for now
- First token slow with full context - Expected with 200K context window
# 1. Build llama.cpp with Metal
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake . -B build -DGGML_METAL=ON
cmake --build build --config Release -j
# 2. Download sharded files
pip3 install huggingface-hub
huggingface-cli download unsloth/GLM-4.6-GGUF --include "Q4_K_M/*" --local-dir ./glm46
# 3. Merge into single file
./build/bin/llama-gguf-split --merge \
glm46/Q4_K_M/GLM-4.6-Q4_K_M-00001-of-*.gguf \
glm46-q4.ggufTotal time: ~15 hours (mostly download)
- Original Model: Zhipu AI (GLM-4.6)
- Quantization: Unsloth (sharded Q4_K_M GGUF)
- Merge: BrianMac (this unsharded version)
- Runtime: llama.cpp team
- Inspiration: Frustration with Ollama's lack of sharded GGUF support
Same as original GLM-4.6 model from Zhipu AI.
- unsloth/GLM-4.6-GGUF - Original sharded versions (all quants)
- MichelRosselli/GLM-4.6 - Ollama package (Q4_K_M)
- zai-org/GLM-4.6 - Original full-precision model
Having issues? Check these:
- Did you use
--jinjaflag? (99% of problems) - Is your Mac Studio running low on RAM? (check Activity Monitor)
- Are you on Apple Silicon? (Intel Macs won't work)
- Try reducing
--ctx-sizeif running out of memory
Tested on Mac Studio M2 Ultra, 192GB unified memory:
- Tokens/sec: 6-8 t/s average
- First token latency: 2-3 seconds (empty context)
- First token latency: 10-15 seconds (32K context)
- Memory usage: 165-175GB total
- GPU utilization: 40-60% (attention layers)
- CPU utilization: 20-40% (MoE layers)
For comparison:
- DeepSeek-V3 Q4_K_M: 4-6 t/s (similar architecture)
- Qwen2.5-72B Q5_K_M: 10-12 t/s (smaller model)
- Llama-3.1-70B Q8_0: 12-15 t/s (non-MoE)
MoE models are inherently slower due to routing overhead, but GLM-4.6 offers significantly better reasoning and coding than similarly-sized dense models.
---
### Hugging Face Web UI Fields
**Model name:** `GLM-4.6-Q4_K_M-Unsharded-Metal`
**Short description:**
`Unsharded Q4_K_M quantization of GLM-4.6 for Mac Studio and Apple Silicon. Single-file GGUF optimized for Metal GPU. 140GB, runs at 6-8 tokens/sec on 192GB Mac Studio.`
**Tags:**
glm glm-4.6 gguf quantized q4_k_m unsharded metal apple-silicon mac-studio llama-cpp conversational code reasoning moe mixtral
**Language:**
en (English)
**License:**
other (Same as original GLM-4.6 from Zhipu AI)
**Model type:**
Text Generation
**Library:**
GGUF
**Base model:**
zai-org/GLM-4.6
**Datasets used:**
(Leave blank - inference only)
**Additional metadata (in model card YAML frontmatter):**
```yaml
---
language:
- en
license: other
tags:
- glm
- glm-4.6
- gguf
- quantized
- q4_k_m
- unsharded
- metal
- apple-silicon
- mac-studio
- llama-cpp
- conversational
- code
- reasoning
- moe
library_name: gguf
base_model: zai-org/GLM-4.6
inference: true
pipeline_tag: text-generation
model_type: glm
quantization: q4_k_m
---
"Model outputs gibberish"
- Add
--jinjaflag (99% of the time this is the issue)
"Out of memory error"
- Reduce
--ctx-sizeto 16384 or 8192 - Try Q3_K_M instead of Q4_K_M
- Close other apps to free RAM
"Not using Metal GPU"
- Rebuild llama.cpp with
-DGGML_METAL=ON - Check you're running on Apple Silicon (not Intel)
- Verify in Activity Monitor → GPU tab
"Too slow / hanging"
- Add
-ot ".ffn_.*_exps.=CPU"flag - Reduce
--n-gpu-layersto 50-60 - First token with large context is always slow (expected)
"Download interrupted"
- Resume with same
huggingface-cli downloadcommand - It will continue from where it left off
Before this:
- GLM-4.6 comes sharded (6 files)
- Ollama doesn't support sharded files
- Manual merge required for every user
- Confusing for newcomers
After this:
- Single file, ready to use
- Download and run immediately
- Works with all llama.cpp-based tools
- Will work with Ollama once they add GLM-4.6 support
TL;DR: Making GLM-4.6 accessible to Mac Studio users without the hassle.
That's it! You now have the biggest, baddest GLM-4.6 running locally on your Mac Studio, serving OpenAI-compatible API, and (optionally) uploaded to Hugging Face for others to use.