Local AI coding assistant using Claude Code with a local model on NVIDIA DGX Spark.
This setup allows Claude Code to run without using the Anthropic cloud API by redirecting requests to a local vLLM inference server running Qwen3.5. oai_citation:0‡vLLM
Claude Code (Mac)
│
│ ANTHROPIC_BASE_URL
▼
vLLM Server (DGX Spark :8000)
│
▼
Qwen3.5-35B-A3B
│
▼
NVIDIA GB10 GPU
Claude Code normally communicates with Anthropic servers, but vLLM can expose a compatible API so a local model can be used instead. oai_citation:1‡vLLM
DGX Spark
- NVIDIA GB10 GPU
- 128GB unified memory
- CUDA 13
- Docker
- NVIDIA Container Toolkit
Qwen/Qwen3.5-35B-A3B
Features
- Mixture-of-Experts architecture
- 262k context window
- Tool calling support
- Optimized for reasoning and coding workloads. oai_citation:2‡GitHub
Run the model with tool calling enabled.
docker rm -f qwen35
docker run -d \
--name qwen35 \
--gpus all \
--ipc host \
--shm-size 64gb \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:cu130-nightly \
Qwen/Qwen3.5-35B-A3B \
--served-model-name qwen3.5-35b \
--host 0.0.0.0 \
--port 8000 \
--dtype bfloat16 \
--gpu-memory-utilization 0.9 \
--max-model-len 262144 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3
Important flags
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
These enable agent tool use required by Claude Code.
Check that the model is available.
curl http://localhost:8000/v1/models
Expected output
{
"id": "qwen3.5-35b"
}
nvidia-smi
Example
VLLM::EngineCore ~99GB VRAM
This confirms the model is running on GPU.
Claude Code normally calls the Anthropic API.
We redirect it to our local vLLM server.
export ANTHROPIC_BASE_URL=http://192.168.8.146:8000
export ANTHROPIC_API_KEY=dummy
export ANTHROPIC_AUTH_TOKEN=dummy
export ANTHROPIC_DEFAULT_OPUS_MODEL=qwen3.5-35b
export ANTHROPIC_DEFAULT_SONNET_MODEL=qwen3.5-35b
export ANTHROPIC_DEFAULT_HAIKU_MODEL=qwen3.5-35b
Start Claude Code
claude
Claude Code will now use Qwen3.5 running locally on the DGX Spark.
Add shortcuts to .zshrc.
claude() {
if [[ "$1" == "--spark" ]]; then
shift
export ANTHROPIC_BASE_URL="http://192.168.8.146:8000"
export ANTHROPIC_AUTH_TOKEN="dummy"
unset ANTHROPIC_API_KEY
export ANTHROPIC_DEFAULT_OPUS_MODEL="qwen3.5-35b"
export ANTHROPIC_DEFAULT_SONNET_MODEL="qwen3.5-35b"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="qwen3.5-35b"
command claude "$@"
return
fi
if [[ "$1" == "--cloud" ]]; then
shift
unset ANTHROPIC_BASE_URL
unset ANTHROPIC_AUTH_TOKEN
unset ANTHROPIC_API_KEY
unset ANTHROPIC_DEFAULT_OPUS_MODEL
unset ANTHROPIC_DEFAULT_SONNET_MODEL
unset ANTHROPIC_DEFAULT_HAIKU_MODEL
command claude "$@"
return
fi
command claude "$@"
}
Reload shell
source ~/.zshrc
Usage
claude --spark
claude --cloud
Example
write a fastapi server with jwt authentication
Claude Code will:
- read project files
- generate code
- edit files
- run commands
But the reasoning comes from Qwen3.5 locally.
Many developers add a router layer:
Claude Code
│
▼
LiteLLM Router
│
┌───┴──────────┐
▼ ▼
Qwen3.5 DeepSeekCoder
Reasoning Coding
Benefits
- better coding performance
- automatic model selection
- fallback models
You now have a fully local coding AI stack
Claude Code CLI
│
▼
vLLM inference server
│
▼
Qwen3.5-35B
│
▼
NVIDIA DGX Spark
Advantages
- no API cost
- private code execution
- GPU acceleration
- large context reasoning model