DGX Spark + Qwen3.5 + Claude Code Setup

Local AI coding assistant using Claude Code with a local model on NVIDIA DGX Spark.

This setup allows Claude Code to run without using the Anthropic cloud API by redirecting requests to a local vLLM inference server running Qwen3.5. oai_citation:0‡vLLM

Architecture

Claude Code (Mac)
        │
        │  ANTHROPIC_BASE_URL
        ▼
vLLM Server (DGX Spark :8000)
        │
        ▼
Qwen3.5-35B-A3B
        │
        ▼
NVIDIA GB10 GPU

Claude Code normally communicates with Anthropic servers, but vLLM can expose a compatible API so a local model can be used instead. oai_citation:1‡vLLM

Hardware

DGX Spark

NVIDIA GB10 GPU
128GB unified memory
CUDA 13
Docker
NVIDIA Container Toolkit

Model

Qwen/Qwen3.5-35B-A3B

Features

Mixture-of-Experts architecture
262k context window
Tool calling support
Optimized for reasoning and coding workloads. oai_citation:2‡GitHub

1. Start the vLLM Server

Run the model with tool calling enabled.

docker rm -f qwen35

docker run -d \
--name qwen35 \
--gpus all \
--ipc host \
--shm-size 64gb \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:cu130-nightly \
Qwen/Qwen3.5-35B-A3B \
--served-model-name qwen3.5-35b \
--host 0.0.0.0 \
--port 8000 \
--dtype bfloat16 \
--gpu-memory-utilization 0.9 \
--max-model-len 262144 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3

Important flags

--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3

These enable agent tool use required by Claude Code.

2. Verify the Server

Check that the model is available.

curl http://localhost:8000/v1/models

Expected output

{
 "id": "qwen3.5-35b"
}

3. Monitor GPU Usage

nvidia-smi

Example

VLLM::EngineCore   ~99GB VRAM

This confirms the model is running on GPU.

4. Configure Claude Code (Mac)

Claude Code normally calls the Anthropic API.

We redirect it to our local vLLM server.

export ANTHROPIC_BASE_URL=http://192.168.8.146:8000
export ANTHROPIC_API_KEY=dummy
export ANTHROPIC_AUTH_TOKEN=dummy

export ANTHROPIC_DEFAULT_OPUS_MODEL=qwen3.5-35b
export ANTHROPIC_DEFAULT_SONNET_MODEL=qwen3.5-35b
export ANTHROPIC_DEFAULT_HAIKU_MODEL=qwen3.5-35b

Start Claude Code

claude

Claude Code will now use Qwen3.5 running locally on the DGX Spark.

5. Switching Between Cloud Claude and Local DGX

Add shortcuts to .zshrc.

Cloud Claude and DGX

claude() {

  if [[ "$1" == "--spark" ]]; then
    shift

    export ANTHROPIC_BASE_URL="http://192.168.8.146:8000"
    export ANTHROPIC_AUTH_TOKEN="dummy"
    unset ANTHROPIC_API_KEY

    export ANTHROPIC_DEFAULT_OPUS_MODEL="qwen3.5-35b"
    export ANTHROPIC_DEFAULT_SONNET_MODEL="qwen3.5-35b"
    export ANTHROPIC_DEFAULT_HAIKU_MODEL="qwen3.5-35b"

    command claude "$@"
    return
  fi


  if [[ "$1" == "--cloud" ]]; then
    shift

    unset ANTHROPIC_BASE_URL
    unset ANTHROPIC_AUTH_TOKEN
    unset ANTHROPIC_API_KEY
    unset ANTHROPIC_DEFAULT_OPUS_MODEL
    unset ANTHROPIC_DEFAULT_SONNET_MODEL
    unset ANTHROPIC_DEFAULT_HAIKU_MODEL

    command claude "$@"
    return
  fi

  command claude "$@"
}

Reload shell

source ~/.zshrc

Usage

claude --spark
claude --cloud

6. Test Prompt

Example

write a fastapi server with jwt authentication

Claude Code will:

read project files
generate code
edit files
run commands

But the reasoning comes from Qwen3.5 locally.

7. Optional Advanced Setup

Many developers add a router layer:

Claude Code
     │
     ▼
LiteLLM Router
     │
 ┌───┴──────────┐
 ▼              ▼
Qwen3.5     DeepSeekCoder
Reasoning    Coding

Benefits

better coding performance
automatic model selection
fallback models

Result

You now have a fully local coding AI stack

Claude Code CLI
        │
        ▼
vLLM inference server
        │
        ▼
Qwen3.5-35B
        │
        ▼
NVIDIA DGX Spark

Advantages

no API cost
private code execution
GPU acceleration
large context reasoning model

YanSte/dgx-qwen-claude.md

Select an option

No results found