Skip to content

Instantly share code, notes, and snippets.

@YanSte
Last active March 15, 2026 00:24
Show Gist options
  • Select an option

  • Save YanSte/d4160a6d0b34bbe67815272a1612f263 to your computer and use it in GitHub Desktop.

Select an option

Save YanSte/d4160a6d0b34bbe67815272a1612f263 to your computer and use it in GitHub Desktop.
DGX Spark + Qwen3.5 + Claude Code Setup

DGX Spark + Qwen3.5 + Claude Code Setup

Local AI coding assistant using Claude Code with a local model on NVIDIA DGX Spark.

This setup allows Claude Code to run without using the Anthropic cloud API by redirecting requests to a local vLLM inference server running Qwen3.5. oai_citation:0‡vLLM


Architecture

Claude Code (Mac)
        │
        │  ANTHROPIC_BASE_URL
        ▼
vLLM Server (DGX Spark :8000)
        │
        ▼
Qwen3.5-35B-A3B
        │
        ▼
NVIDIA GB10 GPU

Claude Code normally communicates with Anthropic servers, but vLLM can expose a compatible API so a local model can be used instead. oai_citation:1‡vLLM


Hardware

DGX Spark

  • NVIDIA GB10 GPU
  • 128GB unified memory
  • CUDA 13
  • Docker
  • NVIDIA Container Toolkit

Model

Qwen/Qwen3.5-35B-A3B

Features

  • Mixture-of-Experts architecture
  • 262k context window
  • Tool calling support
  • Optimized for reasoning and coding workloads. oai_citation:2‡GitHub

1. Start the vLLM Server

Run the model with tool calling enabled.

docker rm -f qwen35

docker run -d \
--name qwen35 \
--gpus all \
--ipc host \
--shm-size 64gb \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:cu130-nightly \
Qwen/Qwen3.5-35B-A3B \
--served-model-name qwen3.5-35b \
--host 0.0.0.0 \
--port 8000 \
--dtype bfloat16 \
--gpu-memory-utilization 0.9 \
--max-model-len 262144 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3

Important flags

--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3

These enable agent tool use required by Claude Code.


2. Verify the Server

Check that the model is available.

curl http://localhost:8000/v1/models

Expected output

{
 "id": "qwen3.5-35b"
}

3. Monitor GPU Usage

nvidia-smi

Example

VLLM::EngineCore   ~99GB VRAM

This confirms the model is running on GPU.


4. Configure Claude Code (Mac)

Claude Code normally calls the Anthropic API.

We redirect it to our local vLLM server.

export ANTHROPIC_BASE_URL=http://192.168.8.146:8000
export ANTHROPIC_API_KEY=dummy
export ANTHROPIC_AUTH_TOKEN=dummy

export ANTHROPIC_DEFAULT_OPUS_MODEL=qwen3.5-35b
export ANTHROPIC_DEFAULT_SONNET_MODEL=qwen3.5-35b
export ANTHROPIC_DEFAULT_HAIKU_MODEL=qwen3.5-35b

Start Claude Code

claude

Claude Code will now use Qwen3.5 running locally on the DGX Spark.


5. Switching Between Cloud Claude and Local DGX

Add shortcuts to .zshrc.

Cloud Claude and DGX

claude() {

  if [[ "$1" == "--spark" ]]; then
    shift

    export ANTHROPIC_BASE_URL="http://192.168.8.146:8000"
    export ANTHROPIC_AUTH_TOKEN="dummy"
    unset ANTHROPIC_API_KEY

    export ANTHROPIC_DEFAULT_OPUS_MODEL="qwen3.5-35b"
    export ANTHROPIC_DEFAULT_SONNET_MODEL="qwen3.5-35b"
    export ANTHROPIC_DEFAULT_HAIKU_MODEL="qwen3.5-35b"

    command claude "$@"
    return
  fi


  if [[ "$1" == "--cloud" ]]; then
    shift

    unset ANTHROPIC_BASE_URL
    unset ANTHROPIC_AUTH_TOKEN
    unset ANTHROPIC_API_KEY
    unset ANTHROPIC_DEFAULT_OPUS_MODEL
    unset ANTHROPIC_DEFAULT_SONNET_MODEL
    unset ANTHROPIC_DEFAULT_HAIKU_MODEL

    command claude "$@"
    return
  fi

  command claude "$@"
}

Reload shell

source ~/.zshrc

Usage

claude --spark
claude --cloud

6. Test Prompt

Example

write a fastapi server with jwt authentication

Claude Code will:

  • read project files
  • generate code
  • edit files
  • run commands

But the reasoning comes from Qwen3.5 locally.


7. Optional Advanced Setup

Many developers add a router layer:

Claude Code
     │
     ▼
LiteLLM Router
     │
 ┌───┴──────────┐
 ▼              ▼
Qwen3.5     DeepSeekCoder
Reasoning    Coding

Benefits

  • better coding performance
  • automatic model selection
  • fallback models

Result

You now have a fully local coding AI stack

Claude Code CLI
        │
        ▼
vLLM inference server
        │
        ▼
Qwen3.5-35B
        │
        ▼
NVIDIA DGX Spark

Advantages

  • no API cost
  • private code execution
  • GPU acceleration
  • large context reasoning model
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment