Skip to content

Instantly share code, notes, and snippets.

@ugovaretto
Last active November 2, 2025 03:48
Show Gist options
  • Select an option

  • Save ugovaretto/7cd207244186efae163ba74d7088e39d to your computer and use it in GitHub Desktop.

Select an option

Save ugovaretto/7cd207244186efae163ba74d7088e39d to your computer and use it in GitHub Desktop.
Run llama.cpp on multiple nodes

llama.cpp on multiple nodes

After compiling with RPC enabled (GGML_RPC CMake parameter):

Run rpc-server on the remote nodes:

rpc-server --port 5001 --host 169.254.51.65

in this case I have only one remote node but you can use a comma separated list of <ip address>:<port> entries if you have more than one.

On the head node run: llama-server --rpc 169.254.51.65:5001 --list-devices

Which in my case, with a Macbook Air connected to an AMD Strix Halo over Thunderbolt (that's why I am specifying explicitly the IP address of the Thunderbolt device) outputs:

Available devices:
  Metal: Apple M4 (21845 MiB, 21844 MiB free)
  BLAS: Accelerate (0 MiB, 0 MiB free)
  RPC0: 169.254.51.65:5001 (87722 MiB, 87247 MiB free)

Note that the CPU device is included, and if you compiled with BLAS support that devices is listed as BLAS.

To run llama.cpp on multiple nodes specifying the --tensor-split parameter run:

llama-server -m ./unsloth_gpt-oss-120b-GGUF_gpt-oss-120b-F16.gguf --port 8080 \
  -ngl 99 --rpc 169.254.51.65:5001 \
  --device Metal,RPC0 \ # <--- !!!
  -fa on --no-mmap --tensor-split 10,90

You need to specify explicitly the devices to run on.

IMPORTANT: the --device parameter must be specified AFTER the --rpc one.

--tensor-split 10,90 means that 10% of these layers will be stored on the first device (Metal, local) and 90% on the other (RPC0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment