After compiling with RPC enabled (GGML_RPC CMake parameter):
Run rpc-server on the remote nodes:
rpc-server --port 5001 --host 169.254.51.65
in this case I have only one remote node but you can use a comma separated list of <ip address>:<port> entries
if you have more than one.
On the head node run: llama-server --rpc 169.254.51.65:5001 --list-devices
Which in my case, with a Macbook Air connected to an AMD Strix Halo over Thunderbolt (that's why I am specifying explicitly the IP address of the Thunderbolt device) outputs:
Available devices:
Metal: Apple M4 (21845 MiB, 21844 MiB free)
BLAS: Accelerate (0 MiB, 0 MiB free)
RPC0: 169.254.51.65:5001 (87722 MiB, 87247 MiB free)
Note that the CPU device is included, and if you compiled with BLAS support
that devices is listed as BLAS.
To run llama.cpp on multiple nodes specifying the --tensor-split parameter run:
llama-server -m ./unsloth_gpt-oss-120b-GGUF_gpt-oss-120b-F16.gguf --port 8080 \
-ngl 99 --rpc 169.254.51.65:5001 \
--device Metal,RPC0 \ # <--- !!!
-fa on --no-mmap --tensor-split 10,90
You need to specify explicitly the devices to run on.
IMPORTANT: the --device parameter must be specified AFTER the --rpc one.
--tensor-split 10,90 means that 10% of these layers will be stored on the first
device (Metal, local) and 90% on the other (RPC0).