On every machine in the cluster install openmpi and mlx-lm:
conda install conda-forge::openmpi
pip install -U mlx-lmNext download the pipeline parallel run script. Download it to the same path on every machine:
curl -O https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.pyMake a hosts.json file on the machine you plan to launch the generation. For two machines it should look like this:
[
{"ssh": "hostname1"},
{"ssh": "hostname2"}
]
Also make sure you can ssh hostname from every machine to every other machine. Check-out the MLX documentation for more information on setting up and testing MPI.
Set the wired limit on the machines to use more memory. For example on a 192GB M2 Ultra set this:
sudo sysctl iogpu.wired_limit_mb=180000Run the generation with a command like the following:
mlx.launch \
--hostfile path/to/hosts.json \
--backend mpi \
path/to/pipeline_generate.py \
--prompt "What number is larger 6.9 or 6.11?" \
--max-tokens 128 \
--model mlx-community/DeepSeek-R1-4bit
For DeepSeek R1 quantized in 3-bit you need in aggregate 350GB of RAM accross the cluster of machines, e.g. two 192 GB M2 Ultras. To run the model quantized to 4-bit you need 450GB in aggregate RAM or three 192 GB M2 Ultras.


Thank you for the interesting and helpful writeup! I'm excited to run this on a large number of hosts if possible.
I've got 31x M1 hosts provisioned with 16 GB RAM configured with
iogpu.wired_limit_mb=12000, running macOS 15.2, openmpi 5.0.6, mlx 0.22.1, mlx-lm 0.21.4 and validated all hosts can reach one another via SSH key.I've tested both DeepSeek R1 3-bit and 2-Bit but end up with the same MLX error + stack trace from multiple hosts after the shards are downloaded. I was just wondering if you had any suggestions for troubleshooting here?