Skip to content

Instantly share code, notes, and snippets.

@luraess
Last active December 11, 2025 11:02
Show Gist options
  • Select an option

  • Save luraess/ed93cc09ba04fe16f63b4219c1811566 to your computer and use it in GitHub Desktop.

Select an option

Save luraess/ed93cc09ba04fe16f63b4219c1811566 to your computer and use it in GitHub Desktop.
CUDA-aware MPI multi-GPU test
using MPI
using CUDA
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
# select device (specifically relevant if >1 GPU per node)
# using node-local communicator to retrieve node-local rank
comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
rank_l = MPI.Comm_rank(comm_l)
gpu_id = CUDA.device!(rank_l)
# using default device if the scheduler exposes different GPU per rank (e.g. SLURM `--gpus-per-task=1`)
# gpu_id = CUDA.device!(0)
# select device
size = MPI.Comm_size(comm)
dst = mod(rank+1, size)
src = mod(rank-1, size)
println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id), size=$size, dst=$dst, src=$src")
N = 4
send_mesg = CuArray{Float64}(undef, N)
recv_mesg = CuArray{Float64}(undef, N)
fill!(send_mesg, Float64(rank))
CUDA.synchronize()
rank==0 && println("start sending...")
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
println("recv_mesg on proc $rank_l: $recv_mesg")
rank==0 && println("done.")
@luraess
Copy link
Author

luraess commented Dec 9, 2025

Reporting issues

Please report any question or issues you may encounter related to GPU-aware MPI on either Julia at scale Discourse or as an issue on MPI.jl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment