sean-smith/bad-gpu-pc.md

Created May 2, 2024 16:23

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/sean-smith/37008be16c0a257e8d370a3e6a71f0c6.js"></script>
Save sean-smith/37008be16c0a257e8d370a3e6a71f0c6 to your computer and use it in GitHub Desktop.

Download ZIP

Diagnose GPU Failures

Raw

bad-gpu-pc.md

Diagnose GPU Failures on ParallelCluster

To diagnose a node with a bad gpu ip-10-1-69-242 on ParallelCluster, do the following:

Run the nvidia reset command where 0 is the device index shown by nvidia-smi of the gpu you want to reset:

srun -w ip-10-1-69-242 sudo nvidia-smi --gpu-reset -i 0

If that doesn't success then generate a bug report:

srun -w ip-10-1-69-242 nvidia-bug-report.sh

Grab the instance id:

srun -w ip-10-1-69-242 cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " "

Grab the output of nvidia-bug-report.sh and replace that instance where <instance-id> is the instance id from above.

aws ec2 terminate-instances \
    --instance-ids <instance-id>

ParallelCluster will re-launch the instance and you'll see a new instance come up in the EC2 console.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment