To diagnose a node with a bad gpu ip-10-1-69-242 on ParallelCluster, do the following:
- Run the nvidia reset command where
0is the device index shown bynvidia-smiof the gpu you want to reset:
srun -w ip-10-1-69-242 sudo nvidia-smi --gpu-reset -i 0- If that doesn't success then generate a bug report:
srun -w ip-10-1-69-242 nvidia-bug-report.sh- Grab the instance id:
srun -w ip-10-1-69-242 cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " "- Grab the output of
nvidia-bug-report.shand replace that instance where<instance-id>is the instance id from above.
aws ec2 terminate-instances \
--instance-ids <instance-id>- ParallelCluster will re-launch the instance and you'll see a new instance come up in the EC2 console.