Prereq: Slurm is installed and
slurmdbdis configured/running. Controller and nodes share the sameslurm.conf(includingAccountingStorageType=accounting_storage/slurmdbd) and Munge works across nodes.
If Slurm isn’t installed yet, follow: https://mtreviso.github.io/blog/slurm.html
Show clusters:
sudo sacctmgr show clustersCreate if missing:
sudo sacctmgr add cluster sardine-clusterNote: Changes require a healthy
slurmdbdand a matchingClusterNamein/etc/slurm/slurm.conf.
List:
sudo sacctmgr show accountCreate if missing:
sudo sacctmgr add account sardine Description="SARDINE" Organization=sardineQoS controls job limits and priority. Higher numeric priority runs sooner (e.g., priority=100 > priority=10).
List:
sudo sacctmgr show qosDelete (example):
sudo sacctmgr delete qos gpu-debugAdd (example policy set—tune to taste):
sudo sacctmgr add qos cpu set priority=10 MaxJobsPerUser=4 MaxTRESPerUser=cpu=32,mem=128G,gres/gpu=0
sudo sacctmgr add qos gpu-debug set priority=20 MaxJobsPerUser=1 MaxTRESPerUser=gres/gpu=8 MaxWallDurationPerJob=01:00:00
sudo sacctmgr add qos gpu-short set priority=10 MaxJobsPerUser=4 MaxTRESPerUser=gres/gpu=4 MaxWallDurationPerJob=04:00:00
sudo sacctmgr add qos gpu-medium set priority=5 MaxJobsPerUser=1 MaxTRESPerUser=gres/gpu=4 MaxWallDurationPerJob=2-00:00:00
sudo sacctmgr add qos gpu-long set priority=2 MaxJobsPerUser=2 MaxTRESPerUser=gres/gpu=2 MaxWallDurationPerJob=7-00:00:00
sudo sacctmgr add qos gpu-h100 set priority=10 MaxJobsPerUser=2 MaxTRESPerUser=gres/gpu=4 MaxWallDurationPerJob=2-00:00:00
sudo sacctmgr add qos gpu-h200 set priority=10 MaxJobsPerUser=2 MaxTRESPerUser=gres/gpu=4 MaxWallDurationPerJob=4-00:00:00
sudo sacctmgr add qos gpu-hero set priority=100 MaxJobsPerUser=3 MaxTRESPerUser=gres/gpu=3priority— higher means earlier dispatch (subject to other factors)MaxJobsPerUser— limit concurrent jobs per user in that QoSMaxTRESPerUser— cap total resources (e.g.,gres/gpu=4)MaxWallDurationPerJob— per-job wallclock limit
Modify QoS:
sudo sacctmgr update qos gpu-debug set priority=20Unset a value:
sudo sacctmgr update qos gpu-debug set priority=-1List:
sudo sacctmgr show user -sAdd a user with allowed QoS:
sudo sacctmgr create user --immediate name=mtreviso account=sardine QOS=gpu-debug,gpu-short,gpu-medium,gpu-long
--immediateskips the interactive confirmation. Omit it if you prefer to review changes.
Modify:
sudo sacctmgr -i modify user where name=mtreviso set QOS=gpu-debug,gpu-short,gpu-medium,gpu-longDelete:
sudo sacctmgr delete user mtrevisoSee reasons for drained nodes:
sudo sinfo -RDraining due to memory mismatch:
- Ensure node hardware lines in
/etc/slurm/slurm.confmatchsudo slurmd -Cand real memory fromfree -m. - Update
slurm.confon all nodes, then:
sudo scontrol update NodeName=<nodename> State=RESUMEService order on controller:
sudo systemctl restart slurmdbd
sudo systemctl restart slurmctld
sudo systemctl restart slurmdCheck logs:
/var/log/slurm/slurmdbd.log/var/log/slurm/slurmctld.log/var/log/slurm/slurmd.log
- Priority multifactor: https://slurm.schedmd.com/priority_multifactor.html
- squeue reason codes: https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES
- Resource limits: https://slurm.schedmd.com/resource_limits.html
- Handy scripts repo: https://github.com/cdt-data-science/cluster-scripts