In addition to Harvard's fantastic list, we list some other convenient SLURM commands.
Useful for letting other enqueued jobs run without having to kill/re-run already running jobs. To delay for 7 days:
scontrol update JobID=<JOB ID> StartTime=now+7daysThe NODELIST(REASON) field reported by squeue for the delayed jobs will become (BeginTime).
when
suspendandholddon't seem to do anything!
You may want to stop running jobs and requeue them further down the queue (i.e. avoid immediately re-runing them). This is useful for freeing up nodes to let other jobs run without having to resubmit your running jobs.
To requeue a job and delay it for one day:
(export jobid=<JOB ID>; scontrol requeue $jobid; scontrol update JobID=$jobid StartTime=now+1day)If you have many jobs (with unique job-ids), you'll want to type out a list of jobs to requeue and delay using a for loop:
for jobid in <SPACE SEPARATED LIST OF JOB IDS>; do scontrol requeue $jobid; scontrol update JobID=$jobid StartTime=now+1day; doneIf many of your jobs share a common prefix which you don't want to retype; export it!
(export prefix=<COMMON JOB ID PREFIX>; for suffix in <SPACE SEPARATED LIST OF JOB ID SUFFIXES>; do scontrol requeue ${prefix}${suffix}; scontrol update JobID=${prefix}${suffix} StartTime=now+1day; done)For example, of the following job id list...
1234567_10
1234567_11
1234567_12
1234567_13
1234567_14
if you want to requeue + delay jobs 1234567_11 and 1234567_12 for 2 days, you'd call
(export prefix=1234567_1; for suffix in 1 2; do scontrol requeue ${prefix}${suffix}; scontrol update JobID=${prefix}${suffix} StartTime=now+2days; done)Note that SLURM will often not list the re-queued jobs in
squeue, but rest assured, they're still enqueued!
Take care to ensure your jobs have everything they need (e.g. files) when they're eventually re-run.
Keep in mind re-queued jobs may behave differently when re-run. Think carefully e.g. about your random seeding!