Slurm: HPC batch scheduler#

Jobs are managed on all the nodes by the software Slurm. For a general introduction to using SLURM, watch this video tutorial.

Here's a quick start on slurm and a useful cheatsheet of many of the most common Slurm commands. Example of submission scripts are available on Liger directory /softs/liger/slurm/examples/

First steps#

To submit a submission script:

 $ sbatch script.slurm

To monitor jobs which are waiting or in execution:

 $ squeue -u $USER

or

 $ sacct -X

This command displays information in the following form:

JOBID  PARTITION  NAME  USER  ST   TIME  NODES  NODELIST(REASON)   
  235  part_name  test   abc   R  00:02      1  r6i3n1

Where

JOBID: Job identifier
PARTITION: Partition used
NAME: Job name
USER: User name of job owner
ST: Status of job execution ( R=running, PD=pending, CG=completing )
TIME: Elapsed time
NODES: Number of nodes used
NODELIST: List of nodes used

Note: You can use the --start option to display an estimated start time for your jobs (“START_TIME” column). Slurm might not have a reliable estimation for the start time of some jobs, in this case the information will show as not available (“N/A”). Since the list of pending jobs is always evolving, it is important to note that the information given by Slurm is only an estimate which might change depending on the machine load.

To obtain complete information about a job (allocated resources and execution status) :

 $ scontrol show job $JOBID

To cancel an execution:

 $ scancel $JOBID

Comments#

A complete reference table of Slurm commands is available here .

In case of a problem on the machine, the SLURM default configuration is such that the running jobs are automatically restarted from scratch. If you want to avoid this behavior, you should use the --no-requeue option in the submission process, that is, submit your job doing

 $ sbatch --no-requeue script.slurm

or add the line

 $SBATCH --no-requeue

in your submission script.