Slurm options for GPU resources#
When you submit a job with Slurm on Liger, you must specify:
- A partition which defines the type of compute nodes you wish to reserve.
You may wish to set additional options in order to customise your jobs. Some of the options are:
- A QoS (Quality of Service)
--qoswhich calibrates your resource needs (number of nodes,execution time, ...). If not specified, its value will be the Default QOS specified for your account (usually
- A Liger account, or project, can be specified to access resources that are restricted to
users of a particular project or simplt to "bill" the calculation time on a pecific account. The option is
- A reservation, only needed to access resources that are reserved for a particular group.
- Number of cores, number of nodes etc. Note that the number of nodes corresponds to the memory used, according to the following formula: allocated memory = TOTAL MEM / TOTAL CORES * NUMBER OF CORES
- see slurm webiste for all the available options
There is 1 partition on Liger for GPU resources called:
Slurm partition added on gpu nodes:
PartitionName=gpus AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=4 MaxTime=4-04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=turing[01-03],viz[01-04] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=YES:4 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=208 TotalNodes=7 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=8192 MaxMemPerNode=368640
That means here we have:
- 8192 MB ram per core
- 12 cores per GPU
- a total of 368 GB ram
Note: DefMemPerCPU, MaxMemPerNode correspond to the maximum memory for nodes with the largest capacity in Liger. Other GPU nodes have less memory and therefore will throw an error if trying to reserve more memory than they have.
Options cheat sheet#
Below information on the slurm option for GPU jobs. Remember to add
#SBATCH before for
- For all jobs on GPUs, you must specify:
- For jobs on turing01 (reserved) you must specify at least:
--reservation=turing01 --account=(1) --qos=(2) --nodelist=turing01
- (1) one among:
- (2) one among: See QOS policy below.
Even when using other nodes, it is advised to specify all the options above with the desired settings in order to ensure your job settings.
To request GPU nodes:
1 node with 1 core and 1 GPU card
1 node with 2 cores and 2 GPU cards
1 node with 3 cores and 3 GPU cards, specifically the type of Tesla V100 cards. Note that It is always best to request at least as many CPU cores are GPUs
The available GPU node configurations are shown here.
When you request GPUs, the system will set two environment variables - we strongly recommend you do not change or unset these variables:
To your application, it will look like you have GPU 0,1,.. (up to as many GPUs as you requested). So if for example, there are two jobs from different users: the first one requesting 1 GPU card, the second 3 GPU cards, and they happen landing on the same node gpu-08: