Slurm GPU partition and QOS#
When you submit a job with Slurm on Liger, you must specify:
- A partition which defines the type of compute nodes you wish to reserve.
- A QoS (Quality of Service) which calibrates your resource needs (number of nodes,execution time, ...).
If not specified, its value will be the Default QOS specified for your account (usually
There is 1 partition on Liger for GPU resources (including
turing01). It is called:
Slurm partition added on gpu nodes:
PartitionName=gpus AllowGroups=ALL AllowAccounts=gpu-coquake,gpu-milcom,gpu-others,gpu-ici AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=01:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=4-12:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=viz[01-04],turing01 Priority=1 RootOnly=NO ReqResv=NO Shared=YES:4 PreemptMode=OFF State=UP TotalCPUs=120 TotalNodes=4 SelectTypeParameters=N/A DefMemPerCPU=8192 MaxMemPerNode=368640
That means here we have:
- 8192 MB ram per core
- 12 cores per GPU
- a total of 368 GB ram
Note: DefMemPerCPU, MaxMemPerNode correspond to the maximum memory for nodes with the largest capacity in Liger. Other GPU nodes have less memory and therefore will throw an error if trying to reserve more memory than they have.
To request GPU nodes:
1 node with 1 core and 1 GPU card
1 node with 2 cores and 2 GPU cards
1 node with 3 cores and 3 GPU cards, specifically the type of Tesla V100 cards. Note that It is always best to request at least as many CPU cores are GPUs
The available GPU node configurations are shown here.
When you request GPUs, the system will set two environment variables - we strongly recommend you do not change or unset these variables:
To your application, it will look like you have GPU 0,1,.. (up to as many GPUs as you requested). So if for example, there are two jobs from different users: the first one requesting 1 GPU card, the second 3 GPU cards, and they happen landing on the same node gpu-08: