Running AI jobs on Liger GPUs#

This document provides details how to run GPU jobs. It will refer to some of the content in the Quick Start guide. Consult "Quick start" to run a concrete AI application and for a more practical overview.

Singularity operations#

Singularity can perform a wide range of operations including launching and stopping containers, move files from and to the containers. A list of complete operations can be seen with the command:

singularity help

The most relevant operation are shell, exec and run that can be used to run the containers in the system.

singularity shell <CONTAINER>: start a shell session inside the CONTAINER.
singularity exec <CONTAINER> <COMMAND>: execute COMMAND inside the CONTAINER.
singularity shell <CONTAINER>: run a container default's command (configured in the build).

Furthermore, some more options can be added to these commands in order to enable important features:

--nv flag, to be able to exploit the NVidia GPUs by giving the container access to drivers and libraries
-B src:dest flag, to copy files (such as programs and data) from the src path in the host server to the dest path inside the container. If dest path does not need to exist already in the container it will be created.
This needs to be done because the container has its own filesystem, hence it will not see any of the folder in the host system unless explicitly set with the -B option.

These options are likely to be needed every time you want to submit a GPU job. For a more detailed overview on Singularity features, check out the documentation.

Examples#

Run a shell session inside a container and make your dataset folder visible (bind) inside the containers:

singularity shell -B /path/to/dataset:/myWorkspace AI-container.simg

Start a deep learning training inside a TensorFlow container including NVIDIA libraries to run on GPUs:

singularity exec --nv -B train.py:myWorkspace tensorflow.simg python3 train.py

exec will execute the command from the dest folder specified with -B

Slurm + Singularity submissions#

In Liger, you will have to use Slurm to reserve GPU resources. As for all other nodes, there are 2 ways of running a program on gpu nodes: interactively and through batch submission.

Interactive Jobs#

You might need an interactive job to do short term jobs, visualise data, data pre-processing, debug your program and in general to have an immediate feedback. For instance, in the Quick start guide, an interactive session was used to visualise the predictions of the computer vision classifier that was previously trained (see reference for detailed example).

SSH into Liger (-X to enable visualisation):

ssh -X <username>@liger

Then submit a job to a gpu node in order to be able to reserve and access the node. A good way to do this is initialising a shell through the srun Slurm command.

srun --pty -p gpus --gres=gpu:1 --account=<project-id> bash

The above command will reserve the node for 1 hour. Use the option -t to set a longer time period. The options -p gpus adnd --account=<project-id> specify the slurm partition and the account enabled for GPU resources, respectively. They are necessary to be able. Add -w <node>if you require a specific node, or --qos=<qos> if you require a different QOS than the defualt one.

After the job was allocated, it is possible to SSH into a gpu node:

ssh -X turing01

Then load the latest version of Singularity

module load singularity

Now you are ready interactive GPU jobs through singularity. For example to run an interactive Python session in the Tensorflow container provided in Liger:

singularity exec --nv /softs/singularity/containers/ai/ngc-tf2.3-fat_latest.sif ipython3 --pylab

A list of all available containers in Liger can be found here: Using and building containers.

Batch submission#

When running time consuming computations, such as deep learning trainings with long convergence periods, dataset generation etc. you might want to submit a job and let it run in the system until it finishes. This was exactly the case for the MNIST training in the Quick start page.

The sbatch command can be used for this type of submission. An sbatch file needs to be implemented with the correct instructions to be used on GPU resources (see below and reference ). Below an example batch submission file. Let's assume that we want to execute a Python AI training program called train.py, this would be your job_script.sh file:

#!/bin/bash
#SBATCH --job-name=single_gpu        # name of job
#SBATCH --account=<project_id>       # replace <project_id> by your project ID
#SBATCH --ntasks=1                   # total number of processes (= number of GPUs here)
#SBATCH -p gpus                      # name of the GPU partition
#SBATCH -w turing02                  # name of the target GPU server
#SBATCH --gres=gpu:1                 # number of GPUs (1/4 of GPUs)
#SBATCH --cpus-per-task=12           # number of cores per task (1/4 of the 4-GPUs node)
# /!\ Caution, "multithread" in Slurm vocabulary refers to hyperthreading.
#SBATCH --hint=nomultithread         # hyperthreading is deactivated
#SBATCH --time=00:10:00              # maximum execution time requested (HH:MM:SS)
#SBATCH --output=gpu_single%j.out    # name of output file
#SBATCH --error=gpu_single%j.out     # name of error file (here, appended with the output file)

# cleans out the modules loaded in interactive and inherited by default 
module purge

# load Singularity
module load singularity   

# echo of launched commands
set -x

# code execution. As in the previous examples, use singularity to submit your job in a 
# container that you chose. Let's use the TF container once again
singularity exec --nv -B ./:app \  # tell singularity to run a program with nvidia gpus 
                                   # copy the current folder in the container
            /softs/singularity/containers/ai/ngc-tf2.3-fat_latest.sif \ # container name
            python3 \ # command to run - Python interpreter in this case
            train.py  # python file with AI training

Then copy the files to Liger (assuming you are in the same folder of your files):

scp train.py <LIGER-ID>@liger
scp job_script.sh <LIGER-ID>@liger

SSH into Liger (remember the -X to enable visualisation):

ssh -X <username>@liger

And submit your job thrugh sbatch:

sbatch job_script.py

If the resources are available, your job will be sumbitted on a gpu node and run until it finishes.

Refer to the exec.sl file in the repository for another example of bash script.

Troubleshooting#

Refer to the Troubleshooting page.