AI Overview#

Liger has a specific partition (group of servers) dedicated to Artificial intelligence workloads. These servers are equipped with powerful GPUs (Big Data most popular accelerators) that can be exploited to speed up state-of-the-art AI computational jobs.

All AI software resources are contained in the following GitLab repository: liger-ai-tools

Ensure you are able to connect to the ICI HPC clusters. As usual for all tests and compilation, you MUST work on a computing node.

GPU resources#

The following list of GPU nodes is dedicated to Artificial intelligence jobs:

Name	Model	CPUs	RAM	GPUs	GPU RAM
turing01	DELL C4140	- 2x Xeon Gold 6252 - 24 cores @ 2.10 GHz - 48 cores in total	384GB	4x GPU Nvidia Tesla V100 - Tensor cores - NVLink hyper bandwidth	32GB
turing[02-03]		- 2x AMD EPYC 7313 - 16 cores @ 3 GHz - 32 cores in total	384GB	2x 2 GPU Nvidia Tesla A100 - Tensor cores - NVLink hyper bandwidth	40GB
viz[01-04]	bullx R421-E4	- 2x Intel Xeon E5-2680v3 - 12 cores @ 2.5GHz - 24 cores in total	256GB	4x 2 GPUs NVIDIA K80	12GB

_{* Expressions in square brackets indicate a range, i.e node[01-10] = 10 servers: nodes[01], node[02]...node[10]

* Data on each row refers to a single server}

Policies#

turing01 - Project Accounting#

The GPU server (C4140 DELL) called turing01, tailored for AI research, was co-funded by the ECN/Research Department and 2 partners lab on site (GeM and LS2N) hosted by ECN/ICI lab. The machine is equipped with powerful GPUs (4x V100) integrated in LIGER, an HPC system at ECN/ICI.

Identify your project Liger ID granted to use Turing

project ID	Description
`gpu-milcom`	users on project MILCOM/LS2N
`gpu-coquake`	users on project COQUAKE/GeM
`gpu-others`	ECN Lab Users, request on demand

Specify your project ID with the following option your SBATCH script before submission

--account=<project ID>
Check-out status and manage your job submission

All other resources#

No specific restrictions (Liger standard Slurm restrictions apply).

Job submission#

AI jobs are submitted via the Slurm batch scheduler, as all other jobs in Liger. Check out other guides in these docs to know how to use Slurm.
Example scripts and resources can be found in the repository: liger-ai-tools

Applications#

Artificial Intelligence - GPU resources are configured to host containerised applications. As a consequence, it is highly recommended to avoid running programs directly on the server, since the environment (installed programs, server configuration etc.) is likely not to be compatible with most applications. Instead, all jobs should be submitted (via Slurm) through Singularity, the container engine installed on Liger.
Non-containerised applications will NOT be supported (i.e. installing software directly on GPU servers).

We provide some pre-built containers with common DL environments that can be used out of the box and are optimised to run on NVIDIA GPUs.
To ensure your application has all the required dependencies, you can use containers available in the system, pull external containers or build your own ones, more information in Using and building containers.

Useful resources#

This documentation does not provide tutorials on Deep Learning (DL). For that, we encourage you to take a look at: - MIT introduction to deep learning - Nvidia resources on deep learning, and developer site - the courses taught at Master Datascience Paris Saclay and available on https://github.com/m2dsupsdlclass/lectures-labs