AI Overview#

Liger has a specific partition (group of servers) dedicated to Artificial intelligence workloads. These servers are equipped with powerful GPUs (Big Data most popular accelerators) that can be exploited to speed up state-of-the-art AI computational jobs.

All AI software resources are contained in the following GitLab repository: liger-ai-tools

Ensure you are able to connect to the ICI HPC clusters. As usual for all tests and compilation, you MUST work on a computing node.

GPU resources#

The following list of GPU nodes is dedicated to Artificial intelligence jobs:

Name Model CPUs RAM GPUs GPU RAM
turing01 DELL C4140 - 2x Xeon Gold 6252
- 24 cores @ 2.10 GHz
- 48 cores in total
384GB 4x GPU Nvidia Tesla V100
- Tensor cores
- NVLink hyper bandwidth
32GB
turing[02-03] - 2x AMD EPYC 7313
- 16 cores @ 3 GHz
- 32 cores in total
384GB 2x 2 GPU Nvidia Tesla A100
- Tensor cores
- NVLink hyper bandwidth
40GB
viz[01-04] bullx R421-E4 - 2x Intel Xeon E5-2680v3
- 12 cores @ 2.5GHz
- 24 cores in total
256GB 4x 2 GPUs NVIDIA K80 12GB

* Expressions in square brackets indicate a range, i.e node[01-10] = 10 servers: nodes[01], node[02]...node[10]
* Data on each row refers to a single server

Policies#

turing01 - Project Accounting#

The GPU server (C4140 DELL) called turing01, tailored for AI research, was co-funded by the ECN/Research Department and 2 partners lab on site (GeM and LS2N) hosted by ECN/ICI lab. The machine is equipped with powerful GPUs (4x V100) integrated in LIGER, an HPC system at ECN/ICI.

  1. Identify your project Liger ID granted to use Turing
project ID Description
gpu-milcom users on project MILCOM/LS2N
gpu-coquake users on project COQUAKE/GeM
gpu-others ECN Lab Users, request on demand
  1. Specify your project ID with the following option your SBATCH script before submission

    --account=<project ID>

  2. Check-out status and manage your job submission


All other resources#

No specific restrictions (Liger standard Slurm restrictions apply).

Job submission#

AI jobs are submitted via the Slurm batch scheduler, as all other jobs in Liger. Check out other guides in these docs to know how to use Slurm.
Example scripts and resources can be found in the repository: liger-ai-tools

Applications#

Artificial Intelligence - GPU resources are configured to host containerised applications. As a consequence, it is highly recommended to avoid running programs directly on the server, since the environment (installed programs, server configuration etc.) is likely not to be compatible with most applications. Instead, all jobs should be submitted (via Slurm) through Singularity, the container engine installed on Liger.
Non-containerised applications will NOT be supported (i.e. installing software directly on GPU servers).

We provide some pre-built containers with common DL environments that can be used out of the box and are optimised to run on NVIDIA GPUs.
To ensure your application has all the required dependencies, you can use containers available in the system, pull external containers or build your own ones, more information in Using and building containers.

Useful resources#

This documentation does not provide tutorials on Deep Learning (DL). For that, we encourage you to take a look at: - MIT introduction to deep learning - Nvidia resources on deep learning, and developer site - the courses taught at Master Datascience Paris Saclay and available on https://github.com/m2dsupsdlclass/lectures-labs