AI Overview#
Liger has a specific partition (group of servers) dedicated to Artificial intelligence workloads. These servers are equipped with powerful GPUs (Big Data most popular accelerators) that can be exploited to speed up state-of-the-art AI computational jobs.
All AI software resources are contained in the following GitLab repository: liger-ai-tools
Ensure you are able to connect to the ICI HPC clusters. As usual for all tests and compilation, you MUST work on a computing node.
GPU resources#
The following list of GPU nodes is dedicated to Artificial intelligence jobs:
Name | Model | CPUs | RAM | GPUs | GPU RAM |
---|---|---|---|---|---|
turing01 | DELL C4140 | - 2x Xeon Gold 6252 - 24 cores @ 2.10 GHz - 48 cores in total |
384GB | 4x GPU Nvidia Tesla V100 - Tensor cores - NVLink hyper bandwidth |
32GB |
turing[02-03] | - 2x AMD EPYC 7313 - 16 cores @ 3 GHz - 32 cores in total |
384GB | 2x 2 GPU Nvidia Tesla A100 - Tensor cores - NVLink hyper bandwidth |
40GB | |
viz[01-04] | bullx R421-E4 | - 2x Intel Xeon E5-2680v3 - 12 cores @ 2.5GHz - 24 cores in total |
256GB | 4x 2 GPUs NVIDIA K80 | 12GB |
* Expressions in square brackets indicate a range, i.e node[01-10] = 10 servers: nodes[01], node[02]...node[10]
* Data on each row refers to a single server
Policies#
turing01 - Project Accounting#
The GPU server (C4140 DELL) called turing01, tailored for AI research, was co-funded by the ECN/Research Department and 2 partners lab on site (GeM and LS2N) hosted by ECN/ICI lab. The machine is equipped with powerful GPUs (4x V100) integrated in LIGER, an HPC system at ECN/ICI.
- Identify your project Liger ID granted to use Turing
project ID | Description |
---|---|
gpu-milcom |
users on project MILCOM/LS2N |
gpu-coquake |
users on project COQUAKE/GeM |
gpu-others |
ECN Lab Users, request on demand |
-
Specify your project ID with the following option your SBATCH script before submission
--account=<project ID>
All other resources#
No specific restrictions (Liger standard Slurm restrictions apply).
Job submission#
AI jobs are submitted via the Slurm batch scheduler, as all other jobs in Liger. Check out other guides in these docs to know how to use Slurm.
Example scripts and resources can be found in the repository: liger-ai-tools
Applications#
Artificial Intelligence - GPU resources are configured to host containerised applications. As a consequence, it is highly recommended to avoid running programs directly on the server, since the environment (installed programs, server configuration etc.) is likely not to be compatible with most applications. Instead, all jobs should be submitted (via Slurm) through Singularity, the container engine installed on Liger.
Non-containerised applications will NOT be supported (i.e. installing software directly on GPU servers).
We provide some pre-built containers with common DL environments that can be used out of the box and are optimised to run on NVIDIA GPUs.
To ensure your application has all the required dependencies, you can use containers available in the system, pull external containers or build your own ones, more information in Using and building containers.
Useful resources#
This documentation does not provide tutorials on Deep Learning (DL). For that, we encourage you to take a look at: - MIT introduction to deep learning - Nvidia resources on deep learning, and developer site - the courses taught at Master Datascience Paris Saclay and available on https://github.com/m2dsupsdlclass/lectures-labs