Quick start#

Reference repository: liger-ai-tools


  • Have an user account on Liger
  • General Linux shell commands knowledge is recommended
  • Familiarity with slurm is recommended

MNIST: handwritten digits classification algorithm#

The repository liger-ai-tools contains sample Python code that implements a classifier for the MNIST dataset: a classical AI - computer vision task that consists in recognising handwritten digits. The MNIST dataset is relatively simple and is therefore often used as benchmark to test deep learning models and environments.
The programs implementing the MNIST task use TensorFlow and Keras for neural network operations and PyLab (Numpy, Matplotlib) for data manipulation and visualisation. The relevant files can be found in the dl-examples directory:

  • mnist.npz: MNIST dataset containing 70000 28x28 pixel images of handwirtten digits.
  • mnist_train.py: data loading, processing + model creation and training.
  • mnist_predict.py: use the generated model to predict the digits on unseen data.

These programs and the related processes can be used as a reference to implement your DL algorithms in the Liger environment.

The Tensorflow + PyLab environment is provided by a pre-built container already present in Liger at /softs/singularity/containers/ai/ngc-tf2.3-fat.sif.

Training the classifier#

SSH into Liger with visualisation enabled:

localhost:~$ ssh -X <LIGER-UID>@liger

Clone the liger-ai-tools repository using your credentials:

login02:~$ git clone https://<GIT-USERNAME>@gitlab.in2p3.fr/ecn-collaborations/liger-ai-tools.git

Move into the repository:

login02:~$ cd gpu-ai-liger

Run the training via the submission script specifying your account:

login02:~$ sbatch --account=<project-id> --qos=qos_gpu exec.sl

If turing01 has any available GPU, the job will be submitted to the node. TensorFlow binds to one of the GPUs that will perform the model training on the MNIST dataset. The model will be saved in the dl-examples folder at the end of the training for later use.
Follow the script execution with the following command:

login02:~$ tail -f sjob.txt

Or you can run the Nvidia monitoring tool, shortly after the job submission, too see the GPU utilisation:

login02:~$ srun -p gpus -w turing01 nvidia-smi -l
Tue Dec  1 16:26:24 2020       
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:18:00.0 Off |                    0 |
| N/A   38C    P0    56W / 300W |    354MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
|   1  Tesla V100-SXM2...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   34C    P0    52W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
|   2  Tesla V100-SXM2...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   35C    P0    55W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
|   3  Tesla V100-SXM2...  Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   38C    P0    56W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|    0   N/A  N/A     53079      C   python3                           351MiB |

Making predictions#

Once the classifier is trained, the mnist_predict.py program can use the final output model to make a guess on a new handwirtten digit (a sample from the test set not used for the training).
This time the program will be executed through an interactive session, as opposed the sbatch job submission that was used for the training. The following steps will go through he execution of commands through an IPython shell session in the TensorFlow container available in Liger.

Reserve a GPU server (turing01 for this example):

login02:~$ srun --pty -p gpus -w turing01 --gres=gpu:1 --account=<project-id> bash

SSH into the GPU server with visualisation enabled:

localhost:~$ ssh -X turing01

Move into the repository:

login02:~$ cd gpu-ai-liger

Start the TensorFlow plot-enanbled container with Singularity.:

login02:~$ module load singularity
login02:~$ singularity run --nv -B ./:/app /softs/singularity/containers/ai/ngc-tf2.3-fat.sif

This is a relevant command to start a shell session in any kind of container. It's important to keep in mind that when inside a container, all the tools that were built into the container at its creation (programs, data, etc.) are available even if not present in the host environment. In this case, the container was pre-built by us but any container can be further customised by the user to suit their needs.
Furthermore, files and folders can be copied into the container at runtime with the -B Singularity option. Use singularity help for more information.

Move into the examples folder:

Singularity> cd dl-examples

Start an IPython session:

Singularity> ipython3
Python 3.6.9 (default, Jul 17 2020, 12:50:27) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 

From this environment you will be able to run Python code interactively, run the following to make prediction exploiting the previously created model:

In [1]: run mnist_predict.py

After a few seconds, you should see the original picture and the corresponding prediction of the algorithm (in the title!). You can re-run the program with different sample numbers to check if the program manages to guess different digits.

Make sure to log out (Ctrl-D) from the container, turing01 and clean up the node allocation with scancel


Refer to the Troubleshooting page.