Quick start#
Reference repository: liger-ai-tools
Pre-requisites#
- Have an user account on Liger
- General Linux shell commands knowledge is recommended
- Familiarity with slurm is recommended
MNIST: handwritten digits classification algorithm#
The repository liger-ai-tools contains sample Python code that implements a classifier for the MNIST dataset: a classical AI - computer vision task that consists in recognising handwritten digits. The MNIST dataset is relatively simple and is therefore often used as benchmark to test deep learning models and environments.
The programs implementing the MNIST task use TensorFlow and Keras for neural network operations and PyLab (Numpy, Matplotlib) for data manipulation and visualisation. The relevant files can be found in the dl-examples directory:
- mnist.npz: MNIST dataset containing 70000 28x28 pixel images of handwirtten digits.
- mnist_train.py: data loading, processing + model creation and training.
- mnist_predict.py: use the generated model to predict the digits on unseen data.
These programs and the related processes can be used as a reference to implement your DL algorithms in the Liger environment.
The Tensorflow + PyLab environment is provided by a pre-built container already present in Liger at /softs/singularity/containers/ai/ngc-tf2.3-fat_latest.sif
.
Training the classifier#
SSH into Liger with visualisation enabled:
localhost:~$ ssh -X <LIGER-UID>@liger
Clone the liger-ai-tools repository using your credentials:
login02:~$ git clone https://<GIT-USERNAME>@gitlab.in2p3.fr/ecn-collaborations/liger-ai-tools.git
Password:
Move into the repository:
login02:~$ cd liger-ai-tools
Run the training via the submission script*:
login02:~$ sbatch <options> exec.sl
If there is any available GPU, the job will be submitted to one of the nodes. TensorFlow binds to one of the GPUs that will perform the model training on the MNIST dataset. The model will be saved in the dl-examples folder at the end of the training for later use.
Follow the script execution with the following command:
login02:~$ tail -f sjob.txt
Or you can run the Nvidia monitoring tool, shortly after the job submission, too see the GPU utilisation:
login02:~$ Mysqueue
235211 gpus drovelli R 0:06 1 1 normal 904999 turing02 interactive
login02:~$ srun -p gpus -w turing02 nvidia-smi -l
##output
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB On | 00000000:21:00.0 Off | 0 |
| N/A 32C P0 49W / 250W | 39007MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-PCIE-40GB On | 00000000:81:00.0 Off | 0 |
| N/A 30C P0 35W / 250W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 58589 C python3 39005MiB |
+-----------------------------------------------------------------------------+
* Different nodes might require different submission option (i.e. turing01)
Making predictions#
Once the classifier is trained, the mnist_predict.py program can use the final output model to make a guess on a new handwirtten digit (a sample from the test set not used for the training).
This time the program will be executed through an interactive session, as opposed the sbatch job submission that was used for the training. The following steps will go through he execution of commands through an IPython shell session in the TensorFlow container available in Liger.
Reserve a GPU server (turing02 for this example):
login02:~$ srun --pty -p gpus -w turing02 --gres=gpu:1 bash
SSH into the GPU server with visualisation enabled:
localhost:~$ ssh -X turing02
Move into the repository:
login02:~$ cd liger-ai-tools
Start the TensorFlow plot-enanbled container with Singularity.:
login02:~$ module load singularity
login02:~$ singularity run --nv -B ./:/app /softs/singularity/containers/ai/ngc-tf2.3-fat_latest.sif
This is a relevant command to start a shell session in any kind of container. It's important to keep in mind that when inside a container, all the tools that were built into the container at its creation (programs, data, etc.) are available even if not present in the host environment. In this case, the container was pre-built by us but any container can be further customised by the user to suit their needs.
Furthermore, files and folders can be copied into the container at runtime with the -B
Singularity option. Use singularity help
for more information.
Move into the examples folder:
Singularity> cd dl-examples
Start an IPython session:
Singularity> ipython3
#output
Python 3.6.9 (default, Jul 17 2020, 12:50:27)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]:
From this environment you will be able to run Python code interactively, run the following to make prediction exploiting the previously created model:
In [1]: run mnist_predict.py
After a few seconds, you should see the original picture and the corresponding prediction of the algorithm (in the title!). You can re-run the program with different sample numbers to check if the program manages to guess different digits.
Make sure to log out (Ctrl-D) from the container, the node and clean up the node allocation with scancel
Troubleshooting#
Refer to the Troubleshooting page.