Troubleshooting

Reference to common and know issues. Give us a feedback in 3 ways:

minor issue or question for a help : use slack #help channel
medium issue or annoying but can we wait : slack #troubleshooting channel
major issue or blocking situation : open a ticket to svp-hpc@ec-nantes.fr

OPENED AND KNOWN ISSUES

1 - Unexpected dependency error inside a pre-built or custom container#

There is currently a problem due conflicts between the local environment and the container environment. If you experience error such as "package not found" or "library not found", it could be that the container is looking at the libraries in your local environment, or Liger, rather than using the ones inside the container. Issue #25

Make sure your environment is clean before running containers: - no modules are loaded - no conflicting env variables are set. i.e. link to a Python library in LD_LIBRARY_PATH - no local packages are bound (through singularity -B) to the container

2 - Quick start program not running#

The Quick Start program should work with no modification. If you experience issue refer to 1.

3 - Major BUG when submitting more than 2 jobs on a gpu node#

There is currently a problem when submitting more than two jobs on a gpu node. This means that you might experience the following error when using sbatch or srun:

slurmstepd: write(/dev/cpuset/slurm1732538/slurm1732538.0_1/cpuset.cpus): Invalid argument slurmstepd: read(/dev/cpuset/slurm1732538/slurm1732538.0_1/tasks): Invalid argument slurmstepd: Failed task affinity setup

If it happens, you can use salloc instead, SSH into turing and run the job interactively from inside the node. Make sure to manually select the available resources as others might be using the cluster.

The underlying problem has been found and will be addressed asap, this is meant to be a temporary workaround. Thanks @Mickael.Tardy for the suggestion

NOTE Be sure after doing this tip to kill your allocation properly after logging out from SSH, by using :

scancel --signal=KILL <job_id> exit

~~Wait for a minute before checkomg out if job is cancelled. You should see your job_id in the result. sacct -X | tail -1~~

Now solved by restarting Slurm - please notify us if you see this problem occurring again and we'll fix it immediately. Permanently solved with the next OS+slurm upgrade.