Queueing and Running Jobs
The job scheduler used on Lucia is Slurm Workload Manager version 23.02. A quick start guide explaining the basics is available on Slurm's official website along with the full documentation. For former PBS users, you may also want to check out the Rosetta Stone of job schedulers which is a conversion table of commands, variables, etc. between various job schedulers.
Partitions
The available nodes are grouped into partitions, also sometimes called queues, usually depending on the type of resource made available and the usage purpose. Each partition has its own limits and preferred type of usage, see the table below.
About resource usage
As you can see in the table below, the GPU nodes have only 32 CPU cores and 240GB of memory available for 4 GPUs. To maximize the use of the GPUs on Lucia, please do no use more than 8 CPU cores and 60 GB per GPU.
As a general rule, it is also recommended to avoid exceeding to the optimal amount of memory per CPU as much as possible so as not to waste computing resources.
Partition | Job type | Num nodes | CPUs/node | GPUs/node | Available Mem/node | Optimal Mem/CPU | Shared |
---|---|---|---|---|---|---|---|
batch | MPI/SMP | 260 | 128 | - | 240GB | 1920MB | NO (ExclusiveUser) |
medium | MPI/SMP | 30 | 128 | - | 492GB | 3936MB | NO (ExclusiveUser) |
shared | Serial/SMP | 10 | 128 | - | 492GB | 3936MB | YES |
large | SMP | 7 | 64 | - | 2000GB | 32000MB | YES |
xlarge | SMP | 1 | 64 | - | 4000GB | 64000MB | YES |
gpu | GPU | 50 | 32 | 4 x A100 40GB | 240GB | 7680MB | YES |
ia | GPU | 2 | 64 | 8 x A100 80GB | 2000GB | 32000MB | YES |
visu | Visualization | 4 | 32 | 4 x T4 16GB | 492GB | 15744MB | YES |
debug | Debugging (CPU) | 10 | 128 | - | 240GB | 1920MB | YES |
debug-gpu | Debugging (GPU) | 2 | 32 | 4 x A100 40GB | 240GB | 7680MB | YES |
QoS
We're also using QoS (Quality of Service) on top of partitions to set additional parameters or constraints, see the table below for the default (in bold) and available QoS for each partition. Actual limits can also be displayed with the following command:
sacctmgr show qos format=Name,Priority,MaxTRESPU%16,MaxJobsPU,MaxSubmitPU,MaxTRESPA,MaxJobsPA,MaxSubmitPA,MinTRES,MaxTRES%32,MaxWall,Flags
Partition | QoS | Max walltime | Job resource limits | Account resource limits | User resource limits |
---|---|---|---|---|---|
batch & medium | normal | 48h | Max 128 nodes | - | Max 2000 queued jobs |
long | 168h | Max 4 nodes | Max 2048 CPU | Max 512 CPU, max 4 nodes, max 2000 queued jobs | |
shared | shared | 168h | Max 1 node | - | Max 500 queued jobs |
large | large | 168h | Min 490GB, max 4 nodes | - | Max 4 nodes, max 16 running jobs, max 200 queued jobs |
xlarge | xlarge | 168h | Min 1000GB, max 1 node | - | Max 1 node, max 4 running jobs, max 200 queued jobs |
gpu | gpu | 48h | Min 1 GPU, max 16 nodes | - | - |
ia | ia | 48h | Min 1 GPU | - | - |
visu | visu | 4h | Min 1 GPU, Max 1 GPU, Max 8 CPU, Max 123GB | - | Max 1 job |
debug | debug | 2h | Max 4 nodes | - | Max 4 nodes, max 4 running jobs, max 20 queued jobs |
debug-gpu | debug-gpu | 2h | Max 2 nodes | - | Max 1 running job, max 10 queued jobs |
Fairshare
Fairshare allows projects and users to get a fair portion of the system based on their past resource usage. Shares on Lucia are established using the Fair Tree algorithm, and the shares are distributed equally between projects of the same category, categories and subcategories shares are as follows:
- Category 1 (85%): non-economic activities, divided in 2 subcategories:
- Category 1a (70%): Universities and colleges
- Category 1b (15%): Accredited research centers
- Category 2 (15%): economic activities, divided in 3 subcategories:
- Category 2a (5%): Universities and colleges
- Category 2b (5%): Accredited research centers
- Category 2c (5%): Companies and industry
Submitting and controlling jobs
- sbatch: to submit batch scripts
- srun: to initiate parallel job steps within a job, and also to start an interactive job
- salloc: to request an interactive allocation, and then use
srun
to execute parallel task on the allocated resources - scancel: to cancel a job
- squeue: to view queued jobs
- scontrol: to view various information about Slurm, e.g job information with
scontrol show <jobid>
Job examples
Single-threaded
Serial job with 1200GB of memory per core, running for 4 days and 12 hours, on the large partition:
#!/bin/bash
#SBATCH --job-name=serial_job
#SBATCH --output=%j_%x.out
#SBATCH --partition=large
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=1200G
#SBATCH --time=4-12:00:00
#SBATCH --account=my_project_name
echo "----------------- Environment ------------------"
module purge
module load foss/2022a
module list
echo "------------------- Job info -------------------"
echo "job_id : $SLURM_JOB_ID"
echo "jobname : $SLURM_JOB_NAME"
echo "queue : $SLURM_JOB_PARTITION"
echo "qos : $SLURM_JOB_QOS"
echo "account : $SLURM_JOB_ACCOUNT"
echo "submit dir : $SLURM_SUBMIT_DIR"
echo "number of mpi tasks: $SLURM_NTASKS tasks"
echo "OMP_NUM_THREADS : $OMP_NUM_THREADS"
echo "------------------- Node list ------------------"
echo $SLURM_JOB_NODELIST
echo "---------------- Checking limits ---------------"
ulimit -a
echo "--------------- Running the code ---------------"
echo -n "This run started on: "
date
./runner.serial
echo -n "This run completed on: "
date
Multi-threaded
SMP/OpenMP job with 64 threads and a total of 60GB memory, running for 12 hours on the batch partition:
#!/bin/bash
# ------------------------------------------------------------------------------
# Slurm directives
# ------------------------------------------------------------------------------
#SBATCH --job-name=openmp_job
#SBATCH --output=%j_%x.out
#SBATCH --partition=batch
#SBATCH --ntasks=1
#SBATCH --mem=60G
#SBATCH --cpus-per-task=64
#SBATCH --time=12:00:00
#SBATCH --account=my_project_name
# ------------------------------------------------------------------------------
# Setting up the environment
# ------------------------------------------------------------------------------
echo "----------------- Environment ------------------"
module purge
module load foss/2022a
module list
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# ------------------------------------------------------------------------------
# Printing some information
# ------------------------------------------------------------------------------
echo "------------------- Job info -------------------"
echo "job_id : $SLURM_JOB_ID"
echo "jobname : $SLURM_JOB_NAME"
echo "queue : $SLURM_JOB_PARTITION"
echo "qos : $SLURM_JOB_QOS"
echo "account : $SLURM_JOB_ACCOUNT"
echo "submit dir : $SLURM_SUBMIT_DIR"
echo "number of mpi tasks: $SLURM_NTASKS tasks"
echo "OMP_NUM_THREADS : $OMP_NUM_THREADS"
echo "Executable : $EXEC"
echo "------------------- Node list ------------------"
echo $SLURM_JOB_NODELIST
echo "---------------- Checking limits ---------------"
ulimit -a
# ------------------------------------------------------------------------------
# And finally running the code
# ------------------------------------------------------------------------------
echo "--------------- Running the code ---------------"
echo -n "This run started on: "
date
./runner.omp
echo -n "This run completed on: "
date
Parallel
Pure MPI
#!/bin/bash
# ------------------------------------------------------------------------------
# Slurm directives
# ------------------------------------------------------------------------------
#SBATCH --job-name=mpi_job
#SBATCH --output=%j_%x.out
#SBATCH --partition=batch
#SBATCH --ntasks=1024
#SBATCH --mem-per-cpu=1920M
#SBATCH --time=24:00:00
#SBATCH --account=my_project_name
# ------------------------------------------------------------------------------
# Setting up the environment
# ------------------------------------------------------------------------------
echo "----------------- Environment ------------------"
module purge
module load PrgEnv-cray
module list
# ------------------------------------------------------------------------------
# Printing some information
# ------------------------------------------------------------------------------
echo "------------------- Job info -------------------"
echo "job_id : $SLURM_JOB_ID"
echo "jobname : $SLURM_JOB_NAME"
echo "queue : $SLURM_JOB_PARTITION"
echo "qos : $SLURM_JOB_QOS"
echo "account : $SLURM_JOB_ACCOUNT"
echo "submit dir : $SLURM_SUBMIT_DIR"
echo "number of mpi tasks: $SLURM_NTASKS tasks"
echo "OMP_NUM_THREADS : $OMP_NUM_THREADS"
echo "Executable : $EXEC"
echo "------------------- Node list ------------------"
echo $SLURM_JOB_NODELIST
echo "---------------- Checking limits ---------------"
ulimit -a
# ------------------------------------------------------------------------------
# And finally running the code
# ------------------------------------------------------------------------------
echo "--------------- Running the code ---------------"
echo -n "This run started on: "
date
srun ./runner.mpi
echo -n "This run completed on: "
date
Hybrid MPI/OpenMP
Multiple threads per MPI process
#!/bin/bash
# ------------------------------------------------------------------------------
# Slurm directives
# ------------------------------------------------------------------------------
#SBATCH --job-name=hybrid_job
#SBATCH --output=%j_%x.out
#SBATCH --partition=batch
#SBATCH --ntasks=256
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=1920M
#SBATCH --time=12:00:00
#SBATCH --account=my_project_name
# ------------------------------------------------------------------------------
# Setting up the environment
# ------------------------------------------------------------------------------
echo "----------------- Environment ------------------"
module purge
module load PrgEnv-cray
module list
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# ------------------------------------------------------------------------------
# Printing some information
# ------------------------------------------------------------------------------
echo "------------------- Job info -------------------"
echo "job_id : $SLURM_JOB_ID"
echo "jobname : $SLURM_JOB_NAME"
echo "queue : $SLURM_JOB_PARTITION"
echo "qos : $SLURM_JOB_QOS"
echo "account : $SLURM_JOB_ACCOUNT"
echo "submit dir : $SLURM_SUBMIT_DIR"
echo "number of mpi tasks: $SLURM_NTASKS tasks"
echo "OMP_NUM_THREADS : $OMP_NUM_THREADS"
echo "Executable : $EXEC"
echo "------------------- Node list ------------------"
echo $SLURM_JOB_NODELIST
echo "---------------- Checking limits ---------------"
ulimit -a
# ------------------------------------------------------------------------------
# And finally running the code
# ------------------------------------------------------------------------------
echo "--------------- Running the code ---------------"
echo -n "This run started on: "
date
srun ./runner.hybrid
echo -n "This run completed on: "
date
GPU
Using GPUs
#!/bin/bash
# ------------------------------------------------------------------------------
# Slurm directives
# ------------------------------------------------------------------------------
#SBATCH --job-name=gpu_job
#SBATCH --output=%j_%x.out
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --mem=240G
#SBATCH --gpus=4
#SBATCH --time=10:00:00
#SBATCH --account=my_project_name
# ------------------------------------------------------------------------------
# Setting up the environment
# ------------------------------------------------------------------------------
echo "----------------- Environment ------------------"
module purge
module load CUDA/11.7.0
module list
# ------------------------------------------------------------------------------
# Printing some information
# ------------------------------------------------------------------------------
echo "------------------- Job info -------------------"
echo "job_id : $SLURM_JOB_ID"
echo "jobname : $SLURM_JOB_NAME"
echo "queue : $SLURM_JOB_PARTITION"
echo "qos : $SLURM_JOB_QOS"
echo "account : $SLURM_JOB_ACCOUNT"
echo "submit dir : $SLURM_SUBMIT_DIR"
echo "number of mpi tasks: $SLURM_NTASKS tasks"
echo "OMP_NUM_THREADS : $OMP_NUM_THREADS"
echo "number of gpus : $SLURM_GPUS_ON_NODE"
echo "Executable : $EXEC"
echo "------------------- Node list ------------------"
echo $SLURM_JOB_NODELIST
echo "---------------- Checking limits ---------------"
ulimit -a
# ------------------------------------------------------------------------------
# And finally running the code
# ------------------------------------------------------------------------------
echo "--------------- Running the code ---------------"
echo -n "This run started on: "
date
srun ./runner.cuda
echo -n "This run completed on: "
date
Interactive
salloc -p batch -A my_project_name -N 2 -n 256 --mem=241G -t 2:00:00
# and once the resources are allocated use srun the same way as in submission scripts:
srun ./runner.mpi
Job Arrays
Running many similar jobs with small variations (e.g. different input files or conditions)
#!/bin/bash
# ------------------------------------------------------------------------------
# Slurm directives
# ------------------------------------------------------------------------------
#SBATCH --job-name=array_job
#SBATCH --output=%A-%a_%x.out
#SBATCH --partition=batch
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=1G
#SBATCH --time=1:00:00
#SBATCH --array=0-19
#SBATCH --account=my_project_name
# ------------------------------------------------------------------------------
# Setting up the environment
# ------------------------------------------------------------------------------
echo "----------------- Environment ------------------"
module purge
module load foss/2022a
module list
# ------------------------------------------------------------------------------
# Printing some information
# ------------------------------------------------------------------------------
echo "------------------- Job info -------------------"
echo "job_id : $SLURM_JOB_ID"
echo "jobname : $SLURM_JOB_NAME"
echo "queue : $SLURM_JOB_PARTITION"
echo "qos : $SLURM_JOB_QOS"
echo "account : $SLURM_JOB_ACCOUNT"
echo "submit dir : $SLURM_SUBMIT_DIR"
echo "number of mpi tasks: $SLURM_NTASKS tasks"
echo "OMP_NUM_THREADS : $OMP_NUM_THREADS"
echo "Executable : $EXEC"
echo "------------------- Node list ------------------"
echo $SLURM_JOB_NODELIST
echo "---------------- Checking limits ---------------"
ulimit -a
# ------------------------------------------------------------------------------
# And finally running the code
# ------------------------------------------------------------------------------
echo "--------------- Running the code ---------------"
echo -n "This run started on: "
date
srun ./runner $SLURM_ARRAY_TASK_ID
echo -n "This run completed on: "
date
Packed
Running (many) independent process inside a job
Heterogeneous
Requesting heterogeneous resources for the same job (e.g. 1 cpu with 100GB of mem + 64 cpu with 2GB)
Co-simulations
Running various programs in the same job