Queueing and Running Jobs

The job scheduler used on Lucia is Slurm Workload Manager version 23.02. A quick start guide explaining the basics is available on Slurm's official website along with the full documentation. For former PBS users, you may also want to check out the Rosetta Stone of job schedulers which is a conversion table of commands, variables, etc. between various job schedulers.

Partitions

The available nodes are grouped into partitions, also sometimes called queues, usually depending on the type of resource made available and the usage purpose. Each partition has its own limits and preferred type of usage, see the table below.

About resource usage

As you can see in the table below, the GPU nodes have only 32 CPU cores and 240GB of memory available for 4 GPUs. To maximize the use of the GPUs on Lucia, please do no use more than 8 CPU cores and 60 GB per GPU.

As a general rule, it is also recommended to avoid exceeding to the optimal amount of memory per CPU as much as possible so as not to waste computing resources.

Partition	Job type	Num nodes	CPUs/node	GPUs/node	Available Mem/node	Optimal Mem/CPU	Shared
batch	MPI/SMP	260	128	-	240GB	1920MB	NO (ExclusiveUser)
medium	MPI/SMP	30	128	-	492GB	3936MB	NO (ExclusiveUser)
shared	Serial/SMP	10	128	-	492GB	3936MB	YES
large	SMP	7	64	-	2000GB	32000MB	YES
xlarge	SMP	1	64	-	4000GB	64000MB	YES
gpu	GPU	50	32	4 x A100 40GB	240GB	7680MB	YES
ia	GPU	2	64	8 x A100 80GB	2000GB	32000MB	YES
visu	Visualization	4	32	4 x T4 16GB	492GB	15744MB	YES
debug	Debugging (CPU)	10	128	-	240GB	1920MB	YES
debug-gpu	Debugging (GPU)	2	32	4 x A100 40GB	240GB	7680MB	YES

QoS

We're also using QoS (Quality of Service) on top of partitions to set additional parameters or constraints, see the table below for the default (in bold) and available QoS for each partition. Actual limits can also be displayed with the following command:

sacctmgr show qos format=Name,Priority,MaxTRESPU%16,MaxJobsPU,MaxSubmitPU,MaxTRESPA,MaxJobsPA,MaxSubmitPA,MinTRES,MaxTRES%32,MaxWall,Flags

Partition	QoS	Max walltime	Job resource limits	Account resource limits	User resource limits
batch & medium	normal	48h	Max 128 nodes	-	Max 2000 queued jobs
	long	168h	Max 4 nodes	Max 2048 CPU	Max 512 CPU, max 4 nodes, max 2000 queued jobs
shared	shared	168h	Max 1 node	-	Max 500 queued jobs
large	large	168h	Min 490GB, max 4 nodes	-	Max 4 nodes, max 16 running jobs, max 200 queued jobs
xlarge	xlarge	168h	Min 1000GB, max 1 node	-	Max 1 node, max 4 running jobs, max 200 queued jobs
gpu	gpu	48h	Min 1 GPU, max 16 nodes	-	-
ia	ia	48h	Min 1 GPU	-	-
visu	visu	4h	Min 1 GPU, Max 1 GPU, Max 8 CPU, Max 123GB	-	Max 1 job
debug	debug	2h	Max 4 nodes	-	Max 4 nodes, max 4 running jobs, max 20 queued jobs
debug-gpu	debug-gpu	2h	Max 2 nodes	-	Max 1 running job, max 10 queued jobs

Fairshare

Fairshare allows projects and users to get a fair portion of the system based on their past resource usage. Shares on Lucia are established using the Fair Tree algorithm, and the shares are distributed equally between projects of the same category, categories and subcategories shares are as follows:

Category 1 (85%): non-economic activities, divided in 2 subcategories:
- Category 1a (70%): Universities and colleges
- Category 1b (15%): Accredited research centers
Category 2 (15%): economic activities, divided in 3 subcategories:
- Category 2a (5%): Universities and colleges
- Category 2b (5%): Accredited research centers
- Category 2c (5%): Companies and industry

Submitting and controlling jobs

sbatch: to submit batch scripts
srun: to initiate parallel job steps within a job, and also to start an interactive job
salloc: to request an interactive allocation, and then use srun to execute parallel task on the allocated resources
scancel: to cancel a job
squeue: to view queued jobs
scontrol: to view various information about Slurm, e.g job information with scontrol show <jobid>

Job examples

Single-threaded

Serial job with 1200GB of memory per core, running for 4 days and 12 hours, on the large partition:

#!/bin/bash

#SBATCH --job-name=serial_job
#SBATCH --output=%j_%x.out
#SBATCH --partition=large
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=1200G
#SBATCH --time=4-12:00:00
#SBATCH --account=my_project_name

echo "----------------- Environment ------------------"
module purge
module load foss/2022a
module list

echo "------------------- Job info -------------------"
echo "job_id             : $SLURM_JOB_ID"
echo "jobname            : $SLURM_JOB_NAME"
echo "queue              : $SLURM_JOB_PARTITION"
echo "qos                : $SLURM_JOB_QOS"
echo "account            : $SLURM_JOB_ACCOUNT"
echo "submit dir         : $SLURM_SUBMIT_DIR"
echo "number of mpi tasks: $SLURM_NTASKS tasks"
echo "OMP_NUM_THREADS    : $OMP_NUM_THREADS"

echo "------------------- Node list ------------------"
echo $SLURM_JOB_NODELIST

echo "---------------- Checking limits ---------------"
ulimit -a

echo "--------------- Running the code ---------------"

echo -n "This run started on: "
date

./runner.serial

echo -n "This run completed on: "
date

Multi-threaded

SMP/OpenMP job with 64 threads and a total of 60GB memory, running for 12 hours on the batch partition:

#!/bin/bash

# ------------------------------------------------------------------------------
# Slurm directives
# ------------------------------------------------------------------------------

#SBATCH --job-name=openmp_job
#SBATCH --output=%j_%x.out
#SBATCH --partition=batch
#SBATCH --ntasks=1
#SBATCH --mem=60G
#SBATCH --cpus-per-task=64
#SBATCH --time=12:00:00
#SBATCH --account=my_project_name

# ------------------------------------------------------------------------------
# Setting up the environment
# ------------------------------------------------------------------------------

echo "----------------- Environment ------------------"
module purge
module load foss/2022a
module list

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# ------------------------------------------------------------------------------
# Printing some information
# ------------------------------------------------------------------------------

echo "------------------- Job info -------------------"
echo "job_id             : $SLURM_JOB_ID"
echo "jobname            : $SLURM_JOB_NAME"
echo "queue              : $SLURM_JOB_PARTITION"
echo "qos                : $SLURM_JOB_QOS"
echo "account            : $SLURM_JOB_ACCOUNT"
echo "submit dir         : $SLURM_SUBMIT_DIR"
echo "number of mpi tasks: $SLURM_NTASKS tasks"
echo "OMP_NUM_THREADS    : $OMP_NUM_THREADS"
echo "Executable         : $EXEC"

echo "------------------- Node list ------------------"
echo $SLURM_JOB_NODELIST

echo "---------------- Checking limits ---------------"
ulimit -a

# ------------------------------------------------------------------------------
# And finally running the code
# ------------------------------------------------------------------------------

echo "--------------- Running the code ---------------"

echo -n "This run started on: "
date

./runner.omp

echo -n "This run completed on: "
date

Parallel

Pure MPI

#!/bin/bash

# ------------------------------------------------------------------------------
# Slurm directives
# ------------------------------------------------------------------------------

#SBATCH --job-name=mpi_job
#SBATCH --output=%j_%x.out
#SBATCH --partition=batch
#SBATCH --ntasks=1024
#SBATCH --mem-per-cpu=1920M
#SBATCH --time=24:00:00
#SBATCH --account=my_project_name

# ------------------------------------------------------------------------------
# Setting up the environment
# ------------------------------------------------------------------------------

echo "----------------- Environment ------------------"
module purge
module load PrgEnv-cray
module list

# ------------------------------------------------------------------------------
# Printing some information
# ------------------------------------------------------------------------------

echo "------------------- Job info -------------------"
echo "job_id             : $SLURM_JOB_ID"
echo "jobname            : $SLURM_JOB_NAME"
echo "queue              : $SLURM_JOB_PARTITION"
echo "qos                : $SLURM_JOB_QOS"
echo "account            : $SLURM_JOB_ACCOUNT"
echo "submit dir         : $SLURM_SUBMIT_DIR"
echo "number of mpi tasks: $SLURM_NTASKS tasks"
echo "OMP_NUM_THREADS    : $OMP_NUM_THREADS"
echo "Executable         : $EXEC"

echo "------------------- Node list ------------------"
echo $SLURM_JOB_NODELIST

echo "---------------- Checking limits ---------------"
ulimit -a

# ------------------------------------------------------------------------------
# And finally running the code
# ------------------------------------------------------------------------------

echo "--------------- Running the code ---------------"

echo -n "This run started on: "
date

srun ./runner.mpi

echo -n "This run completed on: "
date

Hybrid MPI/OpenMP

Multiple threads per MPI process

#!/bin/bash

# ------------------------------------------------------------------------------
# Slurm directives
# ------------------------------------------------------------------------------

#SBATCH --job-name=hybrid_job
#SBATCH --output=%j_%x.out
#SBATCH --partition=batch
#SBATCH --ntasks=256
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=1920M
#SBATCH --time=12:00:00
#SBATCH --account=my_project_name

# ------------------------------------------------------------------------------
# Setting up the environment
# ------------------------------------------------------------------------------

echo "----------------- Environment ------------------"
module purge
module load PrgEnv-cray
module list

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# ------------------------------------------------------------------------------
# Printing some information
# ------------------------------------------------------------------------------

echo "------------------- Job info -------------------"
echo "job_id             : $SLURM_JOB_ID"
echo "jobname            : $SLURM_JOB_NAME"
echo "queue              : $SLURM_JOB_PARTITION"
echo "qos                : $SLURM_JOB_QOS"
echo "account            : $SLURM_JOB_ACCOUNT"
echo "submit dir         : $SLURM_SUBMIT_DIR"
echo "number of mpi tasks: $SLURM_NTASKS tasks"
echo "OMP_NUM_THREADS    : $OMP_NUM_THREADS"
echo "Executable         : $EXEC"

echo "------------------- Node list ------------------"
echo $SLURM_JOB_NODELIST

echo "---------------- Checking limits ---------------"
ulimit -a

# ------------------------------------------------------------------------------
# And finally running the code
# ------------------------------------------------------------------------------

echo "--------------- Running the code ---------------"

echo -n "This run started on: "
date

srun ./runner.hybrid

echo -n "This run completed on: "
date

GPU

Using GPUs

#!/bin/bash

# ------------------------------------------------------------------------------
# Slurm directives
# ------------------------------------------------------------------------------

#SBATCH --job-name=gpu_job
#SBATCH --output=%j_%x.out
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --mem=240G
#SBATCH --gpus=4
#SBATCH --time=10:00:00
#SBATCH --account=my_project_name

# ------------------------------------------------------------------------------
# Setting up the environment
# ------------------------------------------------------------------------------

echo "----------------- Environment ------------------"
module purge
module load CUDA/11.7.0
module list

# ------------------------------------------------------------------------------
# Printing some information
# ------------------------------------------------------------------------------

echo "------------------- Job info -------------------"
echo "job_id             : $SLURM_JOB_ID"
echo "jobname            : $SLURM_JOB_NAME"
echo "queue              : $SLURM_JOB_PARTITION"
echo "qos                : $SLURM_JOB_QOS"
echo "account            : $SLURM_JOB_ACCOUNT"
echo "submit dir         : $SLURM_SUBMIT_DIR"
echo "number of mpi tasks: $SLURM_NTASKS tasks"
echo "OMP_NUM_THREADS    : $OMP_NUM_THREADS"
echo "number of gpus     : $SLURM_GPUS_ON_NODE"
echo "Executable         : $EXEC"

echo "------------------- Node list ------------------"
echo $SLURM_JOB_NODELIST

echo "---------------- Checking limits ---------------"
ulimit -a

# ------------------------------------------------------------------------------
# And finally running the code
# ------------------------------------------------------------------------------

echo "--------------- Running the code ---------------"

echo -n "This run started on: "
date

srun ./runner.cuda

echo -n "This run completed on: "
date

Interactive

srun -p batch -A my_project_name -N 1 -n 16 --mem-per-cpu=1024M -t 60 --pty bash

salloc -p batch -A my_project_name -N 2 -n 256 --mem=241G -t 2:00:00
# and once the resources are allocated use srun the same way as in submission scripts:
srun ./runner.mpi

Job Arrays

Running many similar jobs with small variations (e.g. different input files or conditions)

#!/bin/bash

# ------------------------------------------------------------------------------
# Slurm directives
# ------------------------------------------------------------------------------

#SBATCH --job-name=array_job
#SBATCH --output=%A-%a_%x.out
#SBATCH --partition=batch
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=1G
#SBATCH --time=1:00:00
#SBATCH --array=0-19
#SBATCH --account=my_project_name

# ------------------------------------------------------------------------------
# Setting up the environment
# ------------------------------------------------------------------------------

echo "----------------- Environment ------------------"
module purge
module load foss/2022a
module list

# ------------------------------------------------------------------------------
# Printing some information
# ------------------------------------------------------------------------------

echo "------------------- Job info -------------------"
echo "job_id             : $SLURM_JOB_ID"
echo "jobname            : $SLURM_JOB_NAME"
echo "queue              : $SLURM_JOB_PARTITION"
echo "qos                : $SLURM_JOB_QOS"
echo "account            : $SLURM_JOB_ACCOUNT"
echo "submit dir         : $SLURM_SUBMIT_DIR"
echo "number of mpi tasks: $SLURM_NTASKS tasks"
echo "OMP_NUM_THREADS    : $OMP_NUM_THREADS"
echo "Executable         : $EXEC"

echo "------------------- Node list ------------------"
echo $SLURM_JOB_NODELIST

echo "---------------- Checking limits ---------------"
ulimit -a

# ------------------------------------------------------------------------------
# And finally running the code
# ------------------------------------------------------------------------------

echo "--------------- Running the code ---------------"

echo -n "This run started on: "
date

srun ./runner $SLURM_ARRAY_TASK_ID

echo -n "This run completed on: "
date

Packed

Running (many) independent process inside a job

#!/bin/bash

#SBATCH --job-name=packed_job

Heterogeneous

Requesting heterogeneous resources for the same job (e.g. 1 cpu with 100GB of mem + 64 cpu with 2GB)

#!/bin/bash

#SBATCH --job-name=heterogen_job

Co-simulations

Running various programs in the same job

#!/bin/bash

#SBATCH --job-name=cosim_job