Partitions
The available nodes are grouped into partitions, also sometimes called queues, usually depending on the type of resource made available and the usage purpose. Each partition has its own limits and preferred type of usage, see the table below.
About resource usage
As you can see in the table below, the GPU nodes have only 32 CPU cores and 240GB of memory available for 4 GPUs. To maximize the use of the GPUs on Lucia, please do no use more than 8 CPU cores and 60 GB per GPU.
As a general rule, it is also recommended to avoid exceeding to the optimal amount of memory per CPU as much as possible so as not to waste computing resources.
| Partition | Job type | Num nodes | CPUs/node | GPUs/node | Available Mem/node | Optimal Mem/CPU | Shared |
|---|---|---|---|---|---|---|---|
| batch | MPI/SMP | 260 | 128 | - | 240GB | 1920MB | NO (ExclusiveUser) |
| medium | MPI/SMP | 30 | 128 | - | 492GB | 3936MB | NO (ExclusiveUser) |
| shared | Serial/SMP | 10 | 128 | - | 492GB | 3936MB | YES |
| large | SMP | 7 | 64 | - | 2000GB | 32000MB | YES |
| xlarge | SMP | 1 | 64 | - | 4000GB | 64000MB | YES |
| gpu | GPU | 50 | 32 | 4 x A100 40GB | 240GB | 7680MB | YES |
| ia | GPU | 2 | 64 | 8 x A100 80GB | 2000GB | 32000MB | YES |
| visu | Visualization | 4 | 32 | 4 x T4 16GB | 492GB | 15744MB | YES |
| debug | Debugging (CPU) | 10 | 128 | - | 240GB | 1920MB | YES |
| debug-gpu | Debugging (GPU) | 2 | 32 | 4 x A100 40GB | 240GB | 7680MB | YES |
QoS
We're also using QoS (Quality of Service) on top of partitions to set additional parameters or constraints, see the table below for the default (in bold) and available QoS for each partition. Actual limits can also be displayed with the following command:
sacctmgr show qos format=Name,Priority,MaxTRESPU%16,MaxJobsPU,MaxSubmitPU,MaxTRESPA,MaxJobsPA,MaxSubmitPA,MinTRES,MaxTRES%32,MaxWall,Flags
| Partition | QoS | Max walltime | Job resource limits | Account resource limits | User resource limits |
|---|---|---|---|---|---|
| batch & medium | normal | 48h | Max 128 nodes | - | Max 2000 queued jobs |
| long | 168h | Max 4 nodes | Max 2048 CPU | Max 512 CPU, max 4 nodes, max 2000 queued jobs | |
| shared | shared | 168h | Max 1 node | - | Max 500 queued jobs |
| large | large | 168h | Min 490GB, max 4 nodes | - | Max 4 nodes, max 16 running jobs, max 200 queued jobs |
| xlarge | xlarge | 168h | Min 1000GB, max 1 node | - | Max 1 node, max 4 running jobs, max 200 queued jobs |
| gpu | gpu | 48h | Min 1 GPU, max 16 nodes | - | - |
| ia | ia | 48h | Min 1 GPU | - | - |
| visu | visu | 4h | Min 1 GPU, Max 1 GPU, Max 8 CPU, Max 123GB | - | Max 1 job |
| debug | debug | 2h | Max 4 nodes | - | Max 4 nodes, max 4 running jobs, max 20 queued jobs |
| debug-gpu | debug-gpu | 2h | Max 2 nodes | - | Max 1 running job, max 10 queued jobs |
Fairshare
Fairshare allows projects and users to get a fair portion of the system based on their past resource usage. Shares on Lucia are established using the Fair Tree algorithm, and the shares are distributed equally between projects of the same category, categories and subcategories shares are as follows:
- Category 1 (85%): non-economic activities, divided in 2 subcategories:
- Category 1a (70%): Universities and colleges
- Category 1b (15%): Accredited research centers
- Category 2 (15%): economic activities, divided in 3 subcategories:
- Category 2a (5%): Universities and colleges
- Category 2b (5%): Accredited research centers
- Category 2c (5%): Companies and industry