Skip to content

System Overview

The Lucia Supercomputer is a high-performance computing system designed for diverse computational workloads. Its architecture comprises the following key components:

  1. Compute Partition: A variety of nodes, including CPU, GPU, and specialized nodes, tailored for tasks such as memory-intensive computations, AI processing, and visualization.
  2. Storage Partition: A robust IBM Spectrum Scale parallel filesystem with approximately 3 PiB of storage, supported by an offsite backup system.
  3. Service and Management Partitions: Essential for system operations and management, these partitions are not directly accessible to end users.

All partitions are interconnected through an HDR InfiniBand network and a 10 Gb/s Ethernet network, ensuring high-speed communication.


Compute Partition

The compute infrastructure consists of 364 nodes, categorized into CPU nodes, GPU nodes, and specialized nodes to accommodate a wide range of computational needs.

Info

All computes nodes are available for computations through the SLURM batch scheduler.

Computes Nodes Distribution

The distribution of the various types of compute nodes is depicted in the following pie chart:

  • CPU : 300 nodes
    • Standard : 270 nodes
    • Medium : 30 nodes
  • GPU : 50 nodes
  • Specialized : 14 nodes
    • Large Memory : 7 nodes
    • XLarge Memory : 1 node
    • AI : 2 nodes
    • Visualization : 4 nodes

Theoretical Performance

The Rpeak performance of the various types of compute nodes is depicted in the following bar chart:

Rpeak per Node Type


Storage Partition

The storage system, based on IBM Spectrum Scale (GPFS), offers a unified, tiered solution with 3 PiB of capacity. It includes:

  1. Flash Tier (200 TB): For high-speed I/O, leveraging NVMe SSDs, acts as a burst-buffer.
  2. Standard Tier (2.87 PB): High-capacity storage with NL-SAS disks.

The logical partitionning is directly managed by IBM Spectrum Scale through "filesets", and data are migrated seemlessly between the two physical storage tiers.

Filesets

The storage is organized into four distinct spaces, each designed for specific use cases. The following table outlines the key attributes of each fileset:

Filesets Capacity Burst Buffer Usage Backup Notes
/gpfs/home 200 TB No User home directories Yes Quotas applied per user.
/gpfs/projects 1.5 PB No Shared project spaces Yes Quotas applied per group.
/gpfs/softs 50 TB No Pre-installed software libraries Yes Centrally managed.
/gpfs/scratch 1 PB Yes Temporary high-speed storage No Not backed up.

Performance Benchmarks

The scratch fileset benefiting from the burst-buffer delivers the highest performance, making it ideal for I/O-intensive workloads, while the other filesets provide balanced performance for general-purpose HPC tasks.

Filesets Read Speed (GB/s) Write Speed (GB/s) IOPS (4k Reads)
/scratch 270 200 4–5M
Others 18 18 450k

Backup System

Data in home, projects, and softs is backed up using IBM Spectrum Protect, while scratch is excluded. The backup infrastructure includes:

  • IBM TS4500 tape library with 200 tapes (20 TB each), providing 4 PB uncompressed capacity.

Warning

The /gpfs/scratch fileset is not backed up, as it is designed for temporary storage. Users must transfer any critical data elsewhere before job completion.


Interconnect

Lucia's communication network consists of two main parts:

  • Ethernet network

    The 10Gb Ethernet network is mainly used for administative communication and tasks, and also for connecting to the cluster with SSH and for user data transfers in and out of the cluster. The Ethernet network is divided in multiple subnets/VLANs for dedicated tasks such as node deployment, user access, or server/device management.

  • Infiniband network

    Lucia features a high speed low latency HDR Infiniband network in a non-blocking fat-tree topology. The Infiniband network is primarily used by the compute nodes to communicate with other nodes and transfer data during jobs, and by the high performance IBM Spectrum Scale storage system as well.


Software Environment

  • Operating system: Red Hat Linux Enterprise 8
  • Job scheduler: Slurm 23.02
  • Web portal: Open OnDemand
  • Programming environment: Cray PE 22.09
  • Main software installation framework: EasyBuild 4.9.0