MPI
Lucia features AMD CPUs and NVIDIA A100 GPUs and offers several MPI implementations (OpenMPI, Intel MPI or Cray MPICH). Among these, we recommend OpenMPI as the default for most users because it provides the best overall compatibility and performance across both the CPU and GPU partitions. OpenMPI has robust support for CUDA-aware communication and integrates well with UCX and NCCL for GPU-accelerated workloads. Additionally, OpenMPI is widely portable and offers better interoperability with HPC libraries and containers (Apptainer). While Intel MPI and Cray MPICH have strengths on specific hardware (Intel CPUs or Cray networks), OpenMPI offers a more consistent experience on this mixed-architecture system.
Intel MPI
Although Intel MPI is a widely used and well-optimized MPI implementation (particularly for systems built with Intel hardware), it is not well suited to AMD-based HPC clusters such as Lucia. Intel MPI is tightly integrated with Intel software ecosystem and is specifically optimized for Intel CPU architectures, particularly with respect to memory affinity, cache management, low-level threading and the handling of NUMA domains. These optimizations often assume Intel-specific hardware behavior and can lead to suboptimal or unstable behavior when run on AMD systems.
On Lucia, Intel MPI has shown limited stability and compatibility. Specifically:
-
deadlocks during MPI_Init, especially in multi-node executions or when using hybrid MPI+OpenMP workloads,
-
crashes and hangs during I/O phases, particularly in codes that perform large-scale parallel I/O or rely on non-blocking collectives,
-
inconsistent performance and communication bottlenecks in GPU-accelerated codes when using Intel MPI in conjunction with CUDA-aware features.
Intel MPI support for CUDA-aware communication is more limited and less stable compared to OpenMPI or Cray MPICH, which is a critical limitation for users running GPU-accelerated workloads.
For all these reasons (ranging from hardware mismatch and runtime stability to limited GPU support), Intel MPI usage is not recommended as the preferred MPI implementation on this system. Users are encouraged to use OpenMPI or Cray MPICH (with the UCX backend) for improved compatibility, stability and performance across both CPU and GPU partitions. Anyway, some advanced users may still wish to experiment with it:
-
Loading Intel MPI
Various Intel MPI implementations are available through the EasyBuild software stacks.
Module category Module name EasyBuild/2022a impi/2021.6.0-intel-compilers-2022.1.0 EasyBuild/2023a impi/2021.9.0-intel-compilers-2023.1.0 EasyBuild/2024a impi/2021.13.0-intel-compilers-2024.2.0 -
Basic MPI program launch
Intel MPI may be run with either
mpirun
ormpiexec.hydra
, but on Slurm-based systems, prefer usingsrun
with the correct environment setup: -
Known issues and workarounds
Deadlocks during
MPI_Init
can be observed with higher core counts or complex NUMA configurations. Mitigations include:Disabling pinning and reverting to TCP communication may reduce deadlocks at the cost of performance.
Crashes or hangs during I/O may be observed as Intel MPI collective I/O implementation is fragile on non-Intel systems. You can try:
or disable collective I/O entirely if supported by your code.
-
Debugging Intel MPI
Set a higher debug level to track startup and fabric selection issues:
Use these logs to diagnose startup issues, environment mismatches or hangs in MPI_Init.
Warning
Intel MPI’s ofi fabric layer is not optimized for Mellanox HDR200 or UCX stacks used on Lucia. We do not support Intel MPI on this system and will be unable to provide debugging assistance for issues related to its use.
Cray MPICH
Cray MPICH is a high performance and widely portable implementation of the Message Passing Interface (MPI) standard. In order to use Cray MPICH, it is recommended to use the HPE Cray compiler wrappers cc
, CC
and ftn
. The wrappers will find the necessary MPI headers and libraries as well as scientific libraries provided by LibSci. See section Cray Programming Environment.
Cray MPICH can use two different low-level protocols to transfer data across the network. The default is the Open Fabrics Interface (OFI), the UCX protocol from Mellanox is the second alternative. Which performs better will be application-dependent but Lucia's usage demonstrated that UCX is often more stable and faster for programs that send a lot of data collectively between many processes, e.g. all-to-all communications patterns such as occur in parallel FFTs. As a consequence, it has been decided to switch to the UCX implementation by default.
Note
Switching from UCX to OFI implementation is easy and does not need code recompilation. Simply load the modules:
user@frontal01:~ # module load Cray/24.07
user@frontal01:~ # module load PrgEnv-cray/8.4.0
user@frontal01:~ # module list
Currently Loaded Modules:
1) Cray/24.07 (S) 3) craype-x86-milan 5) craype-network-ucx 7) cray-libsci/24.07.0
2) cce/18.0.0 4) craype/2.7.32 6) cray-mpich-ucx/8.1.30 8) PrgEnv-cray/8.4.0
Where:
S: Module is Sticky, requires --force to unload or purge
user@frontal01:~ # module load craype-network-ofi cray-mpich/8.1.30
Lmod is automatically replacing "craype-network-ucx" with "craype-network-ofi".
Lmod is automatically replacing "cray-mpich-ucx/8.1.30" with "cray-mpich/8.1.30".
user@frontal01:~ # module list
Currently Loaded Modules:
1) Cray/24.07 (S) 3) craype-x86-milan 5) cray-libsci/24.07.0 7) craype-network-ofi 9) libfabric/1.13.1
2) cce/18.0.0 4) craype/2.7.32 6) PrgEnv-cray/8.4.0 8) cray-mpich/8.1.30
Where:
S: Module is Sticky, requires --force to unload or purge
Cray MPICH offers improved algorithms for many collectives, an asynchronous progress engine to improve overlap of communications and computations, customizable collective buffering when using MPI-IO and optimized remote memory access (MPI one-sided communication) which also supports passive remote memory access.
MPI 3.1 is almost completely supported by Cray MPICH, with two exceptions: dynamic process management is not supported and CCE MPI_LONG_DOUBLE
and MPI_C_LONG_DOUBLE_COMPLEX
are also not supported.
The Cray MPICH library does not support the mpirun
or mpiexec
commands. This is allowed by the standard which only requires a process starter and suggest mpirun
or mpiexec
depending on the version of the standard. Instead the Slurm srun
command is used as the process starter.
Environment variables
Some useful environment variables can be used:
Variable | Purpose |
---|---|
MPICH_ENV_DISPLAY | Prints MPI environment info at runtime |
MPICH_MAX_THREAD_SAFETY | Controls max threading (for example, multiple ) |
MPICH_VERSION_DISPLAY | Prints Cray MPICH version |
MPICH_RANK_REORDER_METHOD | Controls process mapping |
Note
Cray MPICH supports up to MPI_THREAD_MULTIPLE thread safety level.
user@frontal02:~ # vi thread_levels.c
#include "mpi.h"
#include <stdio.h>
int main( int argc, char *argv[] )
{
int provided;
MPI_Init_thread(&argc,&argv, MPI_THREAD_MULTIPLE,&provided);
printf("Supports level %d of %d %d %d %d\n", provided,
MPI_THREAD_SINGLE,
MPI_THREAD_FUNNELED,
MPI_THREAD_SERIALIZED,
MPI_THREAD_MULTIPLE);
MPI_Finalize();
return 0;
}
Performance tuning hints
-
UCX transport selection
Cray MPICH auto-selects betweenrc
(reliable connection) andud
(unreliable datagram) based on job size:- Small jobs (≤ MPICH_UCX_RC_MAX_RANKS, default = 8) use
rc,self,sm
- Larger jobs use
ud,self,sm
for better scalability
The transport can be forced manually to control this behaviour:
- Small jobs (≤ MPICH_UCX_RC_MAX_RANKS, default = 8) use
-
Memory registration modes
to disable caching and avoid resources limits.
UCX defaults torcache
, which caches memory registrations, but at scale this can exhaust resources. If registration failures are observed, the following environment variable can help: -
Queue Pair (QP) depth and UCX buffers
For dense workloads, UCX queue depth can be tuned by controlling DCI count and zero-copy thresholds for improved at-scale performance as follows: -
Collective algorithms
If contentions on collective calls are observed, specific HPE Cray optimiezd collectives can be overridden to steer performance:
📚 Documentation resources
-
Cray MPICH manpages:
man intro_mpi
OpenMPI
Per the OpenMPI consortium website: "The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners." OpenMPI is one of the most commonly used MPI implementations for High Performance Computing (HPC) and is derived from the MPI reference standard.
Different versions of OpenMPI are available on Lucia through the EasyBuild software stacks:
Module category | Module name |
---|---|
EasyBuild/2022a | OpenMPI/4.0.5-GCC-11.3.0 |
OpenMPI/4.1.4-GCC-11.3.0 | |
EasyBuild/2023a | OpenMPI/4.1.5-GCC-12.3.0 |
EasyBuild/2024a | OpenMPI/5.0.3-GCC-13.3.0 |
CUDA support has been built into lucia's OpenMPI versions. These versions may of course be used on non-CUDA capable nodes without any errors or performance issues, but users desiring GPU support should purposefully select one these versions if their job will be run on the GPU nodes to leverage the installed CUDA cards. As a reminder, make sure to specify --gres:gpu=#
(where # is an integer between 1-4) when desiring to use GPU-capable nodes. These "CUDA-aware" versions of OpenMPI can be loaded using one of the following modules:
Module category | Module name |
---|---|
EasyBuild/2022a | OpenMPI/4.1.4-NVHPC-22.7-CUDA-11.7.0 |
EasyBuild/2023a | OpenMPI/4.1.5-NVHPC-23.7-CUDA-12.2.0 |
EasyBuild/2024a | OpenMPI/5.0.3-NVHPC-25.1-CUDA-12.6.0 |
Note
OpenMPI is integrated with Slurm and jobs can be submitted via srun
or by calling mpirun
directly.
Modular Component Architecture (mca)
Modular Component Architecture (mca) is a mechanism that may be used to fine-tune runtime parameters when using mpirun
. Users sometimes tweak runtime parameters by specifying mca
attributes, including selection of specific network communications protocols. All OpenMPI versions have been compiled with the Unified Communication X (UCX) communication library, and as such, is used by default to select the optimal network communications model. UCX is a high-performance communication framework that provides low-level communication support for MPI, SHMEM and other HPC middleware.
UCX currently supports:
-
OpenFabrics Verbs (including InfiniBand and RoCE)
-
TCP
-
Shared memory
-
NVIDIA CUDA drivers (applicable on GPU nodes)
While some users may choose to manually set mca
transports, most will probably achieve optimal performance by allowing OpenMPI to utilize UCX at runtime.
For explicit control, the following UCX options can be used:
-
--mca pml ucx
: use UCX as the point-to-point messaging layer -
--mca osc ucx
: use UCX for one-sided communication (RMA). -
-x UCX_NET_DEVICES=mlx5_0:1
: bind UCX to a specific InfiniBand device/port.
Note
If using old builds, the following mca
parameters may be used:
Performance tuning with UCX
On InfiniBand clusters, OpenMPI uses UCX to access hardware features like RDMA, tag matching and atomic operations. When properly configured, UCX can significantly reduce communication latency and improve bandwidth. UCX tuning is done primarily via environment variables and may help in:
- minimizing latency on small messages,
- maximizing throughput on large messages (collectives, all-to-alls),
- optimizing NUMA and IB port affinity,
- avoiding fallback to TCP/IP or suboptimal paths.
Some indications are given below:
-
Select the network interface
In order to avoid automatic (and sometimes suboptimal) interface selection, UCX can be bound to specific Infiniband NIC(s) and port(s):
where:-
mlx5_0
is the IB device name -
:1
is the port number.
Note that choosing the wrong port or device can introduce additional hops or use a slower Infiniband link.
-
-
Transport layers control
UCX uses multiple transport layers (TLs). It is possible to limit which ones are used to avoid overhead or incompatible protocols.
where:-
rc
: reliable connection over InfiniBand (good for low-latency, low-scale) -
ud
: unreliable datagram (used at scale, auto-scaling) -
self
: for local-loopback messages -
sm
: shared memory (intra-node communication)
-
-
Memory type and cache settings
It is possible to disable caching if memory registration is a bottleneck:
where:-
UCX_MEMTYPE_CACHE=n
disables caching of memory types (mainly for GPU workloads — often safe to disable on CPU) -
UCX_IB_REG_METHODS=direct
avoids using registration cache; helpful if memory registration limits are hit.
-
-
Message protocol threshholds
It is sometimes important to control how UCX handles small vs. large messages depending on the workload.
In this case, messages larger than 16384 bits use zero-copy RDMA - which improves bandwidth. This value can be tuned depending on the workload: lower threshold will allow more aggressive RDMA while higher threshold will provide more buffered transfers. -
Information
Command or environment variable | Purpose |
---|---|
export UCX_LOG_LEVEL=info | shows transport and devices actually in use |
ucx_info -c | shows UCX configuration |
ucx_info -d | shows available devices and interfaces |
📚 External resources