Skip to content

MPI

Lucia features AMD CPUs and NVIDIA A100 GPUs and offers several MPI implementations (OpenMPI, Intel MPI or Cray MPICH). Among these, we recommend OpenMPI as the default for most users because it provides the best overall compatibility and performance across both the CPU and GPU partitions. OpenMPI has robust support for CUDA-aware communication and integrates well with UCX and NCCL for GPU-accelerated workloads. Additionally, OpenMPI is widely portable and offers better interoperability with HPC libraries and containers (Apptainer). While Intel MPI and Cray MPICH have strengths on specific hardware (Intel CPUs or Cray networks), OpenMPI offers a more consistent experience on this mixed-architecture system.

Intel MPI

Although Intel MPI is a widely used and well-optimized MPI implementation (particularly for systems built with Intel hardware), it is not well suited to AMD-based HPC clusters such as Lucia. Intel MPI is tightly integrated with Intel software ecosystem and is specifically optimized for Intel CPU architectures, particularly with respect to memory affinity, cache management, low-level threading and the handling of NUMA domains. These optimizations often assume Intel-specific hardware behavior and can lead to suboptimal or unstable behavior when run on AMD systems.

On Lucia, Intel MPI has shown limited stability and compatibility. Specifically:

  • deadlocks during MPI_Init, especially in multi-node executions or when using hybrid MPI+OpenMP workloads,

  • crashes and hangs during I/O phases, particularly in codes that perform large-scale parallel I/O or rely on non-blocking collectives,

  • inconsistent performance and communication bottlenecks in GPU-accelerated codes when using Intel MPI in conjunction with CUDA-aware features.

Intel MPI support for CUDA-aware communication is more limited and less stable compared to OpenMPI or Cray MPICH, which is a critical limitation for users running GPU-accelerated workloads.

For all these reasons (ranging from hardware mismatch and runtime stability to limited GPU support), Intel MPI usage is not recommended as the preferred MPI implementation on this system. Users are encouraged to use OpenMPI or Cray MPICH (with the UCX backend) for improved compatibility, stability and performance across both CPU and GPU partitions. Anyway, some advanced users may still wish to experiment with it:

  1. Loading Intel MPI

    Various Intel MPI implementations are available through the EasyBuild software stacks.

    Module category Module name
    EasyBuild/2022a impi/2021.6.0-intel-compilers-2022.1.0
    EasyBuild/2023a impi/2021.9.0-intel-compilers-2023.1.0
    EasyBuild/2024a impi/2021.13.0-intel-compilers-2024.2.0
  2. Basic MPI program launch

    Intel MPI may be run with either mpirun or mpiexec.hydra, but on Slurm-based systems, prefer using srun with the correct environment setup:

    export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=0
    export I_MPI_DEBUG=5
    srun -n 4 ./my_program
    
  3. Known issues and workarounds

    Deadlocks during MPI_Init can be observed with higher core counts or complex NUMA configurations. Mitigations include:

    export I_MPI_PIN=0
    export I_MPI_FABRICS=shm:tcp
    

    Disabling pinning and reverting to TCP communication may reduce deadlocks at the cost of performance.

    Crashes or hangs during I/O may be observed as Intel MPI collective I/O implementation is fragile on non-Intel systems. You can try:

    export I_MPI_COLLECTIVE_DEFAULT=auto
    export I_MPI_EXTRA_FILESYSTEM=off
    

    or disable collective I/O entirely if supported by your code.

  4. Debugging Intel MPI

    Set a higher debug level to track startup and fabric selection issues:

    export I_MPI_DEBUG=10
    export I_MPI_HYDRA_DEBUG=1
    

    Use these logs to diagnose startup issues, environment mismatches or hangs in MPI_Init.

Warning

Intel MPI’s ofi fabric layer is not optimized for Mellanox HDR200 or UCX stacks used on Lucia. We do not support Intel MPI on this system and will be unable to provide debugging assistance for issues related to its use.

Cray MPICH

Cray MPICH is a high performance and widely portable implementation of the Message Passing Interface (MPI) standard. In order to use Cray MPICH, it is recommended to use the HPE Cray compiler wrappers cc, CC and ftn. The wrappers will find the necessary MPI headers and libraries as well as scientific libraries provided by LibSci. See section Cray Programming Environment.

Cray MPICH can use two different low-level protocols to transfer data across the network. The default is the Open Fabrics Interface (OFI), the UCX protocol from Mellanox is the second alternative. Which performs better will be application-dependent but Lucia's usage demonstrated that UCX is often more stable and faster for programs that send a lot of data collectively between many processes, e.g. all-to-all communications patterns such as occur in parallel FFTs. As a consequence, it has been decided to switch to the UCX implementation by default.

Note

Switching from UCX to OFI implementation is easy and does not need code recompilation. Simply load the modules:

    user@frontal01:~ # module load Cray/24.07 
    user@frontal01:~ # module load PrgEnv-cray/8.4.0
    user@frontal01:~ # module list

    Currently Loaded Modules:
    1) Cray/24.07 (S)   3) craype-x86-milan   5) craype-network-ucx      7) cray-libsci/24.07.0
    2) cce/18.0.0       4) craype/2.7.32      6) cray-mpich-ucx/8.1.30   8) PrgEnv-cray/8.4.0

    Where:
    S:  Module is Sticky, requires --force to unload or purge

    user@frontal01:~ # module load craype-network-ofi cray-mpich/8.1.30

    Lmod is automatically replacing "craype-network-ucx" with "craype-network-ofi".

    Lmod is automatically replacing "cray-mpich-ucx/8.1.30" with "cray-mpich/8.1.30".

    user@frontal01:~ # module list

    Currently Loaded Modules:
    1) Cray/24.07 (S)   3) craype-x86-milan   5) cray-libsci/24.07.0   7) craype-network-ofi   9) libfabric/1.13.1
    2) cce/18.0.0       4) craype/2.7.32      6) PrgEnv-cray/8.4.0     8) cray-mpich/8.1.30

    Where:
    S:  Module is Sticky, requires --force to unload or purge

Cray MPICH offers improved algorithms for many collectives, an asynchronous progress engine to improve overlap of communications and computations, customizable collective buffering when using MPI-IO and optimized remote memory access (MPI one-sided communication) which also supports passive remote memory access.

MPI 3.1 is almost completely supported by Cray MPICH, with two exceptions: dynamic process management is not supported and CCE MPI_LONG_DOUBLE and MPI_C_LONG_DOUBLE_COMPLEX are also not supported.

The Cray MPICH library does not support the mpirun or mpiexec commands. This is allowed by the standard which only requires a process starter and suggest mpirun or mpiexec depending on the version of the standard. Instead the Slurm srun command is used as the process starter.

Environment variables

Some useful environment variables can be used:

Variable Purpose
MPICH_ENV_DISPLAY Prints MPI environment info at runtime
MPICH_MAX_THREAD_SAFETY Controls max threading (for example, multiple)
MPICH_VERSION_DISPLAY Prints Cray MPICH version
MPICH_RANK_REORDER_METHOD Controls process mapping

Note

Cray MPICH supports up to MPI_THREAD_MULTIPLE thread safety level.

   user@frontal02:~ # vi thread_levels.c
   #include "mpi.h"
   #include <stdio.h>

   int main( int argc, char *argv[] )
   {
      int provided;

      MPI_Init_thread(&argc,&argv, MPI_THREAD_MULTIPLE,&provided);

      printf("Supports level %d of %d %d %d %d\n", provided,
             MPI_THREAD_SINGLE,
             MPI_THREAD_FUNNELED,
             MPI_THREAD_SERIALIZED,
             MPI_THREAD_MULTIPLE);

      MPI_Finalize();
      return 0;
   }
    user@frontal02:~ # cc -o thread_levels thread_levels.c
    user@frontal02:~ # srun -n 1 --time=00:01:00 --partition=debug --account=p_userproject ./thread_levels
    Supports level 3 of 0 1 2 3

Performance tuning hints

  1. UCX transport selection
    Cray MPICH auto-selects between rc (reliable connection) and ud (unreliable datagram) based on job size:

    • Small jobs (≤ MPICH_UCX_RC_MAX_RANKS, default = 8) use rc,self,sm
    • Larger jobs use ud,self,sm for better scalability

    The transport can be forced manually to control this behaviour:

         export MPICH_UCX_RC_MAX_RANKS=16
         export UCX_TLS="ud,self,sm"
    
  2. Memory registration modes
    UCX defaults to rcache, which caches memory registrations, but at scale this can exhaust resources. If registration failures are observed, the following environment variable can help:

        export UCX_IB_REG_METHODS=direct
    
    to disable caching and avoid resources limits.

  3. Queue Pair (QP) depth and UCX buffers
    For dense workloads, UCX queue depth can be tuned by controlling DCI count and zero-copy thresholds for improved at-scale performance as follows:

        export UCX_DC_MLX5_NUM_DCI=16
        export UCX_ZCOPY_THRESH=16384
    

  4. Collective algorithms
    If contentions on collective calls are observed, specific HPE Cray optimiezd collectives can be overridden to steer performance:

        export MPICH_COLL_OPT_OFF=mpi_allgather,mpi_alltoall
    

📚 Documentation resources

OpenMPI

Per the OpenMPI consortium website: "The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners." OpenMPI is one of the most commonly used MPI implementations for High Performance Computing (HPC) and is derived from the MPI reference standard.

Different versions of OpenMPI are available on Lucia through the EasyBuild software stacks:

Module category Module name
EasyBuild/2022a OpenMPI/4.0.5-GCC-11.3.0
OpenMPI/4.1.4-GCC-11.3.0
EasyBuild/2023a OpenMPI/4.1.5-GCC-12.3.0
EasyBuild/2024a OpenMPI/5.0.3-GCC-13.3.0

CUDA support has been built into lucia's OpenMPI versions. These versions may of course be used on non-CUDA capable nodes without any errors or performance issues, but users desiring GPU support should purposefully select one these versions if their job will be run on the GPU nodes to leverage the installed CUDA cards. As a reminder, make sure to specify --gres:gpu=# (where # is an integer between 1-4) when desiring to use GPU-capable nodes. These "CUDA-aware" versions of OpenMPI can be loaded using one of the following modules:

Module category Module name
EasyBuild/2022a OpenMPI/4.1.4-NVHPC-22.7-CUDA-11.7.0
EasyBuild/2023a OpenMPI/4.1.5-NVHPC-23.7-CUDA-12.2.0
EasyBuild/2024a OpenMPI/5.0.3-NVHPC-25.1-CUDA-12.6.0

Note

OpenMPI is integrated with Slurm and jobs can be submitted via srun or by calling mpirun directly.

Modular Component Architecture (mca)

Modular Component Architecture (mca) is a mechanism that may be used to fine-tune runtime parameters when using mpirun. Users sometimes tweak runtime parameters by specifying mca attributes, including selection of specific network communications protocols. All OpenMPI versions have been compiled with the Unified Communication X (UCX) communication library, and as such, is used by default to select the optimal network communications model. UCX is a high-performance communication framework that provides low-level communication support for MPI, SHMEM and other HPC middleware.

UCX currently supports:

  • OpenFabrics Verbs (including InfiniBand and RoCE)

  • TCP

  • Shared memory

  • NVIDIA CUDA drivers (applicable on GPU nodes)

While some users may choose to manually set mca transports, most will probably achieve optimal performance by allowing OpenMPI to utilize UCX at runtime. For explicit control, the following UCX options can be used:

    mpirun --mca pml ucx --mca osc ucx -x UCX_NET_DEVICES=mlx5_0:1 ./my_mpi_app

  • --mca pml ucx: use UCX as the point-to-point messaging layer

  • --mca osc ucx: use UCX for one-sided communication (RMA).

  • -x UCX_NET_DEVICES=mlx5_0:1: bind UCX to a specific InfiniBand device/port.

Note

If using old builds, the following mca parameters may be used:

    mpirun --mca btl openib,self,vader ./my_mpi_app
but it is not recommended and migrating to UCX is strongly advised as it supports better scalability, hardware offload and integration with other frameworks (SHMEM, NCCL).

Performance tuning with UCX

On InfiniBand clusters, OpenMPI uses UCX to access hardware features like RDMA, tag matching and atomic operations. When properly configured, UCX can significantly reduce communication latency and improve bandwidth. UCX tuning is done primarily via environment variables and may help in:

  • minimizing latency on small messages,
  • maximizing throughput on large messages (collectives, all-to-alls),
  • optimizing NUMA and IB port affinity,
  • avoiding fallback to TCP/IP or suboptimal paths.

Some indications are given below:

  1. Select the network interface

    In order to avoid automatic (and sometimes suboptimal) interface selection, UCX can be bound to specific Infiniband NIC(s) and port(s):

        export UCX_NET_DEVICES=mlx5_0:1
    
    where:

    • mlx5_0 is the IB device name

    • :1 is the port number.

    Note that choosing the wrong port or device can introduce additional hops or use a slower Infiniband link.

  2. Transport layers control

    UCX uses multiple transport layers (TLs). It is possible to limit which ones are used to avoid overhead or incompatible protocols.

        export UCX_TLS=rc,self,sm
        or
        export UCX_TLS=ud,self,sm
    
    where:

    • rc: reliable connection over InfiniBand (good for low-latency, low-scale)

    • ud: unreliable datagram (used at scale, auto-scaling)

    • self: for local-loopback messages

    • sm: shared memory (intra-node communication)

  3. Memory type and cache settings

    It is possible to disable caching if memory registration is a bottleneck:

       export UCX_MEMTYPE_CACHE=n
       export UCX_IB_REG_METHODS=direct
    
    where:

    • UCX_MEMTYPE_CACHE=n disables caching of memory types (mainly for GPU workloads — often safe to disable on CPU)

    • UCX_IB_REG_METHODS=direct avoids using registration cache; helpful if memory registration limits are hit.

  4. Message protocol threshholds

    It is sometimes important to control how UCX handles small vs. large messages depending on the workload.

        export UCX_ZCOPY_THRESH=16384
    
    In this case, messages larger than 16384 bits use zero-copy RDMA - which improves bandwidth. This value can be tuned depending on the workload: lower threshold will allow more aggressive RDMA while higher threshold will provide more buffered transfers.

  5. Information

Command or environment variable Purpose
export UCX_LOG_LEVEL=info shows transport and devices actually in use
ucx_info -c shows UCX configuration
ucx_info -d shows available devices and interfaces

📚 External resources