Compilers and debuggers

Compilers

An overview of all available compilers on Lucia is presented below. Various versions of compilers for the C, C++, Fortran languages can be used to compile and run scientific codes. Each compiler is available through its own module.

GNU compilers

The basic GNU compiler collection - which is provided by the operating system - is GCC 8.5.0 . gcc, g++ and gfortran are avaible by default in every user environment. As this compiler is a little bit outdated, newer versions are provided through modules.

Module category	Module name	Provides...
Independent	compilers/gcc/10.2.0	GCC v.10.2.0
EasyBuild/2022a	GCC/11.3.0	GCC v.11.3.0
EasyBuild/2023a	GCC/12.3.0	GCC v.12.3.0
EasyBuild/2024a	GCC/13.3.0	GCC v.13.3.0

Note

The above presented EasyBuild GCC modules do not only provide the GCC compiler but also the associated compile and runtime libraries (like libstdc++, libquadmath, etc.)

Warning

Note that older versions of GCC are neither installed nor supported.

Some hints below in order to compile old legacy codes with newer GCC versions:

- Fortran: try `-fallow-argument-mismatch` first, followed by the more extensive flag `-std=legacy` to reduce strictness
- C/C++: look for flags to reduce strictness - such as `-fpermissive`
- C/C++: `-Wpedantic` can warn about lines that break code standards

📚 See the full documentation of the GCC compilers. Additionally, compiler documentation is provided through man pages (e.g., man g++) and through the --help flag to each compiler (for example, gfortran --help).

AMD compilers

AMD compiler is named AOCC (AMD Optimizing C/C++ Compiler). It is based on LLVM and includes many optimizations for the AMD processors. It supports Flang as the Fortran front-end compiler.

AOCC compilers naming:

C compiler: clang
C++ compiler: clang++
Fortran compiler: flang

Different versions of the AOCC compilers are provided via the EasyBuild modules:

Module category	Module name	Provides...
EasyBuild/2022a	AOCC/3.2.0-GCCcore-11.3.0	AOCC v.3.2.0 + GCC 11.3.0 compile and runtime libraries
	AOCC/4.0.0-GCCcore-11.3.0	AOCC v.4.0.0 + GCC 11.3.0 compile and runtime libraries
EasyBuild/2023a	AOCC/4.0.0-GCCcore-12.3.0	AOCC v.4.0.0 + GCC 12.3.0 compile and runtime libraries
EasyBuild/2024a	AOCC/4.2.0-GCCcore-13.3.0	AOCC v.4.2.0 + GCC 13.3.0 compile and runtime libraries

The table below shows a quick comparison of these different versions:

Feature	AOCC 3.2.0	AOCC 4.0.0	AOCC 4.2.0
Base LLVM version	LLVM 13	LLVM 14	LLVM 15
Primary Zen architecture support	Zen2, Zen3	Zen2, Zen3	Zen2, Zen3, Zen4
Auto-vectorization	Basic	Enhanced	Advanced
Recommended for	Legacy code	General use	Best performance

Tip

It is advised to use the following option -march=znver3 on Lucia as it optimizes the code for AMD Epyc Zen3 architecture.

📚 Useful documentation

Intel compilers

Intel compilers and tools are not supported on Lucia (although different versions are available). There have been (and there still are) limitations using Intel softwares on non-Intel platforms (especially AMD processors). Moreover, AMD EPYC processors are not officially supported at software level by Intel. See Intel® HPC Toolkit System Requirements for details.

On Lucia, we also observed that several codes did not work properly with the Intel compiler / IntelMPI combination, exhibiting untimely message passing blocking or I/O problems in particular. Intel's support only provided the following kind of answer: "Thank you for your inquiry. We offer support for hardware platforms that the Intel® oneAPI product supports. These platforms include those that are part of the Intel® Core™ processor family or higher, the Intel® Xeon® processor family, the Intel® Xeon® Scalable processor family, and others which can be found here – Intel® oneAPI Base Toolkit System Requirements, Intel® oneAPI HPC Toolkit System Requirements, Intel® oneAPI IoT Toolkit System Requirements"

Although not supported, Intel compiler is nevertheless available to avoid duplicate local installations in users' space. The table below lists the different Intel compilers available on Lucia:

Module category	Module name
EasyBuild/2022a	intel-compilers/2022.1.0
EasyBuild/2023a	intel-compilers/2023.1.0
EasyBuild/2024a	intel-compilers/2024.2.0

Note

Traditional Intel tools (like Intel MPI library or Intel MKL library) are also available in the same module categories.

Warning

We are aware of some << tricks >> available on Internet (even on official documentations of some HPC systems) to by-pass the check of the CPU type in order to obtain the best performance of some Intel tools on non-genuine Intel processors. We strongly discourage users to use these tricks and no support will be provided in this case.

Cray compilers

Cray compilers are provided through the Cray Programming Environment. As their usage is highly specific, a special section of the documentation is reserved to the CPE. Please check Cray Programming Environment section for detailed information.

Clang/LLVM

The LLVM is a collection of compiler and toolchain tools and libraries which includes its own Clang compiler (clang, clang++). The LLVM Core libraries along with the compilers are locally built and compiled against the GCC compiler suite. The LLVM Core libraries provide a modern source- and target-independent optimizer, along with code generation support for many popular CPUs. These libraries are built around a well specified code representation known as the LLVM intermediate representation ("LLVM IR"). The LLVM Core libraries are well documented, and it is particularly easy to invent your own language (or port an existing compiler) to use LLVM as an optimizer and code generator.

It is important to use the appropriate compiler driver for the programming language in use. clang is intended for compiling C code, while clang++ should be used for C++ projects. Unlike the GNU toolchain, Clang does not automatically link the C++ standard libraries when using the C compiler to compile C++ code, so using the correct driver avoids linking issues and runtime errors. Clang provides fine-grained control over target CPU tuning (option -march=), allowing to specify the exact architecture features of the processor. This not only improves execution speed but also ensures better utilization of available CPU instructions.

Clang’s diagnostic engine is one of its strengths, offering clear and informative error messages. For debugging, Clang supports a suite of runtime sanitizers that help detect memory leaks, undefined behavior and concurrency issues. Options like -Wall -Wextra -Werror -g -fsanitize=address can help identify potential bugs or non-portable code constructs. Clang also includes a static analysis tool scan-build that can be used to perform a deeper examination of the source code without executing it. This is especially helpful for identifying subtle bugs, memory mismanagement or edge cases that might not be caught during typical testing.

For performance code optimization, Link Time Optimization (LTO) (option -flto) is supported and can provide significant improvements by allowing the compiler to analyze the entire program during the linking phase.

LLVM/Clang are available throughout the EasyBuild software stack.

Module category	Module name
EasyBuild/2023a	LLVM/16.0.6-GCCcore-12.3.0
EasyBuild/2024a	LLVM/18.1.8-GCCcore-13.3.0
EasyBuild/2024a	Clang/18.1.8-GCCcore-13.3.0

📚 For documentation of the LLVM compilers, see LLVM, Clang, and Flang websites.

Note

When mixing Clang with components built using other compilers (such as GCC], compatibility between C++ standard libraries should be considered. Inconsistent usage can lead to ABI (Application Binary Interface) incompatibilities.

NVCC

The NVIDIA CUDA Compiler (nvcc) is used to compile applications written in CUDA C and C++ for execution on NVIDIA GPUs. nvcc acts as a wrapper around both the host and device compilation processes and generating binaries that include CPU code compiled by a host compiler (typically gcc, g++ or clang++) and GPU code compiled into PTX or SASS for NVIDIA GPUs.

When working with nvcc, source code should be structured to clearly separate host and device components, making use of the __host__, __device__ and __global__ function qualifiers to control where code executes. nvcc handles this distinction automatically but writing code with a clear separation of concerns improves portability and reduces confusion during debugging. The target GPU architecture should be explicitly specified using the -arch flag (e.g. -arch=sm_80 for NVidia A100 GPUs) to ensure that the compiled device code is optimized for the available hardware. This avoids potential mismatches and enables use of hardware-specific features.

Code correctness can be improved by enabling all relevant compiler warnings. Compiler diagnostics can catch common issues such as type mismatches, unused variables or illegal memory access patterns. Code should be compiled with debug symbols (-G) and reduce optimization levels (-O0) to allow source-level debugging with tools such as cuda-gdb.

nvcc provides flags for controlling optimization, inlining, loop unrolling and other low-level behaviors. However, tuning should also take into account occupancy, shared memory usage and register pressure. Tools such as Nsight Compute and the CUDA Occupancy Calculator are essential for understanding and improving the performance of GPU kernels. Users are encouraged to profile their applications thoroughly to identify performance bottlenecks.

Note

The CUDA compiler relies on an external host compiler to build CPU code. On RHEL 8.10, nvcc is compatible with specific versions of gcc, g++, and clang++. Users must ensure that the selected host compiler version is officially supported by the CUDA Toolkit in use. Mismatches can cause compilation errors or undefined behavior.

CUDA/nvcc are available in the EasyBuild software stacks:

Module category	Module name
EasyBuild/2022a	CUDA/11.7.0
EasyBuild/2023a	CUDA/12.2.0
EasyBuild/2024a	CUDA/12.6.0

📚 For detailed documentation, refer to:

Debuggers

Each of the above presented compilers provide its own debugger:

Compiler	Debugger	Information
GCC/AOCC	GDB	Compile with `-g`, prefer `-O0` + `-fno-inline` for full debug info
Intel	GDB (patched)	Intel-enhanced GDB. Compile with `-g`
Clang	LLDB or GDB	LLDB native - GDB common - same compile flags as GCC
CUDA (nvcc)	cuda-gdb	Requires `-g -G` - debug both host and GPU portions

GCC / AMD AOCC

The primary debugger for GCC and AMD’s AOCC compiler suite is GDB (GNU Debugger). It supports source-level debugging for languages including C, C++, Fortran and more. Users should compile with -g and preferably disable optimizations (-O0,-fno-inline) to ensure proper symbolic debugging. GDB allows stepping through code, setting breakpoints, examining variables and memory. GDB supports advanced features such as reversible debugging (stepping backward), scripting via Python and graphical/text-based front-ends. For instruction on compiling code for debugging, setting watchpoints and remote debugging, refer to the GDB User Manual .

Intel Compiler

Intel historically provided its own graphical debugger (Intel Debugger, IDB) but it has been deprecated. On Linux systems, Intel compilers now support an extended version of GDB, with enhancements tuned for Fortran and parallel programming (MPI, OpenMP). Users should compile with -g and use gdb as with GCC, benefiting from Intel’s patches for better support of their compiler-generated code.

Clang / LLVM

Clang integrates seamlessly with LLVM’s native debugger LLDB, offering source-level debugging with similar capabilities to GDB. However, most users on classic HPC clusters prefer using GDB, which provides consistent debugging across GCC, AOCC, and Clang environments. Both GDB and LLDB support stepping, breakpoints, variable inspection and advanced features like remote debugging. The LLVM documentation encourages the use of -g and reduced optimization for effective debugging. While LLDB offers a smoother experience for C++ templates and modern language features, GDB's maturity and ecosystem support make it a preferred choice .

NVIDIA CUDA

For debugging CUDA applications, cuda-gdb extends GDB to handle GPU kernels and host-device transitions. This allows:

debugging of CPU and GPU code
breakpoints before and within kernels
inspection of GPU registers and memory
debugging in both source-level CUDA C/C++ and GPU assembly/PTX

cuda-gdb supports single- and multi-GPU debugging. Multiple GPU threads and kernels can be managed with commands like info cuda kernels and cuda kernel <n>. It also supports remote debugging over SSH or TCP using cuda-gdbserver.

A key requirement is to compile CUDA applications with debug flags (-g -G) to enable effective GPU debugging.

Linaro DDT

Linaro DDT is a high-performance scalable graphical debugger designed for parallel and distributed applications at scale. It supports debugging for applications written in C, C++, and Fortran but also Python and is fully compatible with MPI, OpenMP, CUDA and hybrid models. On Lucia, DDT is available through environment modules and is configured to run seamlessly with the installed MPI flavors and Slurm. To access the latest version of the debugger:

    user@frontal02:~ # module load devel/ArmForge
    user@frontal02:~ # module list

    Currently Loaded Modules:
      1) devel/ArmForge/23.0.1


    user@frontal02:~ #

Note

At the time this documentation was written, DDT is available on the cluster for debugging MPI and CPU-based applications, GPU debugging is not supported on the GPU partition because our site does not currently hold GPU debugging licenses for DDT. You can still use DDT to debug the host (CPU) side of CUDA-enabled applications, but for full device-level (GPU kernel) debugging, please use cuda-gdb instead.

Interactive usage

While connected on a login node with X-forwarding enabled, DDT debugging of my_code can be initiated as follows:

    user@frontal02:~ # module load devel/ArmForge
    user@frontal02:~ # ddt ./my_code

This command will initiate the graphical DDT interface and prompt the user to configure runtime options such as the number of processes and execution arguments.

For debugging parallel applications using MPI, DDT can be launched by prepending the application launch command (srun or mpirun) with ddt. For example, to run a parallel job with four MPI ranks using SLURM, the following command may be used:

    user@frontal02:~ # module load devel/ArmForge
    user@frontal02:~ # ddt srun -n 4 ./my_code

Batch usage

To launch DDT in a non-interactive session, the same kind of command is typically used. For example, to debug an application on two compute nodes with 8 MPI ranks per node, the following submission script can be used:

    #!/bin/bash
    #SBATCH --job-name=ddt_debug
    #SBATCH --output=%j_%x.out
    #SBATCH --nodes=2
    #SBATCH --ntasks-per-node=8
    #SBATCH --time=01:00:00
    #SBATCH --partition=debug
    #SBATCH --account=my_project_name

    module load devel/ArmForge

    ddt srun ./my_code

Command Line Interface (without GUI) usage

DDT can be run in non-GUI mode via command-line options that allow it to:

automatically run a program under debugger control
collect data such as memory errors, crashes, and stack traces
generate reports in XML or plain text
exit without launching the GUI.

This is done using the --offline, --output, --batch, or --connect flags, depending on the workflow.

Option	Description
`--offline`	runs DDT without opening the GUI
`--output`	specifies the output file to save the diagnostic report
`--batch`	runs a scripted command file
`--connect`	connects to a remote headless session (from ARM Forge client)

Examples

General syntax

    ddt --offline --output=report.html srun -n 4 ./my_code

Memory debugging

    ddt --offline --mem-debug --output=memory_report.html srun -n 8 ./my_code

Capturing a crash

    ddt --offline --output=crash_report.xml srun -n 4 ./my_code

Analyzing core file(s)

    ddt --core=core.12345 ./my_code
    or for MPI applications producing a core file per MPI rank
    ddt ./my_mpi_code core.rank.*

Note

To enable stack overflow detection in memory debugging mode, use the option --check-stack especially when working with large arrays or recursive functions.

Useful resources

📚 For the most up-to-date guidance, consult the Linaro Forge documentation, which contains a dedicated DDT section covering installation, workflow, script usage...

Official Linaro Forge Documentation (DDT)

Valgrind

Valgrind is a suite of simulation-based debugging and profiling tools for programs running on Linux clusters, aimed to aid in the detection in memory leaks and errors in parallel applications. The most popular of these tools is called Memcheck which can detect many memory-related errors and memory leaks.
Other supported tools include:
- Cachegrind - a profiler using the number of instructions executed,
- Callgrind - similar to Cachegrind but records the call history among functions,
- Helgrind - a pthreads error detector,
- DRD - also a pthreads error detector,
- Massif - a heap profiler,
- DHAT - a dynamic heap usage analysis tool.

Different versions of Valgrind are available in the EasyBuild software stacks:

Module category	Module name
EasyBuild/2022a	Valgrind/3.19.0-gompi-2022a
EasyBuild/2023a	Valgrind/3.21.0-gompi-2023a
EasyBuild/2024a	Valgrind/3.24.0-gompi-2024a

Using Valgrind

Use Valgrind as follows to inquire about potential memory problems:

    module load EasyBuild/202...
    module load Valgrind
    valgrind --tool=<tool-name> <valgrind-options> code <code-args>

Note

To use Valgrind, compile your application with the debug flag -g so that Memcheck's error messages include exact line numbers. Using -O0 is also a good idea but if your code becomes way too slow, -01 is an acceptable alternative although line numbers in error messages can be inaccurate. Use of -O2 and above is not recommended as Memcheck occasionally reports uninitialized-value errors which don't really exist.

For MPI codes, simply add valgrind in front of your command:

    module load EasyBuild/202...
    module load Valgrind
    srun -n ${nprocs} valgrind --tool=<tool-name> <valgrind-options> code <code-args>

Hint

It is possible to redirect Valgrind's output to a separate file for each MPI task using the --log-file=... option. In a Slurm submission script, this can be achieved as follows:

    module load EasyBuild/202...
    module load Valgrind
    srun -n ${nprocs} valgrind --tool=<tool-name> --log-file=vlg_%q{SLURM_JOB_ID}.%q{SLURM_PROCID}.out code <code-args>

Memcheck

Memcheck is the most famous (but also the default) tool of the Valgrind suite. It verifies memory access of the code and it can detect use of uninitialized memory, out of bounds memory access, memory leaks, double free, etc. It is advised to use the --leak-check=yes option together with Memcheck in order to get a detailed analysis of the memory leak.

DRD / helgrind

DRD is Valgrind's tool to detect race conditions. It does not detect deadlocks but needs less memory than helgrind. helgrind detects both race conditions and deadlocks. The following example illustrates a data race on global data.

    #include <stdlib.h>
    #include <stdio.h>
    #include "omp.h"

    int global = 4711;

    int main (int argc, char * argv[]) {

    // The following is an ERROR: This should be a (thread-)local variable.
    #pragma omp parallel
    global = omp_get_thread_num();

    printf("global:%d\n", global);
    return EXIT_SUCCESS;
}

Once this test compiled (executable's name being omp_error) and the requested number of OMP threads set, applying helgrind tool allows to detect the race condition:

   valgrind --tool=helgrind ./omp_error
   ==2383761== Helgrind, a thread error detector
   ==2383761== Copyright (C) 2007-2017, and GNU GPL'd, by OpenWorks LLP et al.
   ==2383761== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
   ==2383761== Command: ./omp_error
   ...
   ==2383761== ----------------------------------------------------------------
   ==2383761== 
   ==2383761== Possible data race during read of size 4 at 0x522C0D4 by thread #2
   ==2383761== Locks held: none
   ==2383761==    at 0x407CCAB: do_spin (wait.h:57)
   ==2383761==    by 0x407CCAB: do_wait (wait.h:66)
   ==2383761==    by 0x407CCAB: gomp_barrier_wait_end (bar.c:48)
   ==2383761==    by 0x407A3E7: gomp_simple_barrier_wait (simple-bar.h:60)
   ==2383761==    by 0x407A3E7: gomp_thread_start (team.c:133)
   ==2383761==    by 0x4048866: mythread_wrapper (hg_intercepts.c:406)
   ==2383761==    by 0x4A3A1C9: start_thread (in /usr/lib64/libpthread-2.28.so)
   ==2383761==    by 0x4C8B8D2: clone (in /usr/lib64/libc-2.28.so)
   ==2383761== 
   ==2383761== This conflicts with a previous write of size 4 by thread #1
   ==2383761== Locks held: none
   ==2383761==    at 0x407CD14: gomp_barrier_wait_end (bar.c:41)
   ==2383761==    by 0x407CD14: gomp_barrier_wait_end (bar.c:35)
   ==2383761==    by 0x407AE5B: gomp_simple_barrier_wait (simple-bar.h:60)
   ==2383761==    by 0x407AE5B: gomp_team_start (team.c:872)
   ==2383761==    by 0x40716E0: GOMP_parallel (parallel.c:176)
   ==2383761==    by 0x109085: main (omp_error.c:7)
   ==2383761==  Address 0x522c0d4 is 68 bytes inside a block of size 192 allocated
   ==2383761==    at 0x403E824: malloc (vg_replace_malloc.c:431)
   ==2383761==    by 0x406A6CC: gomp_malloc (alloc.c:38)
   ==2383761==    by 0x407A60B: gomp_get_thread_pool (pool.h:42)
   ==2383761==    by 0x407A60B: get_last_team (team.c:156)
   ==2383761==    by 0x407A60B: gomp_new_team (team.c:175)
   ==2383761==    by 0x40716C9: GOMP_parallel (parallel.c:176)
   ==2383761==    by 0x109085: main (omp_error.c:7)
   ==2383761==  Block was allocated by thread #1

Hint

If the following error occurs when trying to use Valgrind

   valgrind: mmap(0x400000, 4096) failed in UME with error 22 (Invalid argument).
   valgrind: this can be caused by executables with very large text, data or bss segments.

it is advised to recompile the application with the -fPIE -pie options. The error occurs because the non-PIE executable loads at a fixed address (0x400000), which conflicts with memory regions Valgrind needs for its internal operation. Compiling with -fPIE -pie makes the executable position-independent, allowing it to be loaded at a non-conflicting address, which avoids the error.