Compilers and debuggers
Compilers
An overview of all available compilers on Lucia is presented below. Various versions of compilers for the C, C++, Fortran languages can be used to compile and run scientific codes. Each compiler is available through its own module.
GNU compilers
The basic GNU compiler collection - which is provided by the operating system - is GCC 8.5.0 . gcc, g++ and gfortran are avaible by default in every user environment. As this compiler is a little bit outdated, newer versions are provided through modules.
Module category | Module name | Provides... |
---|---|---|
Independent | compilers/gcc/10.2.0 | GCC v.10.2.0 |
EasyBuild/2022a | GCC/11.3.0 | GCC v.11.3.0 |
EasyBuild/2023a | GCC/12.3.0 | GCC v.12.3.0 |
EasyBuild/2024a | GCC/13.3.0 | GCC v.13.3.0 |
Note
The above presented EasyBuild GCC modules do not only provide the GCC compiler but also the associated compile and runtime libraries (like libstdc++, libquadmath, etc.)
Warning
Note that older versions of GCC are neither installed nor supported.
Some hints below in order to compile old legacy codes with newer GCC versions:
- Fortran: try `-fallow-argument-mismatch` first, followed by the more extensive flag `-std=legacy` to reduce strictness
- C/C++: look for flags to reduce strictness - such as `-fpermissive`
- C/C++: `-Wpedantic` can warn about lines that break code standards
📚 See the full documentation of the GCC compilers. Additionally, compiler documentation is provided through man pages (e.g., man g++
) and through the --help
flag to each compiler (for example, gfortran --help
).
AMD compilers
AMD compiler is named AOCC
(AMD Optimizing C/C++ Compiler). It is based on LLVM and includes many optimizations for the AMD processors. It supports Flang
as the Fortran front-end compiler.
AOCC compilers naming:
- C compiler:
clang
- C++ compiler:
clang++
- Fortran compiler:
flang
Different versions of the AOCC compilers are provided via the EasyBuild modules:
Module category | Module name | Provides... |
---|---|---|
EasyBuild/2022a | AOCC/3.2.0-GCCcore-11.3.0 | AOCC v.3.2.0 + GCC 11.3.0 compile and runtime libraries |
AOCC/4.0.0-GCCcore-11.3.0 | AOCC v.4.0.0 + GCC 11.3.0 compile and runtime libraries | |
EasyBuild/2023a | AOCC/4.0.0-GCCcore-12.3.0 | AOCC v.4.0.0 + GCC 12.3.0 compile and runtime libraries |
EasyBuild/2024a | AOCC/4.2.0-GCCcore-13.3.0 | AOCC v.4.2.0 + GCC 13.3.0 compile and runtime libraries |
The table below shows a quick comparison of these different versions:
Feature | AOCC 3.2.0 | AOCC 4.0.0 | AOCC 4.2.0 |
---|---|---|---|
Base LLVM version | LLVM 13 | LLVM 14 | LLVM 15 |
Primary Zen architecture support | Zen2, Zen3 | Zen2, Zen3 | Zen2, Zen3, Zen4 |
Auto-vectorization | Basic | Enhanced | Advanced |
Recommended for | Legacy code | General use | Best performance |
Tip
It is advised to use the following option -march=znver3
on Lucia as it optimizes the code for AMD Epyc Zen3 architecture.
📚 Useful documentation
Intel compilers
Intel compilers and tools are not supported on Lucia (although different versions are available). There have been (and there still are) limitations using Intel softwares on non-Intel platforms (especially AMD processors). Moreover, AMD EPYC processors are not officially supported at software level by Intel. See Intel® HPC Toolkit System Requirements for details.
On Lucia, we also observed that several codes did not work properly with the Intel compiler / IntelMPI combination, exhibiting untimely message passing blocking or I/O problems in particular. Intel's support only provided the following kind of answer: "Thank you for your inquiry. We offer support for hardware platforms that the Intel® oneAPI product supports. These platforms include those that are part of the Intel® Core™ processor family or higher, the Intel® Xeon® processor family, the Intel® Xeon® Scalable processor family, and others which can be found here – Intel® oneAPI Base Toolkit System Requirements, Intel® oneAPI HPC Toolkit System Requirements, Intel® oneAPI IoT Toolkit System Requirements"
Although not supported, Intel compiler is nevertheless available to avoid duplicate local installations in users' space. The table below lists the different Intel compilers available on Lucia:
Module category | Module name |
---|---|
EasyBuild/2022a | intel-compilers/2022.1.0 |
EasyBuild/2023a | intel-compilers/2023.1.0 |
EasyBuild/2024a | intel-compilers/2024.2.0 |
Note
Traditional Intel tools (like Intel MPI library or Intel MKL library) are also available in the same module categories.
Warning
We are aware of some << tricks >> available on Internet (even on official documentations of some HPC systems) to by-pass the check of the CPU type in order to obtain the best performance of some Intel tools on non-genuine Intel processors. We strongly discourage users to use these tricks and no support will be provided in this case.
Cray compilers
Cray compilers are provided through the Cray Programming Environment. As their usage is highly specific, a special section of the documentation is reserved to the CPE. Please check Cray Programming Environment section for detailed information.
Clang/LLVM
The LLVM is a collection of compiler and toolchain tools and libraries which includes its own Clang compiler (clang, clang++). The LLVM Core libraries along with the compilers are locally built and compiled against the GCC compiler suite. The LLVM Core libraries provide a modern source- and target-independent optimizer, along with code generation support for many popular CPUs. These libraries are built around a well specified code representation known as the LLVM intermediate representation ("LLVM IR"). The LLVM Core libraries are well documented, and it is particularly easy to invent your own language (or port an existing compiler) to use LLVM as an optimizer and code generator.
It is important to use the appropriate compiler driver for the programming language in use. clang
is intended for compiling C code, while clang++
should be used for C++ projects. Unlike the GNU toolchain, Clang does not automatically link the C++ standard libraries when using the C compiler to compile C++ code, so using the correct driver avoids linking issues and runtime errors. Clang provides fine-grained control over target CPU tuning (option -march=
), allowing to specify the exact architecture features of the processor. This not only improves execution speed but also ensures better utilization of available CPU instructions.
Clang’s diagnostic engine is one of its strengths, offering clear and informative error messages. For debugging, Clang supports a suite of runtime sanitizers that help detect memory leaks, undefined behavior and concurrency issues. Options like -Wall -Wextra -Werror -g -fsanitize=address
can help identify potential bugs or non-portable code constructs. Clang also includes a static analysis tool scan-build
that can be used to perform a deeper examination of the source code without executing it. This is especially helpful for identifying subtle bugs, memory mismanagement or edge cases that might not be caught during typical testing.
For performance code optimization, Link Time Optimization (LTO) (option -flto
) is supported and can provide significant improvements by allowing the compiler to analyze the entire program during the linking phase.
LLVM/Clang are available throughout the EasyBuild software stack.
Module category | Module name |
---|---|
EasyBuild/2023a | LLVM/16.0.6-GCCcore-12.3.0 |
EasyBuild/2024a | LLVM/18.1.8-GCCcore-13.3.0 |
EasyBuild/2024a | Clang/18.1.8-GCCcore-13.3.0 |
📚 For documentation of the LLVM compilers, see LLVM, Clang, and Flang websites.
Note
When mixing Clang with components built using other compilers (such as GCC], compatibility between C++ standard libraries should be considered. Inconsistent usage can lead to ABI (Application Binary Interface) incompatibilities.
NVCC
The NVIDIA CUDA Compiler (nvcc
) is used to compile applications written in CUDA C and C++ for execution on NVIDIA GPUs. nvcc
acts as a wrapper around both the host and device compilation processes and generating binaries that include CPU code compiled by a host compiler (typically gcc
, g++
or clang++
) and GPU code compiled into PTX or SASS for NVIDIA GPUs.
When working with nvcc
, source code should be structured to clearly separate host and device components, making use of the __host__
, __device__
and __global__
function qualifiers to control where code executes. nvcc
handles this distinction automatically but writing code with a clear separation of concerns improves portability and reduces confusion during debugging. The target GPU architecture should be explicitly specified using the -arch
flag (e.g. -arch=sm_80
for NVidia A100 GPUs) to ensure that the compiled device code is optimized for the available hardware. This avoids potential mismatches and enables use of hardware-specific features.
Code correctness can be improved by enabling all relevant compiler warnings. Compiler diagnostics can catch common issues such as type mismatches, unused variables or illegal memory access patterns. Code should be compiled with debug symbols (-G
) and reduce optimization levels (-O0
) to allow source-level debugging with tools such as cuda-gdb.
nvcc
provides flags for controlling optimization, inlining, loop unrolling and other low-level behaviors. However, tuning should also take into account occupancy, shared memory usage and register pressure. Tools such as Nsight Compute and the CUDA Occupancy Calculator are essential for understanding and improving the performance of GPU kernels. Users are encouraged to profile their applications thoroughly to identify performance bottlenecks.
Note
The CUDA compiler relies on an external host compiler to build CPU code. On RHEL 8.10, nvcc is compatible with specific versions of gcc, g++, and clang++. Users must ensure that the selected host compiler version is officially supported by the CUDA Toolkit in use. Mismatches can cause compilation errors or undefined behavior.
CUDA/nvcc are available in the EasyBuild software stacks:
Module category | Module name |
---|---|
EasyBuild/2022a | CUDA/11.7.0 |
EasyBuild/2023a | CUDA/12.2.0 |
EasyBuild/2024a | CUDA/12.6.0 |
📚 For detailed documentation, refer to:
Debuggers
Each of the above presented compilers provide its own debugger:
Compiler | Debugger | Information |
---|---|---|
GCC/AOCC | GDB | Compile with -g , prefer -O0 + -fno-inline for full debug info |
Intel | GDB (patched) | Intel-enhanced GDB. Compile with -g |
Clang | LLDB or GDB | LLDB native - GDB common - same compile flags as GCC |
CUDA (nvcc) | cuda-gdb | Requires -g -G - debug both host and GPU portions |
GCC / AMD AOCC
The primary debugger for GCC and AMD’s AOCC compiler suite is GDB (GNU Debugger). It supports source-level debugging for languages including C, C++, Fortran and more. Users should compile with -g
and preferably disable optimizations (-O0
,-fno-inline
) to ensure proper symbolic debugging. GDB allows stepping through code, setting breakpoints, examining variables and memory. GDB supports advanced features such as reversible debugging (stepping backward), scripting via Python and graphical/text-based front-ends.
For instruction on compiling code for debugging, setting watchpoints and remote debugging, refer to the GDB User Manual .
Intel Compiler
Intel historically provided its own graphical debugger (Intel Debugger, IDB) but it has been deprecated. On Linux systems, Intel compilers now support an extended version of GDB, with enhancements tuned for Fortran and parallel programming (MPI, OpenMP). Users should compile with -g
and use gdb
as with GCC, benefiting from Intel’s patches for better support of their compiler-generated code.
Clang / LLVM
Clang integrates seamlessly with LLVM’s native debugger LLDB, offering source-level debugging with similar capabilities to GDB. However, most users on classic HPC clusters prefer using GDB, which provides consistent debugging across GCC, AOCC, and Clang environments. Both GDB and LLDB support stepping, breakpoints, variable inspection and advanced features like remote debugging. The LLVM documentation encourages the use of -g
and reduced optimization for effective debugging. While LLDB offers a smoother experience for C++ templates and modern language features, GDB's maturity and ecosystem support make it a preferred choice .
NVIDIA CUDA
For debugging CUDA applications, cuda-gdb
extends GDB to handle GPU kernels and host-device transitions. This allows:
-
debugging of CPU and GPU code
-
breakpoints before and within kernels
-
inspection of GPU registers and memory
-
debugging in both source-level CUDA C/C++ and GPU assembly/PTX
cuda-gdb
supports single- and multi-GPU debugging. Multiple GPU threads and kernels can be managed with commands like info cuda kernels
and cuda kernel <n>
. It also supports remote debugging over SSH or TCP using cuda-gdbserver.
A key requirement is to compile CUDA applications with debug flags (-g -G
) to enable effective GPU debugging.
Linaro DDT
Linaro DDT is a high-performance scalable graphical debugger designed for parallel and distributed applications at scale. It supports debugging for applications written in C, C++, and Fortran but also Python and is fully compatible with MPI, OpenMP, CUDA and hybrid models. On Lucia, DDT is available through environment modules and is configured to run seamlessly with the installed MPI flavors and Slurm. To access the latest version of the debugger:
user@frontal02:~ # module load devel/ArmForge
user@frontal02:~ # module list
Currently Loaded Modules:
1) devel/ArmForge/23.0.1
user@frontal02:~ #
Note
At the time this documentation was written, DDT is available on the cluster for debugging MPI and CPU-based applications, GPU debugging is not supported on the GPU partition because our site does not currently hold GPU debugging licenses for DDT. You can still use DDT to debug the host (CPU) side of CUDA-enabled applications, but for full device-level (GPU kernel) debugging, please use cuda-gdb
instead.
Interactive usage
While connected on a login node with X-forwarding enabled, DDT debugging of my_code
can be initiated as follows:
This command will initiate the graphical DDT interface and prompt the user to configure runtime options such as the number of processes and execution arguments.
For debugging parallel applications using MPI, DDT can be launched by prepending the application launch command (srun
or mpirun
) with ddt
. For example, to run a parallel job with four MPI ranks using SLURM, the following command may be used:
Batch usage
To launch DDT in a non-interactive session, the same kind of command is typically used. For example, to debug an application on two compute nodes with 8 MPI ranks per node, the following submission script can be used:
#!/bin/bash
#SBATCH --job-name=ddt_debug
#SBATCH --output=%j_%x.out
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --time=01:00:00
#SBATCH --partition=debug
#SBATCH --account=my_project_name
module load devel/ArmForge
ddt srun ./my_code
Command Line Interface (without GUI) usage
DDT can be run in non-GUI mode via command-line options that allow it to:
-
automatically run a program under debugger control
-
collect data such as memory errors, crashes, and stack traces
-
generate reports in XML or plain text
-
exit without launching the GUI.
This is done using the --offline
, --output
, --batch
, or --connect
flags, depending on the workflow.
Option | Description |
---|---|
--offline |
runs DDT without opening the GUI |
--output |
specifies the output file to save the diagnostic report |
--batch |
runs a scripted command file |
--connect |
connects to a remote headless session (from ARM Forge client) |
Examples
-
General syntax
-
Memory debugging
-
Capturing a crash
-
Analyzing core file(s)
Note
To enable stack overflow detection in memory debugging mode, use the option --check-stack
especially when working with large arrays or recursive functions.
Useful resources
📚 For the most up-to-date guidance, consult the Linaro Forge documentation, which contains a dedicated DDT section covering installation, workflow, script usage...
Official Linaro Forge Documentation (DDT)
Valgrind
Valgrind is a suite of simulation-based debugging and profiling tools for programs running on Linux clusters, aimed to aid in the detection in memory leaks and errors in parallel applications. The most popular of these tools is called Memcheck which can detect many memory-related errors and memory leaks.
Other supported tools include:
- Cachegrind - a profiler using the number of instructions executed,
- Callgrind - similar to Cachegrind but records the call history among functions,
- Helgrind - a pthreads error detector,
- DRD - also a pthreads error detector,
- Massif - a heap profiler,
- DHAT - a dynamic heap usage analysis tool.
Different versions of Valgrind are available in the EasyBuild software stacks:
Module category | Module name |
---|---|
EasyBuild/2022a | Valgrind/3.19.0-gompi-2022a |
EasyBuild/2023a | Valgrind/3.21.0-gompi-2023a |
EasyBuild/2024a | Valgrind/3.24.0-gompi-2024a |
Using Valgrind
Use Valgrind as follows to inquire about potential memory problems:
module load EasyBuild/202...
module load Valgrind
valgrind --tool=<tool-name> <valgrind-options> code <code-args>
Note
To use Valgrind, compile your application with the debug flag -g
so that Memcheck's error messages include exact line numbers. Using -O0
is also a good idea but if your code becomes way too slow, -01
is an acceptable alternative although line numbers in error messages can be inaccurate. Use of -O2
and above is not recommended as Memcheck occasionally reports uninitialized-value errors which don't really exist.
For MPI codes, simply add valgrind
in front of your command:
module load EasyBuild/202...
module load Valgrind
srun -n ${nprocs} valgrind --tool=<tool-name> <valgrind-options> code <code-args>
Hint
It is possible to redirect Valgrind's output to a separate file for each MPI task using the --log-file=...
option. In a Slurm submission script, this can be achieved as follows:
Memcheck
Memcheck is the most famous (but also the default) tool of the Valgrind suite. It verifies memory access of the code and it can detect use of uninitialized memory, out of bounds memory access, memory leaks, double free, etc.
It is advised to use the --leak-check=yes
option together with Memcheck in order to get a detailed analysis of the memory leak.
DRD / helgrind
DRD is Valgrind's tool to detect race conditions. It does not detect deadlocks but needs less memory than helgrind. helgrind detects both race conditions and deadlocks. The following example illustrates a data race on global data.
#include <stdlib.h>
#include <stdio.h>
#include "omp.h"
int global = 4711;
int main (int argc, char * argv[]) {
// The following is an ERROR: This should be a (thread-)local variable.
#pragma omp parallel
global = omp_get_thread_num();
printf("global:%d\n", global);
return EXIT_SUCCESS;
}
Once this test compiled (executable's name being omp_error) and the requested number of OMP threads set, applying helgrind tool allows to detect the race condition:
valgrind --tool=helgrind ./omp_error
==2383761== Helgrind, a thread error detector
==2383761== Copyright (C) 2007-2017, and GNU GPL'd, by OpenWorks LLP et al.
==2383761== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==2383761== Command: ./omp_error
...
==2383761== ----------------------------------------------------------------
==2383761==
==2383761== Possible data race during read of size 4 at 0x522C0D4 by thread #2
==2383761== Locks held: none
==2383761== at 0x407CCAB: do_spin (wait.h:57)
==2383761== by 0x407CCAB: do_wait (wait.h:66)
==2383761== by 0x407CCAB: gomp_barrier_wait_end (bar.c:48)
==2383761== by 0x407A3E7: gomp_simple_barrier_wait (simple-bar.h:60)
==2383761== by 0x407A3E7: gomp_thread_start (team.c:133)
==2383761== by 0x4048866: mythread_wrapper (hg_intercepts.c:406)
==2383761== by 0x4A3A1C9: start_thread (in /usr/lib64/libpthread-2.28.so)
==2383761== by 0x4C8B8D2: clone (in /usr/lib64/libc-2.28.so)
==2383761==
==2383761== This conflicts with a previous write of size 4 by thread #1
==2383761== Locks held: none
==2383761== at 0x407CD14: gomp_barrier_wait_end (bar.c:41)
==2383761== by 0x407CD14: gomp_barrier_wait_end (bar.c:35)
==2383761== by 0x407AE5B: gomp_simple_barrier_wait (simple-bar.h:60)
==2383761== by 0x407AE5B: gomp_team_start (team.c:872)
==2383761== by 0x40716E0: GOMP_parallel (parallel.c:176)
==2383761== by 0x109085: main (omp_error.c:7)
==2383761== Address 0x522c0d4 is 68 bytes inside a block of size 192 allocated
==2383761== at 0x403E824: malloc (vg_replace_malloc.c:431)
==2383761== by 0x406A6CC: gomp_malloc (alloc.c:38)
==2383761== by 0x407A60B: gomp_get_thread_pool (pool.h:42)
==2383761== by 0x407A60B: get_last_team (team.c:156)
==2383761== by 0x407A60B: gomp_new_team (team.c:175)
==2383761== by 0x40716C9: GOMP_parallel (parallel.c:176)
==2383761== by 0x109085: main (omp_error.c:7)
==2383761== Block was allocated by thread #1
Hint
If the following error occurs when trying to use Valgrind
valgrind: mmap(0x400000, 4096) failed in UME with error 22 (Invalid argument).
valgrind: this can be caused by executables with very large text, data or bss segments.
-fPIE -pie
options. The error occurs because the non-PIE executable loads at a fixed address (0x400000), which conflicts with memory regions Valgrind needs for its internal operation. Compiling with -fPIE -pie
makes the executable position-independent, allowing it to be loaded at a non-conflicting address, which avoids the error.