Preskoči na sadržaj

Modern C++ for High-Performance Computing: Concepts, Tools, and Optimization Strategies

Vedran Miletic and Henri Menke, HPC Application Support division, MPCDF

MPG MPCDF logos

Max Planck Computing & Data Facility, Meet MPCDF, 12. June 2025


Meet MPCDF

From the announcement:

The series Meet MPCDF offers the opportunity for the users to informally interact with MPCDF staff, in order to discuss relevant kinds of technical topics.

Optionally, questions or requests for specific topics to be covered in more depth can be raised in advance via email to training@mpcdf.mpg.de.

Users seeking a basic introduction to MPCDF services are referred to our semi-annual online workshop Introduction to MPCDF services (April and October).


Core language optimizations (1/2)

Try to use a C++ standard as close as possible to the latest one. There are subtle performance improvements with every new version.

  • Guaranteed copy elision since C++17
  • More types are marked TriviallyCopyable
  • More functions and types are marked constexpr

Core language optimizations (2/2)

New library features offer better performance and/or more safety. Some of them are available as third-party libraries for use with older C++ standards.


Memory management strategies

Avoid using raw pointers. It's too easy to leak memory that way. Use RAII containers instead, std::unique_ptr and std::shared_ptr.

PyObject *scipy = PyImport_ImportModule("scipy.sparse");
// do something with scipy
Py_XDECREF(scipy); // don't forget this!

Automatically clean up on exiting the scope

std::unique_ptr<PyObject, decltype(&Py_DecRef)> scipy{
    PyImport_ImportModule("scipy.sparse"),
    &Py_DecRef
};

Formatting and I/O

C++20 brought std::format() and C++23 brought std::print()/std::println() based on std::format(), which are recommended over std::printf() family. This ensures consistent behavior across different platforms, avoiding some quirks (e.g. rounding due to hardware architecture, compiler, or locale settings).

std::print("{2} {1}{0}!\n", 23, "C++", "Hello"); 
std::string mytype = "Vector of integers";
std::vector<int> mydata = {1, 2, 3, 4, 5};
std::print("{}: {}\n", mytype, mydata);

There is no need to specify explicit format specifiers, types are deduced (and checked) at compile-time, including C++ types from the standard library.

For older versions (C++11/14/17), the {fmt} library offers similar functionality.


Parallelization and acceleration approaches (1/2)

Since C++17 many algorithm accept an execution policy to run in parallel.

std::vector<double> v = { ... };
std::transform(std::execution::par, v.begin(), v.end(), [](double d) {
    return 2.0 * d;
});

Not as flexible as OpenMP and probably not a good choice to parallelize a complex code base (sc. a GCC bug), but easy solution to quickly speed up a little section of code.

Upcoming in C++26: Abstractions of SIMD types std::simd for manual vectorization. Vectorizable types include standard integers, standard floats, and std::complex. Unfortunately, not yet implemented in most available compilers.


Parallelization and acceleration approaches (2/2)

The reason why std::simd is necessary is strict aliasing.

In the std::valarray container the data pointer is marked __restrict.

On top of that std::valarray defines many mathematical operations and slicing.

std::valarray<float> pos(3), velocity(3);
// ...
pos += dt * velocity;
std::valarray<float> matrix(n * n);
// ...
auto trace = matrix[std::slice(0, n, n + 1)].sum();

Profiling and benchmarking (1/2)

Google Perftools can be used for performance and memory (heap) profiling.

  1. Compile with opt and debug flags g++ -O3 -g bench.cpp -o bench
  2. LD_PRELOAD=path_to_libprofiler.so CPUPROFILE=bench.prof ./bench
  3. Analyze profiling data, e.g. pprof --text bench bench.prof

    Total: 1190 samples                                                                    
         620  52.1%  52.1%      620  52.1% sum_of_squares
         570  47.9% 100.0%      570  47.9% sum_of_cubes
           0   0.0% 100.0%     1190 100.0% __libc_start_main
           0   0.0% 100.0%     1190 100.0% _start
           0   0.0% 100.0%     1190 100.0% main
           0   0.0% 100.0%      981  82.4% run_calculation (inline)
    

Commercial offerings: Intel VTune and AMD uProf


Profiling and benchmarking (2/2)

Google Benchmark is a framework to micro-benchmark small fragments of a larger program, similar to unit tests. This is useful for finding performance differences between various environments (hardware and software) and can also be used for tracking performance regressions over time via regular execution in CI pipelines.

#include <benchmark/benchmark.h>
static void BM_SomeFunction(benchmark::State& state) {
  // Perform setup here
  for (auto _ : state) {
    SomeFunction(); // hot code to get timed
  }
}
BENCHMARK(BM_SomeFunction);
BENCHMARK_MAIN();

Debugging: Undefined behavior 1/2

Compiled with clang++ -O

#include <cstdlib>                     

static void (*f)();

void evil() {
    system("rm -rf /home");
}

void set_f() {
    f = &evil;
}

int main() {
    f();
}

Debugging: Undefined behavior 2/2

evil():                               
        lea     rdi, [rip + .L.str]
        jmp     system@PLT

set_f():
        ret

main:
        push    rax
        lea     rdi, [rip + .L.str]
        call    system@PLT
        xor     eax, eax
        pop     rcx
        ret

.L.str:
        .asciz  "rm -rf /home"

Sanitizers

A code sanitizer is a compiler plugin that instruments the resulting program for extra checks at runtime.

Performance impact 1.2x to 2x (vs. up to 20x for Valgrind)

Combining sanitizers is possible for some combinations but not recommended.


Example program

A simple example program that will almost always result in a segmentation fault.

int main() {
    *(char *)0 = 0;
}

To compile with AddressSanitizer, -fsanitize=address must be specified at both compiling and linking.

# compile test.c into object file test.o
gcc -fsanitize=address -c test.c
# link object file to binary executable
gcc -fsanitize=address test.o -o test

AddressSanitizer

Crashes your program upon encountering any kind of memory error, e.g. buffer overflows, use after free, or invalid pointer dereference. The trace is usually very informative, can be enhanced by compiling with debugging information.

AddressSanitizer:DEADLYSIGNAL
=================================================================
==87586==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x0000004004c3 bp 0x7ffe066cc8d0 sp 0x7ffe066cc8d0 T0)
==87586==The signal is caused by a WRITE memory access.
==87586==Hint: address points to the zero page.
    #0 0x0000004004c3 in main (/tmp/test+0x4004c3) (BuildId: f53b20123d64112cf4015cd7005d985a2abfb52b)
    #1 0x7f4c270115f4 in __libc_start_call_main (/lib64/libc.so.6+0x35f4) (BuildId: 2b3c02fe7e4d3811767175b6f323692a10a4e116)
    #2 0x7f4c270116a7 in __libc_start_main@@GLIBC_2.34 (/lib64/libc.so.6+0x36a7) (BuildId: 2b3c02fe7e4d3811767175b6f323692a10a4e116)
    #3 0x0000004003c4 in _start (/tmp/test+0x4003c4) (BuildId: f53b20123d64112cf4015cd7005d985a2abfb52b)

==87586==Register values:
rax = 0x0000000000000000  rbx = 0x0000000000000000  rcx = 0x0000000000000000  rdx = 0x0000000000000000  
rdi = 0x0000000000000000  rsi = 0x00007ffe066cc900  rbp = 0x00007ffe066cc8d0  rsp = 0x00007ffe066cc8d0  
 r8 = 0x00007f4c271f6680   r9 = 0x00007f4c271f8000  r10 = 0x0000000000000000  r11 = 0x00007f4c272f49d0  
r12 = 0x00007ffe066cc9f8  r13 = 0x0000000000000001  r14 = 0x00007f4c27950000  r15 = 0x0000000000402dd0  
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/tmp/test+0x4004c3) (BuildId: f53b20123d64112cf4015cd7005d985a2abfb52b) in main
==87586==ABORTING

Best way to opt-in to sanitizers

Add the sanitizer of choice to the compiler flags

export CFLAGS="-fsanitize=<name> ..."
export CXXFLAGS="-fsanitize=<name> ..."
export FFLAGS="-fsanitize=<name> ..."
export FCFLAGS="-fsanitize=<name> ..."

Don't forget to also add it to your linker flags

export LDFLAGS="-fsanitize=<name> ..."

Run your build system, e.g. cmake, ./configure, etc.

Configure sanitizers at runtime via environment variable ASAN_OPTIONS, etc.


Frame pointers 1/2

When compiling with optimizations, compilers will omit the creation of a new stack frame for functions where it is not necessary.

Unfortunately, this hampers debuggability since a debugger will lose track of where the program counter is in without further debugging information.


Frame pointers 2/2

Without -fno-omit-frame-pointer

sum(double*, long):
        ; ...

With -fno-omit-frame-pointer

sum(double*, long):
        push    rbp
        mov     rbp, rsp
        ; ...
        pop     rbp
        ret

Debugging information

Adding debugging information to an application has zero runtime overhead!

Debuginfo flags -g<n> and optimization flags -O<n> are orthogonal!

The additional .debug sections of the binary will only be paged in when the program is actively being debugged.

However, there is a considerable size overhead for the binary (template-heavy C++ code can see an increase of up to 100x).

Debugging optimized builds is challanging, but better than nothing in case of a crash.


C++ standard library debug mode

GNU libstdc++ offers a debug mode (also for LLVM libc++) that provides additional checking for standard iterators, containers, and algorithms.

  • Algorithm preconditions are validated on the input parameters. Safe iterators keep track of the container whose elements they reference.

  • Lightweight debug mode

    -D_GLIBCXX_ASSERTIONS=1         -D_LIBCPP_HARDENING_MODE=_LIBCPP_HARDENING_MODE_EXTENSIVE
    
  • Full debug mode (ABI breaking in libstdc++)

    -D_GLIBCXX_DEBUG=1              -D_LIBCPP_HARDENING_MODE=_LIBCPP_HARDENING_MODE_DEBUG    
    -D_GLIBCXX_DEBUG_PEDANTIC=1
    

Static analysis

Detecting problems during compile-time beyond compiler warnings.

clang-tidy is a clang-based C++ “linter” tool. Its purpose is to provide an extensible framework for diagnosing and fixing typical programming errors, like style violations, interface misuse, or bugs that can be deduced via static analysis.

find_program(CLANG_TIDY_EXE "clang-tidy" REQUIRED)

set_target_properties(my_C_tgt PROPERTIES C_CLANG_TIDY "${CLANG_TIDY_EXE};-checks=bugprone-*")

set_target_properties(my_CXX_tgt PROPERTIES CXX_CLANG_TIDY "${CLANG_TIDY_EXE};-checks=bugprone-*")

Summary

  • Performance Optimization: Use latest C++ standards for subtle improvements.
  • Memory Management: Leverage smart pointers where possible.
  • Profiling & Benchmarking: Use Google Perftools and Google Benchmark for performance analysis.
  • Sanitizers & Debugging: Ensure robust software with tools like AddressSanitizer and clang-tidy.

Author: Vedran Miletić and Henri Menke