Modern C++ for High-Performance Computing: Concepts, Tools, and Optimization Strategies
Vedran Miletic and Henri Menke, HPC Application Support division, MPCDF
Max Planck Computing & Data Facility, Meet MPCDF, 12. June 2025
Meet MPCDF
From the announcement:
The series Meet MPCDF offers the opportunity for the users to informally interact with MPCDF staff, in order to discuss relevant kinds of technical topics.
Optionally, questions or requests for specific topics to be covered in more depth can be raised in advance via email to training@mpcdf.mpg.de.
Users seeking a basic introduction to MPCDF services are referred to our semi-annual online workshop Introduction to MPCDF services (April and October).
Core language optimizations (1/2)
Try to use a C++ standard as close as possible to the latest one. There are subtle performance improvements with every new version.
- Guaranteed copy elision since C++17
- More types are marked
TriviallyCopyable - More functions and types are marked
constexpr
Core language optimizations (2/2)
New library features offer better performance and/or more safety. Some of them are available as third-party libraries for use with older C++ standards.
std::printin lieu of<<orprintfsince C++23
(available in the {fmt} library)- Filesystem library since C++17
(available in the boost.filesystem library) std::string_viewas a safer wrapper forconst char *since C++17
(available in the boost.utility library)- Counter example: Ranges library. Looks nice on paper but has poor computational complexity.
Memory management strategies
Avoid using raw pointers. It's too easy to leak memory that way. Use RAII containers instead, std::unique_ptr and std::shared_ptr.
PyObject *scipy = PyImport_ImportModule("scipy.sparse");
// do something with scipy
Py_XDECREF(scipy); // don't forget this!
Automatically clean up on exiting the scope
std::unique_ptr<PyObject, decltype(&Py_DecRef)> scipy{
PyImport_ImportModule("scipy.sparse"),
&Py_DecRef
};
Formatting and I/O
C++20 brought std::format() and C++23 brought std::print()/std::println() based on std::format(), which are recommended over std::printf() family. This ensures consistent behavior across different platforms, avoiding some quirks (e.g. rounding due to hardware architecture, compiler, or locale settings).
std::print("{2} {1}{0}!\n", 23, "C++", "Hello");
std::string mytype = "Vector of integers";
std::vector<int> mydata = {1, 2, 3, 4, 5};
std::print("{}: {}\n", mytype, mydata);
There is no need to specify explicit format specifiers, types are deduced (and checked) at compile-time, including C++ types from the standard library.
For older versions (C++11/14/17), the {fmt} library offers similar functionality.
Parallelization and acceleration approaches (1/2)
Since C++17 many algorithm accept an execution policy to run in parallel.
std::vector<double> v = { ... };
std::transform(std::execution::par, v.begin(), v.end(), [](double d) {
return 2.0 * d;
});
Not as flexible as OpenMP and probably not a good choice to parallelize a complex code base (sc. a GCC bug), but easy solution to quickly speed up a little section of code.
Upcoming in C++26: Abstractions of SIMD types std::simd for manual vectorization. Vectorizable types include standard integers, standard floats, and std::complex. Unfortunately, not yet implemented in most available compilers.
Parallelization and acceleration approaches (2/2)
The reason why std::simd is necessary is strict aliasing.
In the std::valarray container the data pointer is marked __restrict.
On top of that std::valarray defines many mathematical operations and slicing.
std::valarray<float> pos(3), velocity(3);
// ...
pos += dt * velocity;
std::valarray<float> matrix(n * n);
// ...
auto trace = matrix[std::slice(0, n, n + 1)].sum();
Profiling and benchmarking (1/2)
Google Perftools can be used for performance and memory (heap) profiling.
- Compile with opt and debug flags
g++ -O3 -g bench.cpp -o bench LD_PRELOAD=path_to_libprofiler.so CPUPROFILE=bench.prof ./bench-
Analyze profiling data, e.g.
pprof --text bench bench.profTotal: 1190 samples 620 52.1% 52.1% 620 52.1% sum_of_squares 570 47.9% 100.0% 570 47.9% sum_of_cubes 0 0.0% 100.0% 1190 100.0% __libc_start_main 0 0.0% 100.0% 1190 100.0% _start 0 0.0% 100.0% 1190 100.0% main 0 0.0% 100.0% 981 82.4% run_calculation (inline)
Commercial offerings: Intel VTune and AMD uProf
Profiling and benchmarking (2/2)
Google Benchmark is a framework to micro-benchmark small fragments of a larger program, similar to unit tests. This is useful for finding performance differences between various environments (hardware and software) and can also be used for tracking performance regressions over time via regular execution in CI pipelines.
#include <benchmark/benchmark.h>
static void BM_SomeFunction(benchmark::State& state) {
// Perform setup here
for (auto _ : state) {
SomeFunction(); // hot code to get timed
}
}
BENCHMARK(BM_SomeFunction);
BENCHMARK_MAIN();
Debugging: Undefined behavior 1/2
Compiled with clang++ -O
#include <cstdlib>
static void (*f)();
void evil() {
system("rm -rf /home");
}
void set_f() {
f = &evil;
}
int main() {
f();
}
Debugging: Undefined behavior 2/2
evil():
lea rdi, [rip + .L.str]
jmp system@PLT
set_f():
ret
main:
push rax
lea rdi, [rip + .L.str]
call system@PLT
xor eax, eax
pop rcx
ret
.L.str:
.asciz "rm -rf /home"
Sanitizers
A code sanitizer is a compiler plugin that instruments the resulting program for extra checks at runtime.
- AddressSanitizer (ASan)
-fsanitize=address - LeakSanitizer (LSan)
-fsanitize=leak(also included in ASan) - UndefinedBehaviorSanitizer (UBSan)
-fsanitize=undefined - MemorySanitizer (MSan)
-fsanitize=memory(only LLVM-based compilers) - ThreadSanitizer (TSan)
-fsanitize=thread
Performance impact 1.2x to 2x (vs. up to 20x for Valgrind)
Combining sanitizers is possible for some combinations but not recommended.
Example program
A simple example program that will almost always result in a segmentation fault.
int main() {
*(char *)0 = 0;
}
To compile with AddressSanitizer, -fsanitize=address must be specified at both compiling and linking.
# compile test.c into object file test.o
gcc -fsanitize=address -c test.c
# link object file to binary executable
gcc -fsanitize=address test.o -o test
AddressSanitizer
Crashes your program upon encountering any kind of memory error, e.g. buffer overflows, use after free, or invalid pointer dereference. The trace is usually very informative, can be enhanced by compiling with debugging information.
AddressSanitizer:DEADLYSIGNAL
=================================================================
==87586==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x0000004004c3 bp 0x7ffe066cc8d0 sp 0x7ffe066cc8d0 T0)
==87586==The signal is caused by a WRITE memory access.
==87586==Hint: address points to the zero page.
#0 0x0000004004c3 in main (/tmp/test+0x4004c3) (BuildId: f53b20123d64112cf4015cd7005d985a2abfb52b)
#1 0x7f4c270115f4 in __libc_start_call_main (/lib64/libc.so.6+0x35f4) (BuildId: 2b3c02fe7e4d3811767175b6f323692a10a4e116)
#2 0x7f4c270116a7 in __libc_start_main@@GLIBC_2.34 (/lib64/libc.so.6+0x36a7) (BuildId: 2b3c02fe7e4d3811767175b6f323692a10a4e116)
#3 0x0000004003c4 in _start (/tmp/test+0x4003c4) (BuildId: f53b20123d64112cf4015cd7005d985a2abfb52b)
==87586==Register values:
rax = 0x0000000000000000 rbx = 0x0000000000000000 rcx = 0x0000000000000000 rdx = 0x0000000000000000
rdi = 0x0000000000000000 rsi = 0x00007ffe066cc900 rbp = 0x00007ffe066cc8d0 rsp = 0x00007ffe066cc8d0
r8 = 0x00007f4c271f6680 r9 = 0x00007f4c271f8000 r10 = 0x0000000000000000 r11 = 0x00007f4c272f49d0
r12 = 0x00007ffe066cc9f8 r13 = 0x0000000000000001 r14 = 0x00007f4c27950000 r15 = 0x0000000000402dd0
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/tmp/test+0x4004c3) (BuildId: f53b20123d64112cf4015cd7005d985a2abfb52b) in main
==87586==ABORTING
Best way to opt-in to sanitizers
Add the sanitizer of choice to the compiler flags
export CFLAGS="-fsanitize=<name> ..."
export CXXFLAGS="-fsanitize=<name> ..."
export FFLAGS="-fsanitize=<name> ..."
export FCFLAGS="-fsanitize=<name> ..."
Don't forget to also add it to your linker flags
export LDFLAGS="-fsanitize=<name> ..."
Run your build system, e.g. cmake, ./configure, etc.
Configure sanitizers at runtime via environment variable ASAN_OPTIONS, etc.
Frame pointers 1/2
When compiling with optimizations, compilers will omit the creation of a new stack frame for functions where it is not necessary.
Unfortunately, this hampers debuggability since a debugger will lose track of where the program counter is in without further debugging information.
Frame pointers 2/2
Without -fno-omit-frame-pointer
sum(double*, long):
; ...
With -fno-omit-frame-pointer
sum(double*, long):
push rbp
mov rbp, rsp
; ...
pop rbp
ret
Debugging information
Adding debugging information to an application has zero runtime overhead!
Debuginfo flags -g<n> and optimization flags -O<n> are orthogonal!
The additional .debug sections of the binary will only be paged in when the program is actively being debugged.
However, there is a considerable size overhead for the binary (template-heavy C++ code can see an increase of up to 100x).
Debugging optimized builds is challanging, but better than nothing in case of a crash.
C++ standard library debug mode
GNU libstdc++ offers a debug mode (also for LLVM libc++) that provides additional checking for standard iterators, containers, and algorithms.
-
Algorithm preconditions are validated on the input parameters. Safe iterators keep track of the container whose elements they reference.
-
Lightweight debug mode
-D_GLIBCXX_ASSERTIONS=1 -D_LIBCPP_HARDENING_MODE=_LIBCPP_HARDENING_MODE_EXTENSIVE -
Full debug mode (ABI breaking in libstdc++)
-D_GLIBCXX_DEBUG=1 -D_LIBCPP_HARDENING_MODE=_LIBCPP_HARDENING_MODE_DEBUG -D_GLIBCXX_DEBUG_PEDANTIC=1
Static analysis
Detecting problems during compile-time beyond compiler warnings.
clang-tidy is a clang-based C++ “linter” tool. Its purpose is to provide an extensible framework for diagnosing and fixing typical programming errors, like style violations, interface misuse, or bugs that can be deduced via static analysis.
find_program(CLANG_TIDY_EXE "clang-tidy" REQUIRED)
set_target_properties(my_C_tgt PROPERTIES C_CLANG_TIDY "${CLANG_TIDY_EXE};-checks=bugprone-*")
set_target_properties(my_CXX_tgt PROPERTIES CXX_CLANG_TIDY "${CLANG_TIDY_EXE};-checks=bugprone-*")
Summary
- Performance Optimization: Use latest C++ standards for subtle improvements.
- Memory Management: Leverage smart pointers where possible.
- Profiling & Benchmarking: Use Google Perftools and Google Benchmark for performance analysis.
- Sanitizers & Debugging: Ensure robust software with tools like AddressSanitizer and clang-tidy.
Author: Vedran Miletić and Henri Menke