LLVM 🐉 in HPC 🧮 – language frontends, GPU 🎨 backends, and vendor compilers

Vedran Miletić

HPC application expert, Max Planck Computing and Data Facility (MPCDF)

MPG MPCDF logos

LLVM Meetup in Munich – October 28th, 2025

Short background

former postdoc (Heidelberg Institute for Theoretical Studies)
- contributed to Mesa/Clang/LLVM/libclc to enable running several OpenCL applications on AMD GPUs (pre-ROCm FOSS stack)
former junior professor (University of Rijeka, Croatia)
- taught Code optimization course, inspired in part by Optimising Compilers (Timothy Jones, Tom Stuart, and Alan Mycroft, University of Cambridge)
currently mainly working with software running on AMD Instinct MI300A APUs at Max Planck, but also other NVIDIA- and Intel-powered machines

MPCDF supercomputer Viper, Garching

Image source: Viper-GPU User Guide

Viper bg

Languages for HPC applications

bg left:45%

C, C++, Fortran... Python, Julia, R...
- Clang, Flang(-new)
OpenCL: portability over ease of use, features, and performance
OpenMP: ease of use over performance
CUDA/HIP: full hardware capability
- Clang/Clang
SYCL: (hopefully) portable full hardware capability

Image source: Wikimedia Commons

Why don't GPUs just use C/C++?

2024 LLVM Dev Mtg - A C++ Toolchain for Your GPU (Joseph Huber)
- 2023 LLVM Dev Mtg - The LLVM C Library for GPUs
- DOOM on AMDGPU

bg right:55%

Image source: YouTube

Backends

NVPTX
AMDGPU
(SPIR-V)
(DirectX)
common features: GPU-specific intrinsics, address space management, kernel metadata

GPU bg left:63%

Image source: Wikimedia Commons

NVPTX backend

From Wikipedia Parallel Thread Execution page:

Parallel Thread Execution (PTX or NVPTX) is a low-level parallel thread execution virtual machine and instruction set architecture used in Nvidia's Compute Unified Device Architecture (CUDA) programming environment. The LLVM-based Nvidia CUDA Compiler (NVCC) translates code written in OpenCL C and CUDA C/C++ into PTX instructions (an IL), and the graphics driver contains a compiler which translates PTX instructions into executable binary code, which can run on the processing cores of Nvidia graphics processing units (GPUs).

used in Clang with CUDA and OpenMP offloading
SPIR-V support in AMD ROCm aims to provide similar functionality

AMDGPU backend

supports (R600), GCN, RDNA, and CDNA generations
assumptions for GFX942/Instinct MI300A:

Each agent has multiple shader arrays (SA).

Each SA has multiple compute units (CU).

Each CU has multiple SIMDs that execute wavefronts.

The wavefronts for a single work-group are executed in the same CU but may be executed by different SIMDs.

Each CU has a single LDS memory shared by the wavefronts of the work-groups executing on it.

(...)

AMDGPU backend: features

From llvm/lib/Target/AMDGPU/AMDGPU.td:

// Unless +-flat-for-global is specified, turn on FlatForGlobal for
// all OS-es on VI and newer hardware to avoid assertion failures due
// to missing ADDR64 variants of MUBUF instructions.
// FIXME: moveToVALU should be able to handle converting addr64 MUBUF
// instructions.

def FeatureFlatForGlobal : SubtargetFeature<"flat-for-global",
  "FlatForGlobal",
  "true",
  "Force to generate flat instruction for global"
>;

AMDGPU backend: generations

def FeatureGFX9 : GCNSubtargetFeatureGeneration<"GFX9",
  "gfx9",
  [FeatureFP64,
   FeatureWavefrontSize64, FeatureFlatAddressSpace,
   FeatureGCN3Encoding, FeatureCIInsts, Feature16BitInsts,
   FeatureSMemRealTime, FeatureScalarStores, FeatureInv2PiInlineImm,
   FeatureApertureRegs, FeatureGFX9Insts, FeatureVOP3P, FeatureVGPRIndexMode,
   FeatureFastFMAF32, FeatureDPP, FeatureIntClamp,
   FeatureSDWA, FeatureSDWAOmod, FeatureSDWAScalar, FeatureSDWASdst,
   FeatureFlatInstOffsets, FeatureFlatGlobalInsts, FeatureFlatScratchInsts,
   FeatureAddNoCarryInsts, FeatureGFX8Insts, FeatureGFX7GFX8GFX9Insts,
   FeatureScalarFlatScratchInsts, FeatureScalarAtomics, FeatureR128A16,
   FeatureA16, FeatureSMemTimeInst, FeatureFastDenormalF32, FeatureSupportsXNACK,
   FeatureUnalignedBufferAccess, FeatureUnalignedScratchAccess,
   FeatureUnalignedDSAccess, FeatureNegativeScratchOffsetBug, FeatureGWS,
   FeatureDefaultComponentZero,FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad
  ]
>;

AMDGPU backend: lowering 1/2

From llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp:

AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
                                           const AMDGPUSubtarget &STI)
    : TargetLowering(TM), Subtarget(&STI) {
  // ...
  setOperationAction(
      {ISD::FLOG, ISD::FLOG10, ISD::FEXP, ISD::FEXP2, ISD::FEXP10}, MVT::f32,
      Custom);
  // ...
  setOperationAction({ISD::FLOG10, ISD::FLOG, ISD::FEXP, ISD::FEXP10}, MVT::f16,
                     Custom);
  // ..
}

AMDGPU backend: lowering 2/2

SDValue AMDGPUTargetLowering::LowerOperation(SDValue Op,
                                             SelectionDAG &DAG) const {
  switch (Op.getOpcode()) {
  default:
    // ...
  case ISD::FLOG10:
    return LowerFLOGCommon(Op, DAG);
  // ...
  }
}

AMDGPU backend: logarithm 1/4

SDValue AMDGPUTargetLowering::LowerFLOGCommon(SDValue Op,
                                              SelectionDAG &DAG) const {
  // ...
  const auto &Options = getTargetMachine().Options;
  if (VT == MVT::f16 || Flags.hasApproximateFuncs()) {
    if (VT == MVT::f16 && !Subtarget->has16BitInsts()) {
      // Log and multiply in f32 is good enough for f16.
      X = DAG.getNode(ISD::FP_EXTEND, DL, MVT::f32, X, Flags);
    }
    SDValue Lowered = LowerFLOGUnsafe(X, DL, DAG, IsLog10, Flags);
    if (VT == MVT::f16 && !Subtarget->has16BitInsts()) {
      return DAG.getNode(ISD::FP_ROUND, DL, VT, Lowered,
                         DAG.getTargetConstant(0, DL, MVT::i32), Flags);
    }
    return Lowered;
  }
  // ...

AMDGPU backend: logarithm 2/4

SDValue AMDGPUTargetLowering::LowerFLOGUnsafe(SDValue Src, const SDLoc &SL,
                                              SelectionDAG &DAG, bool IsLog10,
                                              SDNodeFlags Flags) const {
  EVT VT = Src.getValueType();
  unsigned LogOp =
      VT == MVT::f32 ? (unsigned)AMDGPUISD::LOG : (unsigned)ISD::FLOG2;

  double Log2BaseInverted =
      IsLog10 ? numbers::ln2 / numbers::ln10 : numbers::ln2;

AMDGPU backend: logarithm 3/4

  if (VT == MVT::f32) {
    auto [ScaledInput, IsScaled] = getScaledLogInput(DAG, SL, Src, Flags);
    if (ScaledInput) {
      SDValue LogSrc = DAG.getNode(AMDGPUISD::LOG, SL, VT, ScaledInput, Flags);
      SDValue ScaledResultOffset =
          DAG.getConstantFP(-32.0 * Log2BaseInverted, SL, VT);

      SDValue Zero = DAG.getConstantFP(0.0f, SL, VT);

      SDValue ResultOffset = DAG.getNode(ISD::SELECT, SL, VT, IsScaled,
                                         ScaledResultOffset, Zero, Flags);

      SDValue Log2Inv = DAG.getConstantFP(Log2BaseInverted, SL, VT);

      if (Subtarget->hasFastFMAF32())
        return DAG.getNode(ISD::FMA, SL, VT, LogSrc, Log2Inv, ResultOffset,
                           Flags);
      SDValue Mul = DAG.getNode(ISD::FMUL, SL, VT, LogSrc, Log2Inv, Flags);
      return DAG.getNode(ISD::FADD, SL, VT, Mul, ResultOffset);
    }
  }

AMDGPU backend: logarithm 4/4

  SDValue Log2Operand = DAG.getNode(LogOp, SL, VT, Src, Flags);
  SDValue Log2BaseInvertedOperand = DAG.getConstantFP(Log2BaseInverted, SL, VT);

  return DAG.getNode(ISD::FMUL, SL, VT, Log2Operand, Log2BaseInvertedOperand,
                     Flags);
}

AMDGPU backend: test 1/2

From llvm/test/CodeGen/AMDGPU/llvm.log10.ll:

GFX900-SDAG-LABEL: s_log10_f32:
GFX900-SDAG:       ; %bb.0:
GFX900-SDAG-NEXT:    s_load_dword s6, s[4:5], 0x2c
GFX900-SDAG-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
GFX900-SDAG-NEXT:    v_mov_b32_e32 v0, 0x800000
GFX900-SDAG-NEXT:    v_mov_b32_e32 v1, 0x411a209b
GFX900-SDAG-NEXT:    v_mov_b32_e32 v2, 0
GFX900-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
GFX900-SDAG-NEXT:    v_cmp_lt_f32_e32 vcc, s6, v0
GFX900-SDAG-NEXT:    s_and_b64 s[2:3], vcc, exec
GFX900-SDAG-NEXT:    s_cselect_b32 s2, 32, 0
GFX900-SDAG-NEXT:    v_cndmask_b32_e32 v0, 0, v1, vcc
GFX900-SDAG-NEXT:    v_mov_b32_e32 v1, s2
GFX900-SDAG-NEXT:    v_ldexp_f32 v1, s6, v1
; ...

AMDGPU backend: test 2/2

GFX900-SDAG-NEXT:    v_log_f32_e32 v1, v1
GFX900-SDAG-NEXT:    s_mov_b32 s2, 0x3e9a209a
GFX900-SDAG-NEXT:    s_mov_b32 s3, 0x3284fbcf
GFX900-SDAG-NEXT:    v_mul_f32_e32 v3, 0x3e9a209a, v1
GFX900-SDAG-NEXT:    v_fma_f32 v4, v1, s2, -v3
GFX900-SDAG-NEXT:    v_fma_f32 v4, v1, s3, v4
GFX900-SDAG-NEXT:    s_mov_b32 s2, 0x7f800000
GFX900-SDAG-NEXT:    v_add_f32_e32 v3, v3, v4
GFX900-SDAG-NEXT:    v_cmp_lt_f32_e64 vcc, |v1|, s2
GFX900-SDAG-NEXT:    v_cndmask_b32_e32 v1, v1, v3, vcc
GFX900-SDAG-NEXT:    v_sub_f32_e32 v0, v1, v0
GFX900-SDAG-NEXT:    global_store_dword v2, v0, s[0:1]
GFX900-SDAG-NEXT:    s_endpg

Vendor compilers: AMD

AMD height:100px

AMD-LLVM (FOSS)
- AMD's fork of LLVM: stays close to usptream, improved OpenMP, heterogenous debugging and address sanitization, hipcc wrapper, ...
AMD Optimizing C/C++ and Fortran Compilers (AOCC) (not FOSS)
- focus on CPU optimizations (primarily for Epyc), used together with AMD Optimizing CPU Libraries (AOCL)
- version 5.0, from a year ago, is based on LLVM 17

Image source: Wikimedia Commons

Vendor compilers: Intel

Intel height:100px

a.k.a. IntelLLVM, Intel's fork of LLVM
oneAPI DPC++ compiler: C, C++, SYCL, OpenMP offload (FOSS)
custom Fortran compiler (not FOSS)
oneAPI Math Kernel Library (oneMKL) (not FOSS)
- has a FOSS implementation named oneMath

Image source: Wikimedia Commons

Vendor compilers: NVIDIA

NVIDIA height:100px

NVIDIA HPC compilers, successor to the PGI compilers, based on LLVM
C, C++, Fortran
- replace GCC's role in the CUDA stack
CUDA, OpenACC, OpenMP
not FOSS

Image source: Wikimedia Commons

Why should the application developers care about the compiler?

(lack of) standards and features support
(lack of) performance optimizations
(lack of) warnings about bad code
(lack of) descriptive error messages
from this perspective:
- the addition of Clang/LLVM as a production-ready C/C++ compiler to the FOSS ecosystem a decade ago did great things for code quality and
- provided vendors with a common and reliable platform to build custom hardware and software-dependant optimizations on top

Thank you for your attention

Social: Twitter/X @vedranmiletic ; LinkedIn vedranmiletic
E-mail: vedran@miletic.net
Web site: https://vedran.miletic.net/

Author: Vedran Miletić