Prevođenje OpenCL C koda u asemblerski kod arhitektura AMD GCN i RDNA

Arhitekture i procesori

Korištenjem već ranije spomenutog LLVM-ovog alata llc možemo saznati koje su podržane ciljne arhitekture.

llc --version

LLVM (http://llvm.org/):
  LLVM version 11.0.0

  Optimized build.
  Default target: x86_64-pc-linux-gnu
  Host CPU: core2

  Registered Targets:
    aarch64    - AArch64 (little endian)
    aarch64_32 - AArch64 (little endian ILP32)
    aarch64_be - AArch64 (big endian)
    amdgcn     - AMD GCN GPUs
    arm        - ARM
    arm64      - ARM64 (little endian)
    arm64_32   - ARM64 (little endian ILP32)
    armeb      - ARM (big endian)
    avr        - Atmel AVR Microcontroller
    bpf        - BPF (host endian)
    bpfeb      - BPF (big endian)
    bpfel      - BPF (little endian)
    hexagon    - Hexagon
    lanai      - Lanai
    mips       - MIPS (32-bit big endian)
    mips64     - MIPS (64-bit big endian)
    mips64el   - MIPS (64-bit little endian)
    mipsel     - MIPS (32-bit little endian)
    msp430     - MSP430 [experimental]
    nvptx      - NVIDIA PTX 32-bit
    nvptx64    - NVIDIA PTX 64-bit
    ppc32      - PowerPC 32
    ppc64      - PowerPC 64
    ppc64le    - PowerPC 64 LE
    r600       - AMD GPUs HD2XXX-HD6XXX
    riscv32    - 32-bit RISC-V
    riscv64    - 64-bit RISC-V
    sparc      - Sparc
    sparcel    - Sparc LE
    sparcv9    - Sparc V9
    systemz    - SystemZ
    thumb      - Thumb
    thumbeb    - Thumb (big endian)
    wasm32     - WebAssembly 32-bit
    wasm64     - WebAssembly 64-bit
    x86        - 32-bit X86: Pentium-Pro and above
    x86-64     - 64-bit X86: EM64T and AMD64
    xcore      - XCore

Uočimo da u popisu registriranih ciljnih arhitektura u koje LLVM može prevesti kod postoji amdgcn, odnosno AMD-ovi grafički procesori temeljeni na arhitekturi Graphics Core Next (GCN). Kako su instrukcije AMD-ove arhitekture Radeon DNA (RDNA) vrlo slične instrukcijama arhitekture GCN, isti pozadinski dio program prevoditelja koristi se i za RDNA, iako naziv sugerira drugačije. Vrijedi spomenuti da je slična situacija kod starijih grafičkih procesora pa r600 osim Radeona R600 (marketinški serije HD 2000 i HD 3000) podržava i R700 (serija HD 4000), Evergreen (serija HD 5000) i Northern Islands (serija HD 6000).

Generacije arhitekture GCN su redom:

GCN1 (GFX6): Southern Islands (serije HD 7000 i HD 8000; PlayStation 4, Xbox One)
GCN2 (GFX7): Sea Islands (serije R5/R7/R9 200 i R5/R7/R9 300)
GCN3 (GFX8): Volcanic Islands (R9 285, R9 380, R9 Fury)
GCN4 (GFX8): Arctic Islands (serije RX 400, RX 500, RX 600; PlayStation 4 Pro, Xbox One X)
GCN5 (GFX9): Vega (serija RX Vega i VII)

Generacije arhitekture RDNA su redom:

RDNA1 (GFX10): Navi (serija RX 5000)
RDNA2 (GFX11): Big Navi (serija RX 6000; PlayStation 5, Xbox Series X i S)

Popis podržanih procesora i značajki možemo također dobiti korištenjem naredbe llc.

llc -march=amdgcn -mattr=help

Available CPUs for this target:

  bonaire     - Select the bonaire processor.
  carrizo     - Select the carrizo processor.
  fiji        - Select the fiji processor.
  generic     - Select the generic processor.
  generic-hsa - Select the generic-hsa processor.
  gfx1010     - Select the gfx1010 processor.
  gfx1011     - Select the gfx1011 processor.
  gfx1012     - Select the gfx1012 processor.
  gfx1030     - Select the gfx1030 processor.
  gfx600      - Select the gfx600 processor.
  gfx601      - Select the gfx601 processor.
  gfx700      - Select the gfx700 processor.
  gfx701      - Select the gfx701 processor.
  gfx702      - Select the gfx702 processor.
  gfx703      - Select the gfx703 processor.
  gfx704      - Select the gfx704 processor.
  gfx801      - Select the gfx801 processor.
  gfx802      - Select the gfx802 processor.
  gfx803      - Select the gfx803 processor.
  gfx810      - Select the gfx810 processor.
  gfx900      - Select the gfx900 processor.
  gfx902      - Select the gfx902 processor.
  gfx904      - Select the gfx904 processor.
  gfx906      - Select the gfx906 processor.
  gfx908      - Select the gfx908 processor.
  gfx909      - Select the gfx909 processor.
  hainan      - Select the hainan processor.
  hawaii      - Select the hawaii processor.
  iceland     - Select the iceland processor.
  kabini      - Select the kabini processor.
  kaveri      - Select the kaveri processor.
  mullins     - Select the mullins processor.
  oland       - Select the oland processor.
  pitcairn    - Select the pitcairn processor.
  polaris10   - Select the polaris10 processor.
  polaris11   - Select the polaris11 processor.
  stoney      - Select the stoney processor.
  tahiti      - Select the tahiti processor.
  tonga       - Select the tonga processor.
  verde       - Select the verde processor.

Available features for this target:

  16-bit-insts                - Has i16/f16 instructions.
  DumpCode                    - Dump MachineInstrs in the CodeEmitter.
  a16                         - Support gfx10-style A16 for 16-bit coordinates/gradients/lod/clamp/mip image operands.
  add-no-carry-insts          - Have VALU add/sub instructions without carry out.
  aperture-regs               - Has Memory Aperture Base and Size Registers.
  atomic-fadd-insts           - Has buffer_atomic_add_f32, buffer_atomic_pk_add_f16, global_atomic_add_f32, global_atomic_pk_add_f16 instructions.
  auto-waitcnt-before-barrier - Hardware automatically inserts waitcnt before barrier.
  ci-insts                    - Additional instructions for CI+.
  code-object-v3              - Generate code object version 3.
  cumode                      - Enable CU wavefront execution mode.
  dl-insts                    - Has v_fmac_f32 and v_xnor_b32 instructions.
  dot1-insts                  - Has v_dot4_i32_i8 and v_dot8_i32_i4 instructions.
  dot2-insts                  - Has v_dot2_f32_f16, v_dot2_i32_i16, v_dot2_u32_u16, v_dot4_u32_u8, v_dot8_u32_u4 instructions.
  dot3-insts                  - Has v_dot8c_i32_i4 instruction.
  dot4-insts                  - Has v_dot2c_i32_i16 instruction.
  dot5-insts                  - Has v_dot2c_f32_f16 instruction.
  dot6-insts                  - Has v_dot4c_i32_i8 instruction.
  dpp                         - Support DPP (Data Parallel Primitives) extension.
  dpp8                        - Support DPP8 (Data Parallel Primitives) extension.
  ds-src2-insts               - Has ds_*_src2 instructions.
  dumpcode                    - Dump MachineInstrs in the CodeEmitter.
  enable-ds128                - Use ds_{read|write}_b128.
  enable-prt-strict-null      - Enable zeroing of result registers for sparse texture fetches.
  fast-denormal-f32           - Enabling denormals does not cause f32 instructions to run at f64 rates.
  fast-fmaf                   - Assuming f32 fma is at least as fast as mul + add.
  flat-address-space          - Support flat address space.
  flat-for-global             - Force to generate flat instruction for global.
  flat-global-insts           - Have global_* flat memory instructions.
  flat-inst-offsets           - Flat instructions have immediate offset addressing mode.
  flat-scratch-insts          - Have scratch_* flat memory instructions.
  flat-segment-offset-bug     - GFX10 bug, inst_offset ignored in flat segment.
  fma-mix-insts               - Has v_fma_mix_f32, v_fma_mixlo_f16, v_fma_mixhi_f16 instructions.
  fmaf                        - Enable single precision FMA (not as fast as mul+add, but fused).
  fp64                        - Enable double precision operations.
  g16                         - Support G16 for 16-bit gradient image operands.
  gcn3-encoding               - Encoding format for VI.
  get-wave-id-inst            - Has s_get_waveid_in_workgroup instruction.
  gfx10                       - GFX10 GPU generation.
  gfx10-3-insts               - Additional instructions for GFX10.3.
  gfx10-insts                 - Additional instructions for GFX10+.
  gfx10_b-encoding            - Encoding format GFX10_B.
  gfx7-gfx8-gfx9-insts        - Instructions shared in GFX7, GFX8, GFX9.
  gfx8-insts                  - Additional instructions for GFX8+.
  gfx9                        - GFX9 GPU generation.
  gfx9-insts                  - Additional instructions for GFX9+.
  half-rate-64-ops            - Most fp64 instructions are half rate instead of quarter.
  inst-fwd-prefetch-bug       - S_INST_PREFETCH instruction causes shader to hang.
  int-clamp-insts             - Support clamp for integer destination.
  inv-2pi-inline-imm          - Has 1 / (2 * pi) as inline immediate.
  lds-branch-vmem-war-hazard  - Switching between LDS and VMEM-tex not waiting VM_VSRC=0.
  lds-misaligned-bug          - Some GFX10 bug with misaligned multi-dword LDS access in WGP mode.
  ldsbankcount16              - The number of LDS banks per compute unit..
  ldsbankcount32              - The number of LDS banks per compute unit..
  load-store-opt              - Enable SI load/store optimizer pass.
  localmemorysize0            - The size of local memory in bytes.
  localmemorysize32768        - The size of local memory in bytes.
  localmemorysize65536        - The size of local memory in bytes.
  mad-mac-f32-insts           - Has v_mad_f32/v_mac_f32/v_madak_f32/v_madmk_f32 instructions.
  mad-mix-insts               - Has v_mad_mix_f32, v_mad_mixlo_f16, v_mad_mixhi_f16 instructions.
  mai-insts                   - Has mAI instructions.
  max-private-element-size-16 - Maximum private access size may be 16.
  max-private-element-size-4  - Maximum private access size may be 4.
  max-private-element-size-8  - Maximum private access size may be 8.
  mfma-inline-literal-bug     - MFMA cannot use inline literal as SrcC.
  mimg-r128                   - Support 128-bit texture resources.
  movrel                      - Has v_movrel*_b32 instructions.
  no-data-dep-hazard          - Does not need SW waitstates.
  no-sdst-cmpx                - V_CMPX does not write VCC/SGPR in addition to EXEC.
  no-sram-ecc-support         - Hardware does not support SRAM ECC.
  no-xnack-support            - Hardware does not support XNACK.
  nsa-encoding                - Support NSA encoding for image instructions.
  nsa-to-vmem-bug             - MIMG-NSA followed by VMEM fail if EXEC_LO or EXEC_HI equals zero.
  offset-3f-bug               - Branch offset of 3f hardware bug.
  pk-fmac-f16-inst            - Has v_pk_fmac_f16 instruction.
  promote-alloca              - Enable promote alloca pass.
  r128-a16                    - Support gfx9-style A16 for 16-bit coordinates/gradients/lod/clamp/mip image operands, where a16 is aliased with r128.
  register-banking            - Has register banking.
  s-memrealtime               - Has s_memrealtime instruction.
  s-memtime-inst              - Has s_memtime instruction.
  scalar-atomics              - Has atomic scalar memory instructions.
  scalar-flat-scratch-insts   - Have s_scratch_* flat memory instructions.
  scalar-stores               - Has store scalar memory instructions.
  sdwa                        - Support SDWA (Sub-DWORD Addressing) extension.
  sdwa-mav                    - Support v_mac_f32/f16 with SDWA (Sub-DWORD Addressing) extension.
  sdwa-omod                   - Support OMod with SDWA (Sub-DWORD Addressing) extension.
  sdwa-out-mods-vopc          - Support clamp for VOPC with SDWA (Sub-DWORD Addressing) extension.
  sdwa-scalar                 - Support scalar register with SDWA (Sub-DWORD Addressing) extension.
  sdwa-sdst                   - Support scalar dst for VOPC with SDWA (Sub-DWORD Addressing) extension.
  sea-islands                 - SEA_ISLANDS GPU generation.
  sgpr-init-bug               - VI SGPR initialization bug requiring a fixed SGPR allocation size.
  si-scheduler                - Enable SI Machine Scheduler.
  smem-to-vector-write-hazard - s_load_dword followed by v_cmp page faults.
  southern-islands            - SOUTHERN_ISLANDS GPU generation.
  sram-ecc                    - Enable SRAM ECC.
  trap-handler                - Trap handler support.
  trig-reduced-range          - Requires use of fract on arguments to trig instructions.
  unaligned-buffer-access     - Support unaligned global loads and stores.
  unaligned-scratch-access    - Support unaligned scratch loads and stores.
  unpacked-d16-vmem           - Has unpacked d16 vmem instructions.
  unsafe-ds-offset-folding    - Force using DS instruction immediate offsets on SI.
  vcmpx-exec-war-hazard       - V_CMPX WAR hazard on EXEC (V_CMPX issue ONLY).
  vcmpx-permlane-hazard       - TODO: describe me.
  vgpr-index-mode             - Has VGPR mode register indexing.
  vmem-to-scalar-write-hazard - VMEM instruction followed by scalar writing to EXEC mask, M0 or SGPR leads to incorrect execution..
  volcanic-islands            - VOLCANIC_ISLANDS GPU generation.
  vop3-literal                - Can use one literal in VOP3.
  vop3p                       - Has VOP3P packed instructions.
  vscnt                       - Has separate store vscnt counter.
  wavefrontsize16             - The number of threads per wavefront.
  wavefrontsize32             - The number of threads per wavefront.
  wavefrontsize64             - The number of threads per wavefront.
  xnack                       - Enable XNACK support.

Use +feature to enable a feature, or -feature to disable it.
For example, llc -mcpu=mycpu -mattr=+feature1,-feature2

Prevođenje koda za različite procesore

Prevedimo kod:

clang -target amdgcn -S kernel.cl

Zadani procesor je gfx600, odnosno tahiti (Radeon serija HD 7900). Dobiveni kod je oblika:

        .text
        .section        .AMDGPU.config
        .long   47176
        .long   11469125
        .long   47180
        .long   133
        .long   47200
        .long   4194304
        .long   4
        .long   0
        .long   8
        .long   0
        .text
        .globl  multiply_by_two
        .p2align        8
        .type   multiply_by_two,@function
multiply_by_two:
        s_mov_b32 s32, 0
        s_mov_b32 s33, 0
        s_mov_b32 s40, SCRATCH_RSRC_DWORD0
        s_mov_b32 s41, SCRATCH_RSRC_DWORD1
        s_mov_b32 s42, -1
        s_mov_b32 s43, 0xe8f000
        s_add_u32 s40, s40, s3
        s_addc_u32 s41, s41, 0
        s_getpc_b64 s[2:3]
        s_add_u32 s2, s2, get_global_id@gotpcrel32@lo+4
        s_addc_u32 s3, s3, get_global_id@gotpcrel32@hi+4
        s_load_dwordx2 s[4:5], s[2:3], 0x0
        s_load_dwordx4 s[36:39], s[0:1], 0x9
        v_mov_b32_e32 v0, 0
        s_mov_b64 s[0:1], s[40:41]
        s_mov_b64 s[2:3], s[42:43]
        s_waitcnt lgkmcnt(0)
        s_swappc_b64 s[30:31], s[4:5]
        v_ashrrev_i32_e32 v1, 31, v0
        s_mov_b32 s3, 0xf000
        v_lshl_b64 v[0:1], v[0:1], 2
        s_mov_b32 s2, 0
        s_mov_b64 s[0:1], s[36:37]
        buffer_load_dword v2, v[0:1], s[0:3], 0 addr64
        s_waitcnt vmcnt(0)
        v_add_f32_e32 v2, v2, v2
        s_mov_b64 s[0:1], s[38:39]
        buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
        s_endpgm
.Lfunc_end0:
        .size   multiply_by_two, .Lfunc_end0-multiply_by_two

        .ident  "Debian clang version 11.0.1-2"
        .section        ".note.GNU-stack"
        .amd_amdgpu_isa "amdgcn----gfx600

Promotrimo funkciju multiply_by_two. Vidimo niz skalarnih (prefix s_) i vektorskih (prefiks v_) instrukcija. Funkcija množenja s 2 ovdje postaje instrukcija zbrajanja broja sa samim sobom, v_add_f32_e32 koja djeluje na vektorskom registru opće namjene v2. Ta transformacija množenja s dva u zbrajanje je posljedica optimizacije koja zna da je zbrajanje manje zahtjevna operacija od množenja. Uočimo još dvije instrukcije:

instrukciju buffer_load_dword iznad, koja učitava podatke iz globalne memorije grafičkog procesora u vektorski registar v2,
instrukciju buffer_store_dword ispod, koja sprema podatke iz vektorskog registra v2 u globalnu memoriju.

Usporedimo dobiveni kod za grafički procesor gfx803, odnosno fiji (Radeon R9 Fury).

clang -target amdgcn -mcpu=gfx803 -S kernel.cl

        .text
        .section        .AMDGPU.config
        .long   47176
        .long   11469189
        .long   47180
        .long   133
        .long   47200
        .long   4194304
        .long   4
        .long   0
        .long   8
        .long   0
        .text
        .globl  multiply_by_two
        .p2align        8
        .type   multiply_by_two,@function
multiply_by_two:
        s_mov_b32 s40, SCRATCH_RSRC_DWORD0
        s_mov_b32 s41, SCRATCH_RSRC_DWORD1
        s_mov_b32 s42, -1
        s_mov_b32 s43, 0xe80000
        s_add_u32 s40, s40, s3
        s_addc_u32 s41, s41, 0
        s_load_dwordx4 s[36:39], s[0:1], 0x24
        s_getpc_b64 s[0:1]
        s_add_u32 s0, s0, get_global_id@gotpcrel32@lo+4
        s_addc_u32 s1, s1, get_global_id@gotpcrel32@hi+4
        s_load_dwordx2 s[4:5], s[0:1], 0x0
        s_mov_b64 s[0:1], s[40:41]
        s_mov_b64 s[2:3], s[42:43]
        v_mov_b32_e32 v0, 0
        s_mov_b32 s32, 0
        s_mov_b32 s33, 0
        s_waitcnt lgkmcnt(0)
        s_swappc_b64 s[30:31], s[4:5]
        v_ashrrev_i32_e32 v1, 31, v0
        v_lshlrev_b64 v[0:1], 2, v[0:1]
        v_mov_b32_e32 v3, s37
        v_add_u32_e32 v2, vcc, s36, v0
        v_addc_u32_e32 v3, vcc, v3, v1, vcc
        flat_load_dword v2, v[2:3]
        v_mov_b32_e32 v4, s39
        v_add_u32_e32 v0, vcc, s38, v0
        v_addc_u32_e32 v1, vcc, v4, v1, vcc
        s_waitcnt vmcnt(0) lgkmcnt(0)
        v_add_f32_e32 v2, v2, v2
        flat_store_dword v[0:1], v2
        s_endpgm
.Lfunc_end0:
        .size   multiply_by_two, .Lfunc_end0-multiply_by_two

        .ident  "Debian clang version 11.0.1-2"
        .section        ".note.GNU-stack"
        .amd_amdgpu_isa "amdgcn----gfx803"

Uočimo kako je instrukcija v_add_f32_e32 ostala ista, ali kako su instrukcije flat_load_dword i flat_store_dword za rad s generičkim adresnim prostorom zamijenile instrukcije s prefiksom buffer_; naime, u trećoj generaciji arhitekture GCN uklonjene su specifične instrukcije za učitavanje iz globalne memorije i umjesto njih koriste se instrukcije za rad s generičkim adresnim prostorom.

U nastavku ćemo koristiti grafički procesor gfx900 (Radeon RX Vega). Vega uvodi nove instrukcije za rad s globalnom memorijom global_load_dword i global_store_dword:

clang -target amdgcn -mcpu=gfx900 -S kernel.cl

        .text
        .section        .AMDGPU.config
        .long   47176
        .long   11469189
        .long   47180
        .long   133
        .long   47200
        .long   4194304
        .long   4
        .long   0
        .long   8
        .long   0
        .text
        .globl  multiply_by_two
        .p2align        8
        .type   multiply_by_two,@function
multiply_by_two:
        s_mov_b32 s40, SCRATCH_RSRC_DWORD0
        s_mov_b32 s41, SCRATCH_RSRC_DWORD1
        s_mov_b32 s42, -1
        s_mov_b32 s43, 0xe00000
        s_add_u32 s40, s40, s3
        s_addc_u32 s41, s41, 0
        s_load_dwordx4 s[36:39], s[0:1], 0x24
        s_getpc_b64 s[0:1]
        s_add_u32 s0, s0, get_global_id@gotpcrel32@lo+4
        s_addc_u32 s1, s1, get_global_id@gotpcrel32@hi+4
        s_load_dwordx2 s[4:5], s[0:1], 0x0
        s_mov_b64 s[0:1], s[40:41]
        s_mov_b64 s[2:3], s[42:43]
        v_mov_b32_e32 v0, 0
        s_mov_b32 s32, 0
        s_mov_b32 s33, 0
        s_waitcnt lgkmcnt(0)
        s_swappc_b64 s[30:31], s[4:5]
        v_ashrrev_i32_e32 v1, 31, v0
        v_lshlrev_b64 v[0:1], 2, v[0:1]
        v_mov_b32_e32 v3, s37
        v_add_co_u32_e32 v2, vcc, s36, v0
        v_addc_co_u32_e32 v3, vcc, v3, v1, vcc
        global_load_dword v2, v[2:3], off
        v_mov_b32_e32 v4, s39
        v_add_co_u32_e32 v0, vcc, s38, v0
        v_addc_co_u32_e32 v1, vcc, v4, v1, vcc
        s_waitcnt vmcnt(0)
        v_add_f32_e32 v2, v2, v2
        global_store_dword v[0:1], v2, off
        s_endpgm
.Lfunc_end0:
        .size   multiply_by_two, .Lfunc_end0-multiply_by_two

        .ident  "Debian clang version 11.0.1-2"
        .section        ".note.GNU-stack"
        .amd_amdgpu_isa "amdgcn----gfx900"

Iako nam neće trebati toliko detaljno poznavanje instrukcija, čitava specifikacija pete generacije arhitekture GCN dostupna je kao dio prezentacija i priručnika AMD GPUOpena.

Nivoi optimizacije

Clang podržava 4 nivoa optimizacije: -O0, -O1, -O2 i -O3.

Iskoristimo nivo optimizacije -O0:

clang -target amdgcn -mcpu=gfx900 -O0 -S kernel.cl

Rezultirajući kod je oblika:

        .text
        .section        .AMDGPU.config
        .long   47176
        .long   11469125
        .long   47180
        .long   133
        .long   47200
        .long   4202496
        .long   4
        .long   0
        .long   8
        .long   0
        .text
        .globl  multiply_by_two
        .p2align        8
        .type   multiply_by_two,@function
multiply_by_two:
        s_mov_b32 s32, 0x800
        s_mov_b32 s33, 0
        s_mov_b32 s36, SCRATCH_RSRC_DWORD0
        s_mov_b32 s37, SCRATCH_RSRC_DWORD1
        s_mov_b32 s38, -1
        s_mov_b32 s39, 0xe00000
        s_add_u32 s36, s36, s3
        s_addc_u32 s37, s37, 0
        s_load_dwordx2 s[2:3], s[0:1], 0x24
        s_load_dwordx2 s[0:1], s[0:1], 0x2c
        s_waitcnt lgkmcnt(0)
        s_mov_b32 s4, s3
        v_mov_b32_e32 v0, s4
        buffer_store_dword v0, off, s[36:39], s33 offset:12
        v_mov_b32_e32 v0, s2
        buffer_store_dword v0, off, s[36:39], s33 offset:8
        s_mov_b32 s2, s1
        v_mov_b32_e32 v0, s2
        buffer_store_dword v0, off, s[36:39], s33 offset:20
        v_mov_b32_e32 v0, s0
        buffer_store_dword v0, off, s[36:39], s33 offset:16
        s_getpc_b64 s[6:7]
        s_add_u32 s6, s6, get_global_id@gotpcrel32@lo+4
        s_addc_u32 s7, s7, get_global_id@gotpcrel32@hi+4
        s_load_dwordx2 s[6:7], s[6:7], 0x0
        s_mov_b64 s[8:9], s[36:37]
        s_mov_b64 s[10:11], s[38:39]
        v_mov_b32_e32 v0, 0
        s_mov_b64 s[0:1], s[8:9]
        s_mov_b64 s[2:3], s[10:11]
        s_waitcnt lgkmcnt(0)
        s_swappc_b64 s[30:31], s[6:7]
        buffer_store_dword v0, off, s[36:39], s33 offset:24
        buffer_load_dword v0, off, s[36:39], s33 offset:8
        buffer_load_dword v1, off, s[36:39], s33 offset:12
        s_waitcnt vmcnt(1)
        v_mov_b32_e32 v2, v0
        s_waitcnt vmcnt(0)
        v_mov_b32_e32 v3, v1
        buffer_load_dword v0, off, s[36:39], s33 offset:24
        s_waitcnt vmcnt(0)
        v_ashrrev_i32_e64 v1, 31, v0
        v_mov_b32_e32 v4, v0
        v_mov_b32_e32 v5, v1
        s_mov_b32 s4, 2
        v_lshlrev_b64 v[4:5], s4, v[4:5]
        v_mov_b32_e32 v0, v2
        v_mov_b32_e32 v1, v4
        v_mov_b32_e32 v2, v5
        v_add_co_u32_e64 v0, s[6:7], v0, v1
        v_addc_co_u32_e64 v1, s[6:7], v3, v2, s[6:7]
        v_mov_b32_e32 v6, v0
        v_mov_b32_e32 v7, v1
        global_load_dword v0, v[6:7], off
        s_waitcnt vmcnt(0)
        v_add_f32_e64 v0, v0, v0
        buffer_load_dword v1, off, s[36:39], s33 offset:16
        buffer_load_dword v2, off, s[36:39], s33 offset:20
        s_waitcnt vmcnt(1)
        v_mov_b32_e32 v6, v1
        s_waitcnt vmcnt(0)
        v_mov_b32_e32 v7, v2
        v_mov_b32_e32 v1, v6
        v_mov_b32_e32 v2, v4
        v_add_co_u32_e64 v1, s[6:7], v1, v2
        v_addc_co_u32_e64 v2, s[6:7], v7, v5, s[6:7]
        v_mov_b32_e32 v8, v1
        v_mov_b32_e32 v9, v2
        global_store_dword v[8:9], v0, off
        s_endpgm
.Lfunc_end0:
        .size   multiply_by_two, .Lfunc_end0-multiply_by_two

        .ident  "Debian clang version 11.0.1-2"
        .section        ".note.GNU-stack"
        .amd_amdgpu_isa "amdgcn----gfx900"

Optimizacija koju imamo iznad slična je -O1 i višim nivoima:

clang -target amdgcn -mcpu=gfx900 -O1 -S kernel.cl

Rezultirajući kod je oblika:

        .text
        .section        .AMDGPU.config
        .long   47176
        .long   11469189
        .long   47180
        .long   133
        .long   47200
        .long   4194304
        .long   4
        .long   0
        .long   8
        .long   0
        .text
        .globl  multiply_by_two
        .p2align        8
        .type   multiply_by_two,@function
multiply_by_two:
        s_mov_b32 s32, 0
        s_mov_b32 s40, SCRATCH_RSRC_DWORD0
        s_mov_b32 s41, SCRATCH_RSRC_DWORD1
        s_mov_b32 s42, -1
        s_mov_b32 s43, 0xe00000
        s_add_u32 s40, s40, s3
        s_addc_u32 s41, s41, 0
        s_load_dwordx4 s[36:39], s[0:1], 0x24
        s_getpc_b64 s[0:1]
        s_add_u32 s0, s0, get_global_id@gotpcrel32@lo+4
        s_addc_u32 s1, s1, get_global_id@gotpcrel32@hi+4
        s_load_dwordx2 s[4:5], s[0:1], 0x0
        s_mov_b64 s[0:1], s[40:41]
        s_mov_b64 s[2:3], s[42:43]
        v_mov_b32_e32 v0, 0
        s_waitcnt lgkmcnt(0)
        s_swappc_b64 s[30:31], s[4:5]
        v_ashrrev_i32_e32 v1, 31, v0
        v_lshlrev_b64 v[0:1], 2, v[0:1]
        v_mov_b32_e32 v3, s37
        v_add_co_u32_e32 v2, vcc, s36, v0
        v_addc_co_u32_e32 v3, vcc, v3, v1, vcc
        v_mov_b32_e32 v4, s39
        global_load_dword v2, v[2:3], off
        s_waitcnt vmcnt(0)
        v_add_f32_e32 v2, v2, v2
        v_add_co_u32_e32 v0, vcc, s38, v0
        v_addc_co_u32_e32 v1, vcc, v4, v1, vcc
        global_store_dword v[0:1], v2, off
        s_endpgm
.Lfunc_end0:
        .size   multiply_by_two, .Lfunc_end0-multiply_by_two

        .ident  "Debian clang version 11.0.1-2"
        .section        ".note.GNU-stack"
        .amd_amdgpu_isa "amdgcn----gfx900"

Uočimo kako je optimizirani kod bitno kraći i kako, primjerice, eliminira pet instrukcija buffer_store_dword i isto toliko instrukcija buffer_load_dword.

Vektorske instrukcije i tipovi podataka

Uočimo kako su instrukcije i registri koje smo dosad sreli vektorski, odnosno izvode ih sve niti koje izvode zrno. Ista je situacija kad kod ne koristi polja, primjerice:

__kernel void a_b_c()
{
    int a = 3;
    int b = 4;
    int c = a + b;
}

Prevođenjem ovog koda s uključenim optimizacijama nećemo dobiti ništa konkretno jer će program prevoditelj uočiti da se varijabla c nigdje ne koristi i eliminirati je, a zatim eliminirati varijable a i b. Iskoristimo parametar -O0:

clang -target amdgcn -mcpu=gfx900 -O0 -S kernel.cl

Dobivamo kod:

        .text
        .section        .AMDGPU.config
        .long   47176
        .long   11469056
        .long   47180
        .long   129
        .long   47200
        .long   4096
        .long   4
        .long   0
        .long   8
        .long   0
        .text
        .globl  a_b_c
        .p2align        8
        .type   a_b_c,@function
a_b_c:
        s_mov_b32 s33, 0
        s_mov_b32 s4, SCRATCH_RSRC_DWORD0
        s_mov_b32 s5, SCRATCH_RSRC_DWORD1
        s_mov_b32 s6, -1
        s_mov_b32 s7, 0xe00000
        s_add_u32 s4, s4, s1
        s_addc_u32 s5, s5, 0
        v_mov_b32_e32 v0, 3
        buffer_store_dword v0, off, s[4:7], s33 offset:4
        v_mov_b32_e32 v0, 4
        buffer_store_dword v0, off, s[4:7], s33 offset:8
        buffer_load_dword v0, off, s[4:7], s33 offset:4
        buffer_load_dword v1, off, s[4:7], s33 offset:8
        s_waitcnt vmcnt(0)
        v_add_u32_e64 v0, v0, v1
        buffer_store_dword v0, off, s[4:7], s33 offset:12
        s_endpgm
.Lfunc_end0:
        .size   a_b_c, .Lfunc_end0-a_b_c

        .ident  "Debian clang version 11.0.1-2"
        .section        ".note.GNU-stack"
        .amd_amdgpu_isa "amdgcn----gfx900"

Uočimo postavljanje vrijednosti 3 i 4 u vektorski registar v0 instrukcijom v_mov_b32_e32, a zatim uočimo zbrajanje instrukcijom v_add_u32_e64 koja nije ista kao instrukcija korištena ranije. Naime, ovdje se radi o cjelobrojnim tipovima, a ranije smo koristili brojeve s pomičnim zarezom pa su i instrukcije za rad s njima odvojene.

Author: Vedran Miletić