The LLVM tools

In addition to the use of LLVM via the Clang compiler and the Mesa 3d graphics library, it is also possible to use LLVM via several command-line tools bundled with it. Among these We have already used llc and will also be using llvm-cxxfilt:

llvm-mca, the LLVM machine code analyzer
lli, the LLVM just-in-time compiler or interpreter
llvm-diff, the structural diff tool for files containing LLVM intermediate representations
llvm-stress, the random generator of files containing LLVM intermediate representation
opt, the LLVM optimizer

We will be working with files containing the LLVM intermediate representation (IR). Many LLVM IR examples are available in the VirtualMachine/ir-examples repository on GitHub that we obtained earlier. We will use both the C source code and the LLVM IR source code from that repository.

In the following examples we will place the ir-examples directory in the builddir directory of the llvm-project. Once placed in the builddir directory:

cd builddir

we can achieve this using cURL:

curl -OL https://github.com/Virtual-Machine/ir-examples/archive/refs/heads/master.zip
unzip master.zip
mv ir-examples-master ir-examples

or Git:

git clone https://github.com/Virtual-Machine/ir-examples.git

llvm-mca

llvm-mca predicts the performance of machine code based on the scheduling models of processors in LLVM.

It takes assembly as the input. For example, to predict the performance of test1.c when executed on AMD Bobcat Jaguar microarchitecture CPUs, we should use llvm-mca with the -mcpu parameter set to btver2 (for what it's worth, btver1 is AMD, while bdver1, bdver2, bdver3, and bdver4 are AMD Bulldozer, Piledriver, Steamroller, and Excavator (respectively)):

./bin/clang ir-examples/c/test1.c -O2 -target x86_64-unknown-unknown -S -o - | ./bin/llvm-mca -mcpu=btver2

warning: found a return instruction in the input assembly sequence.
note: program counter updates are ignored.
Iterations:        100
Instructions:      1000
Total Cycles:      803
Total uOps:        1000

Dispatch Width:    2
uOps Per Cycle:    1.25
IPC:               1.25
Block RThroughput: 5.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     1.00           *            pushq %rbp
 1      1     0.50                        movq  %rsp, %rbp
 1      1     0.50                        leal  5(%rdi), %eax
 1      3     1.00    *                   popq  %rbp
 1      4     1.00                  U     retq
 1      1     1.00           *            pushq %rbp
 1      1     0.50                        movq  %rsp, %rbp
 1      0     0.50                        xorl  %eax, %eax
 1      3     1.00    *                   popq  %rbp
 1      4     1.00                  U     retq


Resources:
[0]   - JALU0
[1]   - JALU1
[2]   - JDiv
[3]   - JFPA
[4]   - JFPM
[5]   - JFPU0
[6]   - JFPU1
[7]   - JLAGU
[8]   - JMul
[9]   - JSAGU
[10]  - JSTC
[11]  - JVALU0
[12]  - JVALU1
[13]  - JVIMUL


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
2.50   2.50    -      -      -      -      -     4.00    -     2.00    -      -      -      -

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
 -      -      -      -      -      -      -      -      -     1.00    -      -      -      -     pushq %rbp
0.49   0.51    -      -      -      -      -      -      -      -      -      -      -      -     movq  %rsp, %rbp
0.50   0.50    -      -      -      -      -      -      -      -      -      -      -      -     leal  5(%rdi), %eax
 -      -      -      -      -      -      -     1.00    -      -      -      -      -      -     popq  %rbp
0.50   0.50    -      -      -      -      -     1.00    -      -      -      -      -      -     retq
 -      -      -      -      -      -      -      -      -     1.00    -      -      -      -     pushq %rbp
0.52   0.48    -      -      -      -      -      -      -      -      -      -      -      -     movq  %rsp, %rbp
 -      -      -      -      -      -      -      -      -      -      -      -      -      -     xorl  %eax, %eax
 -      -      -      -      -      -      -     1.00    -      -      -      -      -      -     popq  %rbp
0.49   0.51    -      -      -      -      -     1.00    -      -      -      -      -      -     retq

We can observe the prediction of 1.25 instructions per cycle on average during the program exection (IPC) (higher is better) and the approximately 50% usage of arithmetic logic units (ALUs) in the Resource pressure section (lower is better).

Assignment

Compare the instructions per cycle and resurce pressure when control_flow.c is executed on AMD Jaguar and AMD Piledriver CPUs.

lli

lli is the interpreter that enables direct execution of the programs written in LLVM IR.

For example, running switch.ll from ir-examples/ll directory:

./bin/lli ir-examples/ll/switch.ll

will print:

1.2 + 1.4 = 2.6

Assignment

Modify the switch.c file so that it sums three numbers instead of two. Compile to LLVM IR and compare the resulting file with the switch.ll from the repository. Check if the resulting file can be executed with lli.

llvm-diff

llvm-diff is the structural comparison tool for LLVM IR files. Usage:

./bin/llvm-diff example1.ll example2.ll

Assignment

Use diff and then llvm-diff to compare the two LLVM IR files from the previous assignment. Observe what llvm-diff takes into account that diff does not.

llvm-stress

llvm-stress is the random LLVM IR generator.

Generated output can be saved into a file using -o parameter:

./bin/llvm-stress -o example-stress.ll

The file example-stress.ll contains:

; ModuleID = '/tmp/autogen.bc'
source_filename = "/tmp/autogen.bc"

define void @autogen_SD0(i8* %0, i32* %1, i64* %2, i32 %3, i64 %4, i8 %5) {
BB:
  %A4 = alloca <8 x double>, align 64
  %A3 = alloca <2 x i16>, align 4
  %A2 = alloca <4 x i8>, align 4
  %A1 = alloca <16 x i1>, align 16
  %A = alloca <16 x i1>, align 16
  %L = load <16 x i1>, <16 x i1>* %A, align 16
  store i8 %5, i8* %0, align 1
  %E = extractelement <2 x i8> zeroinitializer, i32 0
  %Shuff = shufflevector <4 x i32> zeroinitializer, <4 x i32> zeroinitializer, <4 x i32> <i32 undef, i32 7, i32 1, i32 undef>
  %I = insertelement <1 x i8> zeroinitializer, i8 21, i32 0
  %B = fmul double 0x18DF23FE11DD527C, 0xAE97BFB633957A34
  %Se = sext i8 77 to i16
  %Sl = select <8 x i1> zeroinitializer, <8 x i1> zeroinitializer, <8 x i1> zeroinitializer
  %Cmp = icmp ugt i16 18761, %Se
  br label %CF

CF:                                               ; preds = %CF, %CF85, %CF86, %BB
  %L5 = load i8, i8* %0, align 1
  store i8 21, i8* %0, align 1
  %E6 = extractelement <2 x i16> zeroinitializer, i32 0
  %Shuff7 = shufflevector <2 x i8> zeroinitializer, <2 x i8> zeroinitializer, <2 x i32> <i32 undef, i32 3>
  %I8 = insertelement <4 x i32> zeroinitializer, i32 32529, i32 3
  %B9 = sub i64 %4, 251161
  %Tr = trunc <4 x i32> %I8 to <4 x i16>
  %Sl10 = select i1 %Cmp, <2 x i16>* %A3, <2 x i16>* %A3
  %Cmp11 = icmp ne <1 x i8> zeroinitializer, zeroinitializer
  %L12 = load <2 x i16>, <2 x i16>* %Sl10, align 4
  store <2 x i16> %L12, <2 x i16>* %Sl10, align 4
  %E13 = extractelement <2 x i64> zeroinitializer, i32 1
  %Shuff14 = shufflevector <4 x i32> zeroinitializer, <4 x i32> %Shuff, <4 x i32> <i32 2, i32 undef, i32 6, i32 0>
  %I15 = insertelement <4 x i32> zeroinitializer, i32 %3, i32 0
  %B16 = and i8 %E, 21
  %FC = uitofp <4 x i32> %I15 to <4 x double>
  %Sl17 = select i1 true, i16 -1979, i16 16293
  %Cmp18 = fcmp ueq double 0xAE97BFB633957A34, 0x99ABD3CAB0A9A048
  br i1 %Cmp18, label %CF, label %CF83

CF83:                                             ; preds = %CF83, %CF92, %CF
  %L19 = load <2 x i16>, <2 x i16>* %Sl10, align 4
  store <2 x i16> zeroinitializer, <2 x i16>* %Sl10, align 4
  %E20 = extractelement <8 x i1> %Sl, i32 1
  br i1 %E20, label %CF83, label %CF92

CF92:                                             ; preds = %CF83
  %Shuff21 = shufflevector <8 x i32> zeroinitializer, <8 x i32> zeroinitializer, <8 x i32> <i32 undef, i32 14, i32 0, i32 2, i32 4, i32 6, i32 8, i32 10>
  %I22 = insertelement <16 x i1> zeroinitializer, i1 true, i32 14
  %B23 = mul <1 x i8> %I, zeroinitializer
  %FC24 = fptoui <4 x double> %FC to <4 x i16>
  %Sl25 = select i1 true, i8 %E, i8 %B16
  %Cmp26 = icmp ule i32 0, 132849
  br i1 %Cmp26, label %CF83, label %CF85

CF85:                                             ; preds = %CF92
  %L27 = load i8, i8* %0, align 1
  store <2 x i16> %L12, <2 x i16>* %Sl10, align 4
  %E28 = extractelement <2 x i64> zeroinitializer, i32 1
  %Shuff29 = shufflevector <4 x i64> zeroinitializer, <4 x i64> zeroinitializer, <4 x i32> <i32 6, i32 undef, i32 2, i32 4>
  %I30 = insertelement <8 x i32> %Shuff21, i32 132849, i32 0
  %B31 = and i8 77, 21
  %Tr32 = trunc <4 x i32> %I8 to <4 x i16>
  %Sl33 = select i1 true, i8 %B31, i8 %5
  %Cmp34 = icmp ule <2 x i16> %L19, %L12
  %L35 = load <2 x i16>, <2 x i16>* %Sl10, align 4
  store i8 %L5, i8* %0, align 1
  %E36 = extractelement <2 x i1> %Cmp34, i32 0
  br i1 %E36, label %CF, label %CF82

CF82:                                             ; preds = %CF82, %CF89, %CF91, %CF85
  %Shuff37 = shufflevector <16 x i1> %I22, <16 x i1> zeroinitializer, <16 x i32> <i32 7, i32 undef, i32 undef, i32 undef, i32 15, i32 17, i32 undef, i32 21, i32 23, i32 25, i32 27, i32 undef, i32 31, i32 1, i32 3, i32 5>
  %I38 = insertelement <4 x double> %FC, double 0xAE97BFB633957A34, i32 1
  %FC39 = sitofp i8 %L5 to double
  %Sl40 = select i1 true, <16 x i1> %I22, <16 x i1> %I22
  %Cmp41 = icmp uge i8 %Sl25, 21
  br i1 %Cmp41, label %CF82, label %CF89

CF89:                                             ; preds = %CF82
  %L42 = load i8, i8* %0, align 1
  store <2 x i16> %L12, <2 x i16>* %Sl10, align 4
  %E43 = extractelement <16 x i1> %Shuff37, i32 11
  br i1 %E43, label %CF82, label %CF88

CF88:                                             ; preds = %CF88, %CF89
  %Shuff44 = shufflevector <4 x i32> zeroinitializer, <4 x i32> zeroinitializer, <4 x i32> <i32 6, i32 0, i32 2, i32 undef>
  %I45 = insertelement <8 x i1> %Sl, i1 true, i32 0
  %B46 = srem <1 x i8> %I, %B23
  %Tr47 = fptrunc double %FC39 to float
  %Sl48 = select <8 x i1> %I45, <8 x i1> %I45, <8 x i1> zeroinitializer
  %Cmp49 = icmp eq i64 251161, %E28
  br i1 %Cmp49, label %CF88, label %CF91

CF91:                                             ; preds = %CF88
  %L50 = load <2 x i16>, <2 x i16>* %Sl10, align 4
  store i8 21, i8* %0, align 1
  %E51 = extractelement <16 x i1> zeroinitializer, i32 10
  br i1 %E51, label %CF82, label %CF86

CF86:                                             ; preds = %CF91
  %Shuff52 = shufflevector <8 x i1> %Sl48, <8 x i1> %I45, <8 x i32> <i32 13, i32 15, i32 1, i32 3, i32 5, i32 undef, i32 9, i32 11>
  %I53 = insertelement <4 x i32> zeroinitializer, i32 0, i32 3
  %FC54 = uitofp <2 x i16> zeroinitializer to <2 x float>
  %Sl55 = select <4 x i1> <i1 true, i1 false, i1 true, i1 false>, <4 x i16> %Tr32, <4 x i16> %Tr32
  %Cmp56 = icmp ult <16 x i1> %L, %Shuff37
  %L57 = load <2 x i16>, <2 x i16>* %Sl10, align 4
  store <2 x i16> %L12, <2 x i16>* %Sl10, align 4
  %E58 = extractelement <4 x i16> %Tr32, i32 0
  %Shuff59 = shufflevector <4 x i32> zeroinitializer, <4 x i32> zeroinitializer, <4 x i32> <i32 3, i32 5, i32 7, i32 1>
  %I60 = insertelement <16 x i1> zeroinitializer, i1 true, i32 5
  %B61 = add <2 x i8> %Shuff7, zeroinitializer
  %Tr62 = trunc i64 %B9 to i1
  br i1 %Tr62, label %CF, label %CF81

CF81:                                             ; preds = %CF81, %CF90, %CF87, %CF86
  %Sl63 = select i1 true, <2 x i16> zeroinitializer, <2 x i16> %L50
  %Cmp64 = icmp sgt <4 x i32> %I8, %Shuff
  %L65 = load <16 x i1>, <16 x i1>* %A, align 16
  store i8 21, i8* %0, align 1
  %E66 = extractelement <2 x i16> %Sl63, i32 1
  %Shuff67 = shufflevector <2 x i64> zeroinitializer, <2 x i64> zeroinitializer, <2 x i32> <i32 2, i32 undef>
  %I68 = insertelement <2 x i64> zeroinitializer, i64 %4, i32 0
  %B69 = urem i64 251161, 251161
  %FC70 = fptosi double %B to i8
  %Sl71 = select i1 %E36, i1 true, i1 %Cmp49
  br i1 %Sl71, label %CF81, label %CF90

CF90:                                             ; preds = %CF81
  %Cmp72 = icmp slt i16 18437, %E58
  br i1 %Cmp72, label %CF81, label %CF87

CF87:                                             ; preds = %CF90
  %L73 = load i8, i8* %0, align 1
  store i64 %B9, i64* %2, align 4
  %E74 = extractelement <16 x i1> zeroinitializer, i32 2
  br i1 %E74, label %CF81, label %CF84

CF84:                                             ; preds = %CF87
  %Shuff75 = shufflevector <4 x i32> %I15, <4 x i32> %I53, <4 x i32> <i32 5, i32 7, i32 1, i32 3>
  %I76 = insertelement <8 x i1> zeroinitializer, i1 %Tr62, i32 7
  %B77 = fsub double 0x99ABD3CAB0A9A048, 0xAE97BFB633957A34
  %FC78 = uitofp i32 0 to double
  %Sl79 = select i1 true, i8 %L42, i8 %L73
  %Cmp80 = icmp ult <8 x i1> %Shuff52, %Sl
  store i8 %FC70, i8* %0, align 1
  store <2 x i16> %L12, <2 x i16>* %Sl10, align 4
  store <2 x i16> %L12, <2 x i16>* %Sl10, align 4
  store i8 %E, i8* %0, align 1
  store i8 %L42, i8* %0, align 1
  ret void
}

If we desire a smaller example file, we can specify the number of desired instructions using the -size parameter:

./bin/llvm-stress -size 8 -o example-stress-size8.ll

The file example-stress-size8.ll contains:

; ModuleID = '/tmp/autogen.bc'
source_filename = "/tmp/autogen.bc"

define void @autogen_SD0(i8* %0, i32* %1, i64* %2, i32 %3, i64 %4, i8 %5) {
BB:
  %A4 = alloca <8 x double>, align 64
  %A3 = alloca <2 x i16>, align 4
  %A2 = alloca <4 x i8>, align 4
  %A1 = alloca <16 x i1>, align 16
  %A = alloca <16 x i1>, align 16
  store i8 21, i8* %0, align 1
  store <8 x double> <double 0.000000e+00, double 0xFFFFFFFFFFFFFFFF, double 0.000000e+00, double 0xFFFFFFFFFFFFFFFF, double 0.000000e+00, double 0xFFFFFFFFFFFFFFFF, double 0.000000e+00, double 0xFFFFFFFFFFFFFFFF>, <8 x double>* %A4, align 64
  store i8 77, i8* %0, align 1
  store i8 77, i8* %0, align 1
  ret void
}

It is also possible to specify the -seed parameter with the value used for seeding the random generator.

Assignment

Check if the results of the subsequent runs of llvm-stress are the same (you can use llvm-diff to compare the files). Check if the result changes when different seed values are specified. (Hint: Rename functions if necessary for comparison.)

opt

opt is used for performing optimization (i.e. analysis and transform) passes.

Availables passes can be printed using the -print-passes parameter:

./bin/opt -print-passes

They are:

Module passes:
  always-inline
  attributor
  annotation2metadata
  openmp-opt
  called-value-propagation
  canonicalize-aliases
  cg-profile
  constmerge
  cross-dso-cfi
  deadargelim
  elim-avail-extern
  extract-blocks
  forceattrs
  function-import
  function-specialization
  globaldce
  globalopt
  globalsplit
  hotcoldsplit
  hwasan
  khwasan
  inferattrs
  inliner-wrapper
  inliner-wrapper-no-mandatory-first
  insert-gcov-profiling
  instrorderfile
  instrprof
  internalize
  invalidate<all>
  ipsccp
  iroutliner
  print-ir-similarity
  loop-extract
  lowertypetests
  metarenamer
  mergefunc
  name-anon-globals
  no-op-module
  objc-arc-apelim
  partial-inliner
  pgo-icall-prom
  pgo-instr-gen
  pgo-instr-use
  print-profile-summary
  print-callgraph
  print
  print-lcg
  print-lcg-dot
  print-must-be-executed-contexts
  print-stack-safety
  print<module-debuginfo>
  rel-lookup-table-converter
  rewrite-statepoints-for-gc
  rewrite-symbols
  rpo-function-attrs
  sample-profile
  scc-oz-module-inliner
  loop-extract-single
  strip
  strip-dead-debug-info
  pseudo-probe
  strip-dead-prototypes
  strip-debug-declare
  strip-nondebug
  strip-nonlinetable-debuginfo
  synthetic-counts-propagation
  verify
  wholeprogramdevirt
  dfsan
  asan-module
  msan-module
  tsan-module
  kasan-module
  sancov-module
  memprof-module
  poison-checking
  pseudo-probe-update
Module analyses:
  callgraph
  lcg
  module-summary
  no-op-module
  profile-summary
  stack-safety
  verify
  pass-instrumentation
  asan-globals-md
  inline-advisor
  ir-similarity
  globals-aa
Module alias analyses:
  globals-aa
CGSCC passes:
  argpromotion
  invalidate<all>
  function-attrs
  attributor-cgscc
  inline
  openmp-opt-cgscc
  coro-split
  no-op-cgscc
CGSCC analyses:
  no-op-cgscc
  fam-proxy
  pass-instrumentation
Function passes:
  aa-eval
  adce
  add-discriminators
  aggressive-instcombine
  assume-builder
  assume-simplify
  alignment-from-assumptions
  annotation-remarks
  bdce
  bounds-checking
  break-crit-edges
  callsite-splitting
  consthoist
  constraint-elimination
  chr
  coro-early
  coro-elide
  coro-cleanup
  correlated-propagation
  dce
  dfa-jump-threading
  div-rem-pairs
  dse
  dot-cfg
  dot-cfg-only
  early-cse
  early-cse-memssa
  ee-instrument
  fix-irreducible
  make-guards-explicit
  post-inline-ee-instrument
  gvn-hoist
  gvn-sink
  helloworld
  infer-address-spaces
  instcombine
  instcount
  instsimplify
  invalidate<all>
  irce
  float2int
  no-op-function
  libcalls-shrinkwrap
  lint
  inject-tli-mappings
  instnamer
  loweratomic
  lower-expect
  lower-guard-intrinsic
  lower-constant-intrinsics
  lower-matrix-intrinsics
  lower-matrix-intrinsics-minimal
  lower-widenable-condition
  guard-widening
  load-store-vectorizer
  loop-simplify
  loop-sink
  lowerinvoke
  lowerswitch
  mem2reg
  memcpyopt
  mergeicmps
  mergereturn
  nary-reassociate
  newgvn
  jump-threading
  partially-inline-libcalls
  lcssa
  loop-data-prefetch
  loop-load-elim
  loop-fusion
  loop-distribute
  loop-versioning
  objc-arc
  objc-arc-contract
  objc-arc-expand
  pgo-memop-opt
  print
  print<assumptions>
  print<block-freq>
  print<branch-prob>
  print<da>
  print<divergence>
  print<domtree>
  print<postdomtree>
  print<delinearization>
  print<demanded-bits>
  print<domfrontier>
  print<func-properties>
  print<inline-cost>
  print<inliner-size-estimator>
  print<loops>
  print<memoryssa>
  print<phi-values>
  print<regions>
  print<scalar-evolution>
  print<stack-safety-local>
  print-alias-sets
  print-predicateinfo
  print-mustexecute
  print-memderefs
  reassociate
  redundant-dbg-inst-elim
  reg2mem
  scalarize-masked-mem-intrin
  scalarizer
  separate-const-offset-from-gep
  sccp
  sink
  slp-vectorizer
  slsr
  speculative-execution
  sroa
  strip-gc-relocates
  structurizecfg
  tailcallelim
  unify-loop-exits
  vector-combine
  verify
  verify<domtree>
  verify<loops>
  verify<memoryssa>
  verify<regions>
  verify<safepoint-ir>
  verify<scalar-evolution>
  view-cfg
  view-cfg-only
  transform-warning
  asan
  kasan
  msan
  kmsan
  tsan
  memprof
Function passes with params:
  loop-unroll<O0;O1;O2;O3;full-unroll-max=N;no-partial;partial;no-peeling;peeling;no-profile-peeling;profile-peeling;no-runtime;runtime;no-upperbound;upperbound>
  msan<recover;kernel;track-origins=N>
  simplifycfg<no-forward-switch-cond;forward-switch-cond;no-switch-to-lookup;switch-to-lookup;no-keep-loops;keep-loops;no-hoist-common-insts;hoist-common-insts;no-sink-common-insts;sink-common-insts;bonus-inst-threshold=N>
  loop-vectorize<no-interleave-forced-only;interleave-forced-only;no-vectorize-forced-only;vectorize-forced-only>
  mldst-motion<no-split-footer-bb;split-footer-bb>
  gvn<no-pre;pre;no-load-pre;load-pre;no-split-backedge-load-pre;split-backedge-load-pre;no-memdep;memdep>
  print<stack-lifetime><may;must>
Function analyses:
  aa
  assumptions
  block-freq
  branch-prob
  domtree
  postdomtree
  demanded-bits
  domfrontier
  func-properties
  loops
  lazy-value-info
  da
  inliner-size-estimator
  memdep
  memoryssa
  phi-values
  regions
  no-op-function
  opt-remark-emit
  scalar-evolution
  stack-safety-local
  targetlibinfo
  targetir
  verify
  pass-instrumentation
  divergence
  basic-aa
  cfl-anders-aa
  cfl-steens-aa
  objc-arc-aa
  scev-aa
  scoped-noalias-aa
  tbaa
Function alias analyses:
  basic-aa
  cfl-anders-aa
  cfl-steens-aa
  objc-arc-aa
  scev-aa
  scoped-noalias-aa
  tbaa
Loop passes:
  canon-freeze
  dot-ddg
  invalidate<all>
  licm
  lnicm
  loop-flatten
  loop-idiom
  loop-instsimplify
  loop-interchange
  loop-rotate
  no-op-loop
  print
  loop-deletion
  loop-simplifycfg
  loop-reduce
  indvars
  loop-unroll-and-jam
  loop-unroll-full
  print-access-info
  print<ddg>
  print<iv-users>
  print<loopnest>
  print<loop-cache-cost>
  loop-predication
  guard-widening
  loop-bound-split
  loop-reroll
  loop-versioning-licm
Loop passes with params:
  simple-loop-unswitch<nontrivial;no-nontrivial;trivial;no-trivial>
Loop analyses:
  no-op-loop
  access-info
  ddg
  iv-users
  pass-instrumentation

Running opt without any parameters and an LLVM IR file in the argument:

./bin/opt example-stress-size8.ll

will (avoid to) output bitcode:

WARNING: You're attempting to print out a bitcode file.
This is inadvisable as it may cause display problems. If
you REALLY want to taste LLVM bitcode first-hand, you
can force output with the `-f' option.

To get LLVM IR as the output instead of bitcode, -S parameter should be used, and adding -o will save the output to a file:

./bin/opt example-stress-size8.ll -S -o example-stress-size8-opt.ll

The example-stress-size8-opt.ll file contains the same contents as example-stress-size8.ll since no optimization passes were specified:

; ModuleID = 'example-stress.ll'
source_filename = "/tmp/autogen.bc"

define void @autogen_SD0(i8* %0, i32* %1, i64* %2, i32 %3, i64 %4, i8 %5) {
BB:
  %A4 = alloca <8 x double>, align 64
  %A3 = alloca <2 x i16>, align 4
  %A2 = alloca <4 x i8>, align 4
  %A1 = alloca <16 x i1>, align 16
  %A = alloca <16 x i1>, align 16
  store i8 21, i8* %0, align 1
  store <8 x double> <double 0.000000e+00, double 0xFFFFFFFFFFFFFFFF, double 0.000000e+00, double 0xFFFFFFFFFFFFFFFF, double 0.000000e+00, double 0xFFFFFFFFFFFFFFFF, double 0.000000e+00, double 0xFFFFFFFFFFFFFFFF>, <8 x double>* %A4, align 64
  store i8 77, i8* %0, align 1
  store i8 77, i8* %0, align 1
  ret void
}

An interesting pass to try is Aggressive Dead Code Elimination (ADCE) (documentation). It is enabled by adding -passes parameter with the value adce:

./bin/opt example-stress-size8.ll -S -passes adce -o example-stress-size8-adce.ll

The contents file example-stress-size8-adce.ll are visibly different from example-stress-size8.ll:

; ModuleID = 'example-stress.ll'
source_filename = "/tmp/autogen.bc"

define void @autogen_SD0(i8* %0, i32* %1, i64* %2, i32 %3, i64 %4, i8 %5) {
BB:
  %A4 = alloca <8 x double>, align 64
  store i8 21, i8* %0, align 1
  store <8 x double> <double 0.000000e+00, double 0xFFFFFFFFFFFFFFFF, double 0.000000e+00, double 0xFFFFFFFFFFFFFFFF, double 0.000000e+00, double 0xFFFFFFFFFFFFFFFF, double 0.000000e+00, double 0xFFFFFFFFFFFFFFFF>, <8 x double>* %A4, align 64
  store i8 77, i8* %0, align 1
  store i8 77, i8* %0, align 1
  ret void
}

Using llvm-diff on the example files:

./bin/llvm-diff example-stress.ll example-stress-opt.ll

we can see that 4 allocations were removed:

in function autogen_SD0:
  in block %BB:
    in instruction store to %A4 / store to %A4:
      operands %A4 and %A4 differ
    <   %A4 = alloca <8 x double>, align 64
    <   %A3 = alloca <2 x i16>, align 4
    <   %A2 = alloca <4 x i8>, align 4
    <   %A1 = alloca <16 x i1>, align 16

Adding the parameter -time-passes will cause opt to print the pass execution timing report after the optimization is finished:

./bin/opt example-stress-size8.ll -S -passes adce -time-passes -o example-stress-size8-adce.ll

The report looks like:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0002 seconds (0.0002 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.0001 ( 42.8%)   0.0000 ( 39.1%)   0.0001 ( 42.3%)   0.0001 ( 42.7%)  PrintModulePass
   0.0000 ( 28.3%)   0.0000 ( 26.1%)   0.0001 ( 28.0%)   0.0001 ( 28.3%)  ADCEPass
   0.0000 ( 12.7%)   0.0000 ( 13.0%)   0.0000 ( 12.7%)   0.0000 ( 13.0%)  VerifierPass
   0.0000 (  7.8%)   0.0000 ( 13.0%)   0.0000 (  8.5%)   0.0000 (  8.0%)  VerifierAnalysis
   0.0000 (  8.4%)   0.0000 (  8.7%)   0.0000 (  8.5%)   0.0000 (  8.0%)  PostDominatorTreeAnalysis
   0.0002 (100.0%)   0.0000 (100.0%)   0.0002 (100.0%)   0.0002 (100.0%)  Total

===-------------------------------------------------------------------------===
                                LLVM IR Parsing
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0003 seconds (0.0003 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.0003 (100.0%)   0.0000 (100.0%)   0.0003 (100.0%)   0.0003 (100.0%)  Parse IR
   0.0003 (100.0%)   0.0000 (100.0%)   0.0003 (100.0%)   0.0003 (100.0%)  Total

Assignment

Check the behaviour of this optimization pass on other files created by llvm-stress. I partocular, check whether a similar amount of code is deleted and try to explain why or why not by inspecting the files.

Assignment

Many optimization passes are not applicable to many situations. Pick any file created by llvm-stress and find two optimization passes that modify the code and two that do not (i.e. the code is invariant in relation to these optimizations).

Optimizations are generally noncommutative, that is, the order of optimization passes is important. For example, consider the following optimization of theexample-stress.ll file in which loop unrolling is performed first, and code sinking second:

./bin/opt example-stress.ll -S -passes loop-unroll,sink -o example-stress-loop-unroll-sink.ll

and the optimization where code sinking is performed first, and loop unrolling second:

./bin/opt example-stress.ll -S -passes sink,loop-unroll -o example-stress-sink-loop-unroll.ll

The results of these optimizations can be compared with llvm-diff:

./bin/llvm-diff example-stress-loop-unroll-sink.ll example-stress-sink-loop-unroll.ll

and observed to be different:

in function autogen_SD0:
  in block %CF:
    >   %I15 = insertelement <4 x i32> zeroinitializer, i32 %3, i32 0
    >   %FC = uitofp <4 x i32> %I15 to <4 x double>
  in block %CF83.preheader:
    >   %FC24 = fptoui <4 x double> %FC to <4 x i16>
    <   %I15 = insertelement <4 x i32> zeroinitializer, i32 %3, i32 0
    <   %B16 = and i8 %E, 21
    <   %FC = uitofp <4 x i32> %I15 to <4 x double>
    <   %Shuff21 = shufflevector <8 x i32> zeroinitializer, <8 x i32> zeroinitializer, <8 x i32> <i32 undef, i32 14, i32 0, i32 2, i32 4, i32 6, i32 8, i32 10>
    <   %I22 = insertelement <16 x i1> zeroinitializer, i1 true, i32 14
    <   %B23 = mul <1 x i8> %I, zeroinitializer
    <   %FC24 = fptoui <4 x double> %FC to <4 x i16>
    <   %Sl25 = select i1 true, i8 %E, i8 %B16
  in block %CF85:
    >   %B16 = and i8 %E, 21
    >   %Shuff21 = shufflevector <8 x i32> zeroinitializer, <8 x i32> zeroinitializer, <8 x i32> <i32 undef, i32 14, i32 0, i32 2, i32 4, i32 6, i32 8, i32 10>
    >   %I22 = insertelement <16 x i1> zeroinitializer, i1 true, i32 14
    >   %B23 = mul <1 x i8> %I, zeroinitializer
    >   %Sl25 = select i1 true, i8 %E, i8 %B16
        %L27 = load i8, i8* %0, align 1
        store <2 x i16> %L12, <2 x i16>* %Sl10, align 4
    >   %E28 = extractelement <2 x i64> zeroinitializer, i32 1
        %Shuff29 = shufflevector <4 x i64> zeroinitializer, <4 x i64> zeroinitializer, <4 x i32> <i32 6, i32 undef, i32 2, i32 4>
    >   %I30 = insertelement <8 x i32> %Shuff21, i32 132849, i32 0
    <   %I30 = insertelement <8 x i32> %Shuff21, i32 132849, i32 0
  in block %CF82.preheader:
    <   %E28 = extractelement <2 x i64> zeroinitializer, i32 1
  in block %CF82:
    >   %I38 = insertelement <4 x double> %FC, double 0xAE97BFB633957A34, i32 1
    >   %Sl40 = select i1 true, <16 x i1> %I22, <16 x i1> %I22
    >   %Cmp41 = icmp uge i8 %Sl25, 21
    >   br i1 %Cmp41, label %CF82.backedge, label %CF89
    <   %I38 = insertelement <4 x double> %FC, double 0xAE97BFB633957A34, i32 1
    <   %Sl40 = select i1 true, <16 x i1> %I22, <16 x i1> %I22
    <   %Cmp41 = icmp uge i8 %Sl25, 21
    <   br i1 %Cmp41, label %CF82.backedge, label %CF89

Assignment

Check whether two optimizations that were changing the code chosen in the last assignment can be performed on that code in any order.

Author: Vedran Miletić