1.3.7. Optimization Options

To enable optimization passes in the tiarmclang compiler, select a level of optimization from among the following -O[0|1|2|3|fast|g|s|z] options. In general, the options below represent various levels of optimization with some options designed to favor smaller compiler generated code size over performance, while others favor performance at the cost of increased compiler generated code size.

For a more precise list of optimizations performed for each level, please see Details of Optimizations Performed at Each Level.

1.3.7.1. Optimization Level Options

Note

Optimization Option Recommendations

  • The -Oz option is recommended if small compiler generated code size is a priority.

  • The -O3 option is recommended for optimizing performance, but it is likely to increase compiler-generated code size.

-O0

Performs no optimization.

-O1, -O

Enables restricted optimizations, providing a good trade-off between code size and debuggability. This option is recommended for maximum debuggability.

-O2

Enables most optimizations, but some optimizations that require significant additional compile time are disabled.

-O3

Enables all optimizations available at -O2 plus others that require additional compile time to perform. This option is recommended for optimizing performance, but it is likely to increase compiler-generated code size.

-Ofast

Enables all optimizations available at -O3 plus additional aggressive optimizations that have the potential for additional performance gains, but are not guaranteed to be in strict compliance with language standards.

-Og

Enables restricted optimizations while preserving debuggability. All optimizations available at -O1 are performed with the addition of some optimizations from -O2.

-Os

Enables all optimizations available at -O2 plus additional optimizations that are designed to reduce code size while mitigating negative impacts on performance.

-Oz

Enables all optimizations available at -O2 plus additional optimizations to further reduce code size with the risk of sacrificing performance. Using -Oz retains performance gains from many of the -O2 level optimizations that are performed. This optimization setting is recommended if small code size is a priority.

1.3.7.3. More Specialized Optimization Options

1.3.7.3.1. Floating-Point Arithmetic

-ffast-math, -fno-fast-math

Enable or disable ‘fast-math’ mode during compilation. By default, the ‘fast-math’ mode is disabled. Enabling ‘fast-math’ mode allows the compiler to perform aggressive, not necessarily value-safe, assumptions about floating-point math, such as:

  • Assume floating-point math is consistent with regular algebraic rules for real numbers (e.g. addition and multiplication are associative, x/y == x * 1/y, and (a + b) * c == a * c + b * c).

  • Operands to floating-point operations are never NaNs or Inf values.

  • +0 and -0 are interchangeable.

Enabling the -ffast-math option also causes the following options to be set:

  • -ffp-contract=fast

  • -fno-honor-nans

  • -ffp-model=fast

  • -fno-rounding-math

  • -fno-signed-zeros

Use of the ‘fast-math’ mode also instructs the compiler to predefine the __FAST_MATH__ macro symbol.

-ffp-model=<precise|strict|fast>

-ffp-model is an umbrella option that is used to establish a model of floating-point semantics that the compiler will operate under. The available arguments to the -ffp-model option will imply settings for the other, single-purpose floating-point options, including -ffast-math, -ffp-contract, and frounding-math (described below).

The available arguments to the -ffp-model option are:

  • precise - With the exception of floating-point contraction optimizations, all other optimizations that are not value-safe on floating-point data are disabled (ffp-contract=on and -fno-fast-math). The tiarmclang compiler assumes this floating-point model by default.

  • strict - Disables floating-point contraction optimizations (-ffp-contract=off), honors dynamically-set floating-point rounding modes (-frounding-math), and disables all ‘fast-math’ floating-point optimizations (-fno-fast-math). Also sets -ffp-exception-behavior=strict.

  • fast - Enables all ‘fast-math’ floating-point optimizations (-ffast-math) and enables floating-point contraction optimizations across C/C++ statements (-ffp-contract=fast).

-ffp-contract=<fast|on|off|fast-honor-pragmas>

Instruct the compiler whether and to what degree it is allowed to form fused floating-point operations, such as floating-point multiply and add (FMA) instructions. This optimization is also known as floating-point contraction. Fused floating-point operations are permitted to produce more precise results than would be otherwise computed if the operations were performed separately.

The available arguments to the -ffp-contract option are:

  • fast - Allows fusing of floating-point operations across C/C++ statements, and ignores any FP_CONTRACT or clang fp contract pragmas that would otherwise affect the compiler’s ability to apply floating-point contraction optimizations.

  • on - Allows floating-point contraction within a given C/C++ statement. The floating-point contraction behavior can be affected by the use of FP_CONTRACT or clang fp contract pragmas.

  • off - Disables all floating-point contraction optimizations.

  • fast-honor-pragma - Same as the fast argument, but the user can alter the behavior via the use of the FP_CONTRACT and/or clang fp contract pragmas.

-ffp-exception-behavior=<ignore|strict|maytrap>

This option determines the behavior in coordination with potential floating-point hardware exceptions.

  • ignore - This is the default setting. The compiler expects exception status flags to not be read and floating point exceptions to be masked.

  • maytrap - This setting causes the compiler to avoid performing floating point transformations that may raise exceptions if the original code would not have raise them. The compiler may still perform constant folding.

  • strict - This setting causes the compiler to strictly preserve the floating point exception semantics of the original code when performing any floating point transformations. Setting -ffp-model=strict causes -ffp-exception-behavior to be set to strict.

The maytrap and strict settings also prevent the following optimizations:

  • Floating point instruction speculation

  • Hoisting floating point instructions into code where they may execute despite guard conditions that would otherwise prevent their execution

Floating-Point Speculation Example

Consider a floating-point divide operation that is guarded by an if statement in a C function:

__attribute__((noinline)) void cp_fpu_trace(float32 *f_Info)
{
    float32 f_array[10] = {0};
    float32 float1 = *f_Info;

    if (float1 > CP_MIN)
    {
      f_array[1u] = (1.f / *f_Info);
    }
    else
    {
      f_array[1u] = CP_MAX;
    }

    *f_Info = f_array[1u];
}

In this code the if statement is intended to prevent the floating-point divide operation from performing a divide-by-zero that will trigger a hardware floating-point exception. However, if optimization is enabled to perform floating-point speculation, the compiler may generate code to perform the divide instruction prior to the compare instruction that carries out the intent of the if statement:

...
cp_fpu_trace:
    vldr        s0, [r0]
    vmov.f32    s2, #1.000000e+00
    vldr        s4, .LCPI1_0         ; <- load of CP_MIN
    vldr        s6, .LCPI1_1         ; <- load of CP_MAX
    vdiv.f32    s2, s2, s0           ; <- divide
    vcmp.f32    s0, s4               ; <- compare
    vmrs        APSR_nzcv, fpscr
    vmovgt.f32  s6, s2               ; <- conditional move:
                                     ;    if (float1 > CP_MIN)
                                     ;      s6 = result of divide
    vstr        s6, [r0]
    bx          lr
...

In this case, if the value loaded into s0 is zero, then the divide instruction will trigger a floating-point hardware exception before the compare instruction has a chance to prevent that from happening. As mentioned above, including the -ffp-exception-behavior=maytrap or -ffp-exception-behavior=strict option on the compiler command-line will prevent the compiler from performing floating-point speculation optimizations and generating code that will execute the divide instruction prior to the compare and conditional move instructions.

-fhonor-nans, -fno-honor-nans

Instructs the compiler to check for and properly handle floating-point NaN values. Use of the -fno-honor-nans can improve code if the compiler can assume that it doesn’t need to check for and enforce the proper handling of floating-point NaN values.

-frounding-math, -fno-rounding-math

By default, the compiler assumes that the -fno-rounding-mode option is in effect. This instructs the compiler to always round-to-nearest for floating-point operations.

The C standard runtime library provides functions such as fesetround and fesetenv that allow you to dynamically alter the floating-point rounding mode. If the -frounding-math option is specified, the compiler honors any dynamically-set floating-point rounding mode. This can be used to prevent optimizations that may affect the result of a floating-point operation if the current rounding mode has changed or is different from the default (round-to-nearest). For example, floating-point constant folding may be inhibited if the result is not exactly representable.

-fsigned-zeros, -fno-signed-zeros

Assumes the presence of signed floating-point zero values. Use of the -fno-signed-zeros option can improve code if the compiler can assume that it doesn’t need to account for the presence of signed floating-point zero values.

1.3.7.3.2. Inlining and Outlining

-finline-functions, -fno-inline-functions

Inline suitable functions. The -fno-inline-functions option disables this optimization.

-finline-hint-functions

Inline functions that are explicitly or implicitly marked as inline.

-mllvm -arm-memset-max-stores=<n>

When optimization is turned on during a compilation, the tiarmclang compiler inlines calls to the memset and memclr runtime support routines if the size of the data is below a certain threshold. For example, in the following source file:


#include <string.h>

struct {
  int t1;
  int t2;
  int t3;
  int t4;
  short t5;
  long t6;
} my_struct_inline;

void func()
{
  memset(&my_struct_inline, 0, sizeof(my_struct_inline));
}

When compiled with -O[1|2|3|fast], the call to memset is inlined if the clearing of the my_struct_inline data object can be done with <= 8 store instructions:

%> tiarmclang -mcpu=cortex-m0 -O3 -S struct_inline.c
%> cat struct_inline.s
...
func:
          ldr     r0, .LCPI0_0
          movs    r1, #0
          str     r1, [r0]
          str     r1, [r0, #4]
          str     r1, [r0, #8]
          str     r1, [r0, #12]
          str     r1, [r0, #16]
          str     r1, [r0, #20]
          bx      lr
          .p2align        2
.LCPI0_0:
          .long   my_struct_inline
...

However, when compiled with -O[s|z}, where the compiler is attempting to generate smaller code, the call to memset is inlined only if clearing the my_struct_inline data object can be done with <= 4 store instructions. When compiled in combination with the -mcpu=cortex-m0 option, the call to memset is not inlined, but it will be implemented with a call to __aeabi_memclr:

%> tiamclang -mcpu=cortex-m0 -Oz -S struct_inline.c
%> cat struct_inline.s
...
func:
        push    {r7, lr}
        ldr     r0, .LCPI0_0
        movs    r1, #24
        bl      __aeabi_memclr4
        pop     {r7, pc}
        .p2align        2
.LCPI0_0:
        .long   my_struct_inline

The -mllvm -arm-memset-max-stores=<n> option allows you to control the criteria used by the compiler to decide whether or not to inline a call to the memset or memclr function. If the above example is re-compiled with -mcpu=cortex-m0 -Oz -mllvm -arm-memset-max_stores=6, then the call to memset will get inlined since the clearing of my_struct_inline can be accomplished on Cortex-M0 with 6 store instructions:

%> tiarmclang -mcpu=cortex-m0 -Oz -mllvm -arm-memset-max-stores=6 -S struct_inline.c
%> cat struct_inline.s
...
func:
          ldr     r0, .LCPI0_0
          movs    r1, #0
          str     r1, [r0]
          str     r1, [r0, #4]
          str     r1, [r0, #8]
          str     r1, [r0, #12]
          str     r1, [r0, #16]
          str     r1, [r0, #20]
          bx      lr
         .p2align        2
.LCPI0_0:
         .long   my_struct_inline
...

The optimal value for the argument <n> to use with the -mllvm -arm-memset-max-stores=<n> option will vary depending on each particular use-case. Adjusting this value will only be beneficial if you are able to control the limit as needed.

Note

Use Caution When Defining Symbols Inside an asm() Statement

Inlining a function that contains an asm() statement that contains a symbol definition when compiling with the tiarmclang compiler can cause a “symbol multiply defined” error.

Please see Inlining Functions that Contain asm() Statements for more details.

-moutline

Function outlining (aka “machine outlining”) is an optimization that saves code size by identifying recurring sequences of machine code and replacing each instance of the sequence with a call to a new function that the identified sequence of operations.

Function outlining is enabled when the -Oz option is specified on the tiarmclang command line. There are 3 settings for the function outlining optimization when using the -Oz option:

The -moutline option is the default setting; it performs machine outlining within functions. This is less aggressive than -moutline-inter-function, but it is guaranteed to be applied only when doing so will reduce the net code size.

-moutline-inter-function

The -moutline-inter-function option can be specified in combination with the -Oz option to enable inter-function outlining. While this is the more aggressive of the function outlining settings, it does not always guarantee an overall code size reduction if, for example, outlining occurs across multiple functions in a given compilation unit yet only one of those functions is included in the linked application, the application will include the outlined code as well as the additional instructions required to call that code. However, it is likely to be beneficial when all functions defined in a given compilation unit are included in the linked application.

-mno-outline

The -mno-outline option can be used to disable function outlining for a given compilation unit when using the -Oz option.

1.3.7.3.3. Loop Unrolling

-funroll-loops, -fno-unroll-loops

Enable optimizer to unroll loops. The -fno-unroll-loops option disables this optimization.

1.3.7.4. Details of Optimizations Performed at Each Level

This table lists examples of optimizations performed at each optimization level. (These optimizations are not listed in the order they are performed.)

Optimization Level

Optimizations Performed

-O0

None

-O1
  • Control Flow Simplification

  • Merge contiguous icmps into a memcmp

  • memcpy/memset/memcmp inlining

  • Constant Hoisting

  • Partially inline calls to library functions

  • Inline for always_inline functions

  • Global Variable Merging

  • Merge disjoint stack slots

  • Loop Strength Reduction

  • Loop Invariant Code Motion

  • Common Subexpression Elimination

  • Dead Argument Elimination

  • Machine code sinking

  • Peephole optimization

  • Tail Predication

  • Tail Duplication

  • Load/store optimization

  • Simple Register Coalescing

  • Copy Propagation

  • Conditional Constant Propagation

  • Called Value Propagation

  • Control Flow optimization

  • If-conversion

  • Thumb2 instruction size reduction

  • Dead Code Elimination

  • Loop Vectorization

  • Printf function specialization

  • Small memcpy/memset function specialization

  • Conditionally eliminate dead library calls

  • Loop Rotation

  • Loop Unrolling

-O2
  • Performs all (-O1) optimizations, plus:

  • Function Integration/Inlining

  • Instruction speculation

  • Value Propagation

  • Jump Threading (non-DFA)

  • Tail Call Elimination

  • Merged Load/Store Motion

  • Global Value Numbering

  • Memory Dependence Analysis

  • Dead Store Elimination

  • Superword-Level Parallelism (SLP) vectorization

  • Combine redundant instructions

  • Dead Global Elimination

  • Global Duplicate Constant Merging

  • Fast memcpy/memset function specialization

  • Align loop target boundaries to 16 bytes (Cortex-R4/R5)

-O3
  • Performs all (-O2) optimizations tuned for speed, plus:

  • Replace functions with supported intrinsics

  • Additional alias analysis and loop optimization

  • Aggressive Function Inlining

  • Call-site splitting

  • Promote ‘by reference’ arguments to scalars

  • Combine pattern based expressions

-Ofast
  • Performs all (-O3) optimizations, plus:

  • Allow optimizations to treat the sign of a zero argument or result as insignificant

  • Assumes no Inf values

  • Assumes no NaN values

  • Enable optimizations that make unsafe assumptions about IEEE math

  • Allow reassociation transformations for floating-point instructions

  • Allow optimizations to use the reciprocal of an argument rather than perform division

  • Allow more aggressive, lossy floating point math operations that enhance speed

-Og
  • Performs all (-O2) optimizations, but disables the following:

  • No Loop Vectorization

  • No function inlining except for always_inline functions

  • No instruction speculation

  • No Jump Threading

  • No Value Propagation

  • No Tail Call Elimination

  • No Merged Load/Store Motion

  • No Global Value Numbering

  • No Memory Dependence Analysis

  • No Superword-Level Parallelism (SLP) Vectorization

  • No Dead Global Elimination

  • No Global Duplicate Constant Merging

-Os
  • Performs all (-O2) optimizations tuned for code size, plus:

  • Previously enabled optimizations tuned for code size

  • Small memcpy/memset function specialization

  • Minimal memcpy/memset/memcmp inlining

  • Don’t conditionally eliminate dead library calls

  • Don’t align loop target boundaries to 16 bytes (Cortex-R4/R5)

-Oz
  • Performs all (-Os) optimizations, plus:

  • Machine Outlining

  • No Loop Vectorization

  • Less aggressive optimizations that impact code size

  • Don’t align loop target boundaries to 16 bytes (Cortex-R4/R5)