4.2. Optimization levels

The compiler can perform many optimizations to improve the execution speed and reduce the size of C and C++ programs. Table 4.2 lists the optimization levels available, the scope of each level and some examples of optimizations performed at each level.

Table 4.2 Optimization levels

Optimization level

Scope

Optimizations performed

--opt_level=off, -Ooff

None

None. This is the default setting for the C28x compiler.

--opt_level=0, -O0

Statement

--opt_level=1, -O1

Block

  • Performs all –opt_level=0 (-O0) optimizations, plus:

  • Performs local constant propagation and folding, copy propagation

  • Eliminates local common subexpressions

--opt_level=2, -O2

Function

  • Performs all –opt_level=1 (-O1) optimizations, plus:

  • Loop optimizations, Loop unrolling

  • Eliminates global common subexpressions

  • Eliminates global unused assignments

  • Generates auto incremented addresses

--opt_level=3, -O3

File (i.e. across functions in a file)

  • Performs all –opt_level=2 (-O2) optimizations, plus:

  • Inlining of small functions

  • Removes functions not called in the file

--opt_level=4, -O4

Program

Link time optimizations. Refer to TMS320C28x Optimizing C/C++ Compiler User’s Guide, Section 3.6, Link-Time Optimization (–opt_level=4 Option).

Note

To generate efficient code, it is highly recommended to set the optimization level at -O2 or higher.

For descriptions of these optimizations, refer to TMS320C28x Optimizing C/C++ Compiler User’s Guide, Section 3.16, What Kind of Optimization Is Being Performed?

4.2.1. Examples

4.2.1.1. Expression simplification

Listing 4.4 Example to illustrate expression simplification
int32_t test(int32_t a, int32_t b, int32_t c, int32_t d)
{
    int32_t tmp;

    if (d > 0)
        tmp = (a * b) + (a * c);
    else
        tmp = (a * b);

    return tmp;
}

There are 3 32-bit multiplies in the source code in Listing 4.4, which require the IMPYL instruction. At -O2, the compiler is able to simplify the expressions to generate 1 IMPYL instructions vs. 3 without optimizations.

Optimization level

Number of IMPYL in generated assembly

-Ooff

3

-O0, -O1

2

-O2

1

4.2.1.2. Constant propagation and folding

Listing 4.5 Example to illustrate constant propagation and folding
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
int32_t constant(int32_t c, int32_t d)
{
    int32_t a = 42;
    int32_t b = 10;
    int32_t tmp;

    if (d > 0)
        tmp = (a * b) + (a * c);
    else
        tmp = (a * b);

    return tmp;
}

This optimization propagates the values of constants into expressions and precomputes the results of constant expressions.

At -O2 and higher, the compiler replaces the expression with:

(d > 0L) ? (tmp = (c+10L)*42L) : (tmp = 420L);

I.e. it propagates the values of a and b into the expressions on lines 8, 10 and computes a * b on line 10, replacing the expression with the constant 420.

4.2.1.3. Unused assignment removal

Listing 4.6 Example to illustrate unused assignment removal
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
int32_t unused_asg(int32_t a, int32_t b, int32_t c, int32_t d)
{
    int32_t tmp = 42;

    if (d > 0)
        tmp = (a * b) + (a * c);
    else
        tmp = (a * b);

    return tmp;
}

In Listing 4.6, the assignment to tmp on line 3 is not required because of the subsequent assignments to tmp on both the if and else paths on lines 6 and 8 respectively. At -O0 and higher, the compiler removes the assignment.

This improves performance because expressions not required for correctness are removed, resulting in fewer cycles.

4.2.1.4. Auto incremented addressing

Listing 4.7 Example to illustrate auto incremented addressing
int32_t addressing(int32_t* array, int16_t N)
{
    int32_t sum = 0;
    int32_t i   = 0;

    _nassert (N > 0);
    for (i = 0; i < N; i++)
        sum += array[i];

    return sum;
}

At -O2 and higher, the compiler generates the efficient auto incremented addressing mode for the loop in Listing 4.7, resulting in fewer instructions to execute the loop: 12 instructions at -O1 vs. 8 instructions at -O2.

Table 4.3 Assembly generated for loop at various optimization levels

-O1

-O2 generates efficient *XARn++ addressing

||$C$L7||:
        ;*** g2:
        ;***   sum += array[i];
        ;***   if ( (++i) < (long)N ) goto g2;
                        MOVL      ACC,XAR5
                        LSL       ACC,1
                        ADDL      ACC,XAR4
                        MOVL      XAR6,ACC
                        ADDB      XAR5,#1
                        MOVL      ACC,P
                        ADDL      ACC,*+XAR6[0]
                        MOVL      P,ACC
                        MOV       AL,AR7
                        MOV       ACC,AL
                        CMPL      ACC,XAR5
                        B         ||$C$L7||,GT
||$C$L7||:
        ;*** g2:
        ;***   sum += *U$7++;
        ;***   if ( (--L$1) != (-1L) ) goto g2;
                        MOVL      ACC,XAR6
                        SUBB      XAR5,#1
                        ADDL      ACC,*XAR4++
                        MOVL      XAR6,ACC
                        MOVB      ACC,#0
                        SUBB      ACC,#1
                        CMPL      ACC,XAR5
                        B         ||$C$L7||,NEQ

4.2.1.5. Dead code elimination

Listing 4.8 Example to illustrate dead code elimination
int32_t dce(int32_t a, int32_t b, int32_t c, int32_t d)
{
    int32_t tmp1 = a * b * c * d;
    int32_t tmp;

    if (d > 0)
        tmp = (a * b) + (a * c);
    else
        tmp = (a * b);

    return tmp;
}

In Listing 4.8, the expression computed and assigned to tmp1 is dead because tmp1 is not used anywhere in the function. Dead code elimination is a compiler technique to remove unused expressions. At -Ooff, the generated assembly contains 6 IMPYL instructions, corresponding to each of the multiplies in the source. At -O0, the compiler is able to optimize the code and reduce the number of IMPYL generated to 2 using a combination of dead code elimination and expression simplification.

(d > 0L) ? (tmp = (b+c)*a) : (tmp = a*b);

4.2.2. Code size vs. speed tradeoffs

For details on code size vs. speed tradeoffs, refer to TMS320C28x Optimizing C/C++ Compiler User’s Guide, Section 3.2, Controlling Code Size Versus Speed.

4.2.3. Optimization levels and debug

At higher levels of optimization, it gets progressively harder to debug (e.g. single-step) the application. This is because at higher optimization levels, the compiler makes transformations to the application to reduce its execution time, memory footprint, power consumption, or a combination of these. These transformations significantly change the layout of the code and make it difficult, or impossible, for the debugger to identify the source code that corresponds to a set of assembly instructions.

The best approach is to perform initial development and debug with optimization disabled and then enable optimizations. Refer to Enable debugging for details.

4.2.4. Optimizer interlist

Optimization makes normal source interlisting impractical, because the compiler extensively rearranges the program.

The --src_interlist option interlists compiler comments with assembly source statements. When this option is used with optimization enabled, the interlist feature does not run as a separate pass. Instead, the compiler inserts comments into the code, indicating how the compiler has rearranged and optimized the code. These comments appear in the assembly language file as comments starting with ;**.

Table 4.4 Output of the --src_interlist option

C source

Interlist output in the assembly file

float fmac(float *farray, int N)
{
    int i;
    float sum = 0.0f;

    #pragma MUST_ITERATE(4, , 4)
    #pragma UNROLL(2)
    for (i = 1; i < N; i++)
        sum += farray[i] * farray[i-1];

    return sum;
}
||fmac||:
;***  	-----------------------    U$13 = farray;
;***  	-----------------------    L$1 = (N>>1)-1;
;*** 31	-----------------------    sum = 0.0F;
;***  	-----------------------    #pragma MUST_ITERATE(2, 16382, 2)
;***  	-----------------------    #pragma UNROLL(1L)
;***  	-----------------------    // LOOP BELOW UNROLLED BY FACTOR(2)
;***  	-----------------------    #pragma LOOP_FLAGS(4103u)
;***	-----------------------g2:
;*** 36	-----------------------    C$1 = U$13[1];
;*** 36	-----------------------    sum += *U$13++*C$1;
;*** 36	-----------------------    sum += U$13[1]*C$1;
;*** 35	-----------------------    ++U$13;
;*** 35	-----------------------    if ( (--L$1) != (-1) ) goto g2;
;*** 38	-----------------------    return sum;

From the listing in Table 4.4, it is clear that the loop has been unrolled 2x by the optimizer. The original pragmas from the source have also been updated to account for the unrolling. For details on loop unrolling, refer to Loop unrolling.

Warning

The --c_src_interlist option can have a negative effect on performance and code size because it can prevent some optimizations from crossing C/C++ statement boundaries. So, the --src_interlist is recommended when optimizations are enabled. In CCS, the --src_interlist option is available in the “Source interlist” dropdown under Build -> C2000 Compiler -> Advanced Options -> Assember Options.

For details on the interlist option, refer to TMS320C28x Optimizing C/C++ Compiler User’s Guide, Section 3.10, Using the Interlist Feature With Optimization.