4.2. Optimization levels¶
The compiler can perform many optimizations to improve the execution speed and reduce the size of C and C++ programs. Table 4.2 lists the optimization levels available, the scope of each level and some examples of optimizations performed at each level.
Optimization level |
Scope |
Optimizations performed |
---|---|---|
|
None |
None. This is the default setting for the C28x compiler. |
|
Statement |
|
|
Block |
|
|
Function |
|
|
File (i.e. across functions in a file) |
|
|
Program |
Link time optimizations. Refer to TMS320C28x Optimizing C/C++ Compiler User’s Guide, Section 3.6, Link-Time Optimization (–opt_level=4 Option). |
Note
To generate efficient code, it is highly recommended to set the optimization level at -O2
or higher.
For descriptions of these optimizations, refer to TMS320C28x Optimizing C/C++ Compiler User’s Guide, Section 3.16, What Kind of Optimization Is Being Performed?
4.2.1. Examples¶
4.2.1.1. Expression simplification¶
int32_t test(int32_t a, int32_t b, int32_t c, int32_t d)
{
int32_t tmp;
if (d > 0)
tmp = (a * b) + (a * c);
else
tmp = (a * b);
return tmp;
}
There are 3 32-bit multiplies in the source code in Listing 4.4, which require the IMPYL
instruction. At -O2
, the compiler is able to simplify the expressions to generate 1 IMPYL
instructions vs. 3 without optimizations.
Optimization level |
Number of IMPYL in generated assembly |
---|---|
-Ooff |
3 |
-O0, -O1 |
2 |
-O2 |
1 |
4.2.1.2. Constant propagation and folding¶
1 2 3 4 5 6 7 8 9 10 11 12 13 | int32_t constant(int32_t c, int32_t d)
{
int32_t a = 42;
int32_t b = 10;
int32_t tmp;
if (d > 0)
tmp = (a * b) + (a * c);
else
tmp = (a * b);
return tmp;
}
|
This optimization propagates the values of constants into expressions and precomputes the results of constant expressions.
At -O2
and higher, the compiler replaces the expression with:
(d > 0L) ? (tmp = (c+10L)*42L) : (tmp = 420L);
I.e. it propagates the values of a
and b
into the expressions on lines 8, 10 and computes a * b
on line 10, replacing the expression with the constant 420
.
4.2.1.3. Unused assignment removal¶
1 2 3 4 5 6 7 8 9 10 11 | int32_t unused_asg(int32_t a, int32_t b, int32_t c, int32_t d)
{
int32_t tmp = 42;
if (d > 0)
tmp = (a * b) + (a * c);
else
tmp = (a * b);
return tmp;
}
|
In Listing 4.6, the assignment to tmp
on line 3 is not required because of the subsequent assignments to tmp
on both the if and else paths on lines 6 and 8 respectively. At -O0
and higher, the compiler removes the assignment.
This improves performance because expressions not required for correctness are removed, resulting in fewer cycles.
4.2.1.4. Auto incremented addressing¶
int32_t addressing(int32_t* array, int16_t N)
{
int32_t sum = 0;
int32_t i = 0;
_nassert (N > 0);
for (i = 0; i < N; i++)
sum += array[i];
return sum;
}
At -O2
and higher, the compiler generates the efficient auto incremented addressing mode for the loop in Listing 4.7, resulting in fewer instructions to execute the loop: 12 instructions at -O1
vs. 8 instructions at -O2
.
|
|
---|---|
||$C$L7||:
;*** g2:
;*** sum += array[i];
;*** if ( (++i) < (long)N ) goto g2;
MOVL ACC,XAR5
LSL ACC,1
ADDL ACC,XAR4
MOVL XAR6,ACC
ADDB XAR5,#1
MOVL ACC,P
ADDL ACC,*+XAR6[0]
MOVL P,ACC
MOV AL,AR7
MOV ACC,AL
CMPL ACC,XAR5
B ||$C$L7||,GT
|
||$C$L7||:
;*** g2:
;*** sum += *U$7++;
;*** if ( (--L$1) != (-1L) ) goto g2;
MOVL ACC,XAR6
SUBB XAR5,#1
ADDL ACC,*XAR4++
MOVL XAR6,ACC
MOVB ACC,#0
SUBB ACC,#1
CMPL ACC,XAR5
B ||$C$L7||,NEQ
|
4.2.1.5. Dead code elimination¶
int32_t dce(int32_t a, int32_t b, int32_t c, int32_t d)
{
int32_t tmp1 = a * b * c * d;
int32_t tmp;
if (d > 0)
tmp = (a * b) + (a * c);
else
tmp = (a * b);
return tmp;
}
In Listing 4.8, the expression computed and assigned to tmp1
is dead because tmp1
is not used anywhere in the function. Dead code elimination is a compiler technique to remove unused expressions. At -Ooff
, the generated assembly contains 6 IMPYL
instructions, corresponding to each of the multiplies in the source. At -O0
, the compiler is able to optimize the code and reduce the number of IMPYL
generated to 2 using a combination of dead code elimination and expression simplification.
(d > 0L) ? (tmp = (b+c)*a) : (tmp = a*b);
4.2.2. Code size vs. speed tradeoffs¶
For details on code size vs. speed tradeoffs, refer to TMS320C28x Optimizing C/C++ Compiler User’s Guide, Section 3.2, Controlling Code Size Versus Speed.
4.2.3. Optimization levels and debug¶
At higher levels of optimization, it gets progressively harder to debug (e.g. single-step) the application. This is because at higher optimization levels, the compiler makes transformations to the application to reduce its execution time, memory footprint, power consumption, or a combination of these. These transformations significantly change the layout of the code and make it difficult, or impossible, for the debugger to identify the source code that corresponds to a set of assembly instructions.
The best approach is to perform initial development and debug with optimization disabled and then enable optimizations. Refer to Enable debugging for details.
4.2.4. Optimizer interlist¶
Optimization makes normal source interlisting impractical, because the compiler extensively rearranges the program.
The --src_interlist
option interlists compiler comments with assembly source statements. When this option is used with optimization enabled, the interlist feature does not run as a separate pass. Instead, the compiler inserts comments into the code, indicating how the compiler has rearranged and optimized the code. These comments appear in the assembly language file as comments starting with ;**
.
C source |
Interlist output in the assembly file |
---|---|
float fmac(float *farray, int N)
{
int i;
float sum = 0.0f;
#pragma MUST_ITERATE(4, , 4)
#pragma UNROLL(2)
for (i = 1; i < N; i++)
sum += farray[i] * farray[i-1];
return sum;
}
|
||fmac||:
;*** ----------------------- U$13 = farray;
;*** ----------------------- L$1 = (N>>1)-1;
;*** 31 ----------------------- sum = 0.0F;
;*** ----------------------- #pragma MUST_ITERATE(2, 16382, 2)
;*** ----------------------- #pragma UNROLL(1L)
;*** ----------------------- // LOOP BELOW UNROLLED BY FACTOR(2)
;*** ----------------------- #pragma LOOP_FLAGS(4103u)
;*** -----------------------g2:
;*** 36 ----------------------- C$1 = U$13[1];
;*** 36 ----------------------- sum += *U$13++*C$1;
;*** 36 ----------------------- sum += U$13[1]*C$1;
;*** 35 ----------------------- ++U$13;
;*** 35 ----------------------- if ( (--L$1) != (-1) ) goto g2;
;*** 38 ----------------------- return sum;
|
From the listing in Table 4.4, it is clear that the loop has been unrolled 2x by the optimizer. The original pragmas from the source have also been updated to account for the unrolling. For details on loop unrolling, refer to Loop unrolling.
Warning
The --c_src_interlist
option can have a negative effect on performance and code size because it can prevent some optimizations from crossing C/C++ statement boundaries. So, the --src_interlist
is recommended when optimizations are enabled. In CCS, the --src_interlist
option is available in the “Source interlist” dropdown under Build -> C2000 Compiler -> Advanced Options -> Assember Options.
For details on the interlist option, refer to TMS320C28x Optimizing C/C++ Compiler User’s Guide, Section 3.10, Using the Interlist Feature With Optimization.