C2000 Performance Tips and Tricks

#Introduction This article describes how to enable the compiler to generate code that takes advantage of the C2000 architecture’s powerful performance features. #Other Resources * [Accelerators: Enhancing the Capabilities of the C2000 MCU Family](https://www.ti.com/lit/pdf/spry288) * [Floating Point Optimization](https://processors.wiki.ti.com/index.php/Floating_Point_Optimization#--fp_mode) * [TMS320C28x Optimizing C/C++ Compiler User's Guide (spru514)](https://www.ti.com/lit/pdf/spru514) * [TMS320C28x Assembly Language Tools User's Guide (spru513)](https://www.ti.com/lit/pdf/spru513) * [TMS320C28x Extended Instruction Sets Technical Reference Manual](https://www.ti.com/lit/pdf/spruhs1) #Auto-increment/decrement Addressing Modes The C2000 hardware has several addressing modes, including auto-increment/decrement address registers (*XARn++/--). These operands eliminate the need for explicit address calculation when striding through data arrays in loops. At optimization level 2 and above, the optimizer will transform array accesses to enable this addressing mode when possible. This provides a powerful performance boost not only from eliminating non-essential address arithmetic in loops, but also, as we will see in the topics to follow, by enabling the use of other powerful architectural features. ##Indexed array accesses in loops Source code written in a natural C language style, at Optimization level 2+: ```c int i; for (i=0; i < upper_bound; i++) array[i] = … ``` will generate: * load base address of array into XARn * access *XARn++ for each iteration ##Signed Array Index Variables The array index variable (i in the example above) should be a signed integer— not an unsigned integer— whenever possible. To the programmer, using an unsigned integer might be logical—after all, the programmer assumes the array index variable will never have a negative value. However, the semantics of unsigned integers mean that unless the optimizer can prove the lower and upper bounds of the value, it must not assume the value will not wrap around. In order to translate array[i] to *XARn++, the compiler must prove that the value of i, and hence the address, is strictly increasing. In the absence of a known upper bound, the compiler can’t rule out wrap around of the index variable and thus can’t perform the transformation. The C language semantics of signed integers mean that wrap-around is undefined behavior. Therefore, the compiler is not bound to support wrap-around for signed integers and can do the transformation without needing to prove an upper bound. If the programmer specifically wishes to use an unsigned integer for greater range of values, the optimizer can still perform the transformation if an upper bound is statically knowable— such as through the use of a macro, constant (if the definition can be seen in the scope of optimization), or **#pragma MUST_ITERATE**. ##Address strictly increasing or decreasing by 1 Additionally, the array accesses must be determined to be strictly increasing (or decreasing) by increments of 1. The Optimizer can transform array accesses to this addressing mode when it can detect that the addresses computed per iteration only vary with respect to the incremented loop index value. For example: ```c int i, j; float sum; for (i = 0; i < N; i++) for (j=0; j < M; j++) sum += array[j] * array[j - i]; ``` ##Efficient Pointer Modification The C28x CPU has some powerful architecture features for pointer modification. When writing C code, it is best to modify pointers in a way which maps to efficient C28x instructions, and to avoid pointer modification which does not map to a C28x instruction. For example: * C28x CPU has single instructions for: \*p++, \*--p * C28x CPU does ***not*** have single instructions for: \*p--, \*++p It is recommended to use the modifiers whic the CPU has. However, sometimes the compiler can create efficient code even if the above recommended syntax is not used. #Single Repeatable Instructions The C2000 architecture has numerous instructions that can be issued as single repeatable instructions (RPT || ). This construct eliminates all branching overhead and only adds a total of 1 or 4 cycles depending on whether the repeat count is used as an immediate value or register, respectively. When possible, the C2000 compiler will transform loops containing supported operations into single repeatable instructions. Supported operations include various types of multiply-accumulate instructions, memory initialization with 0, 32-bit memory add, unsigned subtraction used in integer divide, and block copies. (See [Appendix 1](#appendices) for source code examples.) In order to transform an operation in a loop into a single repeatable instruction, the operation must be the only operation in the loop. This means the operation’s operands can’t be replaced inside the loop body. Thus, generating auto-incremented/decremented address operands is essential for generating single repeatable instructions if the operation is not performing on the same data on different loop iterations (which is typically not the case). #Memory Operands Many instructions on the C2000 ALU take memory operands, meaning they can operate directly on data in memory without having to load to and store back from registers. ##Data allocation for instructions with two memory operands For those instructions taking 2 memory operands (see [Appendix 3](#instructions-with-two-memory-operands)), the second memory operand (*XAR7) uses the program memory bus. The C2000 RAM blocks are single-access (SRAM) and only support one access to the same memory block in a single pipeline cycle. To avoid a pipeline stall, the data arrays should be allocated to different physical RAM blocks. The physical RAM blocks can be found in the memory map of the device in its data manual. #MAC-Style Instructions Another high-performance feature of the C2000 architecture is the multiply-accumulate instructions (see [Appendix 2](#list-of-mac-style-instructions) for the list of MAC-style instructions.) These instructions combine multiply and add operations, with an optional shift of the accumulated value, into a single instruction. They also have forms taking direct memory operands and are available as single repeatable instructions. Thus performing a multiply-accumulate on a data array can be performed as a single repeatable instruction using auto-incremented memory operands. (For instructions operating on two memory operands, see the note in data allocation section.) A complication in generating these instructions is that the hardware instruction does not correlate to the natural C-language construct. In C, a typical multiply-accumulate operation performs a multiply and then adds the product to an existing accumulation: a += b \* c. However, the C2000 MAC-style instructions (with the exception of DMAC) operate by adding a previously-computed product in the same cycle as performing the subsequent multiply: a += p; p’ = b \* c. Therefore, the compiler must recognize the source-language construct and translate the code to the proper form to generate the hardware instructions. ##Note on generating 16 x 16->32 MAC The C language semantics for 16 \* 16 -> 32-bit multiplication (such as the MAC instruction) require casting the multiply operands to 32-bits. See https://www.ti.com/lit/an/spra683/spra683.pdf ##MACF32 The MACF32 is the only single-repeatable instruction on the FPU. Additionally, it has a form taking two memory operands, and is the only FPU arithmetic instruction that takes memory operands. However, this instruction performs two separate multiply-accumulates and adds the results back together at the end, essentially reassociating the adds. Since floating point addition is not naturally associative, the compiler only generates this instruction when the **--fp_reassoc** flag is set to “on” (which is the default setting). There can be a large difference in precision between using the RPT || MACF32 and performing a serial multiply-accumulate loop; if this variance is not acceptable, the **--fp_reassoc ** flag should be set to “off”. ***Figure 1*** shows how auto-incremented addressing mode, single repeatable instructions, and MACF32 work together to provide a 71% performance improvement in a small computational kernel. This benchmark was an example of compiler performance improvements implemented in the C2000 CGT v.6.2 compiler. The performance was measured as cycles on a cycle-accurate simulator. ![](./images/c28x-perf-figure1.jpg) [[g Figure 1 ]] #DMAC w/RPT The DMAC instruction performs multiply-accumulates on 2 adjacent signed integers at the same time. It is a SIMD (single instruction, multiple data) operation which performs dual 16 x 16 MAC operations in one instruction. It essentially operates on 2 int2 vectors in memory and results in 2 longs. The data addresses must be 32-bit aligned. There are 3 levels of compiler support ranging from almost fully automatic to intrinsics. The DMAC is a single repeatable operation with memory operands: when there is nothing else in the loop, the compiler can generate a RPT || DMAC. It is the most powerful computational instruction on the C2000 core. ##Almost fully automatic The most fully automatic level of support requires that: * the source arrays are known to be 32-bit aligned and the arrays are accessed via array indices * the loop trip count is known to be **even**, either from a known trip count or use of **#pragma MUST_ITERATE** ```c int src1[N], src2[N]; // int arrays must be 32-bit aligned #pragma DATA_ALIGN(src1,2); #pragma DATA_ALIGN(src2,2); {...} int i; long res = 0; #pragma MUST_ITERATE(,,2) // Can specify loop trip count multiple for (i = 0; i < N; i++) // Loop trip count N must be even res += (long)src1[i] * src2[i]; // Arrays must be accessed via array indices ``` ##Assertions for data address alignment The **_nassert** intrinsic can be used to assert that the data addresses are aligned to 32-bits. It is up to the programmer to ensure that only properly aligned data addresses are used by the operation. The source data must still be accessed as indexed arrays and the loop trip count must be known to be **even**, either from a known trip count or use of **#pragma MUST_ITERATE**. ```c int *src1, *src2; // src1 and src2 are pointers to int arrays of at least size N. // User must ensure that both are 32-bit aligned addresses. {...} int i; long res = 0; _nassert((long)src1 % 2 == 0); _nassert((long)src2 % 2 == 0); // Can use #pragma MUST_ITERATE(,,2) for (i = 0; i < N; i++) // Loop trip count N must be even res += (long)src1[i] * src2[i]; // src1 and src2 must be accessed via array indices ``` ##DMAC Intrinsic The DMAC instruction can also be generated from a source-level intrinsic: ```c void __dmac( long *src1, long *src2, long &accum1, long &accum2, int shift); ``` See the following two examples for using the intrinsic. Note that the user is responsible for ensuring correct data alignment and loop count, and the two partial accumulations must be added into a final result after the loop. ###Example 1 ```c int src1[N], src2[N]; // src1 and src2 are int arrays of at least size N // User must ensure that both start on 32-bit // aligned boundaries. {...} int i; long res = 0; long temp = 0; _nassert((N % 2) == 0); for (i = 0; i < (N / 2); i++) __dmac(((long *)src1)[i], ((long *)src2)[i], res, temp, 0); res += temp; ``` ###Example 2 ```c int *src1, *src2; // src1 and src2 are pointers to int arrays of at // least size N. User must ensure that both are // 32-bit aligned addresses. {...} int i; long res = 0; long temp = 0; long *ls1 = (long *)src1; long *ls2 = (long *)src2; _nassert((N % 2) == 0); for (i = 0; i < (N / 2); i++) __dmac(*ls1++, *ls2++, res, temp, 0); res += temp; ``` #RPTB (FPU Only) On devices with a floating point unit (FPU), the repeat block (RPTB) instruction can eliminate all branching overhead for a loop. This is useful for any loop that meets the requirements, not just floating point computations. It can drastically improve performance for smaller loops in which the branching overhead contributes an even larger percentage of loop runtime. The RPTB instruction adds only 1 or 4 cycles overhead (depending on whether an immediate value or a register is used for the repeat count), regardless of the number of loop iterations. The two requirements for generating a RPTB are that the loop must have no internal control flow (no nested loops or conditional statements), and that it must fall between a minimum and maximum instruction word count (9 and 127). Additionally, code must be compiled with **--float_support=fpu32** and may require an optimization setting of 2 or higher. For small loops that don’t meet the minimum size threshold, the compiler will insert NOPs if it is still profitable to do so at **--opt_for_speed** levels of 1 and 2, and at levels of 3 and above the compiler will attempt to perform loop unrolling. The **#pragma UNROLL** is also available to allow developers to specify that specific loops should be unrolled. Loop unrolling is discussed further in the [Unrolling](#unrolling) section. ***Figure 2*** shows a small computational kernel for which loop unrolling enables RPTB generation, improving performance by 52%. ![](./images/c28x-perf-figure2.jpg) [[g Figure 2 ]] #Unrolling Loop unrolling is a compiler transformation in which the code in the loop body is replicated some number of times and the loop iteration count is decreased accordingly. The primary advantage of loop unrolling is reducing loop branch overhead, which is now amortized across the number of loop code replications. See ***Figure 3*** for an example. ![](./images/c28x-perf-figure3.jpg) [[g Figure 3 ]] As discussed in the previous section, on C2000 devices with FPU, the branch overhead can be eliminated entirely in the case of enabling RPTB generation (see Figure 2.) Loop unrolling can also enable other optimizations by providing increased code context. For example, unrolling may enable the compiler to fill delay slots or form parallel instructions by exposing independent computations from multiple loop iterations in the same loop body. However, loop unrolling can also negatively impact performance. Loop unrolling increases code size. Additionally, it may increase register pressure, resulting in costly register spilling inside the loop. On devices with FPU, for a loop already meeting the RPTB requirements, loop unrolling would not be likely to improve performance and could result in surpassing the maximum size threshold for a RPTB. To complicate matters, loop unrolling is performed early on in the Optimizer, while machine code is not generated until later. Therefore, at the time of unrolling, the compiler doesn’t know the loop size and cannot directly determine whether a RPTB can be generated. Due to the increase in code size, the compiler turns on loop unrolling at **--opt_for_speed** levels of 3 or greater, when performance has been prioritized over code size. Because the compiler doesn’t know which loops are most important for overall application performance, users may instead choose to use the **#pragma UNROLL** to specify specific loops that should be unrolled. The syntax is **#pragma UNROLL(n)**, where n is the number of copies of the original loop body inside the transformed loop. See the compiler guide for more information. #Restrict Keyword The restrict keyword is a source-level construct that the application developer can use to convey significant information to the compiler, enabling transformations that might otherwise constitute a safety risk. ```c Usage: float *restrict ptr; ``` In the above usage example, the restrict keyword is used to tell the compiler that for its lifetime, the pointer ***ptr*** points to data not referred to by any other name. Therefore, in the following declaration: ```c` float *restrict ptr; float f; float *ptr2; ``` ***ptr*** can be known NOT to point to ***f*** or to the same data to which ***ptr2*** points; However, ***ptr2*** might point to ***f***. Knowing that memory is not aliased enables the compiler to perform certain optimizations such as instruction reordering and eliminating unnecessary memory accesses. In the following computational kernel, the use of the restrict keyword enables the C2000 compiler to generate much better code, resulting in a performance improvement of 58%. ![](./images/restrict1.jpg) [[g Restrict Example Slide 1 ]] ![](./images/restrict2.jpg) [[g Restrict Example Slide 2 ]] ![](./images/restrict3.jpg) [[g Restrict Example Slide 3 ]] #Inlining Inlining functions with the **inline** keyword may increase performance by eliminating the function call and function return overhead. However, if the inlined function is often used in a program, then this may significantly increase code size since the code for the function is replicated and inlined in every instance the function is called. #Saturation in C In order to perform efficient saturation in C on the C28x, the ternary operator should be used. For example, Do not perform saturation in this way: ```c if( sum > max ) sum = max; if( sum < min ) sum = min; ``` Rather, perform saturation in this way: ```c sum = (sum > max) ? max : sum; sum = (sum < min) ? min : sum; ``` The first method compiles to: ```c CMPL ACC,@_max BF $C$L1,LEQ MOVL ACC,@_max $C$L1: CMPL ACC,@_min BF $C$L2,GEQ MOVL ACC,@_min $C$L2: ``` The second method compiles to the much more efficient: ```c MINL ACC,@_max MAXL ACC,@_min ``` #TMU The Trigonometric Math Unit (TMU) provides hardware support for floating point division, sqrt, sin, cos, atan, and atan2. The hardware support delivers superior performance; however, as the results may vary from the standard library implementations of these operations due to algorithmic differences, the library calls are not replaced by default. When TMU support is enabled and the **--fp_mode=relaxed** option is selected, the compiler will automatically replace library calls to these operations with hardware instructions. If the floating point mode is strict (the default setting, **--fp_mode=strict**), the compiler will issue performance advice if it encounters any opportunities for replacing library calls with TMU instructions. Advice will be issued once per operation type per file. Alternatively, intrinsics are available for the TMU operations listed above. See the intrinsics table in the compiler guide. See [Accelerators: Enhancing the Capabilities of the C2000 MCU Family](https://www.ti.com/lit/an/spry288a/spry288a.pdf) for more information and performance comparisons. #Miscellaneous * Group global variables into structures wherever possible to potentially enable the compiler to address the variables more efficiently. * Pass parameters via pointers to structures or arrays in function calls. Use this method if more than two variables are passed to a function. * Break up structures or arrays and keep them small. Keep structures to less than 64 words if possible for best efficiency. * Put most ofetn used variables at the beginning of structures. This may allow the compiler to use more efficient pointer indexing modes for the first 8 words. * Organize variables in structures in the order they are used. This may enable more efficient increment or decrement pointer modes. * Avoid declaring too many local variables. Keep the local frame less than 64 words or else declare the least used local variables as ***static*** if possible. * Declare local variables and parameters in the same order in which they are used. #Appendices ##Source code examples of single repeatable instructions ```c /*****************************************************/ /* Examples of C2000 RPT SINGLE INSTRUCTIONS */ /* REQUIRES --opt_level=2 [or greater] */ /*****************************************************/ float fresult; long result; int N, M; int array[100]; float farray[100]; // Generates RPT || MAC (requires --unified_memory) void foo() { int i, j; long sum = 0; for (i = 0; i < N; i++) for (j=0; j < M; j++) sum += (long)array[j] * (long)array[j - i]; result = sum; } // Generates RPT || MACF32 (requires --float_support=fpu32 --unified_memory) void foo2() { int i, j; float sum = 0; for (i = 0; i < N; i++) for (j=0; j < M; j++) sum += farray[j] * farray[j-i]; fresult = sum; } // Generates RPT || MOV #0 void foo3() { int i; for (i = 0; i < N; i++) array[i] = 0; } // Generates RPT || ADDL void foo4(long *x) { int i; long sum = 0; for (i = 0; i < N; i++) sum += x[i]; result = sum; } // Generates RPT || SUBCUL void foo5(unsigned long n, int b) { n /= b; result = n; } // Generates RPT || PREAD (requires --unified_memory) void foo6(int *x) { int i = 0; for (i = 0; i < N; i ++) array[i] = x[i]; } ``` ##List of MAC-style instructions * MAC * MPYA * MPYS * SQRA * SQRS * IMACL * IMPYAL * IMPYSL * QMACL * QMPYAL * QMPYSL * DMAC * MACF32 (FPU only) ##Instructions with two memory operands The following instructions use the program memory bus for a second memory access via *XAR7: * MAC * IMACL * QMACL * DMAC * MACF32 (FPU only) * PREAD * PWRITE (not currently generated by the compiler)  <div id="footer"></div>