6.3. Identifying Software Pipelining Failures and Performance Issues¶

The subsections that follow explain situations that may prevent loops from being optimized.

6.3.1. Issues that Prevent a Loop from Being Software Pipelined¶

The following situations may prevent a loop from being eligible for software pipelining. These can be detected by examining the assembly output and the Software Pipeline Information in the comment block.

Loop contains function calls: Although a software pipelined loop can contain intrinsics, it cannot contain function calls. This includes code that will result in a call to un-inlinable run-time support routines, such as floating-point division. You may attempt to inline small, user-defined functions. See the section Function Calls and Inlining for more information.
Loop contains control code: In some cases, the compiler cannot remove all of the control flow from if-then-else statements or "?:" statements. You may attempt to optimize such situations by using if statements only around code that updates memory and around variables whose values are calculated inside the loop and used only outside the loop.
Conditionally incremented loop control variable is not software pipelined. If a loop contains a loop control variable that is conditionally incremented, the compiler will not be able to software pipeline the loop.

for (i = 0; i < x; i++)
{
    . . .
    if (b > a)
        i += 2
}

Too many instructions. Oversized loops typically cannot be scheduled due to the large number of registers needed. However, some large loops require an undue amount of time for compilation. A potential solution may be to break the loop into multiple smaller loops.
Uninitialized iteration counter. The loop counter may not have been set to an initial value.
Cannot identify iteration counter. The loop control is too complex. Try to simplify the loop.

6.3.2. Software Pipeline Failure Messages¶

Possible software pipeline failure messages provided by the compiler include the following:

Address increment too large. During software pipelining, the compiler allows reordering of all loads and stores occurring from the same array or pointer. This maximizes flexibility in scheduling. Once a schedule is found, the compiler returns and adds the appropriate offsets and increments/decrements to each load and store. Sometimes, the loads and/or stores end up being offset too far from each other after reordering (the limit for standard load pointers is +/- 32). If this happens, try to restructure the loop so that the pointers are closer together or to rewrite the pointers to use precomputed register offsets.
Cannot allocate machine registers. See the section Cannot Allocate Machine Registers.
Cycle Count Too High. Not Profitable. In rare cases, the iteration interval of a software pipelined loop is higher than a non-pipelined loop. In this case it is more efficient to execute the non-software pipelined loop. A possible solution is to split the loop into multiple loops or reduce the complexity of the loop.
Did not find schedule. Sometimes the compiler simply cannot find a valid software pipeline schedule at a particular initiation interval. A possible solution is to split the loop into multiple loops or reduce the complexity of the loop.
Iterations in parallel > max. iteration count. Not all loops can be profitably pipelined. Based on the available information for the largest possible iteration count, the compiler estimates that it will always be more profitable to execute a non-software-pipelined version than to execute the pipelined version, given the schedule found at the current initiation interval. A possible solution may be to unroll the loop completely.
Iterations in parallel > min. iteration count. Based on the available information on the minimum iteration count, it is not always safe to execute the pipelined version of the loop. Normally, a redundant loop would be generated. However, in this case, redundant loop generation has been suppressed via the --opt_for_speed=3 or lower option. A possible solution is to add the MUST_ITERATE pragma to give the compiler more information on the minimum iteration count of the loop.
Register is live-too long. Sometimes the compiler finds a valid software pipeline schedule, but one or more of the values is live too long. The lifetime of a register is determined by the cycle time between when a value is written into the register and the last cycle this value is read by another instruction. By definition, a variable can never be live longer than the ii of the loop, because the next iteration of the loop overwrites that value before it is read. After this message, the compiler provides a detailed description of which values are live to long:
```
ii = 11 Register is live too long
|72| -> |74|
|73| -> |75|
```
The numbers 72, 73, 74, and 75 in this example correspond to line numbers and can be mapped back to the offending instructions. The compiler aggressively attempts to both prevent and fix live-too longs. Techniques you can use to resolve live-too longs have low probabilities of success. Therefore, such techniques are not discussed in this document. In addition, the compiler can usually find a successful software pipeline schedule at a higher initiation interval (ii).

6.3.3. Software Pipelining Performance Issues¶

You can find the following issues by examining the assembly source and the Software Pipeline Information comment block. Potential solutions are given for each condition.

6.3.3.1. Large Outer Loop Overhead in Nested Loop¶

If the inner loop count of a nested loop is relatively small, the time to execute the outer loop can become a large percentage of the total execution time. For cases where this seems to degrade the overall loop nest performance, two approaches can be tried. First, if there are not too many instructions in the outer loop, you may want to give a hint to the compiler that it should coalesce the loop nest. Try using the COALESCE_LOOP pragma and check the relative performance of the entire loop nest. If the COALESCE_LOOP pragma does not work, and the number of iterations of the inner loop is small and do not vary, fully unrolling the inner loop by hand may improve performance of the nested loop because the outer loop may be able to be software pipelined.

See the C7000 Optimizing C/C++ Compiler User's Guide (SPRUIG8) for information about pragmas.

6.3.3.2. Loop Carried Dependency Bound is Larger than the Partitioned Resource Bound¶

Loop Carried Dependency Bound is Larger than the Partitioned Resource Bound. If you see a loop carried dependency bound that is higher than the partitioned resource bound, you likely have one of two problems. First, the compiler may think there is a memory dependence from a store to a subsequent load. In this document, see the section, Use of the restrict Keyword. The "Memory Dependencies" section of the TMS320C6000 Programmer's Guide (SPRU198) also has more information. Second, a computation in one iteration of the loop may be used in the next iteration of the loop. In this case, the only option is to try to eliminate the flow of information from one iteration to the next, thereby making the iterations more independent of each other.

6.3.3.3. Two Loops are Generated, One Not Software Pipelined / Duplicate Loop¶

Two Loops are Generated, One Not Software Pipelined / Duplicate Loop Generated. If you see the message "Duplicate Loop Generated" in the Software Pipeline Information comment block, or you notice that there is a second version of the loop that isn't software pipelined, it may mean that when the iteration count of the loop is too low, it is illegal to execute the software pipelined version of the loop that the compiler has created. In order to generate only the software pipelined version of the loop, the compiler needs to prove that the minimum iteration count of the loop would be high enough to always safe execute the pipelined version. If the minimum number of iterations of the loop is known, using the MUST_ITERATE pragma to tell the compiler this information may help eliminate the duplicate loop.

6.3.3.4. Cannot Allocate Machine Registers¶

If the software pipeline feedback of the inner loop of interest says "Cannot allocate machine registers", the compiler is having trouble mapping all of the necessary variables, values, and intermediate results in the inner loop to the available registers available in the CPU. When this message is present, it likely indicates the loop has too many values that need to be available at the same time.

In some cases, the compiler can successfully software pipeline a loop at a higher initiation interval (ii).

There are unfortunately only a few techniques to alleviate register allocation issues.

Use the SE or SA. You can also try using the Streaming Engine and/or Streaming Address Generator. If the Streaming Engine or Streaming Address Generator are not being used, the use of these may help alleviate some register pressure. See the sections Automatic Use of the Streaming Engine and Streaming Address Generator Streaming Engine Streaming Address Generator for more information on using the Streaming Engine and Streaming Address Generator.
Make the loop less complex. You can try to make the loop less complex.
Split the loop. If possible, you can try to split the loop into two or more loops.

6.3.3.5. There are Memory Bank Conflicts¶

If the compiler generates two memory accesses in one cycle and those accesses occur within the same memory block in the cache hierarchy, a memory bank stall can occur. To avoid this degradation, memory bank conflicts can be avoided by skewing the two objects starting addresses so the accesses to them start in different memory blocks. One of the ways this can be accomplished is through the use of the DATA_MEM_BANK pragma or the memalign memory allocation function. The DATA_MEM_BANK pragma only works for global variables. Techniques for other objects can involve using the DATA_ALIGN pragma (or the memalign function) and differential padding (empty space) at the beginning of arrays. See the C7000 Optimizing C/C++ Compiler User's Guide (SPRUIG8) for more information about the DATA_MEM_BANK pragma and DATA_ALIGN pragma.