4.6. Use of the restrict Keyword¶
Careful use of the restrict keyword is a common and important technique when programming the C7000 DSP core. Using the restrict keyword can dramatically improve the performance of an inner loop and is often required to achieve acceptable performance results of an inner loop. However, the restrict keyword imposes a significant responsibility for the user, because misuse of this technique can lead to incorrect results.
This section discusses the restrict keyword and how the use of the restrict keyword is often necessary for optimal performance.
4.6.1. The Restrict Keyword¶
Many common digital signal processing loops contain one or more
load operations, some computation, and then a store operation.
Typically, the loads read from an array and the stores
store values to a different array. If the compiler cannot prove that the
arrays are separate (or at least the accesses do not overlap), the compiler
must be conservative and assume that the value written by
a store in iteration i
may be read by a load in
iteration i+1
or i+2
, etc. The compiler cannot safely issue
the load until the store has completed, which prevents the from
aggressively overlapping the loop iterations, which can lead to poorly
performing inner loops. Therefore, it is important to tell the compiler
when the memory accesses cannot refer to the same location; that is, that the
object/arrays pointed to by those pointers do not overlap.
We can do this with the use of the restrict
keyword. This
keyword tells the compiler that throughout the lifetime of the
variable (array name or pointer name used to access the array),
accesses to that object or array will only be made through that
array name or pointer name.
Note
This description of the restrict keyword is conservatively simplified. Following this conservatively safe description can never cause incorrect execution, and it is easier to learn and use. The precise definition of the restrict keyword in the C standard is a bit more flexible, but it is subtle and difficult to understand, and situations where the additional utility is appropriate are rarely seen in actual practice. If you wish to learn more about the restrict keyword, see the C standard or Demystifying The Restrict Keyword.
In the load-compute-store case described above, use of the restrict keyword allows the compiler to conclude that the store to memory will not write to the same place where the next iterations' loads will read from. Thus, memory operations in successive iterations can be overlapped, and this often allows the generated code to run faster.
The performance of a software-pipelined loop is dominated by how tightly the compiler is able to overlap successive iterations of the loop. In general, the smaller the initiation interval (ii), the tighter and faster the loop will be, because we're keeping more functional units busy simultaneously, maximizing work performed per cycle.
The C function below shows the use of the restrict keyword. After this C function is compiled, the resulting Software Pipeline Information comment block in the generated assembly code will show that when the restrict keyword is used, the initiation interval is two cycles. Without the restrict keyword, the initiation interval is 12 cycles. (The reader is encouraged to review the section Loop-Carried Dependencies for an explanation why a metric called the loop-carried dependency bound is reduced in this case and thus why the initiation interval is reduced.)
Note that the use of the restrict keyword is legal if a, b, and out do not overlap. If they do overlap, the use of the restrict keyword may be incorrect and can lead to incorrect code being generated by the compiler. Therefore, it is critically important that the programmer only use the restrict keyword when it is safe to do so.
void weighted_sum(int * restrict a, int * restrict b, int * restrict out, int weight_a, int weight_b, int n)
{
#pragma UNROLL(1)
#pragma MUST_ITERATE(1024, ,32)
for (int i = 0; i < n; i++)
{
out[i] = a[i] * weight_a + b[i] * weight_b;
}
}
The Texas Instruments C7000 C/C++ Compiler allows the restrict keyword to be used in both C and C++ modes, despite the restrict keyword not being part of the C++ standards.
Note
The use of the restrict keyword is a promise from the programmer. If you use the restrict keyword where the promise isn't true, the compiler will often produce code with undefined behavior--meaning that the code generated by the compiler will produce an incorrect result.
For more information on the restrict keyword, see "The Restrict Keyword" section in the C7000 C/C++ Optimizing Compiler User's Guide.
4.6.2. Compiler-Emitted Restrict Advice¶
In certain situations, the C7000 C/C++ compiler may detect a loop-carried dependency that is negatively affecting performance. The compiler will emit a performance remark when it detects that an inner loop that is a candidate for software pipelining has a loop-carried dependency that is the result of unknown aliasing between an input pointer (a load) and an output pointer (a store). In these situations, the compiler will emit a performance remark to the console noting the issue, where it occurs, and where the addition of the retrict keyword may help. It is up to the user to determine if the addition of the restrict keyword is legal in that particular situation.
4.6.3. Run-Time Alias Disambiguation¶
Two pointers are aliases of each other if they refer to the same object.
When two pointers refer to the same object in a loop, a "loop-carried" dependence may result, and this can negatively affect the performance of the loop.
Under certain limited circumstances, the compiler may generate two loops: one that assumes two pointers are aliases of each other and one that assumes the two pointers are not aliases of each other. It generates a run-time check to determine if the two pointers are aliases that then branches to the appropriate version of the loop. This optimization is called run-time alias disambiguation. The advantage is that the loop that assumes that an alias does not exist usually can be software pipelined at a much smaller initiation interval, leading to improved performance of the loop.
The compiler cannot always perform run-time alias disambiguation due to considerations that are too technical to describe here. In addition, certain further optimizations such as nested loop coalescing are inhibited when the compiler produces two different loops with a run-time alias check, so it is best to use the restrict keyword whenever legally possible.
During run-time alias disambiguation, the compiler may make a copy of the loop if the transformation is deemed to be profitable and the -mf level is 4 or above (i.e., default/-mf4 or -mf5 is used). This transformation also adds a small amount of control logic. Adding a copy of the loop and the associated control logic will increase code size.
Users that observe undesired code size growth may opt to use
--opt_for_speed=3
/-mf3
or
lower on those files that are affected. Run-time alias disambiguation is
not performed by the compiler at --opt_for_speed=3
or lower. Also, users
may want to investigate the possibility of using the restrict keyword on the
pointers that are used in the affected loops, in order to prevent the compiler
from needing to perform run-time alias disambiguation.
Refer to the C7000 Optimization Guide for more information on the use of the
--opt_for_speed
/-mf
option to control the code-size and performance
tradeoff on C7000.
For further discussion and details regarding identifying and eliminating loop-carried dependencies, consult the following references:
4.6.4. Loop-Carried Dependencies¶
The performance of a software-pipelined loop is dominated by how tightly the compiler is able to overlap successive iterations of the loop. In general, the smaller the initiation interval (ii), the tighter and faster the loop will be, because we're keeping more functional units busy simultaneously, maximizing work performed per cycle. Ideally, the ii is no larger than the partitioned resource bound.
When the compiler is not able to optimally overlap successive iterations of the loop, performance suffers. In almost all cases, the culprit is a loop-carried dependency. A loop-carried dependency can prevent the execution of iteration i+1 from overlapping with iteration i as well as it otherwise could. When a loop-carried dependency gets too big, it can force the compiler to use a larger ii, hurting the performance of the loop.
When a loop-carried dependency contains too many cycles (relative to the partitioned resource bound), it negatively affects the performance of a loop.
A loop-carried dependency arises because there is a cycle in the ordering constraints (dependencies) for some set of instructions in a loop. In other words, there is a flow of data from one iteration to a later iteration, meaning that a future iteration is dependent on data that is generated from a previous iteration.
Out of all these cycles in the loop (loop-carried dependences), the largest
loop-carried dependency cycle is called the loop-carried dependency bound.
The compiler will state what the loop-carried dependency bound is for
any software pipelined loop. This value can be found in the
SOFTWARE PIPELINE INFORMATION
comment block that appears next to the
software pipelined loop in the assembly code.
To reduce or eliminate a problematic loop-carried dependency, one must identify the cycle and then find a way to shorten or break it. Let's walk through an example.
The following example shows a loop with a problematic loop-carried dependency.
void weighted_sum(int * a, int * b, int * out, int weight_a, int weight_b, int n)
{
#pragma UNROLL(1)
#pragma MUST_ITERATE(1024, ,32)
for (int i = 0; i < n; i++)
{
out[i] = a[i] * weight_a + b[i] * weight_b;
}
}
The compiler-generated assembly code for this example (shown below) shows that the Loop Carried Dependency Bound in the Software Pipeline Information section of the assembly code is 12 cycles.
;*----------------------------------------------------------------------------*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop found in file : weighted_vector_sum_v3.cpp
;* Loop source line : 10
;* Loop opening brace source line : 11
;* Loop closing brace source line : 13
;* Known Minimum Iteration Count : 1024
;* Known Max Iteration Count Factor : 32
;* Loop Carried Dependency Bound(^) : 12
;* Unpartitioned Resource Bound : 2
;* Partitioned Resource Bound : 2 (pre-sched)
;*
;* Searching for software pipeline schedule at ...
;* ii = 12 Schedule found with 2 iterations in parallel
. . .
;*----------------------------------------------------------------------------*
;* SINGLE SCHEDULED ITERATION
;*
;* ||$C$C51||:
;* 0 TICK ; [A_U]
;* 1 LDW .D2 *D1++(4),BM0 ; [A_D2] |12| ^
;* || LDW .D1 *D2++(4),BM1 ; [A_D1] |12| ^
;* 2 NOP 0x5 ; [A_B]
;* 7 MPYWW .M2 BM2,BM0,BL0 ; [B_M2] |12| ^
;* || MPYWW .N2 BM3,BM1,BL1 ; [B_N2] |12| ^
;* 8 NOP 0x3 ; [A_B]
;* 11 ADDW .L2 BL1,BL0,B0 ; [B_L2] |12| ^
;* 12 STW .D1X B0,*D0++(4) ; [A_D1] |12| ^
;* || BNL .B1 ||$C$C51|| ; [A_B] |10|
;* 13 ; BRANCHCC OCCURS {||$C$C51||} ; [] |10|
;*----------------------------------------------------------------------------*
The final software pipelined initiation interval of a software pipelined loop can be no smaller than either the Loop Carried Dependency Bound or the Partitioned Resource Bound. In other words, when the loop-carried dependency bound is greater than the partitioned resource bound, the software pipelined loop could likely run faster if the loop-carried dependency bound is eliminated or reduced. Therefore, in this example, because the partitioned resource bound is 2 and the loop-carried dependency bound is 12, this code has an issue that should be investigated.
To identify the problem, we need to look at the instructions
involved in the loop-carried dependency. These instructions are
marked with the caret "^
" symbol in the comment block in
the compiler-generated assembly file. Notice that the two load
instructions and the store instruction are marked with a caret.
Because the load instructions appear first with a caret and the
store instruction appears last with a caret, it's likely that
the compiler thinks that the load instructions (LDW) depend
on values stored (via the STW instruction) from a previous iteration.
This is likely because the
compiler cannot prove the stores are writing to an area of
memory that is independent of the location from which the load
instructions are loading values. In absence of information
about the locations of the pointers, arrays and address access
patterns, the compiler must assume that successive iterations
may load from the location of the previous iteration's stores.
As previously discussed, careful use of the restrict keyword may help.
Loop-carried dependencies can be grouped into two categories, memory and data. The previous example was a memory loop-carried dependence. If a value travels from one iteration to a subsequent iteration in a register, it is called a data loop-carried dependency. In the case of a data loop-carried dependency, the restrict keyword will not help. Eliminating data loop-carried dependences can be difficult, and involve finding ways to eliminate the data dependence from one iteration to another.
See Hand-Tuning Loops and Control Code on the TMS320C6000 (SPRA666) for more about loop-carried dependencies and how to identify them.