4.3. Automatic Use of the Streaming Engine and Streaming Address GeneratorΒΆ
The compiler can automatically use the special C7000 hardware features called
the Streaming Engine (SE) and the Streaming Address Generator (SA)
if the --auto_stream=no_saving
option is used on C7100 and C7120
devices or the --auto_stream=saving
option is used on C7504, C7524, and
later devices.
More information on the Streaming Address Generator and the Streaming Engine can be found in sections Streaming Engine and Streaming Address Generator and in the C7000 Optimizing C/C++ Compiler User's Guide (SPRUIG8).
If the weighted_vector_sum_v3.cpp example in section
Software Pipelining: Performance and Code-Size Tradeoff is compiled with the
--auto_stream=no_saving
option, the following software pipeline
information block is generated. (The generated assembly in this example is
for C7100.)
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop found in file : weighted_vector_sum_v2.cpp
;* Loop source line : 7
;* Loop opening brace source line : 7
;* Loop closing brace source line : 9
;* Loop Unroll Multiple : 32x
;* Known Minimum Iteration Count : 32
;* Known Max Iteration Count Factor : 1
;* Loop Carried Dependency Bound(^) : 2
;* Unpartitioned Resource Bound : 2
;* Partitioned Resource Bound : 2 (pre-sched)
;*
;* Searching for software pipeline schedule at ...
;* ii = 2 Schedule found with 4 iterations in parallel
. . .
;*----------------------------------------------------------------------------*
;* SINGLE SCHEDULED ITERATION
;*
;* ||$C$C36||:
;* 0 TICK ; [A_U]
;* 1 VMPYWW .N2 VBM1,SE0++,VBL0 ; [B_N2] |8| ^
;* || VMPYWW .M2 VBM0,SE1++,VBL1 ; [B_M2] |8| ^
;* 2 VMPYWW .N2 VBM1,SE0++,VBL0 ; [B_N2] |8| ^
;* || VMPYWW .M2 VBM0,SE1++,VBL1 ; [B_M2] |8| ^
;* 3 NOP 0x2 ; [A_B]
;* 5 VADDW .L2 VBL1,VBL0,VB0 ; [B_L2] |8|
;* 6 VST16W .D2 VB0,*D0(0) ; [A_D2] |8|
;* || VADDW .L2 VBL1,VBL0,VB0 ; [B_L2] |8|
;* 7 VST16W .D2 VB0,*D0(64) ; [A_D2] |8| [C0]
;* || ADDD .D1 D0,0x80,D0 ; [A_D1] |7| [C1]
;* || BNL .B1 ||$C$C36|| ; [A_B] |7|
;* 8 ; BRANCHCC OCCURS {||$C$C36||} ; [] |7|
In this case, the compiler uses SE0 and SE1 to replace the
loads that previously set a lower ii bound of 4. With these
loads instead being performed with SEs, an ii of 2 is achieved.
To use the SEs in the above example, the compiler must
configure and open them. The configuration and open actions are
shown in comments added by the --src_interlist
option before
the loop:
;*** ----------------------- S$1 = __internal_SE_TEMPLATE_1_i_1_i_d_i_d_i_d_i_d_i_d_2_4;
;*** ----------------------- S$1.ICNT0 = C$5 = (unsigned)(n+15&0xfffffff0);
;*** ----------------------- __se_open_V0_U32_O(*__se_mem((packed void *)a), 0, S$1);
;*** ----------------------- S$3 = __internal_SE_TEMPLATE_1_i_1_i_d_i_d_i_d_i_d_i_d_2_4;
;*** ----------------------- S$3.ICNT0 = C$5;
;*** ----------------------- __se_open_V0_U32_O(*__se_mem((packed void *)b), 1, S$3);
By default, the compiler uses the SE or SA only if using them appears to be profitable and legal.
For profitability, a key consideration is that using the SEs or SAs comes with a processing overhead; the compiler does not necessarily know whether this overhead is profitable. In the example, the MUST_ITERATE pragma indicates the minimum iteration count is 1024, which convinces the compiler that use of SEs or SAs is likely profitable, so the compiler performs the transformation. If the compiler is not using the SE or SA and you want to cause the compiler to use them, indicating the number of iterations with the MUST_ITERATE or PROB_ITERATE pragma can help.
For legality, most reasons for not using the SE or the SA relate to whether an addressing pattern can always be mapped to an SE or SA. These reasons include, but are not limited to:
Iteration counter (ICNT) values that exceed the range of an unsigned 32-bit type. For example, this occurs in
for (i = 0; i < icnt; i++)
wheni
andicnt
are 64-bit types.DIM values that exceed the range of a signed 32-bit type. For example, this occurs in
data_in[i*dim]
whendim
is a 64-bit type.Additions or multiplies in addressing that exceed the range of a signed 32-bit type. For example, this occurs in
data_in[i*dim]
wheni
ordim
is a 64-bit type.Addressing exceeding the range of INT_MIN to INT_MAX elements. For example, in
int16_ptr[i]
whenint16_ptr
is anint16 *
andi
is anint
, the maximum range is INT_MIN*16 elements to INT_MAX*16 elements.
Each of these are edge cases are unlikely to occur in
practice. To allow the compiler to ignore them, use the
--assume_addresses_ok_for_stream
option.
If using the SE or SA is not profitable in practice, you can
override the --auto_stream
and/or
--assume_addresses_ok_for_stream
options for a single function
using the FUNCTION_OPTIONS pragma.
If the code explicitly uses the SE or SA in a function, the compiler does not choose to use either the SE or the SA for optimization. In this case, the compiler assumes that the code handles all aspects of optimization with the SE and SA within that function.
For further information on automatic use of the SE and SA and the associated compiler options, see the C7000 C/C++ Compiler User's Guide (SPRUIG8).