4.3. Automatic Use of the Streaming Engine and Streaming Address Generator¶

The compiler can automatically use the special C7000 hardware features called the Streaming Engine (SE) and the Streaming Address Generator (SA) if the --auto_stream=no_saving option is used on C7100 and C7120 devices or the --auto_stream=saving option is used on C7504, C7524, and later devices.

More information on the Streaming Address Generator and the Streaming Engine can be found in sections Streaming Engine and Streaming Address Generator and in the C7000 Optimizing C/C++ Compiler User's Guide (SPRUIG8).

If the weighted_vector_sum_v3.cpp example in section Software Pipelining: Performance and Code-Size Tradeoff is compiled with the --auto_stream=no_saving option, the following software pipeline information block is generated. (The generated assembly in this example is for C7100.)

;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop found in file               : weighted_vector_sum_v2.cpp
;*      Loop source line                 : 7
;*      Loop opening brace source line   : 7
;*      Loop closing brace source line   : 9
;*      Loop Unroll Multiple             : 32x
;*      Known Minimum Iteration Count    : 32
;*      Known Max Iteration Count Factor : 1
;*      Loop Carried Dependency Bound(^) : 2
;*      Unpartitioned Resource Bound     : 2
;*      Partitioned Resource Bound       : 2 (pre-sched)
;*
;*      Searching for software pipeline schedule at ...
;*        ii = 2  Schedule found with 4 iterations in parallel
. . .
;*----------------------------------------------------------------------------*
;*        SINGLE SCHEDULED ITERATION
;*
;*        ||$C$C36||:
;*   0              TICK    ; [A_U]
;*   1              VMPYWW  .N2     VBM1,SE0++,VBL0   ; [B_N2] |8|  ^
;*     ||           VMPYWW  .M2     VBM0,SE1++,VBL1   ; [B_M2] |8|  ^
;*   2              VMPYWW  .N2     VBM1,SE0++,VBL0   ; [B_N2] |8|  ^
;*     ||           VMPYWW  .M2     VBM0,SE1++,VBL1   ; [B_M2] |8|  ^
;*   3              NOP    0x2      ; [A_B]
;*   5              VADDW   .L2     VBL1,VBL0,VB0     ; [B_L2] |8|
;*   6              VST16W  .D2     VB0,*D0(0)        ; [A_D2] |8|
;*     ||           VADDW   .L2     VBL1,VBL0,VB0     ; [B_L2] |8|
;*   7              VST16W  .D2     VB0,*D0(64)       ; [A_D2] |8| [C0]
;*     ||           ADDD    .D1     D0,0x80,D0        ; [A_D1] |7| [C1]
;*     ||           BNL     .B1     ||$C$C36||        ; [A_B] |7|
;*   8              ; BRANCHCC OCCURS {||$C$C36||}    ; [] |7|

In this case, the compiler uses SE0 and SE1 to replace the loads that previously set a lower ii bound of 4. With these loads instead being performed with SEs, an ii of 2 is achieved. To use the SEs in the above example, the compiler must configure and open them. The configuration and open actions are shown in comments added by the --src_interlist option before the loop:

;***    -----------------------    S$1 = __internal_SE_TEMPLATE_1_i_1_i_d_i_d_i_d_i_d_i_d_2_4;
;***    -----------------------    S$1.ICNT0 = C$5 = (unsigned)(n+15&0xfffffff0);
;***    -----------------------    __se_open_V0_U32_O(*__se_mem((packed void *)a), 0, S$1);
;***    -----------------------    S$3 = __internal_SE_TEMPLATE_1_i_1_i_d_i_d_i_d_i_d_i_d_2_4;
;***    -----------------------    S$3.ICNT0 = C$5;
;***    -----------------------    __se_open_V0_U32_O(*__se_mem((packed void *)b), 1, S$3);

By default, the compiler uses the SE or SA only if using them appears to be profitable and legal.

For profitability, a key consideration is that using the SEs or SAs comes with a processing overhead; the compiler does not necessarily know whether this overhead is profitable. In the example, the MUST_ITERATE pragma indicates the minimum iteration count is 1024, which convinces the compiler that use of SEs or SAs is likely profitable, so the compiler performs the transformation. If the compiler is not using the SE or SA and you want to cause the compiler to use them, indicating the number of iterations with the MUST_ITERATE or PROB_ITERATE pragma can help.

For legality, most reasons for not using the SE or the SA relate to whether an addressing pattern can always be mapped to an SE or SA. These reasons include, but are not limited to:

Iteration counter (ICNT) values that exceed the range of an unsigned 32-bit type. For example, this occurs in for (i = 0; i < icnt; i++) when i and icnt are 64-bit types.
DIM values that exceed the range of a signed 32-bit type. For example, this occurs in data_in[i*dim] when dim is a 64-bit type.
Additions or multiplies in addressing that exceed the range of a signed 32-bit type. For example, this occurs in data_in[i*dim] when i or dim is a 64-bit type.
Addressing exceeding the range of INT_MIN to INT_MAX elements. For example, in int16_ptr[i] when int16_ptr is an int16 * and i is an int, the maximum range is INT_MIN*16 elements to INT_MAX*16 elements.

Each of these are edge cases are unlikely to occur in practice. To allow the compiler to ignore them, use the --assume_addresses_ok_for_stream option.

If using the SE or SA is not profitable in practice, you can override the --auto_stream and/or --assume_addresses_ok_for_stream options for a single function using the FUNCTION_OPTIONS pragma.

If the code explicitly uses the SE or SA in a function, the compiler does not choose to use either the SE or the SA for optimization. In this case, the compiler assumes that the code handles all aspects of optimization with the SE and SA within that function.

For further information on automatic use of the SE and SA and the associated compiler options, see the C7000 C/C++ Compiler User's Guide (SPRUIG8).