5.3. Vectorization and Vector Predication¶
The C7000 instruction set has many powerful single-instruction, multiple-data (SIMD) instructions that can perform multiple operations in a single instruction. To take advantage of this, the compiler tries to vectorize the source code when possible and profitable. Vectorization usually involves using vector (SIMD) instructions to perform an operation on several loop iterations of data at a time.
The following example removes the UNROLL pragma and the MUST_ITERATE pragma from the example in the previous section. The UNROLL(1) pragma prevented certain loop-transformation optimizations in the C7000 compiler.
// weighted_vector_sum_v2.cpp
// Compile with "cl7x -mv7100 --opt_level=3 --debug_software_pipeline
// --src_interlist --symdebug:none weighted_vector_sum_v2.cpp"
void weighted_sum(int * restrict a, int *restrict b, int *restrict out,
int weight_a, int weight_b, int n)
{
for (int i = 0; i < n; i++)
{
out[i] = a[i] * weight_a + b[i] * weight_b;
}
}
The following shows the resulting internal compiler code, which
has been vectorized. Vectorization by the compiler can be
inferred by the "+= 16
" address increments and "32x16
"
in the names of optimizer temporary variables (to indicate
there are 16 32-bit elements in the temporary variable).
;*** -----------------------g3:
;*** 6 ----------------------- if ( !((d$1 == 1)&U$33) ) goto g5;
;*** 6 ----------------------- VP$25 = VP$24;
;*** -----------------------g5:
;*** 7 ----------------------- VP$20 = VP$25;
;*** 7 ----------------------- __vstore_pred_p_P64_S32(VP$20, &(*(packed int (*)<[16]>)U$47),
*(packed int (*)<[16]>)U$38*VRC$s32x16$001+*(packed int (*)<[16]>)U$42*VRC$s32x16$002);
;*** 6 ----------------------- U$38 += 16;
;*** 6 ----------------------- U$42 += 16;
;*** 6 ----------------------- U$47 += 16;
;*** 6 ----------------------- --d$1;
;*** 6 ----------------------- if ( L$1 = L$1-1 ) goto g3;
The software pipeline information block from the resulting assembly file is as follows:
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop found in file : weighted_vector_sum_v2.cpp
;* Loop source line : 6
;* Loop opening brace source line : 6
;* Loop closing brace source line : 8
;* Loop Unroll Multiple : 16x
;* Known Minimum Iteration Count : 1
;* Known Max Iteration Count Factor : 1
;* Loop Carried Dependency Bound(^) : 1
;* Unpartitioned Resource Bound : 2
;* Partitioned Resource Bound : 2 (pre-sched)
;*
;* Searching for software pipeline schedule at ...
;* ii = 2 Schedule found with 7 iterations in parallel
...
;*----------------------------------------------------------------------------*
;* SINGLE SCHEDULED ITERATION
;*
;* ||$C$C41||:
;* 0 TICK ; [A_U]
;* 1 VLD16W .D1 *D0++(64),VBM0 ; [A_D1] |7| [SI]
;* 2 VLD16W .D1 *D1++(64),VBM0 ; [A_D1] |7| [SI]
;* 3 NOP 0x4 ; [A_B]
;* 7 VMPYWW .N2 VBM2,VBM0,VBL0 ; [B_N2] |7|
;* 8 VMPYWW .N2 VBM1,VBM0,VBL1 ; [B_N2] |7|
;* 9 CMPEQW .L1 AL0,0x1,D3 ; [A_L1] |6| ^
;* 10 ANDW .D2 D2,D3,AL1; [A_D2] |6|
;* || ADDW .L1 AL0,0xffffffff,AL0 ; [A_L1] |6| ^
;* 11 CMPEQW .S1 AL1,0,A0 ; [A_S1] |6|
;* 12 [!A0] MV .P2 P1,P0 ; [B_P] |6| CASE-1
;* || VADDW .L2 VBL1,VBL0,VB0 ; [B_L2] |7|
;* 13 VSTP16W .D2 P0,VB0,*A1(0) ; [A_D2] |7|
;* || ADDD .M1 A1,0x40,A1 ; [A_M1] |6| [C1]
;* || BNL .B1 ||$C$C41|| ; [A_B] |6|
;* 14 ; BRANCHCC OCCURS {||$C$C41||} ; [] |6|
This example compares the output from that in the previous section to show these effects of vectorization:
The "optimizer" code after several high-level optimization steps, including vectorization. (This "optimizer" code appears in the assembly when using the
-os
compiler option.) The address increments are by 16 and there are optimizer temporary variables with the partial name of 32x16, indicating 16 32-bit elements.The "SOFTWARE PIPELINE INFORMATION" comment block in the assembly file shows that the loop has been unrolled by 16x. This may or may not indicate vectorization has occurred, but is often associated with vectorization.
The software pipelined loop now uses the VMPYWW and VADDW instructions. The 'V' in the instruction mnemonics often (but not always) indicates that the compiler has vectorized a code sequence (using vector/SIMD instructions).
Larger address increments in load and store instructions can be another clue that vectorization has occurred.
In this loop, the compiler does not know how many times the loop will execute. Therefore in our example, the compiler must not store to memory an entire vector on the last loop iteration if the number of loop iterations is not a multiple of the number of elements in the vector width that was chosen. For example, if the original (unvectorized) loop will execute 40 iterations and the compiler vectorized the loop by 16, the last optimized iteration will compute 16 elements, but only 8 of them should be stored to memory.
The C7000 ISA has certain vector predication features, where a vector predicate affects which lanes of a vector operation should be performed. In this case, a BITXPND instruction generates a vector predicate that is used in a vector-predicate-aware store instruction. This vector store instruction (VSTP16W) uses the vector predicate to prevent storing to memory those elements on the last iteration that were computed only as a result of the vectorization process and would not have been computed or stored in the original loop. The compiler attempts to perform vector predication automatically during the vectorization process. Vector predication helps avoid the need for generating peeled loop iterations, which can inhibit loop nest optimizations.
Note
Note: Vector predicated stores may lead to page faults if the Corepac Memory Management Unit (CMMU) is enabled and the store overlaps an illegal memory page. Any memory range that will be within 63 bytes of an illegal memory page at run-time should be reduced in length in the linker command file. For more information, see the C7000 C/C++ Compiler User's Guide (SPRUIG8).
You can avoid vector prediction if you give the compiler information about the number of loop iterations using the MUST_ITERATE pragma. For example, if the loop in the previous example is known to execute only in multiples of 32 and the minimum iteration count is 1024, then the following example improves the generated assembly code:
// weighted_vector_sum_v3.cpp
// Compile with "cl7x -mv7100 --opt_level=3 --debug_software_pipeline
// --src_interlist --symdebug:none weighted_vector_sum_v3.cpp"
void weighted_sum(int * restrict a, int *restrict b, int *restrict out,
int weight_a, int weight_b, int n)
{
#pragma MUST_ITERATE(1024, ,32)
for (int i = 0; i < n; i++) {
out[i] = a[i] * weight_a + b[i] * weight_b;
}
}
When compiled, this modified example generates the following software pipeline information block:
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop found in file : weighted_vector_sum_v3.cpp
;* Loop source line : 7
;* Loop opening brace source line : 7
;* Loop closing brace source line : 9
;* Loop Unroll Multiple : 32x
;* Known Minimum Iteration Count : 32
;* Known Max Iteration Count Factor : 1
;* Loop Carried Dependency Bound(^) : 0
;* Unpartitioned Resource Bound : 4
;* Partitioned Resource Bound : 4 (pre-sched)
;*
;* Searching for software pipeline schedule at ...
;* ii = 4 Schedule found with 5 iterations in parallel
...
;*----------------------------------------------------------------------------*
;* SINGLE SCHEDULED ITERATION
;*
;* ||$C$C36||:
;* 0 TICK ; [A_U]
;* 1 VLD16W .D1 *D1++(128),VBM0 ; [A_D1] |8| [SI][C1]
;* 2 VLD16W .D1 *D1(-64),VBM0 ; [A_D1] |8| [C1]
;* 3 VLD16W .D1 *D2++(128),VBM0 ; [A_D1] |8| [SI][C1]
;* 4 VLD16W .D1 *D2(-64),VBM0 ; [A_D1] |8| [C1]
;* 5 NOP 0x2 ; [A_B]
;* 7 VMPYWW .N2 VBM2,VBM0,VBL1 ; [B_N2] |8|
;* 8 VMPYWW .N2 VBM2,VBM0,VBL0 ; [B_N2] |8|
;* 9 VMPYWW .N2 VBM1,VBM0,VBL2 ; [B_N2] |8|
;* 10 VMPYWW .N2 VBM1,VBM0,VBL1 ; [B_N2] |8|
;* 11 NOP 0x2 ; [A_B]
;* 13 VADDW .L2 VBL2,VBL1,VB0 ; [B_L2] |8|
;* 14 VST16W .D2 VB0,*D0(0) ; [A_D2] |8|
;* || VADDW .L2 VBL1,VBL0,VB0 ; [B_L2] |8|
;* 15 VST16W .D2 VB0,*D0(64) ; [A_D2] |8| [C0]
;* 16 ADDD .D2 D0,0x80,D0 ; [A_D2] |7| [C0]
;* || BNL .B1 ||$C$C36|| ; [A_B] |7|
;* 17 ; BRANCHCC OCCURS {||$C$C36||} ; [] |7|
Due to the added MUST_ITERATE pragma, the compiler knows that vector predication is never needed and does not perform vector predication. As a result, the compiler removes the CMPEQW, ANDW, VSTP16W, and other instructions associated with the vector predication.
5.3.1. Vectorization and Code Size¶
Vectorization tends to increase performance at the expense of increasing
code size. Vectorization tends to be more agressive at the higher
--opt_for_speed
/-mf
levels and vectorization is not performed by
the compiler at --opt_for_speed
/-mf1
or --opt_for_speed=0
.