6.1. Streaming EngineΒΆ
A Streaming Engine is a feature of the C7000 CPU core that aids in high-speed loading of data from memory (L2 or higher) to the functional units in the CPU. Use of the Streaming Engines can significantly improve the throughput and performance of the memory hierarchy. Normally, the C7x CPU can only load one vector-length data item and one 64-bit length data item per clock cycle. The Streaming Engines provide more bandwidth from L2 memory to the CPU than using load instructions alone and they prefetch data from memory to a location near the CPU so the data can be accessed faster. Using a Streaming Engine may also reduce the number of L1 data cache capacity misses as the L1 cache is bypassed for data accessed through the Streaming Engines.
The best way to use the Streaming Engines is with data that has been placed into L2 memory. Using the Streaming Engines on data that is located in L3 or external memory will work, but performance will very likely be significantly lower than with data located in L2. It is recommended to use some kind of software framework to have a DMA engine bring data into the C7x L2 memory in the background and then use the Streaming Engine to access it.
There are a couple of indications that the user may need to use a Streaming Engine.
If more than one vector load is needed per clock cycle.
Data cache misses are dominating at run-time.
When either of these occur, consider using one or both of the Streaming Engines if the access pattern to the objects in memory is known in advance. The Streaming Engine feature supports up to a six-dimensional address access pattern.
Using one or both of the Streaming Engines may also limit the number of instructions required to calculate an address used for a load instruction. This may allow the compiler to perform loop transformation optimizations called loop coalescing and loop collapsing, which may lead to a larger portion of the loop nest getting software pipelined, which can lead to improved performance.
Manual use of the Streaming Engines have the greatest effect when used in conjunction with loops that are vectorized by hand.
To get a baseline understanding of the Streaming Engine and
its default API, and to see example code, the reader is encouraged to read
first section 4.15 "Streaming Engine and Streaming Address Generator" in the
C7000 Optimizing C/C++ Compiler User's Guide (SPRUIG8) and to peruse the
c7x_strm.h
file in the include
directory of the compiler installation
director. The user can also reference the C71x DSP CPU, Instruction Set, and
Matrix Multiply Accelerator Technical Reference Manual (SPRUIP0).
The C7000 CPU variants that are available at the time of this writing have two Streaming Engines, named SE0 and SE1.
A Streaming Engine is controlled by a structure instance that contains several
fields. The __SE_TEMPLATE
structure contains fields that control various
behaviors of the addressing pattern. The user can obtain a structure with
default values by calling a compiler tool-supplied function like
__gen_SE_TEMPLATE_v1()
. The user then modifies fields in the
SE_TEMPLATE
data structure instance in order to configure the Streaming
Engine so the fetching pattern and data transformations are appropriate
for the use case.
To obtain a value from the Streaming engine (in this case, SE0), and advance
the address to the next access location, the user can use the C++ function
c7x::strm_eng<0, T>::get_adv()
, where T
is the type of the vector.
A code example that uses the Streaming Engine can be found in the Examples chapter, in the section Using the Streaming Engine.
As of v4.0.0 of the C7000 compiler, the compiler may automatically use the Streaming Engine, depending on the situation. See section Automatic Use of the Streaming Engine and Streaming Address Generator for more information on what compiler options may be needed to enable automatic use of the Streaming Engines. Also see the C7000 Optimizing C/C++ Compiler User's Guide, Section 4.15, for more information about the Streaming Engine.