7.3. Using the Streaming EngineΒΆ
The Streaming Engine was previously discussed in the section Streaming Engine.
A Streaming Engine is controlled by a structure instance that contains several
fields. The __SE_TEMPLATE
structure contains fields that control various
behaviors of the addressing pattern. The user can obtain a structure with
default values by calling a compiler tool-supplied function like
__gen_SE_TEMPLATE_v1()
. The user then modifies fields in the
SE_TEMPLATE
data structure instance in order to configure the Streaming
Engine so the fetching pattern and data transformations are appropriate
for the use case.
To obtain a value from the Streaming Engine (in this case, SE0), and advance
the address to the next access location, the user can use the macros
SE0() or SE0ADV(). However in this case, because we're using C++ and
scalable vector types, we must use the
c7x::strm_eng<0, T>::get_adv()
intrinsic, where T
is the type of the
vector, as SE0 and SE0ADV do not work with scalable vector types at the
time of this writing.
Let us modify our memory copy example from the previous section to utilize
the Streaming Engine to load the values from the in
pointer.
We're going to use the
__gen_SE_TEMPLATE_v1()
function to get a __SE_TEMPLATE_v1
structure
with default values which we'll then modify so the Streaming Engine will
operate in the desired manner. We'll use the modified
structure to open Streaming Engine SE0 using the macro __SE0_OPEN
.
Then, we'll use the C++ intrinsic c7x::strm_eng<0, c7x::int_vec>::get_adv()
to load a value from SE0 and advance the SE0 to the next vector.
Finally, we close SE0 after the loop with a __SE0_CLOSE()
intrinsic.
void memcpy_scalable_v4 (const c7x::int_vec *restrict in,
c7x::int_vec *restrict out,
int len)
{
// Find the number of vector loads/stores needed to copy the
// buffer.
int cnt = len / c7x::element_count_of<c7x::int_vec>::value;
cnt += (len % c7x::element_count_of<c7x::int_vec>::value > 0);
// Generate a Streaming Address Generator setup template and
// Streaming Engine setup template with default values.
__SA_TEMPLATE_v1 out_tmplt = __gen_SA_TEMPLATE_v1();
__SE_TEMPLATE_v1 in_tmplt = __gen_SE_TEMPLATE_v1();
// Obtain the __SA_VECLEN enumeration value that indicates to the
// streaming addresos generator the number of elements in a vector.
// Use this value to set the VECLEN member of the SA setup record.
out_tmplt.VECLEN = c7x::sa_veclen<c7x::int_vec>::value;
// Modify the SA setup record to indicate to the SA how many total
// elements we want to generate (in the first and only dimension).
// Note that this does not need to be a multiple of the number of
// elements in a vector.
out_tmplt.ICNT0 = len;
// Modify the SE setup record for the appropriate length and vector
// type
in_tmplt.VECLEN = c7x::se_veclen<c7x::int_vec>::value;
in_tmplt.ELETYPE = c7x::se_eletype<c7x::int_vec>::value;
in_tmplt.ICNT0 = len;
// Tell the streaming engine the pattern is 1-dimensional
in_tmplt.DIMFMT = __SE_DIMFMT_1D;
// Open the streaming engine 0 with base pointer "in"
__SE0_OPEN(in, in_tmplt);
// Open the streaming address generator 0 (SA0)
__SA0_OPEN(out_tmplt);
// Perform the copy, including any remainder
int i;
for (i = 0; i < cnt; i++)
{
// Load an int vector's worth of data from the array "in"
c7x::int_vec data = c7x::strm_eng<0, c7x::int_vec>::get_adv();
// Obtain a vector predicate from the streaming address generator 0
// (SA0).
__vpred pred = c7x::strm_agen<0, c7x::int_vec>::get_vpred();
// Obtain an address for the location we will store to next
// by obtaining the offset of the SA0 and adding it to the
// address "out" by using the strm_agen get_adv() operator.
// get_adv() also advances SA0 to the next offset.
c7x::int_vec * addr = c7x::strm_agen<0, c7x::int_vec>::get_adv(out);
// Store the data into the location in out, possibly predicated
// based on the addressing pattern in SA0
__vstore_pred(pred, addr, data);
}
__SE0_CLOSE();
__SA0_CLOSE();
}
Now, let's modify the example above to (1) template the element type,
(2) use the scalable vector programming model, and (3) make a number of
other readability changes, notably utilizing using namespace c7x
.
#include <c7x_scalable.h>
using namespace c7x;
/*
* memcpy_scalable_strm<typename S>(const S*in, S *out, int len)
*
* S - A basic data type such as short or float.
* in - The input buffer.
* out - The output buffer.
* len - The number of elements to copy.
*/
template<typename S>
void memcpy_scalable_strm(const S *restrict in, S *restrict out, int len)
{
/*
* Create scalable vector type V, where the elements are of type S
*/
using V = typename make_full_vector<S>::type;
/*
* Find the maximum number of vector loads/stores needed to copy the buffer,
* including any remainder.
*/
int cnt = len / element_count_of<V>::value;
cnt += (len % element_count_of<V>::value > 0);
/*
* Initialize the SE for a linear read in and the SA for a linear write
* out.
*/
__SE_TEMPLATE_v1 in_tmplt = __gen_SE_TEMPLATE_v1();
__SA_TEMPLATE_v1 out_tmplt = __gen_SA_TEMPLATE_v1();
in_tmplt.VECLEN = se_veclen<V>::value;
in_tmplt.ELETYPE = se_eletype<V>::value;
in_tmplt.DIMFMT = __SE_DIMFMT_1D;
in_tmplt.ICNT0 = len;
out_tmplt.VECLEN = sa_veclen<V>::value;
out_tmplt.DIMFMT = __SA_DIMFMT_1D;
out_tmplt.ICNT0 = len;
__SE0_OPEN(in, in_tmplt);
__SA0_OPEN(out_tmplt);
/*
* Perform the copy. If there is remainder, the last store will be
* predicated.
*/
int i;
for (i = 0; i < cnt; i++)
{
V tmp = strm_eng<0, V>::get_adv();
__vpred pred = strm_agen<0, V>::get_vpred();
V *addr = strm_agen<0, V>::get_adv(out);
__vstore_pred(pred, addr, tmp);
}
__SE0_CLOSE();
__SA0_CLOSE();
}