7.3. Using the Streaming EngineΒΆ

The Streaming Engine was previously discussed in the section Streaming Engine.

A Streaming Engine is controlled by a structure instance that contains several fields. The __SE_TEMPLATE structure contains fields that control various behaviors of the addressing pattern. The user can obtain a structure with default values by calling a compiler tool-supplied function like __gen_SE_TEMPLATE_v1(). The user then modifies fields in the SE_TEMPLATE data structure instance in order to configure the Streaming Engine so the fetching pattern and data transformations are appropriate for the use case.

To obtain a value from the Streaming Engine (in this case, SE0), and advance the address to the next access location, the user can use the macros SE0() or SE0ADV(). However in this case, because we're using C++ and scalable vector types, we must use the c7x::strm_eng<0, T>::get_adv() intrinsic, where T is the type of the vector, as SE0 and SE0ADV do not work with scalable vector types at the time of this writing.

Let us modify our memory copy example from the previous section to utilize the Streaming Engine to load the values from the in pointer. We're going to use the __gen_SE_TEMPLATE_v1() function to get a __SE_TEMPLATE_v1 structure with default values which we'll then modify so the Streaming Engine will operate in the desired manner. We'll use the modified structure to open Streaming Engine SE0 using the macro __SE0_OPEN.

Then, we'll use the C++ intrinsic c7x::strm_eng<0, c7x::int_vec>::get_adv() to load a value from SE0 and advance the SE0 to the next vector.

Finally, we close SE0 after the loop with a __SE0_CLOSE() intrinsic.

void memcpy_scalable_v4 (const c7x::int_vec *restrict in,
                         c7x::int_vec *restrict out,
                         int len)
{
    // Find the number of vector loads/stores needed to copy the
    // buffer.
    int cnt = len / c7x::element_count_of<c7x::int_vec>::value;
    cnt += (len % c7x::element_count_of<c7x::int_vec>::value > 0);

    // Generate a Streaming Address Generator setup template and
    // Streaming Engine setup template with default values.
    __SA_TEMPLATE_v1 out_tmplt = __gen_SA_TEMPLATE_v1();
    __SE_TEMPLATE_v1 in_tmplt  = __gen_SE_TEMPLATE_v1();

    // Obtain the __SA_VECLEN enumeration value that indicates to the
    // streaming addresos generator the number of elements in a vector.
    // Use this value to set the VECLEN member of the SA setup record.
    out_tmplt.VECLEN = c7x::sa_veclen<c7x::int_vec>::value;

    // Modify the SA setup record to indicate to the SA how many total
    // elements we want to generate (in the first and only dimension).
    // Note that this does not need to be a multiple of the number of
    // elements in a vector.
    out_tmplt.ICNT0 = len;

    // Modify the SE setup record for the appropriate length and vector
    // type
    in_tmplt.VECLEN  = c7x::se_veclen<c7x::int_vec>::value;
    in_tmplt.ELETYPE = c7x::se_eletype<c7x::int_vec>::value;
    in_tmplt.ICNT0   = len;

    // Tell the streaming engine the pattern is 1-dimensional
    in_tmplt.DIMFMT  = __SE_DIMFMT_1D;

    // Open the streaming engine 0 with base pointer "in"
    __SE0_OPEN(in, in_tmplt);

    // Open the streaming address generator 0 (SA0)
    __SA0_OPEN(out_tmplt);

    // Perform the copy, including any remainder
    int i;
    for (i = 0; i < cnt; i++)
    {
        // Load an int vector's worth of data from the array "in"
        c7x::int_vec data = c7x::strm_eng<0, c7x::int_vec>::get_adv();

        // Obtain a vector predicate from the streaming address generator 0
        // (SA0).
        __vpred pred = c7x::strm_agen<0, c7x::int_vec>::get_vpred();

        // Obtain an address for the location we will store to next
        // by obtaining the offset of the SA0 and adding it to the
        // address "out" by using the strm_agen get_adv() operator.
        // get_adv() also advances SA0 to the next offset.
        c7x::int_vec * addr = c7x::strm_agen<0, c7x::int_vec>::get_adv(out);

        // Store the data into the location in out, possibly predicated
        // based on the addressing pattern in SA0
        __vstore_pred(pred, addr, data);
    }

    __SE0_CLOSE();
    __SA0_CLOSE();
}

Now, let's modify the example above to (1) template the element type, (2) use the scalable vector programming model, and (3) make a number of other readability changes, notably utilizing using namespace c7x.

#include <c7x_scalable.h>
using namespace c7x;

/*
 * memcpy_scalable_strm<typename S>(const S*in, S *out, int len)
 *
 * S - A basic data type such as short or float.
 * in - The input buffer.
 * out - The output buffer.
 * len - The number of elements to copy.
 */
template<typename S>
void memcpy_scalable_strm(const S *restrict in, S *restrict out, int len)
{
    /*
     * Create scalable vector type V, where the elements are of type S
     */
    using V = typename make_full_vector<S>::type;

    /*
     * Find the maximum number of vector loads/stores needed to copy the buffer,
     * including any remainder.
     */
    int cnt = len / element_count_of<V>::value;
    cnt += (len % element_count_of<V>::value > 0);

    /*
     * Initialize the SE for a linear read in and the SA for a linear write
     * out.
     */
    __SE_TEMPLATE_v1 in_tmplt  = __gen_SE_TEMPLATE_v1();
    __SA_TEMPLATE_v1 out_tmplt = __gen_SA_TEMPLATE_v1();

    in_tmplt.VECLEN  = se_veclen<V>::value;
    in_tmplt.ELETYPE = se_eletype<V>::value;
    in_tmplt.DIMFMT  = __SE_DIMFMT_1D;
    in_tmplt.ICNT0   = len;

    out_tmplt.VECLEN = sa_veclen<V>::value;
    out_tmplt.DIMFMT = __SA_DIMFMT_1D;
    out_tmplt.ICNT0  = len;

    __SE0_OPEN(in, in_tmplt);
    __SA0_OPEN(out_tmplt);

    /*
     * Perform the copy. If there is remainder, the last store will be
     * predicated.
     */
    int i;
    for (i = 0; i < cnt; i++)
    {
        V        tmp  = strm_eng<0, V>::get_adv();
        __vpred  pred = strm_agen<0, V>::get_vpred();
        V       *addr = strm_agen<0, V>::get_adv(out);
        __vstore_pred(pred, addr, tmp);
    }

    __SE0_CLOSE();
    __SA0_CLOSE();
}