The Scalable Vector Programming Model section discussed the scalable vector programming model and ended with an example that begun using elements of the scalable vector type programming model to copy data from one pointer to another. This example used the C++ type trait c7x::element_count_of<T>::value to help determine how many times the copy loop needed to iterate:

void memcpy_scalable_v1 (const c7x::int_vec *restrict in,
                         c7x::int_vec *restrict out,
                         int len /* bytes */)
{
    // Find the number of vector loads/stores needed to copy the
    // buffer. This code assumes the length of the array to be copied
    // is evenly divisible by the size of c7x::int_vec!
    int cnt = len / c7x::element_count_of<c7x::int_vec>::value;

    // Perform the copy
    for (int i = 0; i < cnt; i++)
    {
        out[i] = in[i];
    }
}

Experienced software engineers will note that this example could be improved in at least three ways. First, the loop won't properly handle a length that isn't evenly divisible by the number of bytes in a c7x::int_vec. In other words, it won't copy any remaining bytes that the caller wants copied. Second, the Streaming Address Generator's predicate feature and a predicated store could be used to handle the remaining bytes after the maximum number of full-vector width copies are made. Third, the function could be templatized, so that it can use any scalable vector type.

7.1. Creating Vector PredicatesΒΆ

Let's refine the memory copy example to take care of any remaining bytes to be copied, first without using the Streaming Address Generator.

On the last iteration of the copy loop, we only want to copy the remaining 32-bit integers. That is, we may not want to copy an entire vector. We can do this by using a predicated vector store intrinsic that accepts a vector predicate. The vector predicate will tell the predicated vector store intrinsic which bytes to store to memory. In the normal case, the predicate will indicate every byte is to be copied to memory. In the last iteration, it will indicate only those bytes that should be stored to memory. We can set up a vector predicate by using the __mask_int() intrinsic. The __mask_int() intrinsic is passed a value between 0 and 63 and creates an appropriate vector predicate whose bits indicate which bytes should be stored to memory. Each bit in a vector predicate controls which byte should be "on" when the vector predicate is used. This vector predicate is then passed to the predicated vector store intrinsic along with the store address and the source data. The example code is shown below.

#include <c7x.h>
#include <c7x_scalable.h>

void memcpy_scalable_v2 (const c7x::int_vec *restrict in,
                         c7x::int_vec *restrict out,
                         int len)
{
    // Find the maximum number of vector loads/stores needed to copy the
    // buffer.
    int cnt = len / c7x::element_count_of<c7x::int_vec>::value;
    uint remainder = (len % c7x::element_count_of<c7x::int_vec>::value);

    // Perform the copy

    // Perform most of the copy using full-vector-length loads and stores
    int i;
    for (i = 0; i < cnt; i++)
    {
        out[i] = in[i];
    }

    // Handle any remainder
    if (remainder)
    {
        // Create a vector predicate to store only the bytes we want to copy
        __vpred pred = __mask_int(remainder);
        printf("__vpred is 0x%lx\n", __create_scalar(pred));
        // Use a predicated store instruction and our created predicate to
        // store only the remaining bytes that we want to store.
        __vstore_pred(pred, &out[i], in[i]);
    }
}