The Scalable Vector Programming Model section discussed the scalable vector
programming model and ended with an example that begun using elements of
the scalable vector type programming model to copy data from one
pointer to another. This example used the C++ type trait
c7x::element_count_of<T>::value
to help determine how many times the
copy loop needed to iterate:
void memcpy_scalable_v1 (const c7x::int_vec *restrict in,
c7x::int_vec *restrict out,
int len /* bytes */)
{
// Find the number of vector loads/stores needed to copy the
// buffer. This code assumes the length of the array to be copied
// is evenly divisible by the size of c7x::int_vec!
int cnt = len / c7x::element_count_of<c7x::int_vec>::value;
// Perform the copy
for (int i = 0; i < cnt; i++)
{
out[i] = in[i];
}
}
Experienced software engineers will note that this example could be improved in
at least three ways. First, the loop won't properly handle a length that isn't
evenly divisible by the number of bytes in a c7x::int_vec
. In other words, it
won't copy any remaining bytes that the caller wants copied. Second, the
Streaming Address Generator's predicate feature and a predicated store could
be used to handle the remaining bytes after the maximum number of full-vector
width copies are made. Third, the function could be templatized, so that it
can use any scalable vector type.
7.1. Creating Vector PredicatesΒΆ
Let's refine the memory copy example to take care of any remaining bytes to be copied, first without using the Streaming Address Generator.
On the last iteration of the copy loop, we only want to copy the remaining
32-bit integers. That is, we may not want to copy an entire vector. We can
do this by using a predicated vector store intrinsic that accepts a vector predicate.
The vector predicate will tell the predicated vector store intrinsic which
bytes to store to memory. In the normal case, the predicate will indicate
every byte is to be copied to memory. In the last iteration, it will indicate
only those bytes that should be stored to memory.
We can set up a vector predicate by using the
__mask_int()
intrinsic. The __mask_int()
intrinsic is passed a value
between 0 and 63 and creates an appropriate vector predicate whose bits indicate
which bytes should be stored to memory. Each bit in a vector predicate controls
which byte should be "on" when the vector predicate is used. This vector predicate
is then passed to the predicated vector store intrinsic along with the store
address and the source data. The example code is shown below.
#include <c7x.h>
#include <c7x_scalable.h>
void memcpy_scalable_v2 (const c7x::int_vec *restrict in,
c7x::int_vec *restrict out,
int len)
{
// Find the maximum number of vector loads/stores needed to copy the
// buffer.
int cnt = len / c7x::element_count_of<c7x::int_vec>::value;
uint remainder = (len % c7x::element_count_of<c7x::int_vec>::value);
// Perform the copy
// Perform most of the copy using full-vector-length loads and stores
int i;
for (i = 0; i < cnt; i++)
{
out[i] = in[i];
}
// Handle any remainder
if (remainder)
{
// Create a vector predicate to store only the bytes we want to copy
__vpred pred = __mask_int(remainder);
printf("__vpred is 0x%lx\n", __create_scalar(pred));
// Use a predicated store instruction and our created predicate to
// store only the remaining bytes that we want to store.
__vstore_pred(pred, &out[i], in[i]);
}
}