Readme for C7000 Code Generation Tools v.4.1.0

0 Introduction to the C7000 Code Generation Tools v4.1.x LTS
1 Documentation
2 TI E2E Community - Where to get help
3 Defect Tracking Database
4 Host Emulation Support and Breaking Changes
5 Boolean Vector Types
- 5.1 Supported Syntax and Functions for Boolean Vectors
- 5.2 Boolean Vector vs. Vector Predicate Type
6 A Note on Intrinsics and Header Files
- Supported Intrinsics
7 C7x scalable vector programming
8 Silicon errata i2376 workaround support
9 Link-Time Optimization not supported between targets
10 LUT interface change
11 Restrict advice
12 Predicate-generating comparison intrinsics
13 Automatic use of streaming engine and streaming address generator
14 Removed features
15 Resolved defects
16 Known defects

0 Introduction to the C7000 Code Generation Tools v4.1.x LTS

This C7000 compiler release is a “Long-Term Support” (LTS) release.

This release supports the C7100, C7120, C7504, and C7524 ISA cores. To compile code for the C7100 core, use the compiler command-line option -mv7100 or equivalently, --silicon_version=7100. To compile code for the C7120 core, use the compiler command-line option -mv7120 or equivalently, --silicon_version=7120. To compile code for the C7504 core, use the compiler command-line option -mv7504 or equivalently, --silicon_version=7504. To compile code for the C7524 core, use the compiler command-line option -mv7524 or equivalently, --silicon_version=7524.

For definitions and explanations of STS, LTS, and the versioning number scheme, please see SDTO Compiler Version Numbers

1 Documentation

The following documents provide information on how to use, program, and migrate to the C7000 CPU.

SPRUIG8***.PDF: C7000 C/C++ Optimizing Compiler Users Guide

SPRUIV4***.PDF: C7000 Optimization Guide

SPRUIG4***.PDF: C7000 Embedded Application Binary Interface (EABI) Reference Guide

SPRUIG5***.PDF: C6000-to-C7000 Migration User’s Guide

SPRUIG6***.PDF: C7000 Host Emulation User’s Guide

2 TI E2E Community - Where to get help

Post compiler related questions to the TI E2E design community forum and select the TI device being used.

The E2E Design Support Forum Website

If submitting a defect report, please attach a scaled-down test case with command-line options and the compiler version number to allow us to reproduce the issue easily.

The following is the top-level webpage for all of TI’s Code Generation Tools.

Code Generation Tools Landing Page

3 Defect Tracking Database

Compiler defect reports can be tracked at the Development Tools bug database, SIR. SIR is a JIRA-based view into all public tools defects.

SIR Development Tools Defect Tracking Website

A my.ti.com account is required to access this page. To find an issue in SIR, enter your defect id in the top right search box once logged in. Alternatively from the top red navigation bar, select “Issues” then “Search for Issues”.

4 Host Emulation Support and Breaking Changes

To improve the stability of host emulation and to improve the consistency of the implementation compared to cl7x, some API and syntax changes have been made in the 3.0.0 compiler. In some cases, these changes break compatibility with previous compilers.

4.1 Signed 8-bit Types (Compatibility Break)

In cl7x, plain “char” is a signed type with the same format as “signed char,” but the C standard says that these two types are incompatible types. This means you can’t pass a “char *" to a function expecting a “signed char *" or vice versa. This difference is why SPRUIG6 revision H “C7000 Host Emulation User’s Guide” recommends the use of standard integer types such as “int8_t” in Section 3.2. Edge case incompatibilities between “signed char” and “char” are also why SPRUIG6 revision H contains workaround notes for intrinsics that accept “char *” in Section 4.2.1.

To resolve this inconsistency and to ensure code compiles correctly for both cl7x and host emulation without workarounds, intrinsics that previously accepted “char”, “char *”, or similar now accept “signed char”, “signed char *”, or similar. In the following cases, this will result in an incompatibility between the 3.0.0 compiler and previous versions:

__vload_dup(const signed char*)
__vload_pred_dup(__vpred, const signed char*)
__vload_unpack_short(const signed char*)
__vload_pred_unpack_short(__vpred, const signed char*)
__vload_unpack_int(const signed char*)
__vload_pred_unpack_int(__vpred, const signed char*)
__vload_unpack_long(const signed char*)
__vload_pred_unpack_long(__vpred, const signed char*)
__max_circ_pred(signed char, signed char&, __vpred&)
__max_index(signed char, signed char&, __vpred&)
__min_index(signed char, signed char&, __vpred&)

Other intrinsics not listed above, including vector versions of those intrinsics, should not result in incompatibility with previous compiler versions.

4.2 Function Style Vector Swizzles (Compatibility Break)

Previously, in cl7x and host emulation, vector swizzles always used a data member style syntax. For example, “vec.s0” or “vec.lo”. Host emulation now requires a function style syntax. For example, “vec.s0()” or “vec.lo()”. cl7x will accept either syntax.

int4 x = int4(0, 1, 2, 3);
int4 y = int4(4, 5, 6, 7);
x.lo() = y.lo(); /* Function style syntax, legal in cl7x and host emulation. */
x.lo = y.lo; /* Data member style syntax, legal only in cl7x. */

4.3 Vector Subscript Syntax (Enhancement)

Previously, in cl7x, vector subscript access using .s[n] restricted n to an integer literal. However, host emulation accepts .s[n] without additional restrictions due to being implemented as an array. To address the occasional need to loop over vector elements and to align cl7x more closely with host emulation, the following are now allowed:

Variable index vector subscript when the subscript is not applied to a vector swizzle.
Address of a vector subscript when the subscript is not applied to a vector swizzle.
Reference to a vector subscript when the subscript is not applied to a vector swizzle.

The following are examples of cases that are now allowed:

int16 x;
int32_t i = 0;
x.s[i] = 0; /* Now legal. */
int32_t *x_ptr_1 = &x.s[0]; /* Now legal. */
int32_t *x_ptr_2 = &x.s[i]; /* Now legal. */
int32_t &x_ref_1 = x.s[0]; /* Now legal. */
int32_t &x_ref_2 = x.s[i]; /* Now legal. */

The following examples are still illegal, and are still not enforced by host emulation:

int16 x;
int32_t i = 0;
x.even().s[i] = 0; /* Still illegal. */
int32_t *x_ptr_1 = &x.even().s[0]; /* Still illegal. */
int32_t *x_ptr_2 = &x.odd().s[i]; /* Still illegal. */
int32_t &x_ref_1 = x.lo().s[0]; /* Still illegal. */
int32_t &x_ref_2 = x.hi().s[i]; /* Still illegal. */

4.4 Vector Constructor Style Initialization (Enhancement)

Previously, in cl7x, vectors needed to be initialized with a “cast” style syntax. In host emulation, vectors needed to be initialized with a constructor syntax. This resulted in code similar to the following:

#if defined(__C7X_HOSTEM__)
/* Host emulation syntax */
int4 x = int4(0, 1, 2, 3); /* Illegal in cl7x */
#else
/* cl7x syntax */
int4 x = (int4)(0, 1, 2, 3); /* HE has results equivalent to (int4)(3) */
#endif

To address the need for consistent initialization styles and to align cl7x more closely with host emulation, “constructor” style initializations are now allowed in cl7x. The following examples are now accepted in both cl7x and host emulation, with equivalent behavior:

int4 x = int4(0, 1, 2, 3); /* Legal in cl7x and host emulation. */
int4 y = int4(0); /* Legal in cl7x and host emulation. */
int4 z = int4(int2(0, 1), int2(2, 3)); /* Legal in cl7x and host emulation. */

4.5 Const and Pointer-To Vector Typedefs (Enhancement, Deprecation)

The compiler and host emulation provide const and pointer-to vector typedefs. These were previously needed for compatibility with host emulation. However, these are no longer required for use with host emulation.

int4 x = int4(0, 1, 2, 3);
const int4 y = int4(0, 1, 2, 3); /* Now supported in host emulation. */
int4 *x_ptr = &x; /* Now supported in host emulation. */
const int4 *y_ptr = &y; /* Now supported in host emulation. */

As such, the const and pointer-to vector typedefs provided by host emulation and compiler are deprecated. “const_int16”, “int16_ptr”, “const_int16_ptr”, and similar typedefs for other vector types may be removed in a future release.

5 Boolean Vector Types

Previously, boolean vectors could only be used as a high-level abstraction for vector predicates on intrinsics. Boolean vectors were not supported as a vector type in the C7000 compiler and host emulation.

Boolean vectors are now a vector data type on C7000 and host emulation. Vector data types are described in the section titled “Vector Data Types” in C7000 C/C++ Optimizing Compiler Users Guide (Rev. F). The boolean vector type can hold a maximum of 64 elements.

5.1 Supported Syntax and Functions for Boolean Vectors

Boolean vectors can be initialized and accessed like other vector types. More detailed information on vector operations and functions can be found in the section titled “Operations and Functions for Vector Data Types” in C7000 C/C++ Optimizing Compiler Users Guide (Rev. F).

The following example shows a subset of the supported syntax for initializing and accessing a boolean vector.

bool4 x = bool4(0,1,0,1); /* Initializing with constants. */
bool4 y = bool4(0);

bool8 z = bool8(x,y); /* Initializing with vectors. */

bool a = z.s[0]; /* Accessors */
bool4 b = z.even();
bool2 c = b.lo();
etc.

Boolean vectors can be converted to and from other vector types. More information on conversion functions for vectors can be found in the section titled “Conversion Functions for Vectors” in C7000 C/C++ Optimizing Compiler Users Guide (Rev. F). The following example shows how non-zero values result in a 1 (true) when converted to a boolean vector.

int4 x = int4(0,9,-10,1);
bool4 y = convert_bool4(x); // (0,1,1,1)

Boolean vectors can be reinterpreted to and from other vector types. More information on re-interpretation functions for vectors can be found in the section titled “Re-Interpretation Functions for Vectors” in C7000 C/C++ Optimizing Compiler Users Guide (Rev. F). The following example shows how reinterprets to boolean vectors that do not result in each element containing exactly 0x0 or 0x1 are undefined.

ushort2 myshort2_0 = (ushort2)(0,1);
bool4 mybool4_0 = as_bool4(myshort2_0); // Defined

ushort2 myshort2_1 = (ushort2)(2,3);
bool4 mybool4_1 = as_bool4(myshort2_1); // Undefined

bool8 mybool8_0 = (bool8)(0,1,0,1,0,1,0,1);
float2 myfloat2_0 = as_float2(mybool8_0); // Defined

float2 myfloat2_1 = (float2)(1.0,2.0);
bool8 mybool8_1 = as_bool8(myfloat2_1); // Undefined

Unlike other integral vector types, boolean vectors cannot be used as the condition of the vector ternary operator. More information on the vector ternary operator can be found in the section titled “Ternary Operators for Vectors (?:)” in C7000 C/C++ Optimizing Compiler Users Guide (Rev. F). Standard boolean operations such as &&, ||, &, &=, |, |=, ^, ^=,!, ~, ==, !=, <=, <, >=, >, <=, <, >=, > are not currently supported on the boolean vector type.

5.2 Boolean Vector vs. Vector Predicate Type

Boolean vectors may be used as an abstract alternative to the low-level vector predicate type on most predicated intrinsics on the C7000.

The use of boolean vectors as vector predicates is encouraged. However, the boolean vector type is not fully interchangeable with the low-level vector predicate type. More information on the low-level vector predicate type can be found in the section titled “Vector Predicate Type” in C7000 C/C++ Optimizing Compiler Users Guide (Rev. F).

The following example shows a difference in the capability of vector predicates and boolean vectors. Boolean vectors predicate input data by lane regardless of the element type.

When using a vector predicate type, it is the responsibility of the user to insert proper scaling on predicates. Vector predicates can be scaled up or down by a factor k={0-63} through the intrinsics ’__expand_vpred(__vpred, k)‘and’__pack_vpred(__vpred, k)’.

The two functions in the example below achieve the same result. One with a boolean vector and the other with a low-level vector predicate type.

// Boolean vector example
void foo(int4 *ptr, int4 data, char4 *ptr2, char4 data2)
{
    bool4 pred = bool4(0,1,1,0);
    __vstore_pred(pred, ptr, data); // Word-based store
    __vstore_pred(pred, ptr2, data2); // Byte-based store
}

// Vector predicate example
void bar(int4 *ptr, int4 data, char4 *ptr2, char4 data2)
{
    __vpred pred = _mvrp(0x0000000000000ff0); // Word-scaled predicate
    __vstore_pred(pred, ptr, data);

    pred = __pack_vpred(pred, 2); // Byte-scaled predicate
    __vstore_pred(pred, ptr2, data2);
}

6 A Note on Intrinsics and Header Files

Supported Intrinsics

The included top-level header files “c7x.h” and “c6x_migration.h” list the supported intrinsics for both C7x and C6x, respectively. Note that you must include these header files with your source in order to leverage many of the C7x intrinsics and all of the legacy C6x intrinsics. “c7x.h” includes other useful header files that document/describe supported intrinsics:

c7x_vpred.h: List of intrinsics supporting low-level __vpred vector predicate type.
c7x_direct.h: List of intrinsics that map directly to instructions.
c7x_strm.h: List of intrinsics and flags for C7x Streaming Engine and Stream Address Generator.
c7x_mma.h: List of intrinsics and associated structures and enumerations for the C7x MMA.
c7x_luthist.h: List of intrinsics and flags for C7x Lookup Table / Histogram support.

7 C7x scalable vector programming

A set of utilities are provided in the compiler library for writing vector- width independent code for C7000. To make use of these utilities, include c7x_scalable.h in source code.

These utilities are available in C++ mode only due to use of C++ language features in their implementation.

These utilities are available when using the TI C7000 compiler or when compiling with TI C7000 Host Emulation.

The following APIs are available, all of which are described in further detail in c7x_scalable.h:

Vector type query and construction
- c7x::max_simd<T>::value
- c7x::element_count_of<T>::value
- c7x::element_type_of<T>::type
- c7x::component_type_of<T>::type
- c7x::make_vector<T,N>::type
- c7x::make_full_vector<T>::type
- c7x::is_target_vector<T>::value
Full vector types
- c7x::char_vec
- c7x::short_vec
- etc
Half vector types
- c7x::char_hvec
- c7x::short_hvec
- etc
Quarter vector types
- c7x::char_qvec
- c7x::short_qvec
- etc
Templated vector reinterprets and conversions
- c7x::reinterpret<T>(v)
- c7x::convert<T>(v)
Vector reinterprets and conversions
- c7x::as_char_vec(v)
- c7x::convert_short_vec(v)
- etc
Streaming engine and streaming address generator helpers
- c7x::se_veclen<T>::value
- c7x::se_eletype<T>::value
- c7x::se_eledup<T1,T2>::value
- c7x::sa_veclen<T>::value
- c7x::strm_eng<I,T>::get()
- c7x::strm_eng<I,T>::get_adv()
- c7x::strm_agen<I,T>::get(p)
- c7x::strm_agen<I,T>::get_adv(p)
- c7x::strm_agen<I,T>::get_vpred()

The following macros are available for programming the MMA, all of which are described in c7x_mma.h:

```
  __MMA_A_MAT_BYTES__
```
```
  __MMA_A_ROW_WIDTH_BYTES__
```
```
  __MMA_A_ROWS__
```
```
  __MMA_A_COLS(ebytes)
```
```
  __MMA_A_ENTRIES__
```
```
  __MMA_B_MAT_BYTES__
```
```
  __MMA_B_ROW_WIDTH_BYTES__
```
```
  __MMA_B_ROWS(ebytes)
```
```
  __MMA_B_COLS(ebytes)
```
```
  __MMA_C_MAT_BYTES__
```
```
  __MMA_C_ROW_WIDTH_BYTES__
```
```
  __MMA_C_ROWS__
```
```
  __MMA_C_COLS(ebytes)
```
```
  __MMA_C_ENTRIES__
```

As a moderate complexity example, the following is an implementation of a memcpy templated on the input type, and uses a streaming engine and a streaming address generator:

#include <c7x_scalable.h>

using namespace c7x;

/*
 * memcpy_scalable_strm<typename S>(const S*in, S *out, int len)
 *
 * S - A basic data type such as short or float.
 * in - The input buffer.
 * out - The output buffer.
 * len - The number of elements to copy.
 *
 * Defaulted template arguments:
 * V - A full vector type of S
 */
template<typename S,
         typename V  = typename make_full_vector<S>::type>
void memcpy_scalable_strm(const S *restrict in, S *restrict out, int len)
{
    /*
     * Find the maximum number of vector loads/stores needed to copy the buffer,
     * including any remainder.
     */
    int cnt = len / element_count_of<V>::value;
    cnt += (len % element_count_of<V>::value > 0);

    /*
     * Initialize the SE for a linear read in and the SA for a linear write
     * out.
     */
    __SE_TEMPLATE_v1 in_tmplt = __gen_SE_TEMPLATE_v1();
    __SA_TEMPLATE_v1 out_tmplt = __gen_SA_TEMPLATE_v1();

    in_tmplt.VECLEN = se_veclen<V>::value;
    in_tmplt.ELETYPE = se_eletype<V>::value;
    in_tmplt.ICNT0 = len;

    out_tmplt.VECLEN = sa_veclen<V>::value;
    out_tmplt.ICNT0 = len;

    __SE0_OPEN(in, in_tmplt);
    __SA0_OPEN(out_tmplt);

    /*
     * Perform the copy. If there is remainder, the last store will be
     * predicated.
     */
    int i;
    for (i = 0; i < cnt; i++)
    {
        V tmp = strm_eng<0, V>::get_adv();
        __vpred pred = strm_agen<0, V>::get_vpred();
        V *addr = strm_agen<0, V>::get_adv(out);
        __vstore_pred(pred, addr, tmp);
    }

    __SE0_CLOSE();
    __SA0_CLOSE();
}

8 Silicon errata i2376 workaround support

The compiler option --silicon_errata_i2376 generates code that automatically works around the silicon errata i2376 on devices with the C7504 CPU core. Performance should not be significantly affected with this workaround. When the -mv7504 or --silicon_version=7504 compiler option is specified, the --silicon_errata_i2376 option is turned on automatically. To turn off the workaround, use --silicon_errata_i2376=off. Turning off the workaround is not recommended and is only intended to be turned off by advanced users in specific situations.

9 Link-Time Optimization not supported between targets

A clarification on Link-Time Optimization use:

When using Link-Time Optimization, use only source and object files compiled with the same –silicon_version and –mma_version option. Link-Time Optimization is not supported between source and/or object files compiled with different –silicon_version or –mma_version options. In this case, the compilation may fail.

For more information on Link-Time Optimization, see the C7000 C/C++ Optimizing Compiler Users Guide.

10 LUT interface change

The macros __LUT_SET_LTER, __LUT_SET_LTBR, and __LUT_SET_LTCR defined in c7x_luthist.h have been changed so that the definitions do not end with a semicolon. This is in accordance with the best practice for function-like macros: source code which invokes them should treat them just like function calls, in particular by following the macro invocation with a semicolon.

Before this change, if the source code invoked the macro as part of a containing statement such as an if/else statement, it was not allowed to use semicolons, leading to confusing code:

if (test) __LUT_SET_LTER(a)
else __LUT_SET_LTER(b)

After this change, the code must use semicolons as if the macro were a normal function call:

if (test) __LUT_SET_LTER(a);
else __LUT_SET_LTER(b);

11 Restrict advice

An advice-severity diagnostic message was added that identifies opportunities for qualifying function parameters with restrict if doing so is likely to improve loop performance. See Section 4.16 of the C7000 C/C++ Optimizing Compiler Users Guide.

The diagnostic can be disabled with --diag_suppress=35000, which is also supported in #pragma FUNCTION_OPTIONS.

12 Predicate-generating comparison intrinsics

The __cmp_{ge,gt,le,lt}_{pred,bool} intrinsics are now overloaded to support integer and floating point arguments. Previously, the greater-than versions only supported integer arguments and the less-than versions only supported floating point arguments.

13 Automatic use of streaming engine and streaming address generator

13.1 Overview

Version 4.0.0 of the compiler adds support for automatic use of the streaming engines (SE) and the streaming address generators (SA). This behavior can be controlled with the --auto_stream option:

--auto_stream=off Disables automatic use of the SE and SA.
--auto_stream=saving Enables automatic use of the SE and SA with context saving. This option should be used if an SE or SA may be open when a function call is made. This option is safe, but may be slightly slower than --auto_stream=no_saving and may increase stack usage.
--auto_stream=no_saving Enables automatic use of the SE and SA without context saving. This option should be used if an SE or SA will never be open when a function call is made. This options is less safe than --auto_stream=saving but may be slightly faster and may reduce stack usage.

For C7100 and C7120, this optimization must be enabled manually with --auto_stream=no_saving due to no SE or SA context switching support on C7100 and C7120. For later parts, such as C7504, --auto_stream=saving is enabled by default.

--auto_stream will convert memory accesses in loop nests with addressing patterns that are guaranteed to fit into an SE or SA configuration template. For example:

void example1(char *in, char *restrict out, int len1, int len2)
{
    for (int i = 0; i < len1; i++)
        for (int j = 0; j < len2; j++)
            out[i*len1 + j] = in[i*len1 + j];
}

will be transformed to be equivalent to the following SE configuration on C7504 after being vectorized:

__SE_TEMPLATE_v1 tmplt = __gen_SE_TEMPLATE_v1();
tmplt.ICNT0 = 32;
tmplt.ICNT1 = (len2>>5)+((len2&0x1f) != 0);
tmplt.DIM1 = 32;
tmplt.ICNT2 = len1;
tmplt.DIM2 = len1;
tmplt.VECLEN = __SE_VECLEN_32ELEMS;
tmplt.DIMFMT = __SE_DIMFMT_3D;

13.2 Legality and correctness

The following will not be transformed due to len1 and len2 potentially not fitting in the 32 bit fields of the SE and the loop counters exceeding 32 bit values:

void example2(char *in, char *restrict out, long len1, long len2)
{
    for (long i = 0; i < len1; i++)
        for (long j = 0; j < len2; j++)
            out[i*len1 + j] = in[i*len1 + j];
}

For situations such as above, addressing patterns will almost always map to a stream in practice although edge cases may be possible. Such cases include, but are not limited to:

ICNT values exceeding the range of unsigned 32 bit.
DIM values exceeding the range of signed 32 bit.
Additions or multiplies in addressing exceeding the range of signed 32 bit.
Addressing exceeding the range of INT_MIN to INT_MAX elements.

The --assume_addresses_ok_for_stream option is available to allow the compiler to ignore edge cases such as those above. Using this option will allow example2 to be transformed in the same way as example1.

If the --auto_stream=no_saving option is used when an SE or SA is open when a function call is made, incorrect code may generated. In this case, the state of the SE or SA that is open will be lost if that SE or SA is used automatically by the compiler.

--auto_stream may generate incorrect code if L1D is configured and used as SRAM. In this case, attempting to use the SE to access L1D will fail.

13.3 Profitability and tuning

Automatic use of the SE and SA will only occur if the compiler believes transforming within a loop or loop nest to be profitable, which is primarily related to loop iteration counts. As such, using #pragma PROB_ITERATE and #pragma MUST_ITERATE will help guide this transformation.

Additionally, the compiler will not use an SE or SA if an SE or SA is already used in a function.

#pragma FUNCTION_OPTIONS may be used to control the behavior of automatic SE and SA on a function-by-function basis. For example, #pragma FUNCTION_OPTIONS("--auto_stream=no_saving --assume_addresses_ok_for_stream") could be used to enable automatic SE and SA for a single function on C7100.

14 Removed features

The enum values __MMA_OPEN_FSM_MINRESET and __MMA_OPEN_FSM_MAXRESET, used as the third argument to the __HWAOPEN intrinsic, have been removed from the c7x_mma.h header.

15 Resolved defects

Resolved defects in v4.1.0:

ID	Summary
CODEGEN-11774	HWAOPEN instruction illegally appears in parallel with HWALDAB instruction
CODEGEN-11515	Compiler hangs on compile-time access of non-zero indices

16 Known defects

The up-to-date known defects in v4.1.0 can be found here (dynamically generated):

Known defects in v4.1.0

End Of File