Readme for C7000 Code Generation Tools v3.1.0.LTS

Table of Contents

0 Introduction to the C7000 Code Generation Tools v3.1.x LTS

This C7000 compiler release is a “Long-Term Support” (LTS) release.

This release supports the C7100, C7120, and C7504 ISA cores. To compile code for the C7100 core, use the compiler command-line option -mv7100 or equivalently, --silicon_version=7100. To compile code for the C7120 core, use the compiler command-line option -mv7120 or equivalently, --silicon_version=7120. To compile code for the C7504 core, use the compiler command-line option -mv7504 or equivalently, --silicon_version=7504.

For definitions and explanations of STS, LTS, and the versioning number scheme, please see SDTO Compiler Version Numbers

1 Documentation

The following documents provide information on how to use, program, and migrate to the C7000 CPU. (As of v2.0.0, these documents are no longer included with the compiler tools installer. They can always be found on the TI website.)

SPRUIG8***.PDF: C7000 C/C++Optimizing Compiler Users Guide

SPRUIV4***.PDF: C7000 Optimization Guide

SPRUIG4***.PDF: C7000 Embedded Application Binary Interface (EABI) Reference Guide

SPRUIG5***.PDF: C6000-to-C7000 Migration User’s Guide

SPRUIG3***.PDF: VCOP Kernel-C to C7000 Migration Tool User’s Guide

SPRUIG6***.PDF: C7000 Host Emulation User’s Guide

2 TI E2E Community - Where to get help

Post compiler related questions to the TI E2E design community forum and select the TI device being used.

If submitting a defect report, please attach a scaled-down test case with command-line options and the compiler version number to allow us to reproduce the issue easily.

The following is the top-level webpage for all of TI’s Code Generation Tools.

3 Defect Tracking Database

Compiler defect reports can be tracked at the Development Tools bug database, SIR. SIR is a JIRA-based view into all public tools defects.

A account is required to access this page. To find an issue in SIR, enter your defect id in the top right search box once logged in. Alternatively from the top red navigation bar, select “Issues” then “Search for Issues”.

4 Host Emulation Support and Breaking Changes

To improve the stability of host emulation and to improve the consistency of the implementation compared to the C7000 C/C++ compiler (cl7x), some API and syntax changes have been made in the 3.0.0 compiler. In some cases, these changes break compatibility with previous compilers.

4.1 Signed 8-bit Types (Compatibility Break)

In cl7x, the elements of a char vector are explicitly signed. However, the “char” type is not explicitly required to be signed or unsigned and in certain cases is incompatible with “char”. This difference is why SPRUIG6 revision H “C7000 Host Emulation User’s Guide” recommends the use of standard integer types such as “int8_t” in Section 3.2. Edge case incompatibilities between “signed char” and “char” are also why SPRUIG6 revision H contains workaround notes for intrinsics that accept “char *” in Section 4.2.1.

To resolve this inconsistency and to ensure code compiles correctly for both cl7x and host emulation without workarounds, intrinsics that previously accepted “char”, “char *”, or similar now accept “signed char”, “signed char *”, or similar. In the following cases, this will result in an incompatibility between the 3.0.0 compiler and previous versions:

Other intrinsics not listed above, including vector versions of those intrinsics, should not result in incompatibility with previous compiler versions.

4.2 Function Style Vector Swizzles (Compatibility Break)

Previously, in cl7x and host emulation, vector swizzles always used a data member style syntax. For example, “vec.s0” or “vec.lo”. Host emulation now requires a function style syntax. For example, “vec.s0()” or “vec.lo()”. cl7x will accept either syntax.

int4 x = int4(0, 1, 2, 3);
int4 y = int4(4, 5, 6, 7);
x.lo() = y.lo(); /* Function style syntax, legal in cl7x and host emulation. */
x.lo = y.lo; /* Data member style syntax, legal only in cl7x. */

4.3 Vector Subscript Syntax (Enhancement)

Previously, in cl7x, vector subscript access using .s[n] restricted n to an integer literal. However, host emulation accepts .s[n] without additional restrictions due to being implemented as an array. To address the occasional need to loop over vector elements and to align cl7x more closely with host emulation, the following are now allowed:

The following are examples of cases that are now allowed:

int16 x;
int32_t i = 0;
x.s[i] = 0; /* Now legal. */
int32_t *x_ptr_1 = &x.s[0]; /* Now legal. */
int32_t *x_ptr_2 = &x.s[i]; /* Now legal. */
int32_t &x_ref_1 = x.s[0]; /* Now legal. */
int32_t &x_ref_2 = x.s[i]; /* Now legal. */

The following examples are still illegal, and are still not enforced by host emulation:

int16 x;
int32_t i = 0;
x.even().s[i] = 0; /* Still illegal. */
int32_t *x_ptr_1 = &x.even().s[0]; /* Still illegal. */
int32_t *x_ptr_2 = &x.odd().s[i]; /* Still illegal. */
int32_t &x_ref_1 = x.lo().s[0]; /* Still illegal. */
int32_t &x_ref_2 = x.hi().s[i]; /* Still illegal. */

4.4 Vector Constructor Style Initialization (Enhancement)

Previously, in cl7x, vectors needed to be initialized with a “cast” style syntax. In host emulation, vectors needed to be initialized with a constructor syntax. This resulted in code similar to the following:

#if defined(__C7X_HOSTEM__)
/* Host emulation syntax */
int4 x = int4(0, 1, 2, 3); /* Illegal in cl7x */
/* cl7x syntax */
int4 x = (int4)(0, 1, 2, 3); /* HE has results equivalent to (int4)(3) */

To address the need for consistent initialization styles and to align cl7x more closely with host emulation, “constructor” style initializations are now allowed in cl7x. The following examples are now accepted in both cl7x and host emulation, with equivalent behavior:

int4 x = int4(0, 1, 2, 3); /* Legal in cl7x and host emulation. */
int4 y = int4(0); /* Legal in cl7x and host emulation. */
int4 z = int4(int2(0, 1), int2(2, 3)); /* Legal in cl7x and host emulation. */

4.5 Const and Pointer-To Vector Typedefs (Enhancement, Deprecation)

The compiler and host emulation provide const and pointer-to vector typedefs. These were previously needed for compatibility with host emulation. However, these are no longer required for use with host emulation.

int4 x = int4(0, 1, 2, 3);
const int4 y = int4(0, 1, 2, 3); /* Now supported in host emulation. */
int4 *x_ptr = &x; /* Now supported in host emulation. */
const int4 *y_ptr = &y; /* Now supported in host emulation. */

As such, the const and pointer-to vector typedefs provided by host emulation and compiler are deprecated. “const_int16”, “int16_ptr”, “const_int16_ptr”, and similar typedefs for other vector types may be removed in a future release.

5 Boolean Vector Types

Previously, boolean vectors could only be used as a high-level abstraction for vector predicates on intrinsics. Boolean vectors were not supported as a vector type in the C7000 compiler and host emulation.

Boolean vectors are now a vector data type on C7000 and host emulation. Vector data types are described in the section titled “Vector Data Types” in C7000 C/C++Optimizing Compiler Users Guide (Rev. F). The boolean vector type can hold a maximum of 64 elements.

5.1 Supported Syntax and Functions for Boolean Vectors

Boolean vectors can be initialized and accessed like other vector types. More detailed information on vector operations and functions can be found in the section titled “Operations and Functions for Vector Data Types” in C7000 C/C++Optimizing Compiler Users Guide (Rev. F).

The following example shows a subset of the supported syntax for initializing and accessing a boolean vector.

bool4 x = bool4(0,1,0,1); /* Initializing with constants. */ 
bool4 y = bool4(0); 

bool8 z = bool8(x,y); /* Initializing with vectors. */

bool a = z.s[0]; /* Accessors */ 
bool4 b = z.even();
bool2 c = b.lo(); 

Boolean vectors can be converted to and from other vector types. More information on conversion functions for vectors can be found in the section titled “Conversion Functions for Vectors” in C7000 C/C++Optimizing Compiler Users Guide (Rev. F). The following example shows how non-zero values result in a 1 (true) when converted to a boolean vector.

int4 x = int4(0,9,-10,1); 
bool4 y = convert_bool4(x); // (0,1,1,1) 

Boolean vectors can be reinterpreted to and from other vector types. More information on re-interpretation functions for vectors can be found in the section titled “Re-Interpretation Functions for Vectors” in C7000 C/C++Optimizing Compiler Users Guide (Rev. F). The following example shows how reinterprets to boolean vectors that do not result in each element containing exactly 0x0 or 0x1 are undefined.

ushort2 myshort2_0 = (ushort2)(0,1);
bool4 mybool4_0 = as_bool4(myshort2_0); // Defined

ushort2 myshort2_1 = (ushort2)(2,3);
bool4 mybool4_1 = as_bool4(myshort2_1); // Undefined

bool8 mybool8_0 = (bool8)(0,1,0,1,0,1,0,1);
float2 myfloat2_0 = as_float2(mybool8_0); // Defined

float2 myfloat2_1 = (float2)(1.0,2.0);
bool8 mybool8_1 = as_bool8(myfloat2_1); // Undefined

Unlike other integral vector types, boolean vectors cannot be used as the condition of the vector ternary operator. More information on the vector ternary operator can be found in the section titled “Ternary Operators for Vectors (?:)” in C7000 C/C++Optimizing Compiler Users Guide (Rev. F). Standard boolean operations such as &&, ||, &, &=, |, |=, ^, ^=,!, ~, ==, !=, <=, <, >=, >, <=, <, >=, > are not currently supported on the boolean vector type.

5.2 Boolean Vector vs. Vector Predicate Type

Boolean vectors may be used as an abstract alternative to the low-level vector predicate type on most predicated intrinsics on the C7000.

The use of boolean vectors as vector predicates is encouraged. However, the boolean vector type is not fully interchangeable with the low-level vector predicate type. More information on the low-level vector predicate type can be found in the section titled “Vector Predicate Type” in C7000 C/C++Optimizing Compiler Users Guide (Rev. F).

The following example shows a difference in the capability of vector predicates and boolean vectors. Boolean vectors predicate input data by lane regardless of the element type.

When using a vector predicate type, it is the responsibility of the user to insert proper scaling on predicates. Vector predicates can be scaled up or down by a factor k={0-63} through the intrinsics ’__expand_vpred(__vpred, k)‘and’__pack_vpred(__vpred, k)’.

The two functions in the example below achieve the same result. One with a boolean vector and the other with a low-level vector predicate type.

// Boolean vector example
void foo(int4 *ptr, int4 data, char4 *ptr2, char4 data2)
    bool4 pred = bool4(0,1,1,0);
    __vstore_pred(pred, ptr, data); // Word-based store
    __vstore_pred(pred, ptr2, data2); // Byte-based store

// Vector predicate example
void bar(int4 *ptr, int4 data, char4 *ptr2, char4 data2)
    __vpred pred = _mvrp(0x0000000000000ff0); // Word-scaled predicate
    __vstore_pred(pred, ptr, data);

    pred = __pack_vpred(pred, 2); // Byte-scaled predicate
    __vstore_pred(pred, ptr2, data2);

6 A Note on Intrinsics and Header Files

Supported Intrinsics

The included top-level header files “c7x.h” and “c6x_migration.h” list the supported intrinsics for both C7x and C6x, respectively. Note that you must include these header files with your source in order to leverage many of the C7x intrinsics and all of the legacy C6x intrinsics. “c7x.h” includes other useful header files that document/describe supported intrinsics:

7 C7x scalable vector programming

A set of utilities are provided in the compiler library for writing vector- width independent code for C7000. To make use of these utilities, include c7x_scalable.h in source code.

In v3.0.x of the C7000 Compiler, the MMA portion of these scalable vector programming utilities were under development and thus the MMA portions were not accessible by default in the 3.0.x version of the compiler unless the the __C7X_UNSTABLE_API macro was defined. As of this compiler release (v3.1.x.LTS), defining __C7X_UNSTABLE_API in order to use these macros is no longer required.

These utilities are available in C++ mode only due to use of C++ language features in their implementation.

These utilities are available when using the TI C7000 compiler or when compiling with TI C7000 Host Emulation.

The following APIs are available, all of which are described in further detail in c7x_scalable.h:

The following macros are available for programming the MMA, all of which are described in c7x_mma.h:

As a moderate complexity example, the following is an implementation of a memcpy templated on the input type, and uses a streaming engine and a streaming address generator:

#include <c7x_scalable.h>

using namespace c7x;

 * memcpy_scalable_strm<typename S>(const S*in, S *out, int len)
 * S - A basic data type such as short or float.
 * in - The input buffer.
 * out - The output buffer.
 * len - The number of elements to copy.
 * Defaulted template arguments:
 * V - A full vector type of S
template<typename S,
         typename V  = typename make_full_vector<S>::type>
void memcpy_scalable_strm(const S *restrict in, S *restrict out, int len)
     * Find the maximum number of vector loads/stores needed to copy the buffer,
     * including any remainder.
    int cnt = len / element_count_of<V>::value;
    cnt += (len % element_count_of<V>::value > 0);

     * Initialize the SE for a linear read in and the SA for a linear write
     * out.
    __SE_TEMPLATE_v1 in_tmplt = __gen_SE_TEMPLATE_v1();
    __SA_TEMPLATE_v1 out_tmplt = __gen_SA_TEMPLATE_v1();

    in_tmplt.VECLEN = se_veclen<V>::value;
    in_tmplt.ELETYPE = se_eletype<V>::value;
    in_tmplt.ICNT0 = len;

    out_tmplt.VECLEN = sa_veclen<V>::value;
    out_tmplt.ICNT0 = len;

    __SE0_OPEN(in, in_tmplt);

     * Perform the copy. If there is remainder, the last store will be
     * predicated.
    int i;
    for (i = 0; i < cnt; i++)
        V tmp = strm_eng<0, V>::get_adv();
        __vpred pred = strm_agen<0, V>::get_vpred();
        V *addr = strm_agen<0, V>::get_adv(out);
        __vstore_pred(pred, addr, tmp);


8 New –mma_version compiler option

There is a new command-line option, –mma_version, as of C7000 C/C++ Compiler v2.0. This option tells the compiler which version of the Matrix Multiply Accelerator (MMA) the compiler should compile for. It also causes the compiler to set certain predefined macros which turn on the appropriate MMA API configuration structures and enumeration values in include\c7x_mma.h.

 --mma_version=1       Enables use of MMA version 1     (C7100)
 --mma_version=2       Enables use of MMA version 2     (C7120)
 --mma_version=2_256   Enables use of MMA version 2_256 (C7504)
 --mma_version=NONE    Disables use of the MMA

The compiler will place an appropriate MMA version build attribute in the object files that are generated. If the MMA is not used, an MMA version build attribute will be placed in the object file that indicates that the MMA is not used. MMA version build attributes ensure that linking of object files with incompatible versions of the MMA is disallowed. For more details, please see the C7000 Embedded Application Binary Interface (EABI) Reference Guide.

9 SE/SA/MMA Interface Changes

Beginning in the C7000 v2.0 Compiler, some reserved fields in the __SA_TEMPLATE_v1 configuration structure and some reserved fields in the MMA __HWA_CONFIG_REG_v1 configuration structure were renamed or split and renamed. This has been done in order to use those reserved fields for added functionality that has been implemented in the C7120 ISA or in the MMA v2 hardware. This means that any use of those reserved fields in code that was compiled with the 1.4.x compiler must either be replaced by the new struct member names or replaced with a function call that sets default values as described below. The latter approach is the one we recommend.

In the future, as we’ve done with the v2.x compiler tools, existing reserved fields in the SE/SA/MMA configuration structures may be used for additional features on future devices. Therefore, in future releases of the C7000 compiler, we may again

(1) change the name of a reserved field to support new features or (2) split the reserved field into two or more fields, or both (1) and (2).

A consequence of this is that directly using named reserved fields may not work with a future version of the C7000 Compiler. Therefore, it is recommended to set reserved fields with the __gen_SA_TEMPLATE_v1(), __gen_MMA_TEMPLATE_v1(), and similar functions which setup defaults for the given configuration structures for SA/SA/MMA. See include/c7x_strm.h and include/c7x_mma.h for details on the functions that setup safe default values for these configuration structs.

     sa_params.reserved2 = 0;       // named struct field
     { . . ., .reserved2 = 0, . . } // named struct field in named struct instantiation

Also note that “ordered struct instantiation” (where struct member fields are not named) may also break if a reserved field has its type changed (e.g. int64_t bitfield to an enum type).

Recommended approach:

     // Sets defaults including zeroing-out reserved fields:
     __SA_TEMPLATE_v1 sa0_config = __gen_SA_TEMPLATE_v1();
     // Now setup necessary fields
     sa0_config.ICNT0 = 32;
     // Continue setup not using reserved fields

10 Streaming Address Generator supports predicated loads on C7120

On the C7120 ISA variant, implicit predication occurs on loads that use streaming address generator (SA) operands. If an SA may be used as an operand to a load and that SA may generate predicates with one or more predicate bits off, then a predicated load must be used to avoid unexpected behavior. Use the following idioms with implicitly predicated SA loads:

Well-defined behavior with normal predicated loads:

__vpred vp = __SA0_VPRED(int16);
int16 *ptr = __SA0ADV(int16, baseptr);
int16 x = __vload_pred(vp, ptr); // Normal load with explicit predication

In addition, specialized loads predicated with an SA predicate can be generated with the following idiom, which has well-defined behavior:

__vpred vp = __SA0_VPRED(uchar32);
uchar32 *ptr = __SA0ADV(uchar32, baseptr);
ushort32 x = __vload_pred_unpack_short(vp, ptr); // Specialized load with explicit predication

(Note that vector load intrinsics that have boolean vector arguments are also available.)

The compiler may optimize the above sequences to take advantage of the C7120 ISA’s implicit predication feature.

If implicit predication is not available (C7100), or the idiom is malformed, or the compiler fails to optimize the idiom, an equivalent series of instructions instead will be generated to perform the load and then predicate the result.

After configuring an SA for predication, beware that some C/C++ idioms have unspecified behavior:

ushort32 x = __vload_unpack_short(__SA0ADV(uchar32, baseptr); // May be predicated, or not!
int16 *ptr = __SA0ADV(int16, baseptr);
int16 x = *ptr // May be predicated, or not!

Please see the section titled “Using the Streaming Address Generator” in the C7000 C/C++Optimizing Compiler Users Guide for more information.

11 Compiler does not enforce rate-limit of MMA bias, scale, and shift register loading

This section describes an issue the user may have when compiling code code that utilizes the Matrix Multiply Accelerator (MMA) and the __HWA_LOAD_2REG intrinsic.

11.1 Description of hardware behavior

The Matrix Multiply Accelerator (MMA) paired with the C7120 or C7504 CPUs allows the user to send values into bias, scale, and shift registers within the MMA that affect the operation of the MMA.

The MMA will issue a hardware exception when more than one load of each of a bias, scale, or shift register pair is issued in a 24-cycle period.

A programmer who wants to load a value into the bias, scale or shift registers will use the __HWA_LOAD_2REG intrinsics in C/C++ code. The use of this intrinsic results in an HWAOPEN instruction with a special immediate operand (0x8, 0x9, 0xa, or 0xb) in the compiler-generated assembly.

The C7000 compiler does not ensure that any two loads to the same MMA register pair do not execute within 24 cycles. Therefore, if the source code has two loads to the same MMA register pair, the compiler may produce code that results in the exception described above. This could also occur if a single load to an MMA register appears in a loop.

This issue is tracked in SIR: EXT_EP-10662

There are no plans to address this issue in the compiler.

The MMALIB software package that is delivered with the PSDK is tested to ensure that this condition does not occur.

11.2 Potential workaround

The programmer can ensure that 24 cycles elapse in-between two loads to the same MMA register pair by placing the following C code in-between loads of the same MMA bias/scale/shift register pair:

__asm(" NOP 0x8 ; rate-limit MMA load bias/scale/shift pairs (8) ");
__asm(" NOP 0x8 ; rate-limit MMA load bias/scale/shift pairs (16)");
__asm(" NOP 0x8 ; rate-limit MMA load bias/scale/shift pairs (24)");

This technique may have undesirable performance effects.

The user is responsible for ensuring that any two loads to the same MMA register pair do not execute within 24 cycles, otherwise the MMA will cause the C7x CPU to throw an exception.

12 Removal of MISRA 2004 compiler command-line options

The C7000 C/C++ Compiler does not support MISRA 2004 checking as some other Texas Instruments compilers do. Therefore, the command-line options for MISRA 2004 checking have been removed and are no longer accepted by the compiler.

13 Silicon errata i2117 workaround support

The compiler option --silicon_errata_i2117 generates code that automatically works around silicon errata i2117 on devices with the C7100 CPU core. MMA performance may be negatively impacted by the use of this option in edge cases.

14 Silicon errata i2376 workaround support

The compiler option --silicon_errata_i2376 generates code that automatically works around the silicon errata i2376 on devices with the C7504 CPU core. Performance should not be significantly affected with this workaround. When the -mv7504 or --silicon_version=7504 compiler option is specified, the --silicon_errata_i2376 option is turned on automatically. To turn off the workaround, use --silicon_errata_i2376=off. Turning off the workaround is not recommended and is only intended to be turned off by advanced users in specific situations.

15 Link-Time Optimization not supported between targets

A clarification on Link-Time Optimization use:

When using Link-Time Optimization, use only source and object files compiled with the same –silicon_version and –mma_version option. Link-Time Optimization is not supported between source and/or object files compiled with different –silicon_version or –mma_version options. In this case, the compilation may fail.

For more information on Link-Time Optimization, see the C7000 C/C++Optimizing Compiler Users Guide.

16 Resolved defects

Resolved defects in v3.1.0:

ID Summary
CODEGEN-10525 Compiler fails with INTERNAL ERROR: Decomposition error
CODEGEN-10334 Conditions on certain instructions may be dropped by compiler
CODEGEN-10294 Compiler creates illegal instruction with non-vector SE access and operation
CODEGEN-10249 HE lost implicit conversions on vector with scalar operators
CODEGEN-8177 Change documentation of –symbol_map
CODEGEN-6995 Using __builtin_expect gives internal error
CODEGEN-2534 Compiler terminates abnormally with incomplete complex initialization

17 Known defects

The up-to-date known defects in v3.1.0 can be found here (dynamically generated):

Known defects in v3.1.0

End Of File