Readme for C7000 Code Generation Tools v2.1.1.LTS

0 Introduction to the C7000 Code Generation Tools v2.1.1 LTS
1 Documentation
2 TI E2E Community - Where to get help
3 Defect Tracking Database
4 New –mma_version compiler option
5 EABI change between v1.4 and v2.0
6 SE/SA/MMA Interface Changes
7 Streaming Address Generator supports predicated loads on C7120
8 Link-Time Optimization not supported between targets
9 Notes on Host Emulation Support
- 9.1 Host Emulation is experimental
- 9.2 Additional Host Emulation Pointer Operations Supported
10 A Note on Intrinsics and Header Files
- Supported Intrinsics
11 Compiler does not enforce rate-limit of MMA bias, scale, and shift register loading
- 11.1 Description of hardware behavior
- 11.2 Potential workaround
12 Removal of MISRA 2004 compiler command-line options
13 Silicon errata i2117 workaround support
14 C7x scalable vector programming
15 Resolved defects
16 Known defects

0 Introduction to the C7000 Code Generation Tools v2.1.1 LTS

This C7000 compiler release is a “Long-Term Support” (LTS) release.

This release supports the C7100 and C7120 ISA cores. To compile code for the C7100 core, use the compiler command-line option -mv7100 or equivalently, --silicon_version=7100. To compile code for the C7120 core, use the compiler command-line option -mv7120 or equivalently, --silicon_version=7120.

For definitions and explanations of STS, LTS, and the versioning number scheme, please see SDTO Compiler Version Numbers

1 Documentation

The following documents provide information on how to use, program, and migrate to the C7000 CPU. (As of v2.0.0, these documents are no longer included with the compiler tools installer. They can always be found on the TI website.)

SPRUIG8***.PDF: C7000 C/C++Optimizing Compiler Users Guide

SPRUIV4***.PDF: C7000 Optimization Guide

SPRUIG4***.PDF: C7000 Embedded Application Binary Interface (EABI) Reference Guide

SPRUIG5***.PDF: C6000-to-C7000 Migration User’s Guide

SPRUIG3***.PDF: VCOP Kernel-C to C7000 Migration Tool User’s Guide

SPRUIG6***.PDF: C7000 Host Emulation User’s Guide (NOTE: Host Emulation is an experimental feature)

2 TI E2E Community - Where to get help

Post compiler related questions to the TI E2E design community forum and select the TI device being used.

The E2E Design Support Forum Website

If submitting a defect report, please attach a scaled-down test case with command-line options and the compiler version number to allow us to reproduce the issue easily.

The following is the top-level webpage for all of TI’s Code Generation Tools.

Code Generation Tools Landing Page

3 Defect Tracking Database

Compiler defect reports can be tracked at the Development Tools bug database, SIR. SIR is a JIRA-based view into all public tools defects.

SIR Development Tools Defect Tracking Website

A my.ti.com account is required to access this page. To find an issue in SIR, enter your defect id in the top right search box once logged in. Alternatively from the top red navigation bar, select “Issues” then “Search for Issues”.

4 New –mma_version compiler option

There is a new command-line option, –mma_version, as of C7000 C/C++ Compiler v2.0. This option tells the compiler which version of the Matrix Multiply Accelerator (MMA) the compiler should compile for. It also causes the compiler to set certain predefined macros which turn on the appropriate MMA API configuration structures and enumeration values in include/c7x_mma.h.

 --mma_version=1       Enables use of MMA version 1 (C7100)
 --mma_version=2       Enables use of MMA version 2 (C7120)
 --mma_version=NONE    Disables use of the MMA

The compiler will place an appropriate MMA version build attribute in the object files that are generated. If the MMA is not used, an MMA version build attribute will be placed in the object file that indicates that the MMA is not used. MMA version build attributes ensure that linking of object files with incompatible versions of the MMA is disallowed. For more details, please see the C7000 Embedded Application Binary Interface (EABI) Reference Guide.

5 EABI change between v1.4 and v2.0

The C7000 Compiler v2.0 STS release implements a change in the way boolean vectors (e.g. bool16) are represented. Therefore, object code compiled with the v1.4 and earlier compilers from C/C++ source code that uses boolean vectors will not be compatible with object code compiled with the v2.0 and later compiler from C/C++ source code that uses boolean vectors. Unexpected results and program execution crashes could occur if this takes place. If boolean vectors are not used, there is no incompatibility.

To prevent any issues, the user should ensure that all code to be executed on a C7000 CPU is compiled with a 2.0.x or 2.1.x version of the C7000 compiler if any boolean vectors are used in the source code.

Note that there is no representation change for the __vpred type in the v2.0 compiler and therefore the above statements do not apply for the __vpred type.

6 SE/SA/MMA Interface Changes

Beginning in the C7000 v2.0 Compiler, some reserved fields in the __SA_TEMPLATE_v1 configuration structure and some reserved fields in the MMA __HWA_CONFIG_REG_v1 configuration structure were renamed or split and renamed. This has been done in order to use those reserved fields for added functionality that has been implemented in the C7120 ISA or in the MMA v2 hardware. This means that any use of those reserved fields in code that was compiled with the 1.4.x compiler must either be replaced by the new struct member names or replaced with a function call that sets default values as described below. The latter approach is the one we recommend.

In the future, as we’ve done with the v2.x compiler tools, existing reserved fields in the SE/SA/MMA configuration structures may be used for additional features on future devices. Therefore, in future releases of the C7000 compiler, we may again

(1) change the name of a reserved field to support new features or (2) split the reserved field into two or more fields, or both (1) and (2).

A consequence of this is that directly using named reserved fields may not work with a future version of the C7000 Compiler. Therefore, it is recommended to set reserved fields with the __gen_SA_TEMPLATE_v1(), __gen_MMA_TEMPLATE_v1(), and similar functions which setup defaults for the given configuration structures for SA/SA/MMA. See include/c7x_strm.h and include/c7x_mma.h for details on the functions that setup safe default values for these configuration structs.

     sa_params.reserved2 = 0;       // named struct field
     { . . ., .reserved2 = 0, . . } // named struct field in named struct instantiation

Also note that “ordered struct instantiation” (where struct member fields are not named) may also break if a reserved field has its type changed (e.g. int64_t bitfield to an enum type).

Recommended approach:

     // Sets defaults including zeroing-out reserved fields:
     __SA_TEMPLATE_v1 sa0_config = __gen_SA_TEMPLATE_v1();
     // Now setup necessary fields
     sa0_config.ICNT0 = 32;
     // Continue setup not using reserved fields

In addition, as of version 1.4.0, the streaming address generator (SA) API has been modified to give greater compatibility with typedefs and const pointers. For example, __SA0ADV(int2_typedef, ptr) was disallowed, but is now legal. Similarly, __SA0ADV(const_int2, ptr) is also now legal and will return a const pointer. For more information, consult the descriptions in “c7x_strm.h”.

7 Streaming Address Generator supports predicated loads on C7120

On the C7120 ISA variant, implicit predication occurs on loads that use streaming address generator (SA) operands. If an SA may be used as an operand to a load and that SA may generate predicates with one or more predicate bits off, then a predicated load must be used to avoid unexpected behavior. Use the following idioms with implicitly predicated SA loads:

Well-defined behavior with normal predicated loads:

__vpred vp = __SA0_VPRED(int16);
int16 *ptr = __SA0ADV(int16, baseptr);
int16 x = __vload_pred(vp, ptr); // Normal load with explicit predication

In addition, specialized loads predicated with an SA predicate can be generated with the following idiom, which has well-defined behavior:

__vpred vp = __SA0_VPRED(uchar32);
uchar32 *ptr = __SA0ADV(uchar32, baseptr);
ushort32 x = __vload_pred_unpack_short(vp, ptr); // Specialized load with explicit predication

(Note that vector load intrinsics that have boolean vector arguments are also available.)

The compiler may optimize the above sequences to take advantage of the C7120 ISA’s implicit predication feature.

If implicit predication is not available (C7100), or the idiom is malformed, or the compiler fails to optimize the idiom, an equivalent series of instructions instead will be generated to perform the load and then predicate the result.

After configuring an SA for predication, beware that some C/C++ idioms have unspecified behavior:

ushort32 x = __vload_unpack_short(__SA0ADV(uchar32, baseptr); // May be predicated, or not!

int16 *ptr = __SA0ADV(int16, baseptr);
int16 x = *ptr // May be predicated, or not!

Please see the section titled “Using the Streaming Address Generator” in the C7000 C/C++Optimizing Compiler Users Guide for more information.

8 Link-Time Optimization not supported between targets

A clarification on Link-Time Optimization use:

When using Link-Time Optimization, use only source and object files compiled with the same –silicon_version and –mma_version option. Link-Time Optimization is not supported between source and/or object files compiled with different –silicon_version or –mma_version options. In this case, the compilation may fail.

For more information on Link-Time Optimization, see the C7000 C/C++Optimizing Compiler Users Guide.

9 Notes on Host Emulation Support

9.1 Host Emulation is experimental

Host Emulation is an experimental feature
- Host Emulation is an experimental feature and may not work as intended or expected in certain situations. In addition, there may be limitations that exist that are not disclosed in the Host Emulation User’s Guide, SPRUIG6.

9.2 Additional Host Emulation Pointer Operations Supported

The C7000 Host Emulation User’s Guide is being updated to reflect additional supported operations on pointer types when used with Host Emulation.

In addition to those arithmetic operations listed in “Vector and Complex Element Pointer Types”, the minus (“-”) operation should also be listed for pointer types that were created based on a conversion from a scalar pointer to memory.

An additional list will be added to the “Vector and Complex Eleemnt Pointer Types” section. This list will contain the pointer comparison operations that are supported.

Equal (“==”) pointer
Not-equal (“!=”) pointer
Less-than (“<”) pointer
Greater-than (“>”) pointer
Less-than-or-equal (“<=”) pointer
Greater-than-or-equal (“>=”) pointer

10 A Note on Intrinsics and Header Files

Supported Intrinsics

The included top-level header files “c7x.h” and “c6x_migration.h” list the supported intrinsics for both C7x and C6x, respectively. Note that you must include these header files with your source in order to leverage many of the C7x intrinsics and all of the legacy C6x intrinsics. “c7x.h” includes other useful header files that document/describe supported intrinsics:

c7x_vpred.h: List of intrinsics supporting low-level __vpred vector predicate type.
c7x_direct.h: List of intrinsics that map directly to instructions.
c7x_strm.h: List of intrinsics and flags for C7x Streaming Engine and Stream Address Generator.
c7x_mma.h: List of intrinsics and associated structures and enumerations for the C7x MMA.
c7x_luthist.h: List of intrinsics and flags for C7x Lookup Table / Histogram support.

11 Compiler does not enforce rate-limit of MMA bias, scale, and shift register loading

This section describes an issue the user may have when compiling code that utilizes the Matrix Multiply Accelerator (MMA) and the __HWA_LOAD_2REG intrinsic.

This issue only applies to users who are manually programming the Matrix Multiply Accelerator (MMA) and are using the __HWA_LOAD_2REG intrinsic.

The MMALIB and TIDL software packages that are delivered with the PSDK are tested to ensure that this condition does not occur. Therefore, if the user is using the MMA via routines in the PSDK/TIDL/MMALIB software, the issue described below does not occur.

11.1 Description of hardware behavior

The Matrix Multiply Accelerator (MMA) paired with the C7120 CPU allows the user to send values into bias, scale, and shift registers within the MMA that affect the operation of the MMA.

The MMA will issue a hardware exception when more than one load of each of a bias, scale, or shift register pair is issued in a 24-cycle period.

A programmer who wants to load a value into the bias, scale or shift registers will use the __HWA_LOAD_2REG intrinsics in C/C++ code. The use of this intrinsic results in an HWAOPEN instruction with a special immediate operand (0x8, 0x9, 0xa, or 0xb) in the compiler-generated assembly.

The C7000 compiler does not ensure that any two loads to the same MMA register pair do not execute within 24 cycles. Therefore, if the source code has two loads to the same MMA register pair, the compiler may produce code that results in the exception described above. This could also occur if a single load to an MMA register appears in a loop.

This issue is tracked in SIR: EXT_EP-10662

There are no plans to address this issue in the compiler.

The MMALIB software package that is delivered with the PSDK is tested to ensure that this condition does not occur.

11.2 Potential workaround

When manually writing code for the MMA (i.e., when not using MMALIB/TIDL/PSDK routines that use the MMA), the user is responsible for ensuring that any two loads to the same MMA register pair do not execute within 24 cycles, otherwise the MMA will cause the C7x CPU to throw an exception.

The programmer can ensure that 24 cycles elapse in-between two loads to the same MMA register pair by placing the following C code in-between loads of the same MMA bias/scale/shift register pair:

__asm(" NOP 0x8 ; rate-limit MMA load bias/scale/shift pairs (8) ");
__asm(" NOP 0x8 ; rate-limit MMA load bias/scale/shift pairs (16)");
__asm(" NOP 0x8 ; rate-limit MMA load bias/scale/shift pairs (24)");

This technique may have undesirable performance effects.

12 Removal of MISRA 2004 compiler command-line options

The C7000 C/C++ Compiler does not support MISRA 2004 checking as some other Texas Instruments compilers do. Therefore, the command-line options for MISRA 2004 checking have been removed and are no longer accepted by the compiler.

13 Silicon errata i2117 workaround support

The compiler option --silicon_errata_i2117 generates code that automatically works around silicon errata i2117 on devices with the C7100 CPU core. MMA performance may be negatively impacted by the use of this option in edge cases.

14 C7x scalable vector programming

A set of utilities are provided in the compiler library for writing vector width independent code for C7000. These utilities are under development and may change in the future. As such, they are hidden by default until development is completed. To make early use of these utilities, define the macro __C7X_UNSTABLE_API at the command line and include c7x_scalable.h in source code. When these utilities are ready for general use in a future release, they will be available without defining __C7X_UNSTABLE_API.

These utilities are C++ only due to leveraging features of the C++ language.

Currently, the following APIs are available, all of which are described in further detail in c7x_scalable.h:

Vector type query and construction
- c7x::max_simd<T>::value
- c7x::element_count_of<T>::value
- c7x::element_type_of<T>::type
- c7x::component_type_of<T>::type
- c7x::make_vector<T,N>::type
- c7x::make_full_vector<T>::type
- c7x::make_pointer<T>::type
- c7x::make_const<T>::type
- c7x::is_target_vector<T>::value
Full vector types
- c7x::char_vec
- c7x::short_vec
- etc
Half vector types
- c7x::char_hvec
- c7x::short_hvec
- etc
Quarter vector types
- c7x::char_qvec
- c7x::short_qvec
- etc
Host emulation compatible types for pointers
- c7x::char_vec_ptr
- c7x::const_short_vec_ptr
- etc
Templated vector reinterprets and conversions
- c7x::reinterpret<T>(v)
- c7x::convert<T>(v)
OpenCL style vector reinterprets and conversions
- c7x::as_char_vec(v)
- c7x::convert_short_vec(v)
- etc
Streaming engine and streaming address generator helpers
- c7x::se_veclen<T>::value
- c7x::se_eletype<T>::value
- c7x::sa_veclen<T>::value
- c7x::strm_eng<I,T>::get()
- c7x::strm_eng<I,T>::get_adv()
- c7x::strm_agen<I,T>::get(p)
- c7x::strm_agen<I,T>::get_adv(p)
- c7x::strm_agen<I,T>::get_vpred()

As a moderate complexity example, the following is an implementation of a memcpy templated on the input type and that leverages both the streaming engine and streaming address generator:

#include <c7x_scalable.h>

using namespace c7x;

/*
 * memcpy_scalable_strm<typename S>(const S*in, S *out, int len)
 *
 * S - A basic data type such as short or float.
 * in - The input buffer.
 * out - The output buffer.
 * len - The number of elements to copy.
 *
 * Defaulted template arguments:
 * V - A full vector type of S
 * VP - A pointer to type V
 */
template<typename S,
         typename  V  = typename make_full_vector<S>::type,
         typename  VP = typename make_pointer<V>::type>
void memcpy_scalable_strm(const S *restrict in, S *restrict out, int len)
{
    /*
     * Find the maximum number of vector loads/stores needed to copy the buffer,
     * including any remainder.
     */
    int cnt = len / element_count_of<V>::value;
    cnt += (len % element_count_of<V>::value > 0);

    /*
     * Initialize the SE for a linear read in and the SA for a linear write
     * out.
     */
    __SE_TEMPLATE_v1 in_tmplt = __gen_SE_TEMPLATE_v1();
    __SA_TEMPLATE_v1 out_tmplt = __gen_SA_TEMPLATE_v1();

    in_tmplt.VECLEN = se_veclen<V>::value;
    in_tmplt.ELETYPE = se_eletype<V>::value;
    in_tmplt.ICNT0 = len;

    out_tmplt.VECLEN = sa_veclen<V>::value;
    out_tmplt.ICNT0 = len;

    __SE0_OPEN(in, in_tmplt);
    __SA0_OPEN(out_tmplt);

    /*
     * Perform the copy. If there is remainder, the last store will be
     * predicated.
     */
    int i;
    for (i = 0; i < cnt; i++)
    {
        V tmp = strm_eng<0, V>::get_adv();
        __vpred pred = strm_agen<0, V>::get_vpred();
        VP addr = strm_agen<0, V>::get_adv(out);
        __vstore_pred(pred, addr, tmp);
    }

    __SE0_CLOSE();
    __SA0_CLOSE();
}

15 Resolved defects

Resolved defects in v2.1.1:

ID	Summary
CODEGEN-9607	c7x_scalable.h is incompatible with Windows Host Emulation
CODEGEN-9599	Some compiler diagnostic ID numbers changed in releases after 2019
CODEGEN-9082	Optimizer drops part of a compound conditional expression controlling a loop
CODEGEN-7503	Host Emulation: __vsel_pvkv with float16 arguments gives incorrect results

16 Known defects

The up-to-date, known defects in v2.1.1 can be found here (dynamically generated):

Known defects in v2.1.1

End Of File