6.1. Optimizing applications with MCAL SITARA MCU¶

6.1.1. Optimizations¶

Here is a short summary of the performance optimizations discussed in detail in this document.

Optimize memory placement
Optimize R5F MPU settings
Optimize compiler settings
Optimize application code
Memory optimization

6.1.1.1. Optimize memory placement¶

Note: For optimal performance, update linker command file to place control ISR code and data in TCM (Tightly Coupled Memory) for best execution performance (1 cycle read write access).

Placement of interrupt handler (which invokes the user defined ISR function), ISR function and data in memory (TCM - Tightly coupled memory or OCRAM - On chip RAM) is a crucial factor

Check the linker command file, compiler attributes of the relevant functions, assembly listing filie and generated .map file to find out where it is placed

Function: HwiP_irq_handler or vimRegisterInterrupt()

Snippet:

\* FUNCTION DEF: void HwiP_irq_handler(void) */
.global HwiP_irq_handler .type
HwiP_irq_handler,%function .section “.text.hwi”,“ax”,%progbits .arm
.align 2

As shown, this function is part of .text.hwi section

In the linker command file (linker.cmd), check the MEMORY for this SECTION

For example, in below snippet from linker.cmd file the .text.hwi is placed in TCM A

GROUP {
   .text.hwi: palign(8) .text.cache: palign(8) .text.mpu: palign(8)
   .text.boot: palign(8) .text:abort: palign(8) /* this helps in loading symbols when using XIP mode */
} > R5F_TCMA

Note: Refer this link for info on how to capture and use core trace data to visualize code profiling and coverage data in CCS. https://www.youtube.com/watch?v=4hEY0sZToUE

6.1.1.2. Optimize R5F MPU settings¶

Note: Recommended R5 MPU settings for control peripherals: Device, Non-Shareable. Note: MPU settings for On-Chip RAM shall not be kept shareable unless specifically intended shared region for IPC.

Modify R5F MPU configuration for the ControlSS register space to enable posted writes.

“MPU settings for EPWM periperals”

6.1.1.2.1. Background info on MPU and Cache settings¶

This section provides an overview of various features available in R5F MPU. Cortex-R5 has an in-built Memory Protection Unit (MPU) module that helps configure the memory types and attributes as defined in a processor’s memory ordering mode. The MPU is specific to each core in the system and can only modify the memory ordering model of the CPU to which it is attached.

“MPU settings”

6.1.1.2.2. ARM Cortex-R5 MPU Settings¶

Memory Types

Strongly Ordered
- All memory accesses to Strongly Ordered memory occur in the program order.
- An access to memory marked as Strongly Ordered acts as a memory barrier to all other explicit accesses from that processor, until the point at which the access is complete
- All Strongly Ordered accesses are assumed to be shared
- It is recommended to configure the external peripheral or FIFO logic (accessible via EMIF) as strongly ordered
Device
- Defined for memory locations where an access to the location can cause side effects
- The load or store instruction to or from a Device memory always generates AXI transactions of the same size as implied by the instruction
- Can be shared or non-shared
- It is recommended to configure the peripheral register spaces as device type
Normal
- Defined for memories that store information without side-effects. For example, RAM, Flash
- Can be shared or non-shared
- Can be cached or non-cached

In case there are multiple accesses to a normal memory, the CPU might optimize them leading to a different set of accesses of different size or number. The order of accesses may also be altered by the CPU. CPU makes an assumption that the order or number of accesses to a normal memory is not significant. For example, two 16-bit accesses to consecutive normal memories may be combined to a single 32-bit access. Whereas, in case of device and strongly-ordered memories, the CPU always performs the accesses in the order specified by the instructions. CPU does not alter the order, size or number of accesses to these memories. Device accesses are only ordered with respect to other device accesses, while strongly ordered memory accesses are ordered with respect to all other explicit accesses. It is to be noted that strongly-ordered memory leads to a larger performance penalty.

6.1.1.2.3. Shared and Non-Shared Memories¶

Shared memory attribute permits normal memory access by multiple processors or other system masters whereas non-shared memories can only be accessed by the host CPU. The processor’s L1 cache does not cache shared normal regions. This means that a region marked as shared is always a non-cached region (this device does not support L2 cache).

6.1.1.2.4. Cache Settings¶

This device only supports L1 cache. Cache property is only applicable for normal memories. Due to the unavailability of L2 cache, cache is applicable only for normal non-shared memories. The following are various configurations available for cache:

WTNOWA - Write-Through, No Write-Allocate
WBNOWA - Write-Back, No Write-Allocate
WBWA - Write-Back, Write-Allocate.

If an access to a cached, non-shared normal memory is performed, cache controller does a lookup in the cache table. If the location is already present in the cache, that is a cache hit, the data is read from or written to the cache. If the location is not present, that is a cache miss, it allocates a cache line for the memory location. That means, the cache is always Read-Allocate (RA). In addition, data cache can allocate on a write-access, if the memory is marked as Write-Allocate (WA). Write accesses that are cache-hit, are always written to the cache locations. If the memory is marked as Write-Through (WT), the write is performed in the actual memory as well. If the memory is marked as Write-Back (WB), the cache line is marked as dirty, and the write is only performed on the actual memory when the line is evicted.

6.1.1.2.5. MPU Regions¶

The MPU present in this device supports up to 16 regions. The memories accessed by the CPU can be partitioned up-to 16 regions (with region 0 having the lowest priority and region 15 having the highest). Each can be configured to a specific memory type and assign required permissions. When the CPU performs a memory access, the MPU compares the memory address with the programmed memory regions. If a matching memory region is found, it checks whether the required permissions are set. If not, it signals a Permissions Fault memory abort. If the matching memory region is not found, the access is mapped onto a background region. If background region is not enabled, it signals a Background Fault memory abort.

MPU settings are available under examples/Utils.

6.1.1.2.5.1. Region Base address¶

The base address defines the start of the memory region. It must be aligned to a region-sized boundary. For example, if a region size of 8KB is programmed for a given region, the base address must be a multiple of 8KB.

Note: If the region is not aligned correctly, this results in Unpredictable behavior.

6.1.1.2.6. Recommended MPU settings for the memories:¶

Flash
- Accessed only by the CPU
- Can be split into Privileged and non-privileged regions. Typically, in an RTOS context, the tasks are executed in the user mode and the kernel code is executed in the privileged mode. The tasks can be placed in the non-privileged flash section and kernel code can be placed in privileged section to ensure that no user task accidentally executes the kernel functions.
Privileged Flash Region
- Type: Normal, Non-Shareable, Cacheable
- Permission: Privileged Read Only, Executable
- Non-privileged Flash Region
  - Type: Normal, Non-Shareable, Cacheable
  - Permission: Privileged/User Read Only, Executable
RAM
- Can be accessed by CPU and other bus masters
- Can be split into shared and non-shared regions
- Shared RAM (accessed by other bus master like DMA or EMAC)
- Type : Normal, Shareable, Non-Cacheable
- Non-Shared RAM (accessed only by the CPU)
- Type : Normal, Non-Shareable, Cacheable
- Use WriteBack mode for faster accesses
- Use WriteThrough mode if any of the other masters does a read to this memory
- Can be split into privileged and non-privileged regions. Privileged RAM are typically used to store data accessible from a privileged code
  - Privileged RAM
    - Permission : Privileged Read-Write, Non-Executable
  - Non-Privileged RAM
    - Permission : Privileged/User Read-Write, Non-Executable
    - RAM can also be used to store the code to which CPU can branch and execute
  - Executable RAM - Permission : Privileged (or Privileged/User) Read-Write, Executable
Peripherals
- Accessed only by CPU
- Type : Device, Non-Shareable
- Permission : Privileged/User Read-Write, Non-Executable
External memories (accessed via EMIF module)
External SDRAM (accessed only by CPU)
- Type: Normal, Cacheble, Non-Executable/Executable
External peripherals or FIFO memories
- Type: Strongly-Ordered

6.1.1.3. Optimize compiler settings¶

Note: Recommended optimization level to use with MCAL SITARA MCU: -Os. This option gives balance between code size and performance. Please note -O3 option is recommended for optimizing performance, but it is likely to increase compiler generated code size. Refer https://software-dl.ti.com/codegen/docs/tiarmclang/rel2_1_0_LTS/compiler_manual/using_compiler/compiler_options/optimization_options.html for more information.

Choose compiler code generation settings (optimization levels O[0|1|2|3|fast|g|s|z]) according to application need.

The tiarmclang compiler supports a variety of different optimization options, including:

-O0 - no optimization; generates code that is debug-friendly
-O1 - restricted optimizations, providing a good trade-off between code size and debuggability
-O2 or -O - most optimizations enabled with an eye towards preserving a reasonable compile-time
-O3 - in addition to optimizations available at -O2, -O3 enables optimizations that take longer to perform, trading an increase in compile-time for potential performance improvements
-Ofast - enables all optimizations from -O3 along with other aggressive optimizations that may realize additional performance gains, but also may violate strict compliance with language standards
-Os - enables all optimizations from -O2 plus some additional optimizations intended to reduce code size while mitigating negative effects on performance
-Oz - enables all optimizations from -Os plus additional optimizations to further reduce code size with the risk of sacrificing performance
-Og - enables most optimizations from -O1, but may disable some optimizations to improve debuggability.

Offload floating point calculations to R5 FPU

Use -mfpu=vfpv3-d16

Decide Thumb vs ARM mode based on profiling - Compiler option -mthumb - Instruct the compiler to generate THUMB mode instructions - Compiler option -marm - Instruct the compiler to generate ARM mode instructions

6.1.1.4. Optimize application code¶

Unroll loops – for loops and nested if statements
Avoid repeated peripheral reads by maintaining a local copy (in TCM) of register values. And use in ISR.
Use MCAL SITARA MCU R5 trig math library functions for Trigonometric calculations
Contiguous allocation of ADC results to take advantage of 32-bit read to get two 16-bit ADC results
Program specific parts of the control loop using R5F assembly to get greater control over the access pattern and sequence to achieve better utilization of R5F cycles.
IPC read buffers to be allocated from memory closer to CPU. R5F TCM, ICSS DMEM, M4 TCM, etc.

SoC hardware features:

Enhanced topology for non-conflicting access to the significant number of control peripherals from multiple R5F cores (discussed below)
Enhanced interconnect for
- Low latency read access to analog sensing peripherals (ADC results)
- Low latency write access to control/actuation peripherals (EPWM comparators)
- Supporting higher throughput (back to back writes) to control peripherals

6.1.1.4.1. Peripheral Bus Architecture & Mapping¶

CONTROLSS interconnect is divided into below list of separate interconnect connected to the CORE VBUSP interconnect individually. Since this are connected to the CORE VBUSP interconnect separately, each of this interconnect can be accessed in parallel by different initiators without any arbitration. Accessing a single CONTROLSS interconnect by multiple initiators at the same time will be arbitrated.

MISC PERIPH
MISC CONFIG
FSI0 (FSITX[0:1] and FSIRX[0:1])
FSI1 (FSITX[2:3] and FSIRX[2:3])
G0_EPWM, G1_EPWM, G2_EPWM, G3_EPWM
ADC0, ADC1, ADC2, ADC3, ADC4, ADC5

These have different interconnect connections.

MISC PERIPH, MISC CONFIG, FSI0 and FSI1 are single initiator, multiple targets as shown in the diagram. EPWM interconnect are divided into 4 groups G0_EPWM, G1_EPWM, G2_EPWM and G3_EPWM accessed using different address regions in the memory map. Each interconnect has n target ports depending on number of EPWM in the design. After the interconnect, a 4:1 Static Mux can be configured per EPWM using CONTROLSS_CTRL.EPWM_STATICXBAR_SEL0 & CONTROLSS_CTRL.EPWM_STATICXBAR_SEL1 register.

ADC0, ADC1, ADC2, ADC3, ADC4 and ADC5 are different interconnect per intiator (R5FSS0-0_AHB, R5FSS0-1_AHB,R5FSS1-0_AHB, R5FSS1-1_AHB, CORE VBUSP (Port0), and CORE VBUSP (Port1)). The target ports are based on number of ADCs in the design. Each initiator can independently access any ADC register without any arbitration.

6.1.1.5. Memory optimization¶

Code Size optimizations

The -Oz option is recommended for optimizing code size.
The -O3 option is recommended for optimizing performance,but it is likely to increase compiler generated code size.
ARM/ Thumb
- -mthumb - Instruct the compiler to generate THUMB mode instructions (16-bit THUMB or T32 THUMB depending on which processor variant is selected) for current compilation
- -marm - Instruct the compiler to generate ARM mode instructions for current compilation