Here is a short summary of the performance optimizations discussed in detail in this document.
Memory optimizations
In MCU SDK Driver Porting Layer (DPL), the NORTOS IRQ handler for R5F uses a common dispatcher (HwiP_irq_handler and HwiP_irq_handler_c) for handling interrupt, to look-up and call user specified ISR function. This has relatively high interrupt latency.
The optimized approach makes use of dedicated IRQ handlers per interrupt which then calls the user specified ISR function. This has a reduced interrupt latency.
This section describes the interrupt and exception handling of the ARM Cortex-R5 processor as implemented on Sitara MCU microcontrollers, as well as the related operating modes of the processor.
Exceptions are interruptions of the normal program flow this includes peripheral interrupts. The Cortex-R5 processor usually takes care to preserve the critical parts of the current processor state, so that the normal program flow could be resumed after the exception was handled by the application (saving and restoring of the CPSR and banked Stack Pointers). The Cortex-R5 processor implements the following exceptions:
Interrupt Request (IRQ)
The nIRQ is an input to the processor core, a low signal on the nIRQ input causes the processor to take the IRQ exception, if not masked. On the Sitara MCU’s the nIRQ is connected to the VIM. It is usually used as “general-purpose” interrupt line and interrupt dispatching is usually handled by the VIM. The Cortex-R5 processor offers a so called VIC port to supply the interrupt vector address directly to the processor, in order to reduce interrupt latency for IRQ’s.
Vectored Interrupt Manager (VIM) Module
The Vectored Interrupt Manager (VIM) module provides hardware assistance for prioritizing and controlling the many interrupt sources present on Sitara family devices. Interrupts are caused by events outside of the normal flow of program execution. Normally, these events require a timely response from the central processing unit (CPU); therefore, when an interrupt occurs, the CPU switches execution from the normal program flow to an ISR.
The VIM module has the following features:
Hardware Vectored Interrupts (only IRQ)
The ARM Cortex-R5 (ARMv7-R architecture) processor does not support interrupt nesting in hardware, as some Cortex-M (ARMv7-M architecture) processors do. Only a two level nesting is possible when using IRQ and FIQ, where the FIQ can interrupt the IRQ.
This is mainly because the core has only one Saved Program Status Register (SPSR) and one Link Register (LR) register for the IRQ mode. If an IRQ is interrupted by another IRQ, these CPU registers would get overwritten (corrupted) and a later restoring of the processor state would not be possible. Also, the nature of the ARM C implementation has to be taken into account, so that non-callee-saved registers (R0-R3 and R12) have to be preserved between function calls. To work around these limitations, the CPU registers (SPSR, LR, R0-R3 and R12) have to be preserved on the stack, by an ISR handler. Furthermore, the CPU mode has to be switched to another mode, usually System mode, as the IRQ has to be re-enabled, which causes the current LR (used by subroutines) to be overwritten when the CPU is still in IRQ mode.
As the ISR is executed in User or System mode and not in IRQ mode the “main” stack is used and not the IRQ exclusive stack. Usually the System mode is used, as this offers privileged access similar to the IRQ mode. Before the interrupts could be enabled again, it has to be ensured that the current interrupt source is cleared or masked, so that the ISR is not immediately interrupted by “itself”. ARM suggests a flow to implement a reentrant IRQ handler, which is described below. However, this flow should be optimized to get the shortest interrupt latency and has to be extended by special Vectored Interrupt Manager (VIM) handling to work on Sitara MCUs. This flow is described in detail below
Suggested Flow by ARM
Modified IRQ Handler Flow to Work With VIM
The OS abstraction layer (DPL) of MCU PLUS SDK architected for networking applications introduces performance overheads (interrupt latencies). To support Real Time Control use cases the IRQ handlers are optimized, reducing software overheads down to 150ns. Use below steps to utilize the optimized IRQ handlers.
Steps to register a custom interrupt using MCU SDK (default approach)
Steps to register a custom interrupt using optimized approach
Choose IRQ handler macro based on application need:
For FIQ, use the default approach in MCU SDK. The Fast Interrupt operating mode has eight processor registers banked (R8 - R12, the SP, LR and the SPSR) and has the advantage that these registers do not have to be preserved or saved to the stack in order to use them in an interrupt handler. This improves interrupt latency for FIQ. In the case only one interrupt is mapped to the FIQ, the whole interrupt service handler could be placed at this address (0x1C) to further improve interrupt latency (avoiding unnecessary branches).
Placement of interrupt handler (which invokes the user defined ISR function), ISR function and data in memory (TCM - Tightly coupled memory or OCRAM - On chip RAM) is a crucial factor
Check the linker command file, compiler attributes of the relevant functions, assembly listing filie and generated .map file to find out where it is placed
Function: HwiP_irq_handler (assembly function defined in HwiP_armv7r_handlers_nortos_asm.S)
Snippet:
As shown, this function is part of .text.hwi section
In the linker command file (linker.cmd), check the MEMORY for this SECTION
For example, in below snippet from linker.cmd file the .text.hwi is placed in TCM A
Modify R5F MPU configuration for the ControlSS register space to enable posted writes.
This section provides an overview of various features available in R5F MPU. Cortex-R5 has an in-built Memory Protection Unit (MPU) module that helps configure the memory types and attributes as defined in a processor’s memory ordering mode. The MPU is specific to each core in the system and can only modify the memory ordering model of the CPU to which it is attached.
ARM Cortex-R5 MPU Settings
Memory Types
In case there are multiple accesses to a normal memory, the CPU might optimize them leading to a different set of accesses of different size or number. The order of accesses may also be altered by the CPU. CPU makes an assumption that the order or number of accesses to a normal memory is not significant. For example, two 16-bit accesses to consecutive normal memories may be combined to a single 32-bit access. Whereas, in case of device and strongly-ordered memories, the CPU always performs the accesses in the order specified by the instructions. CPU does not alter the order, size or number of accesses to these memories. Device accesses are only ordered with respect to other device accesses, while strongly ordered memory accesses are ordered with respect to all other explicit accesses. It is to be noted that strongly-ordered memory leads to a larger performance penalty.
Shared and Non-Shared Memories
Shared memory attribute permits normal memory access by multiple processors or other system masters whereas non-shared memories can only be accessed by the host CPU. The processor’s L1 cache does not cache shared normal regions. This means that a region marked as shared is always a non-cached region (this device does not support L2 cache).
Cache Settings
This device only supports L1 cache. Cache property is only applicable for normal memories. Due to the unavailability of L2 cache, cache is applicable only for normal non-shared memories. The following are various configurations available for cache:
MPU Regions
The MPU present in this device supports up to 16 regions. The memories accessed by the CPU can be partitioned up-to 16 regions (with region 0 having the lowest priority and region 15 having the highest). Each can be configured to a specific memory type and assign required permissions. When the CPU performs a memory access, the MPU compares the memory address with the programmed memory regions. If a matching memory region is found, it checks whether the required permissions are set. If not, it signals a Permissions Fault memory abort. If the matching memory region is not found, the access is mapped onto a background region. If background region is not enabled, it signals a Background Fault memory abort.
MPU settings can be easily done using Sysconfig.
Recommended MPU settings for the memories:
Flash
RAM
Peripherals
External memories (accessed via EMIF module)
Choose compiler code generation settings (optimization levels O[0|1|2|3|fast|g|s|z]) according to application need.
The tiarmclang compiler supports a variety of different optimization options, including:
Offload floating point calculations to R5 FPU
Decide Thumb vs ARM mode based on profiling
Disable assertions used in the MCU PLUS SDK driver libraries. Debug assertions (DebugP_assert) are used for validation of inputs or arguments passed to driver functions. This introduces latency/delay in the execution of driver functions. Once the application code is validated (tested for passing valid inputs/arguments), the debug assertions can be disabled to reduce the overhead. To disable the debug assertions,
\source\kernel\dpl\DebugP.h
) SoC hardware features:
CONTROLSS interconnect is divided into below list of separate interconnect connected to the CORE VBUSP interconnect individually. Since this are connected to the CORE VBUSP interconnect separately, each of this interconnect can be accessed in parallel by different initiators without any arbitration. Accessing a single CONTROLSS interconnect by multiple initiators at the same time will be arbitrated.
Below diagram shows the different interconnect connections.
MISC PERIPH, MISC CONFIG, FSI0 and FSI1 are single initiator, multiple targets as shown in the diagram. EPWM interconnect are divided into 4 groups G0_EPWM, G1_EPWM, G2_EPWM and G3_EPWM accessed using different address regions in the memory map. Each interconnect has n target ports depending on number of EPWM in the design. After the interconnect, a 4:1 Static Mux can be configured per EPWM using CONTROLSS_CTRL.EPWM_STATICXBAR_SEL0 & CONTROLSS_CTRL.EPWM_STATICXBAR_SEL1 register.
ADC0, ADC1, ADC2, ADC3, ADC4 and ADC5 are different interconnect per intiator (R5FSS0-0_AHB, R5FSS0-1_AHB,R5FSS1-0_AHB, R5FSS1-1_AHB, CORE VBUSP (Port0), and CORE VBUSP (Port1)). The target ports are based on number of ADCs in the design. Each initiator can independently access any ADC register without any arbitration.
Code Size optimizations