About this document
This guide describes the performance optimizations of real-time applications on the Texas Instruments AM263x platform.
TI’s Sitara MCU processor AM263x is designed for real-time control applications, specifically for use cases such as Traction inverters, Onboard chargers, Charging stations, DC-DC converters, Industrial AC inverters, String inverters and Industrial communication. Crucial aspect of such applications is the signal chain performance. Latency of the signal chain determines the efficiency and performance of the system.
The goal of this document is to provide a comprehensive guide to performance optimization on the AM263x platform and help real-time application developers achieve their performance goals.
- Signal chain performance considerations. See Real Time Control Loop
- List of Optimizations
- Software, SoC hardware and application
- Best practices, recommendations and step-by-step guide
Real Time Control Loop
- Crucial aspect of Real Time Control applications is signal chain performance – which includes sensing, processing and actuation. Latency of the signal chain determines the efficiency and performance of the system (for example: RPM of traction motor in automotive EV application).
- Sensing uses on-chip analog to digital converters, comparators,..
- Actuation uses PWM generation modules
- Processing involves the CPU core (R5F in AM263x) running control loop algorithm
Signal chain
Control Interrupts and ISR
Interrupt Service Routines, commonly referred to as ISRs, play a critical role in real-time control applications. Control ISRs are often used in real-time control systems to read sensed (typically ADC) inputs, perform calculations/run algorithms (called Control loop/Control algorithm in this document) and write to outputs (typically PWM)
Various operations in Real Time Control interrupt
As shown in figure above, 5 operations are
Operation 2 to 5 in a typical application
Following sections show optimizations relevant for each operation performed in the Control loop ISR
- Description about the operation and performance consideration
- Software and hardware optimization options
Operation 1: Latch and respond to interrupt
- Note
- Latency to latch and repond to interrupt from ControlSS is 29 cycles (worst case) at 400MHz R5F core clock.
This latency depends on
- Propagation of interrupt from peripheral to R5F. This is predominantly dependent on hardware latencies.
- Instruction under execution in R5F. Worst case scenario is when R5F starts executing a multi-cycle instruction and unable to respond to the interrupt. Example: a memory load instruction (from OCRAM) under execution in background loop or a low priority task when expecting an ADC interrupt
Optimization options:
- ADC interrupt
- Use Early interrupt feature of ADC to triggers R5F interrupt ahead of completion of ADC conversion. Part of interrupt latency can be absorbed in ADC conversion latency
- EPWM interrupt
- Use Counter Compare to offset the EPWM interrupt
Interrupt route
Operation 2: R5F Context save
- Note
- Latency to save context is 53 cycles (best case) at 400MHz R5F core clock.
This latency depends on
- Saving/preserving the context of interrupted task
- R5F Program status register, LR, core registers
- This latency is predominantly dependent on software/compiler generated code.
- Support for floating point in the interrupt routine
- VFP control status: FPEXC, FPSCR, VFP double-precision float: D0-D7
- Support for nesting of interrupts
Refer Optimizations for list of optimization options and trade-offs for this operation.
Code for context save
Context save
Operation 3: R5F read accesses to sensed inputs
- Note
- Latency to read inputs (one 16-bit ADC result) is 18 cycles (best case) at 400MHz R5F core clock
This latency depends on
- Hardware latency. ControlSS peripheral register read access
- Software/code. Pattern of accessing the registers
- Data type conversion (for example: int to float)
For example: In motor control use cases, R5 reads 16bit ADC results, converts results to floating point values and writes the values to memory for further calculation
Optimization options and trade-offs:
- 2 byte or 4 byte ADC read from R5F (a 32 bit core) involve same latency (18 cycles).
- So, as an optimization use consecutive ADC SOCs, so that that 2 16bit ADC results are stored consecutively in a 32bit register. This reduces 2 read accesses.
- Refer below the code used to split the 32 bit result to 2 16 bit values
- For conversion to floating point, use R5 FPU.
- Write to TCM (1 cycle) instead of L2 OCRAM
- Alternate approach is to use DMA to bring the ADC results to TCM. This can absorb overheads for R5 to read ADC results.
ADC read access optimization
Operation 4: R5F executes calcuations/algorithm
- Note
- Execution time of the control algorithm is specific to application use case. For example: Field Oriented Control computations for AM263x traction inverter executes for 985ns. Refer application note. SPRAD32 https://www.ti.com/lit/an/sprad32/sprad32.pdf
This depends on
- Number of inputs and outputs required. For example: number of channels, phases, stages
- Number of floating point/trigonometric calculations
Optimization options and trade-offs:
Execution time for FOC computation from SPRAD32
Operation 5: R5F writes to outputs
- Note
- Throughput to write outputs (For example: PWM compare values) is 2 cycles per write (best case) at 400MHz R5F core clock. Refer section CHAPTER_OPTIMIZATION_SECTION_3 on the settings to achieve this throughput.
This latency depends on
- Hardware latency (write access latency)
- Software/Generated code (pattern of read/write)
- Load from TCM consumes 1 R5 cycle. Store consumes 2 R5 cycles (1 bus cycle @200MHz).
- So 3 cycles for read+write.
Optimization options and trade-offs:
- Optimize compiler settings
- Setup the PWM region as device memory (not strongly ordered). Refer section CHAPTER_OPTIMIZATION_SECTION_3. This helps to achiee 2 cycle write throughput.
- Store all the PWM data (compare values for modulation) in TCM consecutively for quick access
- Interleave reads and writes (interleave load and store instructions). TI ARM CLANG (used by SDK) supports this optimization
- When the PWM writes are posted writes, the 2nd unused cycle of store instruction can be used for next read operation. This reduces overall write throughput to 2 cycles
PWM write latency optimization