4.1. Memory¶
4.1.1. Executing from flash¶
The TMS320F28xxx family is designed for stand alone operation in embedded controller applications. The on-chip flash usually eliminates the need for external non-volatile memory and a host processor from which to boot-load. For details on running applications from internal flash memory, refer to the application note Running an Application from Internal Flash Memory on the TMS320F28xxx DSP.
Executing code from RAM is faster than executing it from flash. However, C2000 MCUs support code-prefetch and data caching while executing from flash to minimize overhead. For details on these features, refer to the “Flash and OTP Memory” section in the device Technical Reference Manual (TRM).
Note
Both code-prefetch and data caching are disabled at power-up. Application software must enable code-prefetch and configure the wait states appropriately. It also needs to enable the data cache. Refer to the InitFlash()
function in C2000Ware for details. For example, InitFlash()
for F28004x is defined in <C2000Ware install directory>/device_support/f28004x/common/source/f28004x_sysctrl.c
.
Table 4.1 lists cycle counts for executing the loop in Listing 4.1 on flash and RAM. Compiler options used: -O3 --opt_for_speed=5 --abi=eabi
. The --ramfunc=on
option and corresponding linker command file was used to execute code from RAM.
Description |
Cycles on F28004x |
---|---|
flash without enabling code-prefetch |
72006 |
flash with code-prefetch enabled |
59996 |
RAM |
54002 |
Cycles to execute the loop (calculated) |
54000 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | int i;
// RPT+4 NOPs BANZ iterations
// Total cycles = (5 * 10 + 4 ) * 1000 = 54000 cycles
for (i=0; i < 1000; i++)
{
asm(" RPT #3 || NOP;"); // 5 cycles - RPT + 4 NOPs
asm(" RPT #3 || NOP;");
asm(" RPT #3 || NOP;");
asm(" RPT #3 || NOP;");
asm(" RPT #3 || NOP;");
asm(" RPT #3 || NOP;");
asm(" RPT #3 || NOP;");
asm(" RPT #3 || NOP;");
asm(" RPT #3 || NOP;");
asm(" RPT #3 || NOP;");
}
|
4.1.2. Executing from RAM¶
4.1.2.1. Code¶
As seen from the data in Table 4.1, it is beneficial to copy time critical code from its load address in flash to RAM for execution.
The ramfunc
attribute is a TI compiler feature which allows code to easily specify that a function will be placed in and executed out of RAM. The attribute is applied to a function with GCC attribute syntax, as follows:
__attribute__((ramfunc)) void f(void) { ... }
The --ramfunc=on
option is equivalent to specifying the attribute on all functions in source files compiled with the option, with no source modification required.
Note
Fast branch instructions (SBF/BF) are generated for RAM functions. These instructions take advantage of dual prefetch queue on the C28x core that reduces the cycles for a taken branch from 7 to 4.
The ramfunc
attribute and option is available C2000 compiler versions 15.6 and above. For older compilers that do not support this feature, the CODE_SECTION pragma may be used in combination with linker command file modifications.
#pragma CODE_SECTION(f, ".TI.ramfunc") void f(void) { ... }
The linker command file is set up to create symbols corresponding to the load and run addresses for the .TI.ramfunc
section.
.TI.ramfunc : LOAD = FLASH_BANK0_SEC1,
RUN = RAMLS0to7,
LOAD_START(RamfuncsLoadStart),
LOAD_SIZE(RamfuncsLoadSize),
LOAD_END(RamfuncsLoadEnd),
RUN_START(RamfuncsRunStart),
RUN_SIZE(RamfuncsRunSize),
RUN_END(RamfuncsRunEnd),
ALIGN(4)
Code in the application uses a memcpy
to copy the .TI.ramfunc
section from link address in flash to run address in RAM.
memcpy(&RamfuncsRunStart, &RamfuncsLoadStart, (size_t)&RamfuncsLoadSize);
4.1.2.2. Data¶
Constant arrays - if access to a constant array is time critical, then consider copying the array from its load address in flash to a RAM address to reduce access time.
4.1.3. Other considerations¶
If code accesses data within the same physical memory, then performance will degrade due to resource conflicts. Place code and the data it accesses in separate blocks to improve performance.
Wait states will degrade performance. Most SARAM is zero-wait on 28x MCUs. Always check the data manual to find the wait states for each physical block and whether it applies to program or data accesses.
If code makes extensive use of two data buffers, putting each buffer in a different RAM block may improve performance. The goal is to reduce the pipeline stalls due to write and read occurring in the same cycle to different buffers. Refer to Data allocation for instructions with two memory operands for an example.