AM263Px MCU+ SDK  10.01.00
Optishare

Introduction

This document goes over what is optishare and how it can be enabled in any project.

Optishare (Summary)

The following image shows what optishare does at a very high level.

OptiShare Summary

As shown, optishare tool takes in input the application binary (ELF Format) for different cores and the output is binaries corresponding to each core and one additional binary called shared object. In the output, all the common function (text) and read-only data is removed each CPU binary and is placed in the shared object.

IPC Notify Echo Example With OptiShare is an example in SDK which implements OptiShare.

Problem Statement

  1. AM2x devices are multicore devices.
  2. when application is being built, each core is compiled separately.
  3. Libraries are being linked statically in projects.

In case when more than 2 core's projects are using same library, that library is kept more than once in memory and this cause wastage of memory.

OptiShare as a solution

The solution here is to have a concept of "Shared" code/data.

Space Saving with Optishare

However, consider the following code:

#define PmuP_SETUP_COUNTER_DIVIDER_VAL (64ULL)
// global variable
uint64_t gCounterFreqHz = 0;
void PMU_TEXT_SECTION CycleCounterP_init(const uint64_t cpuFreqHz)
{
gCounterFreqHz = cpuFreqHz/PmuP_SETUP_COUNTER_DIVIDER_VAL;
CycleCounterP_reset();
}

Assume this code is in all 4 cores. Now without optishare the address of the above variable is:

Symbol Address Core 0 core 1 Core 2 Core 3
gCounterFreqHz 0x7005cc78 0x7008a1c0 0x700ca1c0 0x7010a1c0
CycleCounterP_init 0x70054f84 0x700866f8 0x700c66f8 0x701066f8

The Idea is to make the above function in just one location like 0x7007014c

However, there are some technical challenges with the above technique:

  1. if CycleCounterP_initis placed just once and all 4 cores are accessing the same function then how does the function know which gCounterFreqHz variable to access? Should it access to gCounterFreqHz at 0x7005cc78 or 0x700ca1c0?
  2. When CycleCounterP_init calls CycleCounterP_reset function, which core's CycleCounterP_reset function should be called?

Implementation

In ti-arm-clang, like GCC, basic units if layout are called "sections". Sections is a bytearray which cannot be split. Because sections are atomic units, therefore, if common functions are needed to be identified as shared functions, then a new section is required for each function (in GCC, -ffunction-section and -fdata-section are the flags that does this.). However, ti-arm-clang is by default making sections for each function and data, no extra function is required to be done.

There are 2 parts of optishare viz compile time and run time.

Compile time

In this implementation, special flags are provided to the linker that makes it generate a .xml file. This XML file has

  1. complete link info.
  2. list of all function and their hash:
    1. this XML file has a list of function, and a unique hash is associated with each function. This hash corresponds to the content of a function. If 2 functions have same hash, then it means that those 2 functions are same.

Optishare Flow

In the above diagram, the blocks that are of green color are the useable objects. The blocks that are colored as red should be discarded. In the bottom, it shows what are the output binaries.

To apply OptiShare in an existing project, it is required to

  1. generate xml file by passing special flags –gen_xml_func_hash .
  2. pass those xml files to optishare script.
  3. link optishare script output to generate SSO.out
  4. re-link project output ELF file again with SSO.out.

In this implementation of optishare, Region Address Translation (RAT) hardware is being used.

RAT Hardware being used for Optishare

RAT hardware does the following functionality in this specific scenario:

(Output Address) = (Input Address) + Offset

From the above flow, the output binaries that are generated by the optishare which is sso.out contains the shared text/data. Here text contains the functions.

When optishare script runs, it does the following:

  1. Read all XML files
  2. find all functions with common hash and mark them as potentially_shared
  3. for all potentially_shared functions, if all function in callgraph of a potentially_shared function are also potentially_shared then mark that function as shared
  4. To see the above algo in action, take example 1. That example calls CycleCounterP_reset function. This function is in the call graph of CycleCounterP_init function. So, from the above algo, CycleCounterP_init will only be marked as shared if CycleCounterP_reset is also shared across all cores. However, if CycleCounterP_reset itself is calling another function which is not being shared, then neither CycleCounterP_reset nor CycleCounterP_init will be shared. This basically means that entire call-graph of a function should be shared among all the cores to make a function shared.

To enable optishare in SDK example:

  1. compilation with new flag: If optishare is required to be applied, all cores are required to be built with -Wl,–gen_xml_func_hash and -Wl,–xml_link_info.
  2. Run Optishare script: Once there is linkxml file for all the cores, then optishare script will run on these linkxml files.
  3. Relink example: relink all the cores with the SSO file and this is done with a new linker flag -Wl,–import_sso.

To enable optishare in CCS

in .projectspec -Wl,–gen_xml_func_hash and -Wl,–xml_link_info flags should be present so that CCS is able to compile without any issues.

Runtime

At runtime, optishare needs special hardware features.

The technical challenges that were previously highlighted is solved using virtual memory region. What this mean is that all the functions that are in sso.out will access .data and .bss from a virtual memory region. Now each core has its own RAT hardware. This RAT hardware will map this virtual memory region to a physical memory region that contains core specific data.

Optishare Runtime (Note that 0x100 is only taken as an example)

So, at runtime, each core will configure RAT that is associated to it, to map that virtual memory region to some physical memory address in SRAM.

However, using the above technique forces one more constraint on the layout. Suppose the shared code assumes the following layout of .data section:

Offset Symbol Name
0 var1
10 var2
12 var3
22 var4

Now because RAT hardware is simply translating the address, each core should have same offset of var1 to var4.

How to Implement in a project

Here IPC Notify Echo Example With OptiShare example is being used.

Build System Changes

As previously written, add new flags to generate xml file. The following images shows the additional linker flag that is be added for each core compilation.

Optishare link time flags

This flag will generate .lnkxml file. This XML will have all the link information in XML format.

Other than this, add new rule that links the appliation again but with --import_sso flag.

Relinking of binary

Memory Map Changes

Each core's linker file needs to be changed.

For AM263Px, last 512KB of the memory is being used as the shared memory of Optishare.

C0_SSO_LCL : ORIGIN = 0x70280000 , LENGTH = 0x8000
C1_SSO_LCL : ORIGIN = 0x70288000 , LENGTH = 0x8000
C2_SSO_LCL : ORIGIN = 0x70290000 , LENGTH = 0x8000
C3_SSO_LCL : ORIGIN = 0x70298000 , LENGTH = 0x8000
SSO_SHM : ORIGIN = 0x702A0000 , LENGTH = 0x8000
USER_SHM : ORIGIN = 0x702A8000 , LENGTH = 0x58000

This looks as follows:

Memory Map of Shared Region

Here

  1. C0_SSO_LCL is the physical memory for Core 0 of the shared code's virtual memory.
  2. SSO_SHM is where the actual shared code be placed.
  3. USER_SHM is the general purpose shared memory for the user application.

in the SECTION of linker of core 0, add the following as is:

.shared.text : {
} > C0_SSO_LCL , palign(4096)
.shared.rodata : {
} > C0_SSO_LCL , palign(4096)
.shared.data : {
} > C0_SSO_LCL , palign(4096)
.shared.bss : {
} > C0_SSO_LCL , palign(4096)

For core 1, memory section would be C1_SSO_LCL and so on.

MPU settings

For this memory region, make sure that each core is marking this 512KB of L2 memory as shared in MPU.

The following images shows the same:

MPU Of Shared Region

Selecting Non-Cached will make that region as the shared.

The reason why Cx_SSO_LCL needs to be non-cached is because, the data that is in this region is mostly some global variables. When Shared code updates that global variable, it sends out a virtual address and then RAT in effects updates the physical memory address. However, if caches are on, then, this would cause in-coherency issue.

Shared Memory Specification File (mem_spec.json)

Optishare script runs, in this example, its cmd is:

node <compiler path>/opti-share/opti-share.js
-o sso.cmd.tmp
../r5fss0-0_freertos/ti-arm-clang/ipc_notify_echo_optishare.release.lnkxml
../r5fss0-1_nortos/ti-arm-clang/ipc_notify_echo_optishare.release.lnkxml
../r5fss1-0_nortos/ti-arm-clang/ipc_notify_echo_optishare.release.lnkxml
../r5fss1-1_nortos/ti-arm-clang/ipc_notify_echo_optishare.release.lnkxml
-s sso.info.tmp
--mem_spec optishare_memmap.json

Notice the --mem_spec flag. This flag is sued to pass in the memory specification to the optishare script. This is because, optishare script as of now cannot deduce the shared memory region. That is why, it is required to send out explicit shared memory specifications.

In this example, it is being defined as.

{
"mem_spec":
{
"device_mem_regions" : [
{
"name": "OCRAM",
"origin": "0x70000000",
"length": "0x300000",
"kind":"system"
},
{
"name": "FLASH",
"origin": "0x60000000",
"length": "0x8000000",
"kind":"system"
},
{
"name": "TCMA",
"origin": "0x0",
"length": "0x8000",
"kind":"local"
},
{
"name": "TCMB",
"origin": "0x80000",
"length": "0x8000",
"kind":"local"
},
{
"name": "CUSTOM",
"origin": "0x0",
"length": "0xffffffff",
"kind":"system"
}
],
"shared_mem_regions" : [
{
"name" : "SSO_SHM_RX",
"origin" : "0x702A0000",
"length" : "0x3000"
},
{
"name" : "SSO_SHM_RO",
"origin" : "0x702A3000",
"length" : "0x1000"
},
{
"name" : "SSO_SHM_RW",
"origin" : "0x702A4000",
"length" : "0x4000"
}
],
"shared_os_placement_instrs" : [
{
"name" : ".shared.text",
"placement" : "> SSO_SHM_RX, palign(4096)"
},
{
"name" : ".shared.rodata",
"placement" : "> SSO_SHM_RO, palign(4096)"
},
{
"name" : ".shared.bss",
"placement" : "> SSO_SHM_RW, palign(4096)"
},
{
"name" : ".shared.data",
"placement" : "> SSO_SHM_RW, palign(4096)"
}
]
}
}

device_mem_regions is the general information of different memories available in a device. For any device, device_mem_regions struct should be same as given here.

shared_mem_regions contains the shared memory specification. Here it splits the SSO_SHM into different region. This is important, as it is required to split it into RX, RO and RW sections and .shared.bss and .shared.data should be placed only in the RW section.

shared_os_placement_instrs is specifying the section placement. This should not be changed and kept as is.

The only change is required for shared_mem_regions.

Code Changes

C Code needs to be changed as presented as follows:

Run time code changes block diagram

As mentioned above, before enabling optishare (which is programming RAT), application should make sure that all the function and their called global variables should not be shared. This can be made sure by adding in do_not_share attribute to a function:

void __attribute__((do_not_share)) AddrTranslateP_init (AddrTranslateP_Params *params);

When, code is relinked with --import_sso flag, linker generates some symbols which can be used to program RAT. The following code shows how to do that:

/*
Following symbols are linker generated symbols
*/
extern int __TI_ATRegion0_src_addr;
extern int __TI_ATRegion0_trg_addr;
extern int __TI_ATRegion0_region_sz;
extern int __TI_ATRegion1_src_addr;
extern int __TI_ATRegion1_trg_addr;
extern int __TI_ATRegion1_region_sz;
extern int __TI_ATRegion2_src_addr;
extern int __TI_ATRegion2_trg_addr;
extern int __TI_ATRegion2_region_sz;
__attribute__((do_not_share)) int main(void)
{
AddrTranslateP_Params params;
AddrTranslateP_RegionConfig region[3];
AddrTranslateP_Params_init(&params);
if((uint32_t)(&__TI_ATRegion0_region_sz) > 0)
{
params.numRegions++;
region[0].size = 0;
uint32_t actualSize = (uint32_t)(&__TI_ATRegion0_region_sz);
region[0].localAddr = (uint32_t)&__TI_ATRegion0_src_addr;
region[0].systemAddr = (uint32_t)&__TI_ATRegion0_trg_addr;
for(uint32_t sz = 1; sz < actualSize; region[0].size++)
{
sz = sz << 1;
}
}
if((uint32_t)(&__TI_ATRegion1_region_sz) > 0)
{
params.numRegions++;
region[1].size = 0;
region[1].localAddr = (uint32_t)&__TI_ATRegion1_src_addr;
region[1].systemAddr = (uint32_t)&__TI_ATRegion1_trg_addr;
for(uint32_t sz = 1; sz < (uint32_t)(&__TI_ATRegion1_region_sz); sz <<= 1, region[1].size++);
}
if((uint32_t)(&__TI_ATRegion2_region_sz) > 0)
{
params.numRegions++;
region[2].size = 0;
region[2].localAddr = (uint32_t)&__TI_ATRegion2_src_addr;
region[2].systemAddr = (uint32_t)&__TI_ATRegion2_trg_addr;
for(uint32_t sz = 1; sz < (uint32_t)(&__TI_ATRegion2_region_sz); sz <<= 1, region[2].size++);
}
params.ratBaseAddr = CSL_RL2_REGS_R5SS0_CORE0_U_BASE + CSL_RL2_OF_R5FSS0_CORE0_RAT_CTL(0) - 0x20;
params.regionConfig = region;
AddrTranslateP_init(&params);
return AppStart();
}

In the above code, it programs the RAT before starting the application.

Performance Of OptiShare

Compiler comes with another program that does the shows the saving of memory that is able to be achieved.

node <compiler-path>/opti-share/utils/opti-save.js ../r5fss0-0_freertos/ti-arm-clang/ipc_notify_echo_optishare.release.lnkxml ../r5fss0-0_freertos/ti-arm-clang/ipc_notify_echo_optishare.release.optishare.lnkxml > ../r5fss0-0_freertos/ti-arm-clang/ipc_notify_echo_optishare.release.ossr

The above command compares the link-xml of application when compiled before optishare and after optishare.

the output is a text file which in this case is a file with extension *.ossr (OptiShare Savings Report).

The contents of looks like the following:

Section ipc_notify_echo_optishare.release.lnkxml ipc_notify_echo_optishare.release.optishare.lnkxml Saving
.text.hwi 2472 2360 112
.text.cache 1072 240 832
.text.mpu 520 400 120
.text.boot 392 368 24
.text:abort 8 0 8
.text 30256 28496 1760
.rodata 5856 5152 704
.data 1000 688 312
.bss.log_shared_mem 16384 0 16384
.shared.data 0 4096 -4096
.shared.bss 0 12288 -12288

The above is for core R5F0-0. Run the above script for each core to get the total savings for each core.

Building MulticoreELF Binaries with Optishare

MulticcoreELF (Understanding Multicore ELF image format) is image format that SDK use to boot from flash. tools/boot/multicore-elf/genimage.py is the python script that takes in input .out file of cores and then provides .mcelf and .mcelf_xip as the output. The command looks like the follwing when optishare is not enabled:

python3 /home/sanmveg/ti/workarea/mcu_plus_sdk/tools/boot/multicore-elf/genimage.py
--core-img=0:../r5fss0-0_freertos/ti-arm-clang/ipc_notify_echo_optishare.release.out
--core-img=1:../r5fss0-1_nortos/ti-arm-clang/ipc_notify_echo_optishare.release.out
--core-img=2:../r5fss1-0_nortos/ti-arm-clang/ipc_notify_echo_optishare.release.out
--core-img=3:../r5fss1-1_nortos/ti-arm-clang/ipc_notify_echo_optishare.release.out
--output=ipc_notify_echo_optishare_system.release.mcelf --merge-segments=true --tolerance-limit=0
--ignore-context=false --xip=0x60000000:0x68000000 --xlat="" --max_segment_size=8192

To enable optishare, –sso flag is to be passed.

python3 <sdkPath>/tools/boot/multicore-elf/genimage.py
--core-img=0:../r5fss0-0_freertos/ti-arm-clang/ipc_notify_echo_optishare.release.optishare.out
--core-img=1:../r5fss0-1_nortos/ti-arm-clang/ipc_notify_echo_optishare.release.optishare.out
--core-img=2:../r5fss1-0_nortos/ti-arm-clang/ipc_notify_echo_optishare.release.optishare.out
--core-img=3:../r5fss1-1_nortos/ti-arm-clang/ipc_notify_echo_optishare.release.optishare.out
--output=ipc_notify_echo_optishare_system.release.optishare.mcelf
--merge-segments=true --tolerance-limit=0 --ignore-context=false --xip=0x60000000:0x68000000
--xlat=""
--max_segment_size=8192
--sso=sso.out

sso.out contains the shared code and data.

Final Remark

Implementation of optishare is bit complex as it requires some understanding of linkers, ARM Memory Protection Unit (MPU), ARM Assembly Addressing Model , SOC level address translation using RAT, Caches etc. However, if implemented correctly, it can lead of a lot of memory savings. In usecase, where there are 2 OS running on different cores, this would make almost all the OS code as shared and leaving more space for user application.