11.1. Benefits of Using LTO - Enabling Inter-Module Optimizations

11.1.1. A Simple Example

Consider a simple example application that demonstrateis just one of the potential benefits of using LTO to enable inter-module optimization …

Suppose we have a series of source files in which many of the same string constants are referenced repeatedly and across multiple source files.

If we compile and link without LTO turned on:

%> tiarmclang -mcpu=cortex-m4 -Oz constant_merge_test.c ic_s10.c ic_s20.c ic_s30.c ic_s40.c s10.c s20.c s30.c s40.c -o no_lto.out -Wl,-llnk.cmd,-mno_lto.map

The linker generated map file, no_lto.map, reveals that the size of the .rodata section where all of the string constants are defined is reasonably large:

...
SEGMENT ALLOCATION MAP

run origin  load origin   length   init length attrs members
----------  ----------- ---------- ----------- ----- -------
00000020    00000020    00007a4c   00007a4c    r-x
  00000020    00000020    00004ad2   00004ad2    r-- .rodata
  ...
...

But if we then compile with LTO enabled:

%> tiarmclang -mcpu=cortex-m4 -flto -Oz constant_merge_test.c ic_s10.c ic_s20.c ic_s30.c ic_s40.c s10.c s20.c s30.c s40.c -o with_lto.out -Wl,-llnk.cmd,-mwith_lto.map

Then the map file, with_lto.map, shows that the .rodata output section is significantly smaller in the LTO-enabled build:

...
SEGMENT ALLOCATION MAP

run origin  load origin   length   init length attrs members
----------  ----------- ---------- ----------- ----- -------
00000020    00000020    00005b84   00005b84    r-x
  ...
  00004530    00004530    00001674   00001674    r-- .rodata
...

The use of LTO in this example enables the compiler to perform an inter-module constant merging optimization that results in a savings of 0x4ad2 - 0x1674 -> 0x345e (13406) bytes in the .rodata section. Note that in this example, the savings in the size of the .rodata section is offset somewhat by increased code size in other sections like .text. The net savings is 0x7a4c - 0x5b84 -> 0x1ec8 (7880) bytes.

11.1.2. Code Size Reduction Due to Use of LTO

Significant code size savings can be realized by simply enabling the LTO feature in the build of an application. Comparisons between compiler-generated code size over a collection of Cortex-M0+, Cortex-M4, and Cortex-R5 example applications demonstrated that building these applications with LTO enabled resulted in significant code size reductions versus building the applications without LTO enabled.

Table 1: Code Size Reduction Due to Use of LTO

TI Arm Processor

% Code Size Reduction

Example Applications (count)

Cortex-M0+

23-25%

M0+ Driver Libraries (191)

Cortex-M4

6-11%

M4 SDK Examples (362)

Cortex-R5

4-8%

EEMBC AutoBench (15)

  • % Code Size Reduction = (1-(LTO code size / non-LTO code size))*100

These code size measurements were taken over 568 Cortex-M0+, Cortex-M4, and Cortex-R5 example applications. All of these applications were compiled with the -Oz compiler option to prioritize code size reduction optimizations. The -flto compiler option was used to enable LTO.

11.1.3. Performance Improvement Due to Use of LTO

Enabling LTO during an application build can also provide significant speedup. With LTO enabled, example applications built on Cortex-M0+, Cortex-M4, and Cortex-R5 ran significantly faster than when the same applications were built without LTO enabled.

Table 2: Performance Improvement Due to Use of LTO

TI Arm Processor

Speedup Factor

% Reduction in Time/Cycles

Cortex-M0+

1.20

3.8%

Cortex-M4

1.12

4.4%

Cortex-R5

1.17

9.3%

  • Speedup Factor = (non-LTO cycles/LTO cycles)

  • % Reduction in Cycles = ((non-LTO cycles - LTO cycles)/non-LTO cycles)*100

The performance results shown here were derived from measurements over 15 EEMBC AutoBench applications running on Cortex-M0+, Cortex-M4, and Cortex-R5 hardware. For the Cortex-M0+ measurements, an application’s execution time was calculated using a SysTick interrupt service routine. For the Cortex-M4 and COrtex-R5 measurements, a simple cycle count for each application’s execution time was collected. All example applications were compiled with the -O3 compiler option to prioritize performance improvement optimizations. The -flto compiler option was used to enable LTO.

Note

Increased Function Inlining

Using LTO may result in increased function inlining, which may improve performance as well as code size generally but may result in larger stack frames. This may require the user to either increase the size of the of the stack or else prevent certain functions from being inlined that are known to require large stack frames.

To debug this, it is recommended that users use the CCS Stack View to see a view of the static stack usage of each function in the application. See Stack Usage View in CCS for more information. Using the Stack Usage View requires that source code be built with debug enabled. This features relies on the –call_graph capability provided by the tiarmofd Object File Display utility.