11.1. Benefits of Using LTO - Enabling Inter-Module Optimizations¶
11.1.1. A Simple Example¶
Consider a simple example application that demonstrates just one of the potential benefits of using LTO to enable inter-module optimization …
Suppose we have a series of source files in which many of the same string constants are referenced repeatedly and across multiple source files.
If we compile and link without LTO turned on:
%> tiarmclang -mcpu=cortex-m4 -Oz constant_merge_test.c ic_s10.c ic_s20.c ic_s30.c ic_s40.c s10.c s20.c s30.c s40.c -o no_lto.out -Wl,-llnk.cmd,-mno_lto.map
The linker generated map file, no_lto.map, reveals that the size of the .rodata section where all of the string constants are defined is reasonably large:
...
SEGMENT ALLOCATION MAP
run origin load origin length init length attrs members
---------- ----------- ---------- ----------- ----- -------
00000020 00000020 00007a4c 00007a4c r-x
00000020 00000020 00004ad2 00004ad2 r-- .rodata
...
...
But if we then compile with LTO enabled:
%> tiarmclang -mcpu=cortex-m4 -flto -Oz constant_merge_test.c ic_s10.c ic_s20.c ic_s30.c ic_s40.c s10.c s20.c s30.c s40.c -o with_lto.out -Wl,-llnk.cmd,-mwith_lto.map
Then the map file, with_lto.map, shows that the .rodata output section is significantly smaller in the LTO-enabled build:
...
SEGMENT ALLOCATION MAP
run origin load origin length init length attrs members
---------- ----------- ---------- ----------- ----- -------
00000020 00000020 00005b84 00005b84 r-x
...
00004530 00004530 00001674 00001674 r-- .rodata
...
The use of LTO in this example enables the compiler to perform an inter-module constant merging optimization that results in a savings of 0x4ad2 - 0x1674 -> 0x345e (13406) bytes in the .rodata section. Note that in this example, the savings in the size of the .rodata section is offset somewhat by increased code size in other sections like .text. The net savings is 0x7a4c - 0x5b84 -> 0x1ec8 (7880) bytes.
11.1.2. Code Size Reduction Due to Use of LTO¶
Significant code size savings can be realized by simply enabling the LTO feature in the build of an application. Comparisons between compiler-generated code size over a collection of Cortex-M0+, Cortex-M4, and Cortex-R5 example applications demonstrated that building these applications with LTO enabled resulted in significant code size reductions versus building the applications without LTO enabled.
Table 1: Code Size Reduction Due to Use of LTO
TI Arm Processor |
% Code Size Reduction |
Example Applications (count) |
---|---|---|
Cortex-M0+ |
23-25% |
M0+ Driver Libraries (191) |
Cortex-M4 |
6-11% |
M4 SDK Examples (362) |
Cortex-R5 |
4-8% |
EEMBC AutoBench (15) |
% Code Size Reduction = (1-(LTO code size / non-LTO code size))*100
These code size measurements were taken over 568 Cortex-M0+, Cortex-M4, and Cortex-R5 example applications. All of these applications were compiled with the -Oz compiler option to prioritize code size reduction optimizations. The -flto compiler option was used to enable LTO.
11.1.3. Performance Improvement Due to Use of LTO¶
Enabling LTO during an application build can also provide significant speedup. With LTO enabled, example applications built on Cortex-M0+, Cortex-M4, and Cortex-R5 ran significantly faster than when the same applications were built without LTO enabled.
Table 2: Performance Improvement Due to Use of LTO
TI Arm Processor |
Speedup Factor |
% Reduction in Time/Cycles |
---|---|---|
Cortex-M0+ |
1.20 |
3.8% |
Cortex-M4 |
1.12 |
4.4% |
Cortex-R5 |
1.17 |
9.3% |
Speedup Factor = (non-LTO cycles/LTO cycles)
% Reduction in Cycles = ((non-LTO cycles - LTO cycles)/non-LTO cycles)*100
The performance results shown here were derived from measurements over 15 EEMBC AutoBench applications running on Cortex-M0+, Cortex-M4, and Cortex-R5 hardware. For the Cortex-M0+ measurements, an application’s execution time was calculated using a SysTick interrupt service routine. For the Cortex-M4 and COrtex-R5 measurements, a simple cycle count for each application’s execution time was collected. All example applications were compiled with the -O3 compiler option to prioritize performance improvement optimizations. The -flto compiler option was used to enable LTO.
Note
Increased Function Inlining
Using LTO may result in increased function inlining, which may improve performance as well as code size generally but may result in larger stack frames. This may require the user to either increase the size of the of the stack or else prevent certain functions from being inlined that are known to require large stack frames.
To debug this, it is recommended that users use the CCS Stack View to see a view of the static stack usage of each function in the application. See Stack Usage View in CCS for more information. Using the Stack Usage View requires that source code be built with debug enabled. This feature relies on the –call_graph capability provided by the tiarmofd - Object File Display Utility.