Build a Leaner and Faster TI Arm® Application with Link-Time Optimization (LTO)

# Introduction This article describes a feature of the tiarmclang compiler called Link-Time Optimization, or LTO for short. It is introduced in version 2.1.0.LTS. # Call to Action - Always compile with -flto, especially when creating a library - Always link with -flto ## Configuration in CCS To request LTO in a CCS project, you only need to set it in one place. CCS uses that setting both when compiling and linking. ![LTO CCS Configuration](images/lto_ccs_configuration.png) # Code Size Reduction Due to Use of LTO You can get significant code size savings by simply enabling the LTO feature in the build of your application. Comparisons between compiler-generated code size over a collection of Cortex-M0+, Cortex-M4, and Cortex-R5 example applications demonstrated that building these applications with LTO enabled resulted in significant code size reductions versus building the applications without LTO enabled. | TI Arm® Processor | % Code Size Reduction | Embedded Applications (count) | |---|:---:|:---:| | Cortex-M0+ | 23-25% | M0+ Driver Libraries (191) | | Cortex-M4 | 6-11% | Connectivity SDK (362) | | Cortex-R5 | 4-8% | EEMBC AutoBench (15) | **Table 1: Code Size Reduction Due to Use of LTO** **% Code Size Reduction = (1 - (LTO code size / Non-LTO code size)) x 100** The code size measurements vary by the nature of the example applications and the TI Arm® processor used. All of these applications were compiled with the **-Oz** compiler option to prioritize code size reduction optimizations. The **-flto** compiler option was used to enable LTO. # Performance Improvement Due to Use of LTO Enabling LTO during your application build can also provide significant speedup. With LTO enabled, example applications built on Cortex-M4 and Cortex-R5 ran significantly faster than when the same applications were built without LTO enabled. | TI Arm® Processor | Speedup Factor| % Reduction in Time/Cycles | |---|:---:|:---:| | Cortex-M0+ | 1.20 | 3.8% | | Cortex-M4 | 1.12 | 4.4% | | Cortex-R5 | 1.17 | 9.3% | **Table 2: Performance Improvement Due to Use of LTO** **Speedup Factor = (Non-LTO cycles / LTO cycles)** **% Reduction in Cycles = ((Non-LTO cycles - LTO cycles) / Non-LTO cycles) x 100** The performance results shown here were derived from cycle count measurements over 15 EEMBC AutoBench applications running on Cortex-M0+, Cortex-M4, and Cortex-R5 hardware. For the Cortex-M0+ measurements, an application's execution time was calculated using a SysTick interrupt service routine. For the Cortex-M4 and Cortex-R5 measurements, a simple cycle count for each application's execution time was collected. All example applications were compiled with the **-O3** compiler option to prioritize performance improvement optimizations. The **-flto** compiler option was used to enable LTO. # How LTO Works A key advantage to using LTO in your application build is that the linker is able to provide the compiler with the ability to optimize across C/C++ compilation unit boundaries. If LTO is not enabled during the build of an application, the compiler's visibility into the application's source code is limited to the C/C++ source file that is currently being compiled. Consequently, the compiler must make conservative assumptions about functions and variables that are referenced from the C/C++ source file, but are defined elsewhere. Hence, the compiler uses constraint with regards to what optimizations can be applied during the compilation of a given C/C++ source file. When you enable LTO, the linker will combine internal representation (IR) modules from the incoming object files that were compiler-generated from C/C++ source files into a single, merged IR module representing all of the functions and variables from all of the C/C++ source files in your application. The compiler then has visibility across all C/C++ source files via this merged IR module and is able to apply inter-module optimizations, such as aggressive inlining, constant merging, and aggressive machine outlining. For example, if multiple source files require access to the same string constant, with LTO enabled, they can all access a single instance of storage for the string constant as opposed to each source file requiring their own copy of the string constant. # LTO Development Flow Overview LTO is easy to incorporate into an application build. An overview of the LTO development flow is shown in **Figure 1** below. The LTO development flow can be divided into two phases: 1. Compilation Phase * Compile with the **-flto** option to enable LTO * This instructs the compiler to embed a bitcode encoding of the *internal representation* (IR) into the object file that the compiler generates from each C/C++ source file, **including library source files** * No additional compile-time is incurred with LTO enabled 2. Link Phase * Link with the **-flto** option to enable LTO * This instructs the linker to extract embedded IR from any object file submitted directly to the linker or pulled in from an object library * Object files that do not contain embedded IR are forwarded to the final link step * Extracted IR modules are merged / linked together to form a combined IR representation of the whole application * The compiler performs inter-module optimizations on the combined IR and generates a single temporary object file, **<lto>.o** * The linker performs a traditional link incorporating **<lto>.o** with input object files that did not contain embedded IR to produce a linked ELF executable ![LTO Development Flow Overview](images/lto_dev_flow_overview.jpg) The LTO feature is now available in version 2.1.0.LTS and more recent versions of the TI Arm® Compiler Tool that you can download from the link below. For more details about using LTO, please see the online user guide. * [TI Arm® Clang Compiler Tools](https://www.ti.com/tool/ARM-CGT) * [TI Arm® Clang Compiler Tools User Guide](https://software-dl.ti.com/codegen/docs/tiarmclang/rel2_1_0_LTS/)  <div id="footer"></div>