The use of C++ language typically increases memory footprint (compared to C). Given the fact embedded systems usually have limited computing/memory resources, this is a very bad news! Luckily, we can use various optimizations to reduce space size or improve computing performance.
In this post I size-optimize the code of the Object Oriented Programming with Embedded Systems (C++ /w STL) project. I reduce the binary size from 100.4 KB to merely 3.4 KB. The combined optimization reduction factor is 29.5, which is a quite nice achievement. Check out the details, how to do it! 🙂
- Ubuntu 14.04 LTS (x86 architecture).
- STM32F4 Discovery Board (ARM architecture, costs less than 20 EUR).
- Complete the OOP tutorial.
- Or at least go through chapter 2. Below is a quick-and-dirty instruction summary if you decide to stay on this page.
cd ~ # Remove the official package sudo apt-get purge binutils-arm-none-eabi \ gcc-arm-none-eabi \ gdb-arm-none-eabi \ libnewlib-arm-none-eabi # Add 3rd party repository sudo add-apt-repository ppa:terry.guo/gcc-arm-embedded sudo apt-get update # Check the GCC package version in the PPA repository sudo apt-cache policy gcc-arm-none-eabi # Install software requirements sudo apt-get install build-essential git openocd \ gcc-arm-none-eabi qemu-system-arm \ symlinks expect # Clone my git repository & init submodules git clone https://github.com/istarc/stm32.git cd ~/stm32 git submodule update --init
2. Install Software Dependencies
Install python3 and octave are required to automate and visualize benchmarking results.
sudo apt-get install python3 octave
2. Memory Footprint Optimization
The non-optimized binary code size of the OOP project (C/C++ including STL) is huge – 100.4 KB! 😯 It has to be reduced in order to be suitable for devices with limited resources. Among others, you can use the following optimizations (see Makefile):
- O1. Override new, new, delete, delete operators using malloc() and free() globally in main.cpp (see lines 165 – 179). Caveat. It’s not a good idea to mix new and delete with malloc and free, respectively. New throws exception upon failure, as required by C++ standard. The overridden new returns NULL upon failure. A special care has to be given, to catch these exceptions appropriately. You have been warned! 🙂
- O2. Disable exception handling support using -fno-exceptions flags. When this optimization is used, you have to make appropriate source code adjustments, i.e. replace try, throw, catch statements with C style exception handling.
- O3. Use the size optimization flag -Os.
- O4. Place each function or data item into its own isolated section in the output file via -ffunction-sections -fdata-sections flags. Remove isolated unused sections at link time using -Wl,-gc-sections flag. Caveat. May break things if linked to 3rd party static libraries that use magic sections.
- O5. Don’t use GCC’s built-in functions via -fno-builtin. It may or may not improve performance/reduce code size.
- O6. Perform link time optimization using -flto flag.
- O7. Disable type introspection using -fno-rtti flag, i.e. disable generation of information about every class with virtual functions.
- O8. Use size optimized newlib via –specs=nano.specs flag.
2.1 Overall results
Build with and without enabled optimizations.
cd ~/stm32/examples/Optimization # Build without optimizations make clean make release # text data bss dec hex filename # 96712 2324 3728 102764 1916c bin/outp.elf sudo make deploy # Build with all optimizations enabled make clean make release-memopt # text data bss dec hex filename # 2276 108 1076 3460 d84 bin/outp.elf sudo make deploy
The total binary size (dec) is comprised of text, data and BSS segments. Text the size of the code and read-only data. Data describes the size of read/write data. BSS describes the size of zero initialized static or global variables.
The experimental results show I reduced the code size from 100.4 KB to 3.4 KB. The combined optimization reduction factor is of 29.5, which is a quite nice achievement! 🙂
Q: Deployment fails repeatedly due to some OpenOCD issue. 😦 Is there a workaround?
A: Yes, the current official Ubuntu package (Aug 2014) contains a prehistoric OpenOCD version. You should build a newer version from scratch. I provide step-by-step instructions here (2nd section, it takes less than 3 minutes to build it). When you are done, just return here and continue as nothing happened. It will work out of the box. 🙂
2.2 Isolated Optimization
In this section I analyse the effect of a isolated optimization options on the binary size to identify, which optimization gives the best performance. One (or more) optimizations can be selected as follows.
# Build with O1, O2, O3, O4, O8 make clean make release-memopt OPT='O1 O2 O3 O4 O8' # text data bss dec hex filename # 2372 108 1076 3556 de4 bin/outp.elf sudo make deploy
To automate benchmarking I provide the a Python script that is used as follows. The experimental result are visualized using OCTAVE.
python3 benchmark-singleopt.py octave benchmark.m # Generates results.png
- Size optimized library newlib achieves the best result reducing from 100 KB (NA) to approx 60 KB (O8).
- There are many unused sections in the original, which are removed by the second best optimization O4.
- The space optimizations (O3) performs similarly good.
- Some overhead can be reduced by disabling exceptions (O2). The optimization effect is diminished by the fact, exception handling is not removed from the (already compiled) libraries.
- Minimal advantage is obtained by replacing new and delete with malloc and free (O1).
- Other optimization (O5, O7) are ineffective in our test case, and perform as no optimization is used (NA). O6 performs even worse than NA.
Caveat. I don’t claim this is a representative code to benchmark optimization performance, it is used only as a proof of concept. These results are specific to this experimental setup and cannot be further generalized.
2.3 Optimization Combinations
I also analyse the cumulative effect of the best optimization options O1, O2, O3, O4 and O8 combinations on the binary size. To automate benchmarking I provide the a Python script that is used as follows.
python3 benchmark-multiopt.py octave benchmark.m # Generates results.png
- O1 (mem. allocation), O2 (exception handling), O3 (size optimization) reduce the code size from 100.4 KB to 10.7 KB.
- O8 (size optimized library) with O1, O2, O3 reduce the code size to 6.5 KB.
- O4 (remove unused code) with O1, O2, O3, O8 reduce the code size to 3.5 KB.
- O1-O8 reduce the code size to 3.4 KB.
- C++ exception handling is space consuming. The best optimization without O2 is 40.6 KB (see O1 O3 O4 O8).
- C++ memory management without exception handling is relatively cheap. O2 O3 O4 O8 reduce the code size to 3.7 KB.
3. Identify space-wasting code
I recommend to identify space-wasting code, before proceeding with optimization. This way, optimization decisions can be tuned to a specific problem.
For this purpose, I created a handy script to derive a list of top 10 space consuming code parts. The script also identifies corresponding source file to blame (see Makefile, lines 107-120). I demonstrate this script on the OOP project, where I use O1, O2, O3 optimization, but omit (very efficient) O8 optimization. The experimental results are the following.
make clean make release-memopt-blame OPT='O1 O2 O3' # text data bss dec hex filename # 7608 2188 1124 10920 2aa8 bin/outp.elf # # Top 10 space consuming symbols # 1 bin/outp.elf:00001336 T _malloc_r # 2 bin/outp.elf:00001064 d impure_data # 3 bin/outp.elf:00001032 D __malloc_av_ # 4 bin/outp.elf:00000412 T _free_r # 5 bin/outp.elf:00000240 T __call_exitprocs # 6 bin/outp.elf:00000236 T TimeDelay::TimeDelay() # 7 bin/outp.elf:00000236 T TimeDelay::TimeDelay() stm32/examples/Optimization/src/TimeDelay.cpp:12 # 8 bin/outp.elf:00000236 T SystemInit stm32/examples/Optimization/src/system_stm32f4xx.c:204 # 9 bin/outp.elf:00000224 W std::mersenne_twister_engine<unsigned int, 32u, 624u, 397u, 31u, 2567483615u, 11u, 4294967295u, 7u, 2636928640u, 15u, 4022730752u, 18u, 1812433253u>::operator()() /usr/arm-none-eabi/include/c++/4.8.4/bits/random.tcc:453 # 10 bin/outp.elf:00000216 T GPIO_DeInit stm32/examples/Optimization/src/stm32f4xx_gpio.c:120 # # ... and corresponging source files to blame. # 1 ??:? # 2 impure.c:? # 3 mallocr.c:? # 4 ??:? # 5 ??:? # 6 stm32/examples/Optimization/src/TimeDelay.cpp:12 # 7 stm32/examples/Optimization/src/TimeDelay.cpp:12 # 8 stm32/examples/Optimization/src/system_stm32f4xx.c:208 # 9 /usr/arm-none-eabi/include/c++/4.8.4/bits/random.tcc:458 # 10 stm32/examples/Optimization/src/stm32f4xx_gpio.c:121 sudo make deploy
Obviously, the memory management (see top 4: _mallor_r, __malloc_av_, _free_r) consumes a huge amount of space relative to other part of the code. This problem can be mitigated when size optimized C library is used (O8).
make clean make release-memopt-blame OPT='O1 O2 O3 O8' # text data bss dec hex filename # 5416 172 1080 6668 1a0c bin/outp.elf # # Top 10 space consuming symbols # 1 bin/outp.elf:00000236 T TimeDelay::TimeDelay() # 2 bin/outp.elf:00000236 T TimeDelay::TimeDelay() stm32/examples/Optimization/src/TimeDelay.cpp:12 # 3 bin/outp.elf:00000236 T SystemInit stm32/examples/Optimization/src/system_stm32f4xx.c:204 # 4 bin/outp.elf:00000224 W std::mersenne_twister_engine<unsigned int, 32u, 624u, 397u, 31u, 2567483615u, 11u, 4294967295u, 7u, 2636928640u, 15u, 4022730752u, 18u, 1812433253u>::operator()() /usr/arm-none-eabi/include/c++/4.8.4/bits/random.tcc:453 # 5 bin/outp.elf:00000216 T GPIO_DeInit stm32/examples/Optimization/src/stm32f4xx_gpio.c:120 # 6 bin/outp.elf:00000188 T _reclaim_reent # 7 bin/outp.elf:00000172 T main stm32/examples/Optimization/src/main.cpp:61 # 8 bin/outp.elf:00000168 T _malloc_r # 9 bin/outp.elf:00000136 T RCC_GetClocksFreq stm32/examples/Optimization/src/stm32f4xx_rcc.c:855 # 10 bin/outp.elf:00000136 T _free_r # # ... and corresponging source files to blame. # 1 stm32/examples/Optimization/src/TimeDelay.cpp:12 # 2 stm32/examples/Optimization/src/TimeDelay.cpp:12 # 3 stm32/examples/Optimization/src/system_stm32f4xx.c:208 # 4 /usr/arm-none-eabi/include/c++/4.8.4/bits/random.tcc:458 # 5 stm32/examples/Optimization/src/stm32f4xx_gpio.c:121 # 6 ??:? # 7 stm32/examples/Optimization/src/main.cpp:62 # 8 ??:? # 9 stm32/examples/Optimization/src/stm32f4xx_rcc.c:860 # 10 ??:? sudo make deploy
4. Further Reading
Performance analysis of C++ language (including STL) is available here and here. Additional information on the space- and time-complexity of STL algorithms and containers can be found here, here, here and here.