STM32F4 – Code Optimization

The use of C++ language typically increases memory footprint (compared to C). Given the fact embedded systems usually have limited computing/memory resources, this is a very bad news! Luckily, we can use various optimizations to reduce space size or improve computing performance.

Morpheus-CodeCanBeSmallerIn this post I size-optimize the code of the Object Oriented Programming with Embedded Systems (C++ /w STL) project. I reduce the binary size from 100.4 KB to merely 3.4 KB. The combined optimization reduction factor is 29.5, which is a quite nice achievement. Check out the details, how to do it! 🙂

1. Prerequisites

  • Ubuntu 14.04 LTS (x86 architecture).
  • STM32F4 Discovery Board (ARM architecture, costs less than 20 EUR).
  • Complete the OOP tutorial.
    • Or at least go through chapter 2. Below is a quick-and-dirty instruction summary if you decide to stay on this page.
cd ~
# Remove the official package
sudo apt-get purge binutils-arm-none-eabi \
                   gcc-arm-none-eabi \
                   gdb-arm-none-eabi \
                   libnewlib-arm-none-eabi

# Add 3rd party repository
sudo add-apt-repository ppa:terry.guo/gcc-arm-embedded
sudo apt-get update
# Check the GCC package version in the PPA repository
sudo apt-cache policy gcc-arm-none-eabi

# Install software requirements
sudo apt-get install build-essential git openocd \
 gcc-arm-none-eabi qemu-system-arm \
 symlinks expect

# Clone my git repository & init submodules
git clone https://github.com/istarc/stm32.git
cd ~/stm32
git submodule update --init

2. Install Software Dependencies

Install python3 and octave are required to automate and visualize benchmarking results.

sudo apt-get install python3 octave

2. Memory Footprint Optimization

The non-optimized binary code size of the OOP project (C/C++ including STL) is huge – 100.4 KB! 😯 It has to be reduced in order to be suitable for devices with limited resources. Among others, you can use the following optimizations (see Makefile):

    • O1. Override new, new[], delete, delete[] operators using malloc() and free() globally in main.cpp (see lines 165 – 179). Caveat. It’s not a good idea to mix new and delete with malloc and free, respectively. New throws exception upon failure, as required by C++ standard. The overridden new returns NULL upon failure. A special care has to be given, to catch these exceptions appropriately. You have been warned! 🙂
    • O2. Disable exception handling support using -fno-exceptions flags. When this optimization is used, you have to make appropriate source code adjustments, i.e. replace try, throw, catch statements with C style exception handling.
    • O3. Use the size optimization flag -Os.
    • O4. Place each function or data item into its own isolated section in the output file via -ffunction-sections -fdata-sections flags. Remove isolated unused sections at link time using -Wl,-gc-sections flag. Caveat. May break things if linked to 3rd party static libraries that use magic sections.
    • O5. Don’t use GCC’s built-in functions via -fno-builtin. It may or may not improve performance/reduce code size.
    • O6. Perform link time optimization using -flto flag.
    • O7. Disable type introspection using -fno-rtti flag, i.e. disable generation of information about every class with virtual functions.
    • O8. Use size optimized newlib via –specs=nano.specs flag.

2.1 Overall results

Build with and without enabled optimizations.

cd ~/stm32/examples/Optimization

# Build without optimizations
make clean
make release
#  text data  bss    dec   hex filename
# 96712 2324 3728 102764 1916c bin/outp.elf

sudo make deploy

# Build with all optimizations enabled
make clean
make release-memopt
# text data  bss  dec hex filename
# 2276  108 1076 3460 d84 bin/outp.elf

sudo make deploy

The total binary size (dec) is comprised of text, data and BSS segments. Text the size of the code and read-only data. Data describes the size of read/write data. BSS describes the size of zero initialized static or global variables.

The experimental results show I reduced the code size from 100.4 KB to 3.4 KB. The combined optimization reduction factor is of 29.5, which is a quite nice achievement! 🙂

Q: Deployment fails repeatedly due to some OpenOCD issue. 😦 Is there a workaround?
A: Yes, the current official Ubuntu package (Aug 2014) contains a prehistoric OpenOCD version. You should build a newer version from scratch. I provide step-by-step instructions here (2nd section, it takes less than 3 minutes to build it). When you are done, just return here and continue as nothing happened. It will work out of the box. 🙂

2.2 Isolated Optimization

In this section I analyse the effect of a isolated optimization options on the binary size to identify, which optimization gives the best performance. One (or more) optimizations can be selected as follows.

# Build with O1, O2, O3, O4, O8
make clean
make release-memopt OPT='O1 O2 O3 O4 O8'
# text data  bss  dec hex filename
# 2372  108 1076 3556 de4 bin/outp.elf

sudo make deploy

To automate benchmarking I provide the a Python script that is used as follows. The experimental result are visualized using OCTAVE.

python3 benchmark-singleopt.py
octave benchmark.m # Generates results.png

results_nocombThe experimental results show the following:

  • Size optimized library newlib achieves the best result reducing from 100 KB (NA) to approx 60 KB (O8).
  • There are many unused sections in the original, which are removed by the second best optimization O4.
  • The space optimizations (O3) performs similarly good.
  • Some overhead can be reduced by disabling exceptions (O2). The optimization effect is diminished by the fact, exception handling is not removed from the (already compiled) libraries.
  • Minimal advantage is obtained by replacing new and delete with malloc and free (O1).
  • Other optimization (O5, O7) are ineffective in our test case, and perform as no optimization is used (NA).  O6 performs even worse than NA.

Caveat. I don’t claim this is a representative code to benchmark optimization performance, it is used only as a proof of concept. These results are specific to this experimental setup and cannot be further generalized.

2.3 Optimization Combinations

I also analyse the cumulative effect of the best optimization options O1, O2, O3, O4 and O8 combinations on the binary size. To automate benchmarking I provide the a Python script that is used as follows.

python3 benchmark-multiopt.py
octave benchmark.m # Generates results.png

results_combThe experimental results show the following optimization combination are the most effective:

  • O1 (mem. allocation), O2 (exception handling), O3 (size optimization) reduce the code size from 100.4 KB to 10.7 KB.
  • O8 (size optimized library) with O1, O2, O3 reduce the code size to 6.5 KB.
  • O4 (remove unused code) with O1, O2, O3, O8 reduce the code size to 3.5 KB.
  • O1-O8 reduce the code size to 3.4 KB.
  • C++ exception handling is space consuming. The best optimization without O2 is 40.6 KB (see O1 O3 O4 O8).
  • C++ memory management without exception handling is relatively cheap. O2 O3 O4 O8 reduce the code size to 3.7 KB.

3. Identify space-wasting code

I recommend to identify space-wasting code, before proceeding with optimization. This way, optimization decisions can be tuned to a specific problem.

For this purpose, I created a handy script to derive a list of top 10 space consuming code parts. The script also identifies corresponding source file to blame (see Makefile, lines 107-120). I demonstrate this script on the OOP project, where I use O1, O2, O3 optimization, but omit (very efficient) O8 optimization. The experimental results are the following.

make clean
make release-memopt-blame OPT='O1 O2 O3'
# text data  bss   dec  hex filename
# 7608 2188 1124 10920 2aa8 bin/outp.elf
#
# Top 10 space consuming symbols
# 1  bin/outp.elf:00001336 T _malloc_r
# 2  bin/outp.elf:00001064 d impure_data
# 3  bin/outp.elf:00001032 D __malloc_av_
# 4  bin/outp.elf:00000412 T _free_r
# 5  bin/outp.elf:00000240 T __call_exitprocs
# 6  bin/outp.elf:00000236 T TimeDelay::TimeDelay()
# 7  bin/outp.elf:00000236 T TimeDelay::TimeDelay() stm32/examples/Optimization/src/TimeDelay.cpp:12
# 8  bin/outp.elf:00000236 T SystemInit             stm32/examples/Optimization/src/system_stm32f4xx.c:204
# 9  bin/outp.elf:00000224 W std::mersenne_twister_engine<unsigned int, 32u, 624u, 397u, 31u, 2567483615u, 11u, 4294967295u, 7u, 2636928640u, 15u, 4022730752u, 18u, 1812433253u>::operator()() /usr/arm-none-eabi/include/c++/4.8.4/bits/random.tcc:453
# 10 bin/outp.elf:00000216 T GPIO_DeInit            stm32/examples/Optimization/src/stm32f4xx_gpio.c:120
#
# ... and corresponging source files to blame.
# 1  ??:?
# 2  impure.c:?
# 3  mallocr.c:?
# 4  ??:?
# 5  ??:?
# 6  stm32/examples/Optimization/src/TimeDelay.cpp:12
# 7  stm32/examples/Optimization/src/TimeDelay.cpp:12
# 8  stm32/examples/Optimization/src/system_stm32f4xx.c:208
# 9  /usr/arm-none-eabi/include/c++/4.8.4/bits/random.tcc:458
# 10 stm32/examples/Optimization/src/stm32f4xx_gpio.c:121

sudo make deploy

Obviously, the memory management (see top 4: _mallor_r, __malloc_av_, _free_r) consumes a huge amount of space relative to other part of the code. This problem can be mitigated when size optimized C library is used (O8).

make clean
make release-memopt-blame OPT='O1 O2 O3 O8'
# text data  bss  dec  hex filename
# 5416  172 1080 6668 1a0c bin/outp.elf
#
# Top 10 space consuming symbols
# 1  bin/outp.elf:00000236 T TimeDelay::TimeDelay()
# 2  bin/outp.elf:00000236 T TimeDelay::TimeDelay() stm32/examples/Optimization/src/TimeDelay.cpp:12
# 3  bin/outp.elf:00000236 T SystemInit             stm32/examples/Optimization/src/system_stm32f4xx.c:204
# 4  bin/outp.elf:00000224 W std::mersenne_twister_engine<unsigned int, 32u, 624u, 397u, 31u, 2567483615u, 11u, 4294967295u, 7u, 2636928640u, 15u, 4022730752u, 18u, 1812433253u>::operator()()    /usr/arm-none-eabi/include/c++/4.8.4/bits/random.tcc:453
# 5  bin/outp.elf:00000216 T GPIO_DeInit            stm32/examples/Optimization/src/stm32f4xx_gpio.c:120
# 6  bin/outp.elf:00000188 T _reclaim_reent
# 7  bin/outp.elf:00000172 T main                   stm32/examples/Optimization/src/main.cpp:61
# 8  bin/outp.elf:00000168 T _malloc_r
# 9  bin/outp.elf:00000136 T RCC_GetClocksFreq      stm32/examples/Optimization/src/stm32f4xx_rcc.c:855
# 10 bin/outp.elf:00000136 T _free_r
#
# ... and corresponging source files to blame.
# 1  stm32/examples/Optimization/src/TimeDelay.cpp:12
# 2  stm32/examples/Optimization/src/TimeDelay.cpp:12
# 3  stm32/examples/Optimization/src/system_stm32f4xx.c:208
# 4  /usr/arm-none-eabi/include/c++/4.8.4/bits/random.tcc:458
# 5  stm32/examples/Optimization/src/stm32f4xx_gpio.c:121
# 6  ??:?
# 7  stm32/examples/Optimization/src/main.cpp:62
# 8  ??:?
# 9  stm32/examples/Optimization/src/stm32f4xx_rcc.c:860
# 10 ??:?

sudo make deploy

4. Further Reading

Performance analysis of C++ language (including STL) is available here and here. Additional information on the space- and time-complexity of STL algorithms and containers can be found here, here, here and here.

Advertisements

About istarc

Embedded Systems Developer.
This entry was posted in Embedded Systems, STM32F4 and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s