Why this blog

Short story

Well, if you work on Cortex M4 or Cortex M7, you probably use it as a Cortex M3 (following picture shows the M4F instruction set that Keil ARM Compiler is able to use) :
m4f_instruction_compiler

As you can see, you have a lot of power left at your reach, provided you can write target-specific code.
This blog intends to demonstrate how you can use this power and what gains can be expected by unleashing the beast :

m4f_instruction_unleashed

Oh, one other thing, if you are not willing to turn your compiler optimizations on high, don’t bother trying to optimize your algorithm : stack usage, function calls and loop management will probably cost a lot compared to your computations !!

Long story

In my work as an embedded software developer, I got my hands on this specific micro-controller that is ARM Cortex M4 and needed to do some signal processing computations with it.

Prestudying a couple of algorithms (IIR & FIR filtering, Fourier transform …) and trying to scale core frequency against current consumption, I quickly looked into Cortex M4 singularity (compared to Cortex M3) : its DSP extension.

CMSIS DSP library

All CM4 and CM7 manufacturers point to CMSIS DSP library to use this extension so I looked into it.
There are a lot of commonly used algorithms and they are written using compiler intrinsics so it’s easy to think that you’re taking the most out your target.
Well, that’s not entirely true. I will not enter into details here, but bottom line is :

  • Some algorithms may be very efficient in terms of execution time but not so much in terms of RAM/ROM usage.
  • Some others, looking at a particular need, are even not very efficient at all.

Both come from a choice developers made (and I can’t blame them for that) : privileging generic over specific.

Assembly radicalization

In the meantime, I had to write some application-specific algorithms. After having written them a first time in plain C, I tried to race my compiler by writing same functionality in pure assembly and after having spent a shameful amount of time, I finally managed to beat him.

In the end, my personal experience has shown that there may be a few exceptions (mainly OS related) but in general it is not a good idea to write assembly for all following reasons:

  • Assembler specific (unlike C, it is almost impossible to use same assembly file with two assembler)
  • Long and painful work on loops, branches
  • Prevents any cross-module optimization

Conclusion

Ultimately, what worked (and is still working) fine with my needs was letting compiler do his job (manage calls, loops, stacks etc …) and use intrinsics !

The idea is to look at compiler optimized output (assembler listing) and to analyse where operations can be merged into one single instruction.

Cortex M4 and Cortex M7 DSP extension has a couple of parallel arithmetic: that’s not very simple to write in C and usually necessitates to arrange data in memory so computations that would benefit from such parallelism need to be thought directly with instruction set in mind (or in front of the eyes : ARM instruction set Cheat Sheet).

So now I’m here with this blog up and running and I will try to give you the taste of writing for Cortex M4/M7.

 

Oh, and by the way, my name is Thibaut and I’m very pleased to meet you !!

Laisser un commentaire

Ce site utilise Akismet pour réduire les indésirables. En savoir plus sur la façon dont les données de vos commentaires sont traitées.