What intrinsics make you gain

If you use intrinsics, your main objective is probably to lower execution time, well that is indeed what happens, but that’s not the whole picture.

The obvious

When using parallel arithmetic, multiply-accumulate or any combined operation, you gain because you execute same operations in less cycles.
Usually, comparing with what C generates is not so impressive in terms of cycles (except for some very specialized operations : RBIT, CLZ …).

The required

Using parallel arithmetic and/or multiply accumulate intrinsics, you would need to load « vectors » from memory.

As all ARM Cortex-M cores are 32-bits, loading less than 4 bytes at a time costs the same as 4 bytes. Instead of performing two 16-bits or four 8-bits memory accesses, you need to load all in one register (we often read that these data are « packed »).

This will add some significant gains in terms of execution cycles but also in code size !

Side effects

Using combined operations and packed data will allow your code to use less registers.
Combined operations would need intermediate results register that are not needed anymore.
Of course, loading 8- or 16-bits takes a full 32-bits register, so packed data take two or four times less registers.
If less registers are needed, less stack is used:

  • less temporary variables need to be stored on stack
  • less registers need to be backed up on stack

On top of that, if your routine becomes small enough, maybe compiler will choose to inline it and makes some significant gains by removing two branch instructions and some stack usage !

In the end

You can expect gains on:

  • Execution time (on 16-bits data, I often end up with 3 to 5 times less than optimized C)
  • RAM/Stack usage
  • ROM size

Un commentaire sur “What intrinsics make you gain

Laisser un commentaire

Ce site utilise Akismet pour réduire les indésirables. En savoir plus sur la façon dont les données de vos commentaires sont traitées.