I am doing some numerical optimization on a scientific application. One thing I noticed is that GCC will optimize the call `pow(a,2)`

by compiling it into `a*a`

, but the call `pow(a,6)`

is not optimized and will actually call the library function `pow`

, which greatly slows down the performance. (In contrast, Intel C++ Compiler, executable `icc`

, will eliminate the library call for `pow(a,6)`

.)

What I am curious about is that when I replaced `pow(a,6)`

with `a*a*a*a*a*a`

using GCC 4.5.1 and options "`-O3 -lm -funroll-loops -msse4`

", it uses 5 `mulsd`

instructions:

```
movapd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
```

while if I write `(a*a*a)*(a*a*a)`

, it will produce

```
movapd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm13, %xmm13
```

which reduces the number of multiply instructions to 3. `icc`

has similar behavior.

Why do compilers not recognize this optimization trick?

Because Floating Point Math is not Associative. The way you group the operands in floating point multiplication has an effect on the numerical accuracy of the answer.

As a result, most compilers are very conservative about reordering floating point calculations unless they can be sure that the answer will stay the same, or unless you tell them you don't care about numerical accuracy. For example: the

`-fassociative-math`

option of gcc which allows gcc to reassociate floating point operations, or even the`-ffast-math`

option which allows even more aggressive tradeoffs of accuracy against speed.