Asked  7 Months ago    Answers:  5   Viewed   31 times

I am doing some numerical optimization on a scientific application. One thing I noticed is that GCC will optimize the call pow(a,2) by compiling it into a*a, but the call pow(a,6) is not optimized and will actually call the library function pow, which greatly slows down the performance. (In contrast, Intel C++ Compiler, executable icc, will eliminate the library call for pow(a,6).)

What I am curious about is that when I replaced pow(a,6) with a*a*a*a*a*a using GCC 4.5.1 and options "-O3 -lm -funroll-loops -msse4", it uses 5 mulsd instructions:

movapd  %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13

while if I write (a*a*a)*(a*a*a), it will produce

movapd  %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm13, %xmm13

which reduces the number of multiply instructions to 3. icc has similar behavior.

Why do compilers not recognize this optimization trick?



Because Floating Point Math is not Associative. The way you group the operands in floating point multiplication has an effect on the numerical accuracy of the answer.

As a result, most compilers are very conservative about reordering floating point calculations unless they can be sure that the answer will stay the same, or unless you tell them you don't care about numerical accuracy. For example: the -fassociative-math option of gcc which allows gcc to reassociate floating point operations, or even the -ffast-math option which allows even more aggressive tradeoffs of accuracy against speed.

Tuesday, June 1, 2021
answered 7 Months ago

In C, a function declared with an empty parameter list accepts an arbitrary number of arguments when being called, which are subject to the usual arithmetic promotions. It is the responsibility of the caller to ensure that the arguments supplied are appropriate for the definition of the function.

To declare a function taking zero arguments, you need to write void foo(void);.

This is for historic reasons; originally, C functions didn't have prototypes, as C evolved from B, a typeless language. When prototypes were added, the original typeless declarations were left in the language for backwards compatibility.

To get gcc to warn about empty parameter lists, use -Wstrict-prototypes:

Warn if a function is declared or defined without specifying the argument types. (An old-style function definition is permitted without a warning if preceded by a declaration which specifies the argument types.)

Sunday, June 13, 2021
answered 6 Months ago

It's using the library sqrt function for error handling. See glibc's documentation: 20.5.4 Error Reporting by Mathematical Functions: math functions set errno for compatibility with systems that don't have IEEE754 exception flags. Related: glibc's math_error(7) man page.

As an optimization, it first tries to perform the square root by the inlined sqrtsd instruction, then checks the result against itself using the ucomisd instruction which sets the flags as follows:

   UNORDERED:    ZF,PF,CF  111;
   LESS_THAN:    ZF,PF,CF  001;
   EQUAL:        ZF,PF,CF  100;

In particular, comparing a QNaN to itself will return UNORDERED, which is what you will get if you try to take the square root of a negative number. This is covered by the jp branch. The je check is just paranoia, checking for exact equality.

Also note that gcc has a -fno-math-errno option which will sacrifice this error handling for speed. This option is part of -ffast-math, but can be used on its own without enabling any result-changing optimizations.

sqrtsd on its own correctly produces NaN for negative and NaN inputs, and sets the IEEE754 Invalid flag. The check and branch is only to preserve the errno-setting semantics which most code doesn't rely on.

-fno-math-errno is the default on Darwin (OS X), where the math library never sets errno, so functions can be inlined without this check.

Saturday, July 31, 2021
answered 5 Months ago

First of all, the problem is not the if; as you saw, gcc sees through the if and manages to pass 30 straight to printf.

Now, gcc does have some logic to handle special cases of printf (in particular, it does optimize printf("somethingn") and even printf("%sn", "something") to puts("something")), but it is extremely specific and doesn't go much further; printf("Hello %sn", "world"), for example, is left as-is. Even worse, any of the variants above without a trailing newline are left untouched, even if they could be transformed to fputs("something", stdout).

I imagine that this comes down to two main problems:

  • the two cases above are extremely easy patterns to implement and happen quite frequently, but for the rest probably it's rarely worth the effort; if the string is constant and the performance is important, the programmer can take care of it easily - actually, if the performance of printf is critical he shouldn't be relying on this kind of optimization, which may break at the slightest change of format string.

    If you ask me, even just the puts optimizations above are already "going for the style points": you are not really going to gain serious performance in anything but artificial test cases.

  • When you start to go outside the realm of %sn, printf is a minefield, because it has a strong dependency on the runtime environment; in particular, many printf specifiers are (unfortunately) affected by the locale, plus there are a host of implementation-specific quirks and specifiers (and gcc can work with printf from glibc, musl, mingw/msvcrt, ... - and at compile time you cannot invoke the target C runtime - think when you are cross-compiling).

    I agree that this simple %d case is probably safe, but I can see why they probably decided to avoid being overly smart and only perform the dumbest and safest optimizations here.

For the curious reader, here is where this optimization is actually implemented; as you can see, the function matches a restricted number of very simple cases (and GIMPLE aside, hasn't changed a lot since this nice article outlining them was written). Incidentally, the source actually explains why they couldn't implement the fputs variant for the non-newline case (there's no easy way to reference the stdout global at that compilation stage).

Friday, August 6, 2021
answered 4 Months ago

No, the compiler is not allowed to do that for the general case: the two operations could produce results that are not bit-identical due to the representation error of the reciprocal.

In your example, 0.1 does not have an exact representation as float. This causes the results of multiplication by 0.1 and division by 10 to differ:

float f = 21736517;
float a = f / 10.f;
float b = f * 0.1f;
cout << (a == b) << endl; // Prints zero


Note: As njuffa correctly notes in the comment below, there are situations when the compiler could make some optimizations for a wide set of numbers, as described in this paper. For example, multiplying or dividing by a power of two is equivalent to addition to the exponent portion of the IEEE-754 float representation.

Tuesday, August 10, 2021
answered 4 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :