Axel's Old News‎ > ‎

Two lines of code = 40% faster

posted Nov 13, 2012, 5:35 AM by Axel Kohlmeyer   [ updated Nov 13, 2012, 9:38 AM ]

You never know what you run across, if you profile random jobs on a cluster...

This story starts with learning about the perf tool, the kernel level profiling tool, that allows to do function level profiling using performance counters on modern CPUs without having to instrument binaries (e.g. for use with gprof). Particularly on a cluster it turns out to be an extremely useful tool to learn "what is going on" with applications, that you don't know. Logging in as root and then running perf top gives a quick overview of how well a node is utilized. For example, the screenshot on the left shows a profile of a node full of MATLAB calculations. One can see that it utilizes its bundled copy of Intel's MKL, but also that apparently there are a lot of memory management operations going on.

About a month ago, I noticed that there was some unusual calculation on some of the cluster nodes using the R statistical toolkit, that made heavy use of the log() function in the math library. In fact more than 50% of the total time was spent in the symbol __ieee754_log of the libm shared library (see the screenshot on the right). Now the logarithm is a very "expensive" math function and when teaching classes about optimizing codes, I always remind people to avoid using it and also avoid the pow() function, as it implicitly uses log(). After an exchange with the cluster user that was running these calculations, it turns out, that this was a case, where indeed the power function would be used a lot, but also could not be avoided without abandoning R and a massive programming effort.

Any time a code spends this large a percentage of the total time in a single function, this is worth researching options to optimization. Since the optimization by avoiding to call log() is not an option, one could try to replace it with a faster version. A web search turned up several SSE vectorized versions of the single precision logarithm (and other expensive math functions) and the AMD libM library, which also provides these vectorized functions and accelerated irrational math functions. In artificial tests, the contained accelerated log function proved to be more than twice as fast as the one in the system math library. Now the remaining challenge was to figure out a way to replace the calls to log() as provided by the libm shared library in /lib64/ with calls to amd_log() as provided by The solution was to write a little wrapper and then use the LD_PRELOAD environment variable to override libm. As it turns out, this can be done with two lines of C code:
#include "amdlibm.h"

double log(double x) { return amd_log(x); }

This is compiled with: gcc -shared -fPIC -o fasterlog.c -L. -lamdlibm
And then activated with: export LD_PRELOAD=/path/to/ LD_LIBRARY_PATH=/path/to:${LD_LIBRARY_PATH-/lib64}

The result is quite remarkable as shown in the second screenshot on the right side. The percentage of time spent in the log() function (the symbol is now called __amd_bas64_log) has dropped significantly to about 20% of the time. The perf tool also nicely illustrates that the function now is taken from instead of the system provided libm. In a full test run of the same (complete) R calculation the wall time dropped from 7h 45m to 4h 35m on a 3.5GHz dual quad-core Intel Xeon X5677 system. Not a bad improvement for writing two lines of C code.


By the way... if you believe that knowing how to write assembly code, will make your programs blisteringly fast, you may have to reconsider. At least in this case. After some research on the web, i found out that the x86 floating point unit provides an instruction to compute the logarithm to the base of 2. Of course, this can easily be converted into a natural logarithm by multiplying it with a constant. In fact, there is an instruction for just this, and even the necessary constant is stored away and can be loaded into a floating point register with a special instruction. So the code for computing the natural logarithm in x86 inline assembly would look like this:

double log(double x) 
    register double result;
    __asm __volatile__ ("fldln2; fxch; fyl2x" 
                      : "=t" (result) 
                      : "0" (x) : "st(1)");
    return result;

This function is even slower than what is provided in the math library of the GNU libc or other codes that i found to compute log(). Ouch!