You never know what you run across, if you profile random jobs on a cluster... This story starts with learning about the perf tool, the kernel level profiling tool, that allows to do function level profiling using performance counters on modern CPUs without having to instrument binaries (e.g. for use with gprof). Particularly on a cluster it turns out to be an extremely useful tool to learn "what is going on" with applications, that you don't know. Logging in as root and then running perf top gives a quick overview of how well a node is utilized. For example, the screenshot on the left shows a profile of a node full of MATLAB calculations. One can see that it utilizes its bundled copy of Intel's MKL, but also that apparently there are a About a month ago, I noticed that there was some unusual calculation on some of the cluster nodes using the R statistical toolkit, that made heavy use of the `log()` is not an option, one could try to replace it with a faster version. A web search turned up several SSE vectorized versions of the single precision logarithm (and other expensive math functions) and the AMD libM library, which also provides these vectorized functions and accelerated irrational math functions. In artificial tests, the contained accelerated log function proved to be more than twice as fast as the one in the system math library. Now the remaining challenge was to figure out a way to replace the calls to log() as provided by the libm shared library in /lib64/libm-2.12.so with calls to amd_log() as provided by libamdlibm.so. The solution was to write a little wrapper and then use the LD_PRELOAD environment variable to override libm. As it turns out, this can be done with two lines of C code:
`#include "amdlibm.h"`
`double log(double x) { return amd_log(x); }` This is compiled with:
`gcc -shared -fPIC -o fasterlog.so fasterlog.c -L. -lamdlibm` And then activated with:
`export LD_PRELOAD=/path/to/fasterlog.so LD_LIBRARY_PATH=/path/to:${LD_LIBRARY_PATH-/lib64}` The result is quite remarkable as shown in the second screenshot on the right side. The percentage of time spent in the ## Update:By the way... if you believe that knowing how to write assembly code, will make your programs blisteringly fast, you may have to reconsider. At least in this case. After some research on the web, i found out that the x86 floating point unit provides an instruction to compute the logarithm to the base of 2. Of course, this can easily be converted into a natural logarithm by multiplying it with a constant. In fact, there is an instruction for just this, and even the necessary constant is stored away and can be loaded into a floating point register with a special instruction. So the code for computing the natural logarithm in x86 inline assembly would look like this: `double log(double x) ` `{` ` register double result;` ` __asm __volatile__ ("fldln2; fxch; fyl2x" ` ` : "=t" (result) ` ` : "0" (x) : "st(1)");` ` return result;` `}` This function is |

Axel's Old News >