Parallelization and Caching

NOTES:

Summer 2020: The caching tools have been completely rewritten in LabVIEW 2020 and uses two maps, the second one for the age lookup. On typical hardware there is no real difference because the calculation time is typically dominant, but the caching overhead is about 3x less and it scales much better with cache size. Since a cache lookup is the critical section for parallelization a faster cache will be more important for future computers with much higher core counts.

The discussed implementation applies to v.600 and newer (~2013). Further code tightening has been made since. Always upgrade to the newest version!

Work has focused on several improvements to speed up fitting by using all available computing resources under all possible conditions. For this purpose, the MOMD calculations have been taken out of the fortran DLL to allow the various orientations to be processed in parallel. In addition, the calculation of the partial derivatives (for Levenberg-Marquardt) has also been fully parallelized. High performance memoization has been implemented as an associative memory cache, avoiding duplicate recalculation of any subspectrum. The default cache size is 4k spectra, but it is adjustable.

History: The code of older versions (<v.600) only implemented mild parallelization. In the model calculation, each spectra component was assigned to a CPU core. For example fitting a single component MOMD spectrum was only able to use a single core. Caching was limited to a single entry per component.

The model calculation of the new version is implemented as follows:

1. Calculate all numerical partial derivatives inside the model. For N fitting parameters, we need to calculate N+1 spectra.
2. For the MOMD model, each spectrum consists of many sub-spectra, one for each orientation.
3. Points (1&2) can require hundreds [up to Nort(Npar+1)] of sub-spectra per function call, all can be calculated in parallel if the hardware allows!
4. Set up an array of all parameter sets, remove duplicates after masking out parameters that only need fast transformations (see point 6 below), then index into a parallel FOR loop.
5. A final sub-spectrum is an expensive core spectrum plus cheap transforms. Often, a core spectrum is already cached, requiring only the application of the transform (e.g. a change in amplitude simply applies a different multiplier to a spectrum typically already found in the cache).
6. Not all partial derivatives need a recalculation of a core spectrum (e.g. amplitude).
7. All recently calculated core spectra are cached in an associative memory array to avoid duplicate recalculation.
8. Assemble all spectra from cache while applying needed transforms for each and calculate all partial derivatives.

Caching also improves the responsiveness during manual simulation. For example if the rate R is increased by one, then decreased by one (returning to the earlier parameter set), the spectrum is returned in sub-ms time, because it is still cached. Changing the amplitude of components is instantaneous, even for very complicated models.

Here is a flow diagram of the process:

The performance of the new version has been extensively tested on a wide variety of hardware, from an Atom processor with a single hyper-threaded core to a Dual Xeon E5-2687W workstation with 16 hyper-threaded cores (32 virtual cores). The following are some preliminary results comparing four different computers:

- Atom N450 netbook, 1 (2) cores, 1.66GHz, 5.5W
- Intel Core 2 Duo T7600, 2 cores, 2.33GHz, 34W
- Intel I7 2600K, 4 (8) cores, 3.4GHz, 95W
- Dual Xeon E5-2687W, 16 (32) cores, 3.1GHz, 2 x 150W

As a baseline, we force sequential calculation:

Here only a single core is used for the spectral calculation. The I7 is fastest because it runs at a higher clock rate compared to the very similar E5 (3.4 vs 3.1GHz). The T7600 is an older design. The Atom is slow as expected. It is a processor optimized for low power consumption.

The units are Spectra/Second in [Hz]. (each MOMD orientation counts as a spectrum)

Once we allow parallelization, the picture changes dramatically:

Note that the sequential results from above are still shown in comparison on the same scale (blue). The E5 is 17.6x faster due to the high number of cores, easily beating the I7, which only gains 4.4x. Even the atom gains about 1.7x due to hyperthreading. (If hypethreading is disabled in the BIOS for the E5 or I7, the speed increase is slightly less than the number of cores as expected (15.8x and 3.8x, resp.), clearly showing that hyperthreading gives an additional boost. The speed of the E5 is over 1k spectra/second, dramatically speeding up the fitting of MOMD spectra. For typical work, even the I7 performance is excellent, considering that its price is only 10% of the E5 workstation.

The units are Spectra/Second in [Hz]. (each MOMD orientation counts as a spectrum)

Important: In the real world, the parallel performance is most important and should be the decision factor.

In a third test, we repeat the benchmark after ensuring that all spectra already exist in the cache. This tests the raw cache performance.

Currently, the stock LabVIEW red-black tree implementation of variant attributes is used for the associative memory cache and there are several improvements possible in the future. Still, the current cache performance is orders of magnitude faster than a re-computation, which is more than sufficient at the moment.

The units are Spectra/Second in [Hz]. (each MOMD orientation counts as a spectrum)

(this is data from an older version, improvements in the caching code as described have increased the cached speed to well over 100'000 spectra/second, (e.g. >280'000Hz on the i9-13900K!))

As can be seen from the image below, all cores of the two E5 processors are 100% busy during EPR fitting.

NOTE: The I7-2600K system was home-built. The Dual Xeon E5-2687W workstation was custom built by @Xi computers.

Update: AMD has released several new CPUs that are highly competitive again.

Update (Mar 2018): I had the pleasure of testing an AMD Ryzen 7 1700X and it performs great! Basically tied for first place in the single-thread performance and currently top of all contenders (with a single CPU) in parallel performance. Only beat by multi-CPU Xeon system costing thousands of dollars more. Great Job AMD!

Note on very old AMD CPUs: I have also tested several of the AMD multicore chips and the performance is abysmal. For example, I tested a Quad AMD Opteron 6274 system (64 cores, 2.2GHz, 4x115W). The sequential performance of an AMD bulldozer core is about 4-5x lower compared to the Intel Sandy Bridge chips for the calculation of EPR spectra. The caching performance is horrible and the parallel performance running on all 64 cores was about equivalent to a 6 core I7.

(Part of this work has been presented at NI-Week 2012 under the title "Parallelizing the Unparallelizable")

Page updated

Google Sites

Report abuse