The image with the graphs attached to this post demonstrates how much can be achieved with rather little effort. The graphs show performance data and the memory consumption for a handwritten C code used in research for management information systems. Both memory consumption and performance of the software had reached the point, where running on a HPC platform was seen as the only solution to continue the research. This is where a copy of it and some test data sets were handed to me in order to see, how it would be best run on the local HPC facility.
After some exploratory profiling, it was found that about 85% of the time was spent in just one rather small subroutine. It was also found that data computed in this subroutine was the major contribution to memory consumption. After realizing that the number range to be stored was limited, the data elements were changed from 'int' to 'char' and thus significantly reducing the memory consumption. Also some other rearrangements to the most time consuming subroutine were made that would enable the compiler for better optimizing the code and finally OpenMP directives were added.
The result is pretty stunning. Overall, there is rather little computation between memory accesses, so the overall performance is bounded by memory bandwidth. But exactly because of that, significantly reducing the size of the storage elements resulted in a significant performance increase. This would become massively enhanced through using aggressive compiler optimization. The high demand for memory bandwidth also explains the limited impact of multi-threading, since concurrent threads accessing different parts of the data will increase the demand for memory bandwidth. Nevertheless, a speedup of up to 12x and a memory reduction to almost 1/4 of the original size was achieved in just a couple of days and with rather localized changes.