Post date: Nov 05, 2019 4:42:20 AM
Background
When I posted "Suggestions for Improvement of STREAM OpenMP Code" on 25 July 2014, then posted the link in various Linkedin groups, a fire storm of controversy was created.
Some group admins either deleted the post, or did not allow it. People attacked me, even though they probably did not even bother to look at the full disclosure results of the STREAM runs. There were comments like "compiler will take care of multiple parallel regions". Basically some people were saying that all the programmer inefficiencies in the STREAM source code will be optimized away by the compiler.
I wrote the piece below in July 2014 as reply to one person for his comments made. I was thinking of adding more information before posting. However, with the raging controversy and bad environment created, I never got to complete the post. What I learnt from this experience is that, to some people, John D. McCalpin's STREAM is a very sacred cow that is beyond critique. Such people refuse to look at the objective facts presented, but fiercely attack any ideas that criticize STREAM.
More than five years have passed since, the controversy has died down and forgotten. I have decided to finally post the work in progress reply below verbatim unedited. The original comments are not included. I do not know where that group post is as Linkedin does not seem to have an easy way to search for postings in groups. This time I will take a low profile and not post in any Linkedin group, so that few people will notice this post.
_____________________________________________________________________________________________
1) AMD Opterons have integrated memory controllers. The memory bandwidth scales with the number of CPUs as every CPU has its own integrated memory controller and its own banks of DIMMs.
Please read the blog post carefully, the Sun Ultra 40 M2 with two Opteron 2224 SE (same clock speed and specs as the 8224 SE in the Fire X4600 M2 except for the number of Hypertransport links) already has slightly higher stock STREAM results as the Sun Fire X4600 M2 with eight Opteron 8224 SE. The DIMMs on the Ultra 40 M2 are 2 GB DDR2 667 MHz (4 DIMMs on each CPU, total of 16 GB), those on the Fire X4600 M2 are 4 GB DDR2 667 MHz (4 DIMMs on each CPU, total of 128 GB). The memory bandwidth of the X4600 M2 is definitely much higher than the Ultra 40 M2.
What you say are OK for 2007 numbers might be applicable to Intel Xeons with Front Side Bus and shared memory controller on the chipset. Those Xeons are notorious for bottlenecked memory bandwidth, and depend on big CPU caches for benchmark results. Xeons with memory controller on chipset do not scale memory bandwidth with the number of CPUs. In fact the systems with four CPUs have lower clocked DIMMs than the two CPU models as the Front Side Bus has to be clocked lower.
2) What you understand about STREAM design intent is not documented any where. There is no such mention in the STREAM FAQ.
I will be very blunt. STREAM code is a textbook example of how NOT to write an OpenMP program. STREAM code is written exactly the way a naive non-knowledgeable programmer will write an OpenMP program, without any consideration for memory locality and implementation overheads.
A benchmark that deliberately uses poor programming does not measure any thing useful, since a better written program can be a few times faster, as I have demonstrated.
3) No, I have not informed John D. McCaplin. I emailed him twice in 2011, he had almost 3 1/2 years to respond but did not do so.
1,2) Agree.
3) There is only one memory controller on every CPU, running on all 16 cores will not scale memory bandwidth 16X. This is not an Intel Xeon with FSB, so memory bandwdth should scale with CPUs. There is still the overhead of cache coherency, so of course the raw bandwidth will not scale linearly with CPUs. And of course the overhead of threads and so on.
4) If the bandwidth of a two CPU workstation is higher, something is not right with the measurement. We had Fluent CFD simulations running 24x7 in the Fire X4600 M2 and Ultra 40 M2. The Fluent parallel licenses (MPI) are not cheap. My colleagues will have found out that running with more than 2 cores does not do any speedups. That is not the case.
5) I made full disclosure of my experiements. Most of the files are actual run results, raw data to show these are real experiements with each code version. I have already summarized the findings. If you read the blog post, you only need to look at a small number of files, and ignore most of them. For posting on the blog, I had to re-jog my memory and study all the relevant files again (it has been 3.5 years since, I had already forgotten the details of course). I have not looked at most of the files again this time as I do not need to.
Have you ever read academic papers that do not make full disclosures? You wonder how they really got their findings as the actual raw results and codes are not posted. I have posted everything here, so no one should be complaining.
6) Yes, that is correct. Quote from STREAM FAQ:
"What is STREAM?
The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels."
The sustainable memory bandwidth is a lot higher than what STREAM suggests, due to the way it is written.
7) I have already given a very detailed explanation of all the code changes I have made. Modern computers with more than one CPU are all NUMA nowadays. People who have studied OpenMP and HPC code optimization will understand the reasons for the proposed code changes to have local memory access and minimum OpenMP overhead.
8) As I have pointed out, STREAM is not a true measurement of sustainale memory bandwidth, that is the flaw. All modern HPC programs need to have NUMA considerations in mind. In fact, an HPC code optimization course will start off by teaching how to minimize memory access latency. A program that assumes uniform memory access latency and no OpenMP implementation overhead is a poorly written program.
9) I have said everything I need to say. Unless there is something new other than defending STREAM as it is written, I will not respond.
The way the rules for tuned codes are written, one calls independent "tuned" routines. However, in order to have a single parallel region in OpenMP, one has to modify the source code outside the "tuned" routines, which from what I understand, does not conform to the rules.
Therefore, even with "tuned" version, one will still face the same issues as the vanilla version.