MKL has bad performances on an AMD CPU

Updated on Aug 10, 2022: Intel removed the flag MKL_DEBUG_CPU_TYPE for MKL version later than 2020.1. It later added a compile path specifically for AMD Zen kernel after MKL 2020.2, and as a result there is no problem on the performance of dgemm routine on AMD anymore according to some benchmarks, though the sgemm is not improved. If one cares about the sgemm, a workaround to disguise AMD as Intel for MKL version later than 2020.1 is discussed in Agner's blog and Daniel's blog and some benchmarks were reported in this German website, though I didn't test it personally. It's said MATLAB users no longer need to worry about this issue, however, there are still reports about slowdown on AMD EPYC CPU.

-------------------------------

Updated on Nov 18, 2019: A trick to run MKL faster on AMD CPU. Setting the environment variable by "export MKL_DEBUG_CPU_TYPE=5" for Linux users, according to a github issue and a reddit discussion. (Thank @Smartcom for bringing that to my attention!) By doing this, I got the gigaflops doubled and the performance of MKL became comparable to OpenBLAS on AMD, though MKL+Intel(i7-8086K) is still 1.4 times faster.

-------------------------------


Intel MKL on AMD is not as optimized as on the Intel CPU's. Moreover, contrary to what some MKL engineers has claimed, while it still overwhelms the blas/lapack, MKL actually behaves worse than OpenBLAS on an AMD CPU.


According to some answers on StackExchange, it's said that two AMD cores share one FPU so that AMD is not suitable for scientific computation. But actually the reason is no longer true since the Ryzen series of AMD, which has changed that architecture. The major obstacle that prevents people doing scientific computation from using AMD CPU's now is the suboptimal performance of Intel MKL on AMD cores, considering that MKL is still the best mathematical library in the world and many softwares use it by default. For now, MKL+ Intel CPU is still the optimal choice to do scientific computation.


To compare MKL's performance on an Intel (Core i5-8400 2.8GHz 6-core/6-thread) and AMD (Ryzen 7 2700X 3.7GHz 8-core/16-thread) CPU, I did a test doing a n-by-n matrix multiplication, which only uses the dgemm routine of MKL, with threads affinity to 1 core and 6 cores respectively. I use the gcc compiler. The results are shown below. 


Comparison of wall time(s) on 1 core between AMD and Intel.
Comparison of wall time(s) on 6 cores between AMD and Intel.
Ratio of wall time on 1 core w.r.t  6 cores of AMD.
Ratio of wall time on 1 core w.r.t 6 cores of Intel.

I also did some tests to compare the MKL's performance with the OpenBLAS on AMD Ryzen 2700X. The OpenBLAS is about three times as fast as MKL on AMD when doing dgemm up to n=20000, but still not as fast as the MKL on an Intel CPU which actually has a lower base frequency. Some people also did other comparisons between them on AMD Ryzen Threadripper 1950X and get the same conclusion.