Yun (Rock) Qu - Performance of Matrix Multiplication

Performance of Matrix Multiplication

Sep 19, 2016

I have been busy with the Xilinx PYNQ project so I was not able to update my personal website. One side product from this project was some performance evaluation on various platforms with various languages and libraries. For example, I have been studying the performance of matrix multiplication. Now I just want to show some basic numbers based on my evaluation.

The basic setup is that we have 2 matrices, each having 128*128 elements. The elements are generated randomly, so a total number of 10 iterations are performed to get the average execution time of a single iteration. The functions or APIs are made identical to make fair comparisons.

The following table shows the performance results. In this table, Python codes can be naive object-based implementation, or based on existing libraries such as numpy, or even bound to C program. On Zybo, some of the C programs are based on SDSoC projects. The PL in this table means Programmable Logic is used to accelerate the computation, which is expected to give us high performance.

There are a couple of interesting things to notice in this table:

The naive C or Python code uses 3-loop, which gives the worst performance on all platforms.
On the same platform, it is easy to see the overhead OS introduced (Ubuntu > PetaLinux > Bare metal).
Using hardware accelerators is always beneficial. Using the PL accelerated version of Python code, we narrowed down the performance gap between an embedded system and a shared work station (6.34 ms versus 2.01 ms).

I guess matrix multiplication is a very specific application that is very comfortable with hardware acceleration. But at least that gives us an idea of how performance will look like using various approaches.

Page updated

Google Sites

Report abuse