Performance of Matrix Multiplication

Sep 19, 2016


I have been busy with the Xilinx PYNQ project so I was not able to update my personal website. One side product from this project was some performance evaluation on various platforms with various languages and libraries. For example, I have been studying the performance of matrix multiplication. Now I just want to show some basic numbers based on my evaluation.


The basic setup is that we have 2 matrices, each having 128*128 elements. The elements are generated randomly, so a total number of 10 iterations are performed to get the average execution time of a single iteration. The functions or APIs are made identical to make fair comparisons.


The following table shows the performance results. In this table, Python codes can be naive object-based implementation, or based on existing libraries such as numpy, or even bound to C program. On Zybo, some of the C programs are based on SDSoC projects. The PL in this table means Programmable Logic is used to accelerate the computation, which is expected to give us high performance.

There are a couple of interesting things to notice in this table:

  1. The naive C or Python code uses 3-loop, which gives the worst performance on all platforms.

  2. On the same platform, it is easy to see the overhead OS introduced (Ubuntu > PetaLinux > Bare metal).

  3. Using hardware accelerators is always beneficial. Using the PL accelerated version of Python code, we narrowed down the performance gap between an embedded system and a shared work station (6.34 ms versus 2.01 ms).

I guess matrix multiplication is a very specific application that is very comfortable with hardware acceleration. But at least that gives us an idea of how performance will look like using various approaches.