|Goal||Implementing a large matrix-matrix multiplication on FPGA|
|Approach||Using divide-and-conquer techniques to describe the matrix multiplication algorithm and then using SDSoC for high-level synthesis|
|Benefits||High-performance implementation, short time-to-market design|
|Credit||This work has been done under the ENPOWER project (funded by EPSRC) at the University of Bristol.|
Matrix multiplication is one of the operators that have a wide range of applications in image processing, scientific computing, simulation, robotics, and so on. Therefore, providing a fast speed implementation using CPU, GPU or FPGA has always been a challenge.
Here, I briefly explain how to implement this operator on FPGA.
As FPGAs have limited resources in terms of internal memory or logics, transferring all the data to an FPGA and then performing the multiplication is not possible. Therefore, the FPGA should collaborate with the main memory to complete the task. However, the low latency of the data access in the main memory is the main bottleneck in this collaboration. To tackle this problem, the approach explained here tries to minimize this collaboration overhead.
The key idea is using the divide-and-conquer technique. The following figure shows this approach by horizontally and vertically dividing matrix A and B, respectively.
The execution times of running this implementation on the Zynq in two different modes, standalone and under Linux are shown in the following figures.
The source code of this implementation can be found here.