Goal Implementing a large matrix-matrix multiplication on FPGA
Approach Using divide and conquer techniques to describe the matrix multiplication algorithm and then using SDSoC for high-level synthesis
Benefits High-performance implementation, short time-to-market design
Credit This work has been done under the ENPOWER project (funded by EPSRC) at the University of Bristol.

Matrix multiplication is one of the operators that has a wide range of applications in image processing, scientific computing, simulation, robotics, and so on. Therefore, providing a fast speed implementation using CPU, GPU, or FPGA has always been a challenge.

Here, I briefly explain how to implement this operator on FPGA.

As FPGAs have limited resources in terms of internal memory or logics, transferring all the data to an FPGA and then performing the multiplication is not possible.  Therefore, the FPGA should collaborate with the main memory to complete the task. However, the low latency of the data access in the main memory is the main bottleneck in this collaboration.  To tackle this problem, the approach explained here tries to minimize this collaboration overhead.

The key idea is using the divide and conquer technique. The following figure shows this approach by horizontally and vertically dividing matrix A and B, respectively.

large_matrix_mult

The execution times of running this implementation on the Zynq in two different modes, standalone and under Linux are shown in the following figures.

Standalone:

matrix-mult-result-standalone

Under Linux:

matrix-mult-result-linux

The source code of this implementation can be found here.