Digital System Design with High-Level Synthesis for FPGA: Combinational Circuits
Goal | Implementing a large matrix-matrix multiplication on FPGA |
Approach | Using divide-and-conquer techniques to describe the matrix multiplication algorithm and then using SDSoC for high-level synthesis |
Benefits | High-performance implementation, short time-to-market design |
Credit | This work has been done under the ENPOWER project (funded by EPSRC) at the University of Bristol. |
Matrix multiplication is one of the operators that have a wide range of applications in image processing, scientific computing, simulation, robotics, and so on. Therefore, providing a fast speed implementation using CPU, GPU or FPGA has always been a challenge.

Here, I briefly explain how to implement this operator on FPGA.
As FPGAs have limited resources in terms of internal memory or logics, transferring all the data to an FPGA and then performing the multiplication is not possible. Therefore, the FPGA should collaborate with the main memory to complete the task. However, the low latency of the data access in the main memory is the main bottleneck in this collaboration. To tackle this problem, the approach explained here tries to minimize this collaboration overhead.
The key idea is using the divide-and-conquer technique. The following figure shows this approach by horizontally and vertically dividing matrix A and B, respectively.

The execution times of running this implementation on the Zynq in two different modes, standalone and under Linux are shown in the following figures.
Standalone:
Under Linux:
The source code of this implementation can be found here.
Digital System Design with High-Level Synthesis for FPGA: Combinational Circuits
I have tried with a matrix with size 4096×4096. I changed the parameters like:
#define N 1024*4
#define M 1024*4
#define P 1024*4
#define A_HEIGHT N
#define A_WIDTH M
#define B_HEIGHT M
#define B_WIDTH P
#define C_HEIGHT N
#define C_WIDTH P
#define A_HEIGHT_BLOCK 512/4
#define B_WIDTH_BLOCK 128/4
It implements without problems but after running the software and hardware results do not match:
Exeution time running matrix multiplication in software: 5.85704e+06 msec
Error at index=256 sw = 1021.82, hw=0
Speed up: 4974.81
TEST FAILED
Bye Matrix Mult
Please enclose expressions defined by macros (e.g, #define) in parenthesis, then everything would be OK. and the output would be as follows
——————————————–
sh-4.3# ./large_latmux_sdsoc_linux_new.elf
Hello Matrix Mult
floating point matrix multiplication
Exeution time running matrix multiplication in hardware: 300918 msec
Exeution time running matrix multiplication in software: 6.04064e+06 msec
Speed up: 20.074
TEST PASSED
Bye Matrix Mult
sh-4.3#
———————————————-
….
#define N (1024*4)
#define M (1024*4)
#define P (1024*4)
#define A_HEIGHT N
#define A_WIDTH M
#define B_HEIGHT M
#define B_WIDTH P
#define C_HEIGHT N
#define C_WIDTH P
#define A_HEIGHT_BLOCK (512/4)
#define B_WIDTH_BLOCK (128/4)
…..
Thanks,
Mohammad
Thanks,
For some reason when I enclose in parenthesis compilation fails in my sdsoc 2016.2 but if I just write the number directly so 4096 instead of (1024*4) it compiles and works.
I have realised this problem in the SDSoC 2016.2 before.
If you add “#pragma SDS data sys_port(A:ACP, B:ACP, C:ACP)” (or similar pragma) before the “mmult_accel” function prototype in the matrix_mult.h then it can be compiled even with complex expressions passed to the zero_copy pragma.
It seems the new version of the SDSoC cannot decide on the memory port when the array size in the zero_copy pragma is an expression. Probably they will correct that in future versions.
[…] one of my previous posts I introduced an implementation for the large matrix-matrix multiplication on FPGA and it was mush […]
Hi.
I was trying your method.
What is the difference between standalone and under linux code?
Is there any difference in approach or method used?
Please explain.
Hi,
Conceptually, both are the same. However, in Linux we had a limitation for allocating the cached contiguous memory to be accessed in the accelerator. It is possible to use the standalone version in the Linux if cache coherency is not necessary.
Can you please tell me what is golden_mm and fast_mm what is the differnce between them and where openmp and neon is used exactly