Embedded System, Hardware Design, High-Level Synthesis

Large Matrix-Matrix Multiplication on FPGA

Date: October 1, 2016Author: Mohammad 8 Comments

Digital System Design with High-Level Synthesis for FPGA: Combinational Circuits

Goal	Implementing a large matrix-matrix multiplication on FPGA
Approach	Using divide-and-conquer techniques to describe the matrix multiplication algorithm and then using SDSoC for high-level synthesis
Benefits	High-performance implementation, short time-to-market design
Credit	This work has been done under the ENPOWER project (funded by EPSRC) at the University of Bristol.

Matrix multiplication is one of the operators that have a wide range of applications in image processing, scientific computing, simulation, robotics, and so on. Therefore, providing a fast speed implementation using CPU, GPU or FPGA has always been a challenge.

Here, I briefly explain how to implement this operator on FPGA.

As FPGAs have limited resources in terms of internal memory or logics, transferring all the data to an FPGA and then performing the multiplication is not possible. Therefore, the FPGA should collaborate with the main memory to complete the task. However, the low latency of the data access in the main memory is the main bottleneck in this collaboration. To tackle this problem, the approach explained here tries to minimize this collaboration overhead.

The key idea is using the divide-and-conquer technique. The following figure shows this approach by horizontally and vertically dividing matrix A and B, respectively.

Digital System Design with High-Level Synthesis for FPGA: Combinational Circuits

The execution times of running this implementation on the Zynq in two different modes, standalone and under Linux are shown in the following figures.

Standalone:

matrix-mult-result-standalone

Under Linux:

matrix-mult-result-linux

The source code of this implementation can be found here.

Digital System Design with High-Level Synthesis for FPGA: Combinational Circuits

8 thoughts on “Large Matrix-Matrix Multiplication on FPGA”

Add Comment

Jose Nunez-Yanez says:

October 27, 2016 at 8:23 am

I have tried with a matrix with size 4096×4096. I changed the parameters like:

#define N 1024*4
#define M 1024*4
#define P 1024*4

#define A_HEIGHT N
#define A_WIDTH M

#define B_HEIGHT M
#define B_WIDTH P

#define C_HEIGHT N
#define C_WIDTH P

#define A_HEIGHT_BLOCK 512/4
#define B_WIDTH_BLOCK 128/4

It implements without problems but after running the software and hardware results do not match:

Exeution time running matrix multiplication in software: 5.85704e+06 msec
Error at index=256 sw = 1021.82, hw=0
Speed up: 4974.81
TEST FAILED
Bye Matrix Mult

Loading...

Reply
1. Mohammad says:
  
  October 27, 2016 at 3:06 pm
  
  Please enclose expressions defined by macros (e.g, #define) in parenthesis, then everything would be OK. and the output would be as follows
  ——————————————–
  sh-4.3# ./large_latmux_sdsoc_linux_new.elf
  Hello Matrix Mult
  floating point matrix multiplication
  Exeution time running matrix multiplication in hardware: 300918 msec
  Exeution time running matrix multiplication in software: 6.04064e+06 msec
  Speed up: 20.074
  TEST PASSED
  Bye Matrix Mult
  sh-4.3#
  ———————————————-
  ….
  #define N (1024*4)
  #define M (1024*4)
  #define P (1024*4)
  
  #define A_HEIGHT N
  #define A_WIDTH M
  
  #define B_HEIGHT M
  #define B_WIDTH P
  
  #define C_HEIGHT N
  #define C_WIDTH P
  
  #define A_HEIGHT_BLOCK (512/4)
  #define B_WIDTH_BLOCK (128/4)
  …..
  Thanks,
  Mohammad
  
  Loading...
  
  Reply
  1. Jose Nunez-Yanez says:
    
    October 28, 2016 at 1:56 pm
    
    Thanks,
    
    For some reason when I enclose in parenthesis compilation fails in my sdsoc 2016.2 but if I just write the number directly so 4096 instead of (1024*4) it compiles and works.
    
    Loading...
  2. Mohammad says:
    
    October 28, 2016 at 3:03 pm
    
    I have realised this problem in the SDSoC 2016.2 before.
    If you add “#pragma SDS data sys_port(A:ACP, B:ACP, C:ACP)” (or similar pragma) before the “mmult_accel” function prototype in the matrix_mult.h then it can be compiled even with complex expressions passed to the zero_copy pragma.
    It seems the new version of the SDSoC cannot decide on the memory port when the array size in the zero_copy pragma is an expression. Probably they will correct that in future versions.
    
    Loading...
Pingback: Large Matrix-Matrix Multiplication on Dual-Core Cortex-A9+NEON – High-Level Synthesis & Embedded Systems Blog
Shikha Goel says:

September 27, 2017 at 8:57 am

Hi.
I was trying your method.
What is the difference between standalone and under linux code?
Is there any difference in approach or method used?
Please explain.

Loading...

Reply
1. Mohammad says:
  
  September 27, 2017 at 7:59 pm
  
  Hi,
  Conceptually, both are the same. However, in Linux we had a limitation for allocating the cached contiguous memory to be accessed in the accelerator. It is possible to use the standalone version in the Linux if cache coherency is not necessary.
  
  Loading...
  
  Reply
Saed Hesham says:

July 10, 2023 at 12:40 am

Can you please tell me what is golden_mm and fast_mm what is the differnce between them and where openmp and neon is used exactly

Loading...

Reply