Digital System Design with High-Level Synthesis for FPGA: Combinational Circuits

GoalImplementing a large matrix-matrix multiplication on FPGA
ApproachUsing divide-and-conquer techniques to describe the matrix multiplication algorithm and then using SDSoC for high-level synthesis
BenefitsHigh-performance implementation, short time-to-market design
CreditThis work has been done under the ENPOWER project (funded by EPSRC) at the University of Bristol.

Matrix multiplication is one of the operators that have a wide range of applications in image processing, scientific computing, simulation, robotics, and so on. Therefore, providing a fast speed implementation using CPU, GPU or FPGA has always been a challenge.

Here, I briefly explain how to implement this operator on FPGA.

As FPGAs have limited resources in terms of internal memory or logics, transferring all the data to an FPGA and then performing the multiplication is not possible.  Therefore, the FPGA should collaborate with the main memory to complete the task. However, the low latency of the data access in the main memory is the main bottleneck in this collaboration.  To tackle this problem, the approach explained here tries to minimize this collaboration overhead.

The key idea is using the divide-and-conquer technique. The following figure shows this approach by horizontally and vertically dividing matrix A and B, respectively.


The execution times of running this implementation on the Zynq in two different modes, standalone and under Linux are shown in the following figures.



Under Linux:


The source code of this implementation can be found here.

Digital System Design with High-Level Synthesis for FPGA: Combinational Circuits

8 thoughts on “Large Matrix-Matrix Multiplication on FPGA”
  1. I have tried with a matrix with size 4096×4096. I changed the parameters like:

    #define N 1024*4
    #define M 1024*4
    #define P 1024*4

    #define A_HEIGHT N
    #define A_WIDTH M

    #define B_HEIGHT M
    #define B_WIDTH P

    #define C_HEIGHT N
    #define C_WIDTH P

    #define A_HEIGHT_BLOCK 512/4
    #define B_WIDTH_BLOCK 128/4

    It implements without problems but after running the software and hardware results do not match:

    Exeution time running matrix multiplication in software: 5.85704e+06 msec
    Error at index=256 sw = 1021.82, hw=0
    Speed up: 4974.81
    Bye Matrix Mult

    1. Please enclose expressions defined by macros (e.g, #define) in parenthesis, then everything would be OK. and the output would be as follows
      sh-4.3# ./large_latmux_sdsoc_linux_new.elf
      Hello Matrix Mult
      floating point matrix multiplication
      Exeution time running matrix multiplication in hardware: 300918 msec
      Exeution time running matrix multiplication in software: 6.04064e+06 msec
      Speed up: 20.074
      Bye Matrix Mult
      #define N (1024*4)
      #define M (1024*4)
      #define P (1024*4)

      #define A_HEIGHT N
      #define A_WIDTH M

      #define B_HEIGHT M
      #define B_WIDTH P

      #define C_HEIGHT N
      #define C_WIDTH P

      #define A_HEIGHT_BLOCK (512/4)
      #define B_WIDTH_BLOCK (128/4)

      1. Thanks,

        For some reason when I enclose in parenthesis compilation fails in my sdsoc 2016.2 but if I just write the number directly so 4096 instead of (1024*4) it compiles and works.

      2. I have realised this problem in the SDSoC 2016.2 before.
        If you add “#pragma SDS data sys_port(A:ACP, B:ACP, C:ACP)” (or similar pragma) before the “mmult_accel” function prototype in the matrix_mult.h then it can be compiled even with complex expressions passed to the zero_copy pragma.
        It seems the new version of the SDSoC cannot decide on the memory port when the array size in the zero_copy pragma is an expression. Probably they will correct that in future versions.

  2. Hi.
    I was trying your method.
    What is the difference between standalone and under linux code?
    Is there any difference in approach or method used?
    Please explain.

    1. Hi,
      Conceptually, both are the same. However, in Linux we had a limitation for allocating the cached contiguous memory to be accessed in the accelerator. It is possible to use the standalone version in the Linux if cache coherency is not necessary.

  3. Can you please tell me what is golden_mm and fast_mm what is the differnce between them and where openmp and neon is used exactly

Leave a Reply to Shikha GoelCancel reply