**Goal** |
Implementing a large matrix-matrix multiplication on FPGA |

**Approach** |
Using divide and conquer techniques to describe the matrix multiplication algorithm and then using SDSoC for high-level synthesis |

**Benefits** |
High-performance implementation, short time-to-market design |

**Credit** |
This work has been done under the ENPOWER project (funded by EPSRC) at the University of Bristol. |

Matrix multiplication is one of the operators that has a wide range of applications in image processing, scientific computing, simulation, robotics, and so on. Therefore, providing a fast speed implementation using CPU, GPU, or FPGA has always been a challenge.

Here, I briefly explain how to implement this operator on FPGA.

As FPGAs have limited resources in terms of internal memory or logics, transferring all the data to an FPGA and then performing the multiplication is not possible. Therefore, the FPGA should collaborate with the main memory to complete the task. However, the low latency of the data access in the main memory is the main bottleneck in this collaboration. To tackle this problem, the approach explained here tries to minimize this collaboration overhead.

The key idea is using the divide and conquer technique. The following figure shows this approach by horizontally and vertically dividing matrix A and B, respectively.

The execution times of running this implementation on the Zynq in two different modes, standalone and under Linux are shown in the following figures.

Standalone:

Under Linux:

The source code of this implementation can be found here.

### Like this:

Like Loading...

I have tried with a matrix with size 4096×4096. I changed the parameters like:

#define N 1024*4

#define M 1024*4

#define P 1024*4

#define A_HEIGHT N

#define A_WIDTH M

#define B_HEIGHT M

#define B_WIDTH P

#define C_HEIGHT N

#define C_WIDTH P

#define A_HEIGHT_BLOCK 512/4

#define B_WIDTH_BLOCK 128/4

It implements without problems but after running the software and hardware results do not match:

Exeution time running matrix multiplication in software: 5.85704e+06 msec

Error at index=256 sw = 1021.82, hw=0

Speed up: 4974.81

TEST FAILED

Bye Matrix Mult

LikeLike

Please enclose expressions defined by macros (e.g, #define) in parenthesis, then everything would be OK. and the output would be as follows

——————————————–

sh-4.3# ./large_latmux_sdsoc_linux_new.elf

Hello Matrix Mult

floating point matrix multiplication

Exeution time running matrix multiplication in hardware: 300918 msec

Exeution time running matrix multiplication in software: 6.04064e+06 msec

Speed up: 20.074

TEST PASSED

Bye Matrix Mult

sh-4.3#

———————————————-

….

#define N (1024*4)

#define M (1024*4)

#define P (1024*4)

#define A_HEIGHT N

#define A_WIDTH M

#define B_HEIGHT M

#define B_WIDTH P

#define C_HEIGHT N

#define C_WIDTH P

#define A_HEIGHT_BLOCK (512/4)

#define B_WIDTH_BLOCK (128/4)

…..

Thanks,

Mohammad

LikeLike

Thanks,

For some reason when I enclose in parenthesis compilation fails in my sdsoc 2016.2 but if I just write the number directly so 4096 instead of (1024*4) it compiles and works.

LikeLike

I have realised this problem in the SDSoC 2016.2 before.

If you add “#pragma SDS data sys_port(A:ACP, B:ACP, C:ACP)” (or similar pragma) before the “mmult_accel” function prototype in the matrix_mult.h then it can be compiled even with complex expressions passed to the zero_copy pragma.

It seems the new version of the SDSoC cannot decide on the memory port when the array size in the zero_copy pragma is an expression. Probably they will correct that in future versions.

LikeLike

Hi.

I was trying your method.

What is the difference between standalone and under linux code?

Is there any difference in approach or method used?

Please explain.

LikeLike

Hi,

Conceptually, both are the same. However, in Linux we had a limitation for allocating the cached contiguous memory to be accessed in the accelerator. It is possible to use the standalone version in the Linux if cache coherency is not necessary.

LikeLike