Thu. Oct 6th, 2022
 Goal Fast Matrix-Matrix Multiplication on Software Approach Cache-friendly code, using dual-core (with OpenMP) and NEON vector processor Benefits Very fast Matrix-Matrix Multiplication Credit This work has been done under the ENPOWER project (funded by EPSRC) at University of Bristol.

In one of my previous posts I introduced an implementation for the large matrix-matrix multiplication on FPGA and it was much faster than the software implementation.

However, the software implementation was quite simple. To make the comparison fare, we have to utilise almost all software resources available on the hardware platform. Here, I am going to introduce a software implementation that uses dual-core cortex-A9 and NEON vector processor. In addition the code is cache-friendly.

This is the code

[code language=”c”]
#define BLOCK 8
unsigned Fast_MMM(float *a, float *b, float *c) {
int i_m,j_m,k_m,i_block,j_block,k_block;
float *c_p, *b_p, *a_p;
#pragma omp parallel shared(a,b,c) private(i_m,j_m,k_m,i_block,j_block,k_block)
{
#pragma omp for schedule(static)
for (i_m = 0; i_m < N; i_m += BLOCK) {
for (j_m = 0; j_m < P; j_m += BLOCK) {
for (k_m = 0; k_m < M; k_m += BLOCK) {
c_p = c+i_m*P+j_m;
a_p = a+i_m*M+k_m;
for (i_block = 0; i_block < BLOCK; i_block++ ) {
b_p = b+k_m*P+j_m;
for (j_block = 0; j_block < BLOCK; j_block++) {
for (k_block = 0; k_block < BLOCK; k_block++) {
c_p[k_block] += a_p[j_block] * b_p[k_block];
}
b_p += P;
}
c_p += P;
a_p += M;
}
}
}
}
}

return 1;
}
[/code]

For compilation the following command should be used:

without NEON

gcc -O3  -std=gnu9x  -mcpu=cortex-a9 -fopenmp  -lm matrix_mult_openmp_neon.c -o matrix_mult_openmp

With NEON

gcc -O2  -std=gnu9x  -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mvectorize-with-neon-quad -mfloat-abi=hard -ffast-math -fopenmp  -lm matrix_mult_openmp_neon.c -o matrix_mult_openmp_neon

For execution

./matrix_mult_openmp_neon

or

./matrix_mult_openmp

Execution Time:

 Matrix size exe-time (openmp) exe-time (openmp+neon) gain 1024×1024 11003.892 ms 7310.619  ms 1.50 2014×2014 103217.05 ms 65076.317 ms 1.59 4096×4096 836098.79 ms 547190.87 ms 1.52

You can find the code here.

##### 15 thoughts on “Large Matrix-Matrix Multiplication on Dual-Core Cortex-A9+NEON”
1. Jose Nunez-Yanez says:

I imagine I need to install some openmp libraries in the board for this to work.

Do you have some instructions for this for the zc702 board, please ?

I thought that for neon it is necessary to use some intrinsics so neon instructions are used etc but in your case is not necessary ?

1. Mohammad says:

Thanks for your question.

As SDSoC environment doesn’t support OpenMP directly, we should use the OpenMP as a library.

One quick option is compiling the MM as a library on the Ubuntu Linux on ARM or using a cross compiler that supports OpenMP and use the library file along the OpenMP library file which are libgomp.*

For Neon I don’t think we need to install anything. As I tested the elf file on the SDSoC generated Linux and just adding openmp library is enough.

2. Jose Nunez-Yanez says:

It seems that overall this optimised implementation is around 5X faster than the standard C ?

1. Mohammad says:

it seems for larger matrix it shows better performance. (based on this compilation)

1024 simple c = 66545 msec Fast version = 11003.892 msec gain=6.04
2048 simple c = 1454058.990 msec Fast Version = 103217.05 msec gain=14.21

I think I should do more research and comparison for different matrix sizes as the gain is increased significantly for larger matrix size.

1. Jose Nunez-Yanez says:

Thanks,
In my experiments standard C I get :

1024 => 66555 ms (as you)
2048 => 594035 ms ( your version seems to be 10x slower ? )

I might have to double check these numbers.
I get 6 million ms for 4096 in C.

2. Mohammad says:

Many thanks for your comment.

I corrected the compiler option for neon. sorry for the mistake.

The execution times are also updated.

2. Mohammad says:

Thanks
you’re right 6.04064e+06 msec is for 4096×4096

1. Jose Nunez-Yanez says:

Thanks, it works on the ARM device.

As a test I modified the compilation flags and remove neon so :

gcc matrix_mult_openmp_neon.c -O3 -std=gnu9x -O3 -mcpu=cortex-a9 -mfloat-abi=hard -ffast-math -fopenmp -lm -o matrix_mult_openmp_neon

performance is the same so it looks that neon instructions are not generated and vectorization is not working.

Probably that needs some hand assembly. In any case I have seen some examples with neon and small matrixes of 4×4 that fit the vector length.

3. Jose Nunez-Yanez says:

Hello,

I seem to be having problems getting the neon results. After using the new compilation command I get the results below:

Comparing the old with the new commands for neon :

old:
gcc matrix_mult_openmp_neon.c -O3 -std=gnu9x -O3 -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mvectorize-with-neon-quad -mfloat-abi=hard -ffast-math -fopenmp -lm -o matrix_mult_openmp_neon

new:
gcc -O2 -std=gnu9x -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mvectorize-with-neon-quad -mfloat-abi=hard -ffast-math -fopenmp -lm matrix_mult_openmp_neon.c -o matrix_mult_openmp_neon

The only difference seems to be that in the new you replaced o3 with o2 but my result below slows down? Have you modified the code as well ?

root:~/matrix_openmp# ./matrix_mult_openmp_neon
Hello Large MM
Matrix size= 2048 * 2048
Fast MM execution time 486209.122000 ms elapsed

Thanks for your help.

1. Mohammad says:

1- Could you please let me know what is your gcc version?
2- and the Linux that you are using
3- I left two bin files at

please check them on the zynq to find out the problem is with libraries/runtime system or compiler.

1. Jose Nunez-Yanez says:

Thanks,

I upgraded from gcc 4.6 to gcc 4.9 and got your results with neon as well.
It is strange that to get to use vectorization it is necessary to switch off optimization from o3 to o2 ?

root:~/matrix_openmp# ./matrix_mult_openmp_neon
Hello Large MM
Matrix size= 1024 * 1024
Fast MM execution time 7371.335000 ms elapsed

4. Mohammad says:

Congratulations!

5. Thaus says:

I am working for LDPC encoding and decoding for kintex board, and after i need to communicate with microblaze. I wrote LDPC in C, its working . But in HLS, i dont know how to declare matrix. Please help me out.

HLS source code

void LDPC_Encoding(int H_Matrix,int msg_length, int message, int Generator, int dout)

H_Matrix = [1 1 0 0 1 0
1 0 0 1 0 1
1 1 1 0 0 1]
Row =3, columns =6
message length =3
message =011

his is, express it as

Hsys = [I| P]

This is my H parity check matrix

H= [1 1 0 0 1 0;
1 0 0 1 0 1;
1 1 1 0 0 1 ];

1. Arranging the parity check matrix in systematic form using row and column operations

Hsys = [I| P]

systematic parity check matrix, Hsys= [0 0 1 0 1 1
0 1 0 1 1 1;
1 0 0 1 0 1;]
2. Rearranging the systematic parity check matrix

Generator matrix G =[Ptranspose\I];

Therefore, G= [0 1 1 1 0 0 ;
1 1 0 0 1 0 ;
1 1 1 0 0 1]

3. Generate the codeword in by multiplying message with generator matrix G

c=m.G //c =codeword

Through C code, i have completed coding tasks 1 &2, but while generation codeword its showing wrong result.

for e.g. m=011
c=  [0 1 1 1 0 0 ;
1 1 0 0 1 0 ;
1 1 1 0 0 1]

6. Michael says:

Hi,

I’m relatively new to VIAVOD HLS. I am attempting to create and initialize a 2 dimensional array in C++ Arbitrary Precision Types:

I written an error correction code which uses a binary generator matrix(which i have declared and initialized within the code) in which each individual element in the matrix is declared as an int(as far as i understand that the size of int in xilix software development kit (XSDK) is 32, please correct me if am wrong). what i want to do is try to declare a binary matrix , that represents each element of the matrix as a bit rather than a word , to optimize the code , to improve space constraints in memory. Is there a bit type in XSDK. please advice. If not , please suggest if this can be achieved in any other way. I’ve attached my matrix declaration code. please check.

#define ROWS 102 //k
#define COLS 204
void GeneratorM(int msg[ROWS], int dout[COLS])
{

int Generator[ROWS][COLS]= {0};
int G[ROWS][COLS] = {0};
int i,j,k,r,c,n;

k = ROWS;
r = ROWS;
c = COLS;
n = COLS;
static int H[ROWS][COLS] = {}

int Codeword[COLS]= {0};
int s = 0;

for (i=0;i<k;i++)
for(j=0;j<k;j++)
if(i == j)
G[i][j] = 1;

for(i=0;i<r;i++)
for(j=0;j<k;j++)
G[j][k+i] = H[i][j];

for(i=0;i<r;i++)
{
for(j=0;j<c;j++)
Generator[i][j]=G[i][j];
}

//Code word generation

for(j=0;j<n;j++)
{
for(i=0;i<k;i++)
{
s = s + msg[i]*Generator[i][j];
}
Codeword[j] = s % 2;
s = 0;
}

1. Mohammad says:

Hi,

If you are going to use Vivado-HLS then have a look at page 605 of Xilinx UG902 document