Large Matrix-Matrix Multiplication on Dual-Core Cortex-A9+NEON

Date: November 24, 2016Author: Mohammad 15 Comments

Goal	Fast Matrix-Matrix Multiplication on Software
Approach	Cache-friendly code, using dual-core (with OpenMP) and NEON vector processor
Benefits	Very fast Matrix-Matrix Multiplication
Credit	This work has been done under the ENPOWER project (funded by EPSRC) at University of Bristol.

In one of my previous posts I introduced an implementation for the large matrix-matrix multiplication on FPGA and it was much faster than the software implementation.

However, the software implementation was quite simple. To make the comparison fare, we have to utilise almost all software resources available on the hardware platform. Here, I am going to introduce a software implementation that uses dual-core cortex-A9 and NEON vector processor. In addition the code is cache-friendly.

This is the code

[code language=”c”]
#define BLOCK 8
unsigned Fast_MMM(float *a, float *b, float *c) {
int i_m,j_m,k_m,i_block,j_block,k_block;
float *c_p, *b_p, *a_p;
#pragma omp parallel shared(a,b,c) private(i_m,j_m,k_m,i_block,j_block,k_block)
{
#pragma omp for schedule(static)
for (i_m = 0; i_m < N; i_m += BLOCK) {
for (j_m = 0; j_m < P; j_m += BLOCK) {
for (k_m = 0; k_m < M; k_m += BLOCK) {
c_p = c+i_m*P+j_m;
a_p = a+i_m*M+k_m;
for (i_block = 0; i_block < BLOCK; i_block++ ) {
b_p = b+k_m*P+j_m;
for (j_block = 0; j_block < BLOCK; j_block++) {
for (k_block = 0; k_block < BLOCK; k_block++) {
c_p[k_block] += a_p[j_block] * b_p[k_block];
}
b_p += P;
}
c_p += P;
a_p += M;
}
}
}
}
}

return 1;
}
[/code]

For compilation the following command should be used:

without NEON

gcc -O3 -std=gnu9x -mcpu=cortex-a9 -fopenmp -lm matrix_mult_openmp_neon.c -o matrix_mult_openmp

With NEON

gcc -O2 -std=gnu9x -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mvectorize-with-neon-quad -mfloat-abi=hard -ffast-math -fopenmp -lm matrix_mult_openmp_neon.c -o matrix_mult_openmp_neon

For execution

export OMP_NUM_THREADS=2

./matrix_mult_openmp_neon

./matrix_mult_openmp

Execution Time:

Matrix size	exe-time (openmp)	exe-time (openmp+neon)	gain
1024×1024	11003.892 ms	7310.619 ms	1.50
2014×2014	103217.05 ms	65076.317 ms	1.59
4096×4096	836098.79 ms	547190.87 ms	1.52

You can find the code here.

15 thoughts on “Large Matrix-Matrix Multiplication on Dual-Core Cortex-A9+NEON”

Add Comment

Jose Nunez-Yanez says:

November 24, 2016 at 2:35 pm

I imagine I need to install some openmp libraries in the board for this to work.

Do you have some instructions for this for the zc702 board, please ?

I thought that for neon it is necessary to use some intrinsics so neon instructions are used etc but in your case is not necessary ?

Loading...

Reply
1. Mohammad says:
  
  November 24, 2016 at 2:42 pm
  
  Thanks for your question.
  
  As SDSoC environment doesn’t support OpenMP directly, we should use the OpenMP as a library.
  
  One quick option is compiling the MM as a library on the Ubuntu Linux on ARM or using a cross compiler that supports OpenMP and use the library file along the OpenMP library file which are libgomp.*
  
  For Neon I don’t think we need to install anything. As I tested the elf file on the SDSoC generated Linux and just adding openmp library is enough.
  
  Loading...
  
  Reply
Jose Nunez-Yanez says:

November 24, 2016 at 2:37 pm

It seems that overall this optimised implementation is around 5X faster than the standard C ?

Loading...

Reply
1. Mohammad says:
  
  November 24, 2016 at 2:48 pm
  
  it seems for larger matrix it shows better performance. (based on this compilation)
  
  1024 simple c = 66545 msec Fast version = 11003.892 msec gain=6.04
  2048 simple c = 1454058.990 msec Fast Version = 103217.05 msec gain=14.21
  
  I think I should do more research and comparison for different matrix sizes as the gain is increased significantly for larger matrix size.
  
  Loading...
  
  Reply
  1. Jose Nunez-Yanez says:
    
    November 24, 2016 at 2:56 pm
    
    Thanks,
    In my experiments standard C I get :
    
    1024 => 66555 ms (as you)
    2048 => 594035 ms ( your version seems to be 10x slower ? )
    
    I might have to double check these numbers.
    I get 6 million ms for 4096 in C.
    
    Loading...
  2. Mohammad says:
    
    November 24, 2016 at 5:08 pm
    
    Many thanks for your comment.
    
    I corrected the compiler option for neon. sorry for the mistake.
    
    The execution times are also updated.
    
    Loading...
2. Mohammad says:
  
  November 24, 2016 at 3:02 pm
  
  Thanks
  you’re right 6.04064e+06 msec is for 4096×4096
  
  Loading...
  
  Reply
  1. Jose Nunez-Yanez says:
    
    November 24, 2016 at 3:44 pm
    
    Thanks, it works on the ARM device.
    
    As a test I modified the compilation flags and remove neon so :
    
    gcc matrix_mult_openmp_neon.c -O3 -std=gnu9x -O3 -mcpu=cortex-a9 -mfloat-abi=hard -ffast-math -fopenmp -lm -o matrix_mult_openmp_neon
    
    performance is the same so it looks that neon instructions are not generated and vectorization is not working.
    
    Probably that needs some hand assembly. In any case I have seen some examples with neon and small matrixes of 4×4 that fit the vector length.
    
    Loading...
Jose Nunez-Yanez says:

November 25, 2016 at 9:30 am

Hello,

I seem to be having problems getting the neon results. After using the new compilation command I get the results below:

Comparing the old with the new commands for neon :

old:
gcc matrix_mult_openmp_neon.c -O3 -std=gnu9x -O3 -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mvectorize-with-neon-quad -mfloat-abi=hard -ffast-math -fopenmp -lm -o matrix_mult_openmp_neon

new:
gcc -O2 -std=gnu9x -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mvectorize-with-neon-quad -mfloat-abi=hard -ffast-math -fopenmp -lm matrix_mult_openmp_neon.c -o matrix_mult_openmp_neon

The only difference seems to be that in the new you replaced o3 with o2 but my result below slows down? Have you modified the code as well ?

root:~/matrix_openmp# export OMP_NUM_THREADS=2
root:~/matrix_openmp# ./matrix_mult_openmp_neon
Hello Large MM
Matrix size= 2048 * 2048
Fast MM execution time 486209.122000 ms elapsed

Thanks for your help.

Loading...

Reply
1. Mohammad says:
  
  November 25, 2016 at 10:32 am
  
  1- Could you please let me know what is your gcc version?
  2- and the Linux that you are using
  3- I left two bin files at
  https://github.com/Hosseinabady/SDSoC-Benchmarks/tree/master/large_matrix_mult/fast_software_implementation/bin
  
  please check them on the zynq to find out the problem is with libraries/runtime system or compiler.
  
  Loading...
  
  Reply
  1. Jose Nunez-Yanez says:
    
    November 25, 2016 at 12:45 pm
    
    Thanks,
    
    I upgraded from gcc 4.6 to gcc 4.9 and got your results with neon as well.
    It is strange that to get to use vectorization it is necessary to switch off optimization from o3 to o2 ?
    
    root:~/matrix_openmp# ./matrix_mult_openmp_neon
    Hello Large MM
    Matrix size= 1024 * 1024
    Fast MM execution time 7371.335000 ms elapsed
    
    Loading...
Mohammad says:

November 25, 2016 at 7:22 pm

Congratulations!

Loading...

Reply
Thaus says:

June 19, 2017 at 7:04 am

I am working for LDPC encoding and decoding for kintex board, and after i need to communicate with microblaze. I wrote LDPC in C, its working . But in HLS, i dont know how to declare matrix. Please help me out.

HLS source code

void LDPC_Encoding(int H_Matrix[10][10],int msg_length, int message[10][10], int Generator[10][10], int dout[10][10])

H_Matrix = [1 1 0 0 1 0
1 0 0 1 0 1
1 1 1 0 0 1]
Row =3, columns =6
message length =3
message =011

his is, express it as

Hsys = [I| P]

This is my H parity check matrix

H= [1 1 0 0 1 0;
1 0 0 1 0 1;
1 1 1 0 0 1 ];

1. Arranging the parity check matrix in systematic form using row and column operations

Hsys = [I| P]

systematic parity check matrix, Hsys= [0 0 1 0 1 1
0 1 0 1 1 1;
1 0 0 1 0 1;]
2. Rearranging the systematic parity check matrix

Generator matrix G =[Ptranspose\I];

Therefore, G= [0 1 1 1 0 0 ;
1 1 0 0 1 0 ;
1 1 1 0 0 1]

3. Generate the codeword in by multiplying message with generator matrix G

c=m.G //c =codeword

Through C code, i have completed coding tasks 1 &2, but while generation codeword its showing wrong result.

for e.g. m=011
c= [011] [0 1 1 1 0 0 ;
1 1 0 0 1 0 ;
1 1 1 0 0 1]

Loading...

Reply
Michael says:

November 20, 2017 at 12:24 pm

Hi,

I’m relatively new to VIAVOD HLS. I am attempting to create and initialize a 2 dimensional array in C++ Arbitrary Precision Types:

I written an error correction code which uses a binary generator matrix(which i have declared and initialized within the code) in which each individual element in the matrix is declared as an int(as far as i understand that the size of int in xilix software development kit (XSDK) is 32, please correct me if am wrong). what i want to do is try to declare a binary matrix , that represents each element of the matrix as a bit rather than a word , to optimize the code , to improve space constraints in memory. Is there a bit type in XSDK. please advice. If not , please suggest if this can be achieved in any other way. I’ve attached my matrix declaration code. please check.

#define ROWS 102 //k
#define COLS 204
void GeneratorM(int msg[ROWS], int dout[COLS])
{

int Generator[ROWS][COLS]= {0};
int G[ROWS][COLS] = {0};
int i,j,k,r,c,n;

k = ROWS;
r = ROWS;
c = COLS;
n = COLS;
static int H[ROWS][COLS] = {}

int Codeword[COLS]= {0};
int s = 0;

for (i=0;i<k;i++)
for(j=0;j<k;j++)
if(i == j)
G[i][j] = 1;

for(i=0;i<r;i++)
for(j=0;j<k;j++)
G[j][k+i] = H[i][j];

for(i=0;i<r;i++)
{
for(j=0;j<c;j++)
Generator[i][j]=G[i][j];
}

//Code word generation

for(j=0;j<n;j++)
{
for(i=0;i<k;i++)
{
s = s + msg[i]*Generator[i][j];
}
Codeword[j] = s % 2;
s = 0;
}

Loading...

Reply
1. Mohammad says:
  
  November 20, 2017 at 1:06 pm
  
  Hi,
  
  If you are going to use Vivado-HLS then have a look at page 605 of Xilinx UG902 document
  
  https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_3/ug902-vivado-high-level-synthesis.pdf
  
  it explains how to define and use C/C++ Arbitrary Precision Types.
  
  Loading...
  
  Reply

High-Level Synthesis & Embedded Systems

Large Matrix-Matrix Multiplication on Dual-Core Cortex-A9+NEON

Like this:

15 thoughts on “Large Matrix-Matrix Multiplication on Dual-Core Cortex-A9+NEON”

Leave a ReplyCancel reply

Share this:

Like this:

15 thoughts on “Large Matrix-Matrix Multiplication on Dual-Core Cortex-A9+NEON”

Leave a ReplyCancel reply

Discover more from High-Level Synthesis & Embedded Systems