Large Matrix-Matrix Multiplication on Dual-Core Cortex-A9+NEON

Goal Fast Matrix-Matrix Multiplication on Software
Approach Cache-friendly code, using dual-core (with OpenMP) and NEON vector processor
Benefits Very fast Matrix-Matrix Multiplication
Credit  This work has been done under the ENPOWER project (funded by EPSRC) at University of Bristol.

In one of my previous posts I introduced an implementation for the large matrix-matrix multiplication on FPGA and it was much faster than the software implementation.

However, the software implementation was quite simple. To make the comparison fare, we have to utilise almost all software resources available on the hardware platform. Here, I am going to introduce a software implementation that uses dual-core cortex-A9 and NEON vector processor. In addition the code is cache-friendly.

This is the code

#define BLOCK 8
unsigned Fast_MMM(float *a, float *b, float *c) {
  int i_m,j_m,k_m,i_block,j_block,k_block;
  float *c_p, *b_p, *a_p;
#pragma omp parallel shared(a,b,c) private(i_m,j_m,k_m,i_block,j_block,k_block)
#pragma omp for schedule(static)
    for (i_m = 0; i_m < N; i_m += BLOCK) {
      for (j_m = 0; j_m < P; j_m += BLOCK) {
        for (k_m = 0; k_m < M; k_m += BLOCK) {
          c_p = c+i_m*P+j_m;
          a_p = a+i_m*M+k_m;
          for (i_block = 0; i_block < BLOCK; i_block++ ) {
            b_p = b+k_m*P+j_m;
            for (j_block = 0; j_block < BLOCK; j_block++) {
              for (k_block = 0; k_block < BLOCK; k_block++) {
                c_p[k_block] += a_p[j_block] * b_p[k_block];
              b_p += P;
            c_p += P;
            a_p += M;

  return 1;

For compilation the following command should be used:

without NEON

gcc -O3  -std=gnu9x  -mcpu=cortex-a9 -fopenmp  -lm matrix_mult_openmp_neon.c -o matrix_mult_openmp


gcc -O2  -std=gnu9x  -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mvectorize-with-neon-quad -mfloat-abi=hard -ffast-math -fopenmp  -lm matrix_mult_openmp_neon.c -o matrix_mult_openmp_neon

For execution





Execution Time:

Matrix size exe-time (openmp) exe-time (openmp+neon) gain
1024×1024 11003.892 ms 7310.619  ms 1.50
2014×2014 103217.05 ms 65076.317 ms 1.59
4096×4096 836098.79 ms 547190.87 ms 1.52

You can find the code here.

Recommended Articles


  1. I imagine I need to install some openmp libraries in the board for this to work.

    Do you have some instructions for this for the zc702 board, please ?

    I thought that for neon it is necessary to use some intrinsics so neon instructions are used etc but in your case is not necessary ?

    1. Thanks for your question.

      As SDSoC environment doesn’t support OpenMP directly, we should use the OpenMP as a library.

      One quick option is compiling the MM as a library on the Ubuntu Linux on ARM or using a cross compiler that supports OpenMP and use the library file along the OpenMP library file which are libgomp.*

      For Neon I don’t think we need to install anything. As I tested the elf file on the SDSoC generated Linux and just adding openmp library is enough.

  2. It seems that overall this optimised implementation is around 5X faster than the standard C ?

    1. it seems for larger matrix it shows better performance. (based on this compilation)

      1024 simple c = 66545 msec Fast version = 11003.892 msec gain=6.04
      2048 simple c = 1454058.990 msec Fast Version = 103217.05 msec gain=14.21

      I think I should do more research and comparison for different matrix sizes as the gain is increased significantly for larger matrix size.

      1. Thanks,
        In my experiments standard C I get :

        1024 => 66555 ms (as you)
        2048 => 594035 ms ( your version seems to be 10x slower ? )

        I might have to double check these numbers.
        I get 6 million ms for 4096 in C.

      2. Many thanks for your comment.

        I corrected the compiler option for neon. sorry for the mistake.

        The execution times are also updated.

    2. Thanks
      you’re right 6.04064e+06 msec is for 4096×4096

      1. Thanks, it works on the ARM device.

        As a test I modified the compilation flags and remove neon so :

        gcc matrix_mult_openmp_neon.c -O3 -std=gnu9x -O3 -mcpu=cortex-a9 -mfloat-abi=hard -ffast-math -fopenmp -lm -o matrix_mult_openmp_neon

        performance is the same so it looks that neon instructions are not generated and vectorization is not working.

        Probably that needs some hand assembly. In any case I have seen some examples with neon and small matrixes of 4×4 that fit the vector length.

  3. Hello,

    I seem to be having problems getting the neon results. After using the new compilation command I get the results below:

    Comparing the old with the new commands for neon :

    gcc matrix_mult_openmp_neon.c -O3 -std=gnu9x -O3 -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mvectorize-with-neon-quad -mfloat-abi=hard -ffast-math -fopenmp -lm -o matrix_mult_openmp_neon

    gcc -O2 -std=gnu9x -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mvectorize-with-neon-quad -mfloat-abi=hard -ffast-math -fopenmp -lm matrix_mult_openmp_neon.c -o matrix_mult_openmp_neon

    The only difference seems to be that in the new you replaced o3 with o2 but my result below slows down? Have you modified the code as well ?

    root:~/matrix_openmp# export OMP_NUM_THREADS=2
    root:~/matrix_openmp# ./matrix_mult_openmp_neon
    Hello Large MM
    Matrix size= 2048 * 2048
    Fast MM execution time 486209.122000 ms elapsed

    Thanks for your help.

    1. 1- Could you please let me know what is your gcc version?
      2- and the Linux that you are using
      3- I left two bin files at

      please check them on the zynq to find out the problem is with libraries/runtime system or compiler.

      1. Thanks,

        I upgraded from gcc 4.6 to gcc 4.9 and got your results with neon as well.
        It is strange that to get to use vectorization it is necessary to switch off optimization from o3 to o2 ?

        root:~/matrix_openmp# ./matrix_mult_openmp_neon
        Hello Large MM
        Matrix size= 1024 * 1024
        Fast MM execution time 7371.335000 ms elapsed

  4. Congratulations!

  5. I am working for LDPC encoding and decoding for kintex board, and after i need to communicate with microblaze. I wrote LDPC in C, its working . But in HLS, i dont know how to declare matrix. Please help me out.

    HLS source code

    void LDPC_Encoding(int H_Matrix[10][10],int msg_length, int message[10][10], int Generator[10][10], int dout[10][10])

    H_Matrix = [1 1 0 0 1 0
    1 0 0 1 0 1
    1 1 1 0 0 1]
    Row =3, columns =6
    message length =3
    message =011

    his is, express it as

    Hsys = [I| P]

    This is my H parity check matrix

    H= [1 1 0 0 1 0;
    1 0 0 1 0 1;
    1 1 1 0 0 1 ];

    1. Arranging the parity check matrix in systematic form using row and column operations

    Hsys = [I| P]

    systematic parity check matrix, Hsys= [0 0 1 0 1 1
    0 1 0 1 1 1;
    1 0 0 1 0 1;]
    2. Rearranging the systematic parity check matrix

    Generator matrix G =[Ptranspose\I];

    Therefore, G= [0 1 1 1 0 0 ;
    1 1 0 0 1 0 ;
    1 1 1 0 0 1]

    3. Generate the codeword in by multiplying message with generator matrix G

    c=m.G //c =codeword

    Through C code, i have completed coding tasks 1 &2, but while generation codeword its showing wrong result.

    for e.g. m=011
    c= [011] [0 1 1 1 0 0 ;
    1 1 0 0 1 0 ;
    1 1 1 0 0 1]

  6. Hi,

    I’m relatively new to VIAVOD HLS. I am attempting to create and initialize a 2 dimensional array in C++ Arbitrary Precision Types:

    I written an error correction code which uses a binary generator matrix(which i have declared and initialized within the code) in which each individual element in the matrix is declared as an int(as far as i understand that the size of int in xilix software development kit (XSDK) is 32, please correct me if am wrong). what i want to do is try to declare a binary matrix , that represents each element of the matrix as a bit rather than a word , to optimize the code , to improve space constraints in memory. Is there a bit type in XSDK. please advice. If not , please suggest if this can be achieved in any other way. I’ve attached my matrix declaration code. please check.

    #define ROWS 102 //k
    #define COLS 204
    void GeneratorM(int msg[ROWS], int dout[COLS])

    int Generator[ROWS][COLS]= {0};
    int G[ROWS][COLS] = {0};
    int i,j,k,r,c,n;

    k = ROWS;
    r = ROWS;
    c = COLS;
    n = COLS;
    static int H[ROWS][COLS] = {}

    int Codeword[COLS]= {0};
    int s = 0;

    for (i=0;i<k;i++)
    if(i == j)
    G[i][j] = 1;

    G[j][k+i] = H[i][j];


    //Code word generation

    s = s + msg[i]*Generator[i][j];
    Codeword[j] = s % 2;
    s = 0;

    1. Hi,

      If you are going to use Vivado-HLS then have a look at page 605 of Xilinx UG902 document

      it explains how to define and use C/C++ Arbitrary Precision Types.

Leave a Reply

Sale on now | Up to 80% OFF on All HLS Courses

2 day left!

%d bloggers like this: