Goal Fast Matrix-Matrix Multiplication on Software
Approach Cache-friendly code, using dual-core (with OpenMP) and NEON vector processor
Benefits Very fast Matrix-Matrix Multiplication
Credit  This work has been done under the ENPOWER project (funded by EPSRC) at University of Bristol.

In one of my previous posts I introduced an implementation for the large matrix-matrix multiplication on FPGA and it was much faster than the software implementation.

However, the software implementation was quite simple. To make the comparison fare, we have to utilise almost all software resources available on the hardware platform. Here, I am going to introduce a software implementation that uses dual-core cortex-A9 and NEON vector processor. In addition the code is cache-friendly.

This is the code

#define BLOCK 8
unsigned Fast_MMM(float *a, float *b, float *c) {
  int i_m,j_m,k_m,i_block,j_block,k_block;
  float *c_p, *b_p, *a_p;
#pragma omp parallel shared(a,b,c) private(i_m,j_m,k_m,i_block,j_block,k_block)
#pragma omp for schedule(static)
    for (i_m = 0; i_m < N; i_m += BLOCK) {
      for (j_m = 0; j_m < P; j_m += BLOCK) {
        for (k_m = 0; k_m < M; k_m += BLOCK) {
          c_p = c+i_m*P+j_m;
          a_p = a+i_m*M+k_m;
          for (i_block = 0; i_block < BLOCK; i_block++ ) {
            b_p = b+k_m*P+j_m;
            for (j_block = 0; j_block < BLOCK; j_block++) {
              for (k_block = 0; k_block < BLOCK; k_block++) {
                c_p[k_block] += a_p[j_block] * b_p[k_block];
              b_p += P;
            c_p += P;
            a_p += M;

  return 1;

For compilation the following command should be used:

without NEON

gcc -O3  -std=gnu9x  -mcpu=cortex-a9 -fopenmp  -lm matrix_mult_openmp_neon.c -o matrix_mult_openmp


gcc -O2  -std=gnu9x  -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mvectorize-with-neon-quad -mfloat-abi=hard -ffast-math -fopenmp  -lm matrix_mult_openmp_neon.c -o matrix_mult_openmp_neon

For execution





Execution Time:

Matrix size exe-time (openmp) exe-time (openmp+neon) gain
1024×1024 11003.892 ms 7310.619  ms 1.50
2014×2014 103217.05 ms 65076.317 ms 1.59
4096×4096 836098.79 ms 547190.87 ms 1.52

You can find the code here.