Goal Fast Matrix-Matrix Multiplication on Software Approach Cache-friendly code, using dual-core (with OpenMP) and NEON vector processor Benefits Very fast Matrix-Matrix Multiplication Credit This work has been done under the ENPOWER project (funded by EPSRC) at University of Bristol.

In one of my previous posts I introduced an implementation for the large matrix-matrix multiplication on FPGA and it was much faster than the software implementation.

However, the software implementation was quite simple. To make the comparison fare, we have to utilise almost all software resources available on the hardware platform. Here, I am going to introduce a software implementation that uses dual-core cortex-A9 and NEON vector processor. In addition the code is cache-friendly.

This is the code

```#define BLOCK 8
unsigned Fast_MMM(float *a, float *b, float *c) {
int i_m,j_m,k_m,i_block,j_block,k_block;
float *c_p, *b_p, *a_p;
#pragma omp parallel shared(a,b,c) private(i_m,j_m,k_m,i_block,j_block,k_block)
{
#pragma omp for schedule(static)
for (i_m = 0; i_m < N; i_m += BLOCK) {
for (j_m = 0; j_m < P; j_m += BLOCK) {
for (k_m = 0; k_m < M; k_m += BLOCK) {
c_p = c+i_m*P+j_m;
a_p = a+i_m*M+k_m;
for (i_block = 0; i_block < BLOCK; i_block++ ) {
b_p = b+k_m*P+j_m;
for (j_block = 0; j_block < BLOCK; j_block++) {
for (k_block = 0; k_block < BLOCK; k_block++) {
c_p[k_block] += a_p[j_block] * b_p[k_block];
}
b_p += P;
}
c_p += P;
a_p += M;
}
}
}
}
}

return 1;
}
```

For compilation the following command should be used:

without NEON

gcc -O3  -std=gnu9x  -mcpu=cortex-a9 -fopenmp  -lm matrix_mult_openmp_neon.c -o matrix_mult_openmp

With NEON

gcc -O2  -std=gnu9x  -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mvectorize-with-neon-quad -mfloat-abi=hard -ffast-math -fopenmp  -lm matrix_mult_openmp_neon.c -o matrix_mult_openmp_neon

For execution