Large Matrix-Matrix Multiplication on Dual-Core Cortex-A9+NEON

Mohammad

7 years ago

Goal	Fast Matrix-Matrix Multiplication on Software
Approach	Cache-friendly code, using dual-core (with OpenMP) and NEON vector processor
Benefits	Very fast Matrix-Matrix Multiplication
Credit	This work has been done under the ENPOWER project (funded by EPSRC) at University of Bristol.

In one of my previous posts I introduced an implementation for the large matrix-matrix multiplication on FPGA and it was much faster than the software implementation.

However, the software implementation was quite simple. To make the comparison fare, we have to utilise almost all software resources available on the hardware platform. Here, I am going to introduce a software implementation that uses dual-core cortex-A9 and NEON vector processor. In addition the code is cache-friendly.

This is the code

[code language=”c”]
#define BLOCK 8
unsigned Fast_MMM(float *a, float *b, float *c) {
int i_m,j_m,k_m,i_block,j_block,k_block;
float *c_p, *b_p, *a_p;
#pragma omp parallel shared(a,b,c) private(i_m,j_m,k_m,i_block,j_block,k_block)
{
#pragma omp for schedule(static)
for (i_m = 0; i_m < N; i_m += BLOCK) {
for (j_m = 0; j_m < P; j_m += BLOCK) {
for (k_m = 0; k_m < M; k_m += BLOCK) {
c_p = c+i_m*P+j_m;
a_p = a+i_m*M+k_m;
for (i_block = 0; i_block < BLOCK; i_block++ ) {
b_p = b+k_m*P+j_m;
for (j_block = 0; j_block < BLOCK; j_block++) {
for (k_block = 0; k_block < BLOCK; k_block++) {
c_p[k_block] += a_p[j_block] * b_p[k_block];
}
b_p += P;
}
c_p += P;
a_p += M;
}
}
}
}
}

return 1;
}
[/code]

For compilation the following command should be used:

without NEON

gcc -O3 -std=gnu9x -mcpu=cortex-a9 -fopenmp -lm matrix_mult_openmp_neon.c -o matrix_mult_openmp

With NEON

gcc -O2 -std=gnu9x -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mvectorize-with-neon-quad -mfloat-abi=hard -ffast-math -fopenmp -lm matrix_mult_openmp_neon.c -o matrix_mult_openmp_neon

For execution

export OMP_NUM_THREADS=2

./matrix_mult_openmp_neon

./matrix_mult_openmp

Execution Time:

Matrix size	exe-time (openmp)	exe-time (openmp+neon)	gain
1024×1024	11003.892 ms	7310.619 ms	1.50
2014×2014	103217.05 ms	65076.317 ms	1.59
4096×4096	836098.79 ms	547190.87 ms	1.52

You can find the code here.

Share this: