Goal | Fast Matrix-Matrix Multiplication on Software |
Approach | Cache-friendly code, using dual-core (with OpenMP) and NEON vector processor |
Benefits | Very fast Matrix-Matrix Multiplication |
Credit | This work has been done under the ENPOWER project (funded by EPSRC) at University of Bristol. |
In one of my previous posts I introduced an implementation for the large matrix-matrix multiplication on FPGA and it was much faster than the software implementation.
However, the software implementation was quite simple. To make the comparison fare, we have to utilise almost all software resources available on the hardware platform. Here, I am going to introduce a software implementation that uses dual-core cortex-A9 and NEON vector processor. In addition the code is cache-friendly.
This is the code
[code language=”c”]
#define BLOCK 8
unsigned Fast_MMM(float *a, float *b, float *c) {
int i_m,j_m,k_m,i_block,j_block,k_block;
float *c_p, *b_p, *a_p;
#pragma omp parallel shared(a,b,c) private(i_m,j_m,k_m,i_block,j_block,k_block)
{
#pragma omp for schedule(static)
for (i_m = 0; i_m < N; i_m += BLOCK) {
for (j_m = 0; j_m < P; j_m += BLOCK) {
for (k_m = 0; k_m < M; k_m += BLOCK) {
c_p = c+i_m*P+j_m;
a_p = a+i_m*M+k_m;
for (i_block = 0; i_block < BLOCK; i_block++ ) {
b_p = b+k_m*P+j_m;
for (j_block = 0; j_block < BLOCK; j_block++) {
for (k_block = 0; k_block < BLOCK; k_block++) {
c_p[k_block] += a_p[j_block] * b_p[k_block];
}
b_p += P;
}
c_p += P;
a_p += M;
}
}
}
}
}
return 1;
}
[/code]
For compilation the following command should be used:
without NEON
gcc -O3 -std=gnu9x -mcpu=cortex-a9 -fopenmp -lm matrix_mult_openmp_neon.c -o matrix_mult_openmp
With NEON
gcc -O2 -std=gnu9x -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mvectorize-with-neon-quad -mfloat-abi=hard -ffast-math -fopenmp -lm matrix_mult_openmp_neon.c -o matrix_mult_openmp_neon
For execution
export OMP_NUM_THREADS=2
./matrix_mult_openmp_neon
or
./matrix_mult_openmp
Execution Time:
Matrix size | exe-time (openmp) | exe-time (openmp+neon) | gain |
1024×1024 | 11003.892 ms | 7310.619 ms | 1.50 |
2014×2014 | 103217.05 ms | 65076.317 ms | 1.59 |
4096×4096 | 836098.79 ms | 547190.87 ms | 1.52 |
You can find the code here.