const unsigned int N = 1024;
void dot_product(float a[N], float b[N], unsigned int n, float &acc)
{
float s = 0;
for (int i = 0; i < n; i++) {
#pragma HLS pipeline
s += a[i]*b[i];
}
acc = s;
}

The Xilinx Vitis-HLS synthesises the for-loop into a pipelined microarchitecture with II=1. Therefore, the whole design takes about n cycles to finish.

Now, let’s increase the performance by partially unroll the loop by the factor of B.

One way is using the HLS pragma as follows:

const unsigned int N = 1024;
const unsigned int B = 32;
void dot_product(float a[N], float b[N], unsigned int n, float &acc)
{
float s = 0;
for (int i = 0; i < n; i++) {
#pragma HLS UNROLL factor=B
#pragma HLS pipeline
s += a[i]*b[i];
}
acc = s;
}

However, synthesising this code results in a pipelined loop with II=160, which is not desirable.

Now let’s do that manually by using a temporary variable and array partitioning, as shown below.

const unsigned int N = 1024;
const unsigned int B = 32;
void dot_product(float a[N], float b[N], unsigned int n, float &acc)
{
#pragma HLS ARRAY_PARTITION variable=a dim=1 factor=B/2 cyclic
#pragma HLS ARRAY_PARTITION variable=b dim=1 factor=B/2 cyclic
float s = 0;
for (int i = 0; i < n/B; i++) {
#pragma HLS pipeline
float s_tmp = 0;
for (int j = 0; j < B; j++)
s_tmp += a[i*B+j]*b[i*B+j];
s += s_tmp;
}
acc = s;
}

Now, after synthesis, the loop II is 1. That means the final design is B times faster than the original implementation (here 32 times faster).