Thu. Oct 6th, 2022 Let’s consider the following code snippet.

```const unsigned int N = 1024;

void dot_product(float a[N], float b[N], unsigned int n, float &acc)
{

float s = 0;

for (int i = 0; i < n; i++) {
#pragma HLS pipeline
s += a[i]*b[i];
}

acc = s;
}```

The Xilinx Vitis-HLS synthesises the for-loop into a pipelined microarchitecture with II=1. Therefore, the whole design takes about n cycles to finish.

Now, let’s increase the performance by partially unroll the loop by the factor of B.

One way is using the HLS pragma as follows:

```const unsigned int N = 1024;
const unsigned int B = 32;

void dot_product(float a[N], float b[N], unsigned int n, float &acc)
{

float s = 0;

for (int i = 0; i < n; i++) {
#pragma HLS UNROLL factor=B
#pragma HLS pipeline
s += a[i]*b[i];
}

acc = s;
}```

However, synthesising this code results in a pipelined loop with II=160, which is not desirable.

Now let’s do that manually by using a temporary variable and array partitioning, as shown below.

```const unsigned int N = 1024;
const unsigned int B = 32;

void dot_product(float a[N], float b[N], unsigned int n, float &acc)
{
#pragma HLS ARRAY_PARTITION variable=a dim=1 factor=B/2 cyclic
#pragma HLS ARRAY_PARTITION variable=b dim=1 factor=B/2 cyclic

float s = 0;

for (int i = 0; i < n/B; i++) {
#pragma HLS pipeline
float s_tmp = 0;
for (int j = 0; j < B; j++)
s_tmp += a[i*B+j]*b[i*B+j];
s += s_tmp;
}

acc = s;
}```

Now, after synthesis, the loop II is 1. That means the final design is B times faster than the original implementation (here 32 times faster).