Let’s consider the following code snippet.

const unsigned int N = 1024;

void dot_product(float a[N], float b[N], unsigned int n, float &acc)
{

	float s = 0;

	for (int i = 0; i < n; i++) {
#pragma HLS pipeline
		s += a[i]*b[i];
	}

	acc = s;
}

“High-Level Synthesis for FPGA” Online Courses

The Xilinx Vitis-HLS synthesises the for-loop into a pipelined microarchitecture with II=1. Therefore, the whole design takes about n cycles to finish.

Now, let’s increase the performance by partially unroll the loop by the factor of B.

One way is using the HLS pragma as follows:

const unsigned int N = 1024;
const unsigned int B = 32;

void dot_product(float a[N], float b[N], unsigned int n, float &acc)
{

	float s = 0;

	for (int i = 0; i < n; i++) {
#pragma HLS UNROLL factor=B
#pragma HLS pipeline
		s += a[i]*b[i];
	}

	acc = s;
}

However, synthesising this code results in a pipelined loop with II=160, which is not desirable.

Now let’s do that manually by using a temporary variable and array partitioning, as shown below.

const unsigned int N = 1024;
const unsigned int B = 32;

void dot_product(float a[N], float b[N], unsigned int n, float &acc)
{
#pragma HLS ARRAY_PARTITION variable=a dim=1 factor=B/2 cyclic
#pragma HLS ARRAY_PARTITION variable=b dim=1 factor=B/2 cyclic

	float s = 0;

	for (int i = 0; i < n/B; i++) {
#pragma HLS pipeline
		float s_tmp = 0;
		for (int j = 0; j < B; j++)
			s_tmp += a[i*B+j]*b[i*B+j];
		s += s_tmp;
	}

	acc = s;
}

Now, after synthesis, the loop II is 1. That means the final design is B times faster than the original implementation (here 32 times faster).

“High-Level Synthesis for FPGA” Online Courses