Loop index can have a great impact on the accelerator performance described in HLS.
Here, using an example, I am going to show this impact.
Let’s consider the following code, which describes an IIR filter.

extern "C" {
void iir_filter_kernel(
		DATA_TYPE x[1024],
		DATA_TYPE y[1024],
		DATA_TYPE a[3],
		DATA_TYPE b[3],
		int p,
		int q,
		int n)
{
	for (int i = 0; i < n; i++) {
		for (SIZE_TYPE j = 0; j < p; j++) {
			y[i] += (i >= j)? b[j]*x[i-j] : 0;
		}
		for (SIZE_TYPE j = 1; j < q; j++) {
			y[i] -= (i >= j)? a[j]*y[i-j] : 0;
		}
	}
}
}

If we synthesise the code by Vitis-HLS for Ultra96v2 board at 150MHz, then the following figure shows the corresponding report

We have two problems in this report: timing violation and II violation.

The main reason for these violations is multiple accessing the vectors x, y, a, and b in the code, while these vectors are located in the main memory, and the access patterns are not sequential.

To solve this problem, we can define four buffers in the BRAMs and modify the code as follows:

extern "C" {
void iir_filter_kernel(
		DATA_TYPE *x,
		DATA_TYPE *y,
		DATA_TYPE *a,
		DATA_TYPE *b,
		int p,
		int q,
		int n)
{
	DATA_TYPE x_buffer[P_MAX];
	DATA_TYPE y_buffer[Q_MAX];
	DATA_TYPE a_buffer[P_MAX];
	DATA_TYPE b_buffer[Q_MAX];
	for (int i = 0; i < p; i++) {
		x_buffer[i] = 0;
		b_buffer[i] = b[i];
	}
	for (int i = 0; i < q; i++) {
		y_buffer[i] = 0;
		a_buffer[i] = a[i];
	}
	for (unsigned int i = 0; i < n; i++) {
		for (int j = 0; j < p; j++) {
			x_buffer[p-j] = x_buffer[p-j-1];
		}
		x_buffer[0] = x[i];
		DATA_TYPE y_local_x = 0;
		for (int j = 0; j < p; j++) {
			y_local_x += b_buffer[j]*x_buffer[j];
		}
		DATA_TYPE y_local_y = 0;
		for (int j = 1; j < q; j++) {
			y_local_y += a_buffer[j]*y_buffer[j];
		}
		DATA_TYPE y_local = y_local_x - y_local_y;
		for (int j = q-1; j > 0; j--) {
			y_buffer[j] = y_buffer[j-1];
		}
		y_buffer[1] = y_local;
		y[i] = y_local;
	}
}
}

Now the following figure shows the synthesis report. We have improved the II, but still, timing and II violations exist.

Now let’s do the magic trick. (probabily it is bug in Vitis-HLS 2020.2) Let’s change the inner loop index data type. As these loops iterate over internal buffers, and the size of buffers in IIR are limited, I have assumed 128, the loop indices do not need to be int, and an 8-bit unsigned value can do the task. So let’s change the loop indices data type, as shown in the following figure.

If you are interested in learning HLS, these online courses are designed for you.

High-Level Synthesis for FPGA Online Courses
#include "iir_filter_kernel.h"
#define SIZE_TYPE ap_uint<8>
extern "C" {
void iir_filter_kernel(
		DATA_TYPE *x,
		DATA_TYPE *y,
		DATA_TYPE *a,
		DATA_TYPE *b,
		SIZE_TYPE p,
		SIZE_TYPE q,
		unsigned int n)
{
	DATA_TYPE x_buffer[P_MAX];
	DATA_TYPE y_buffer[Q_MAX];
	DATA_TYPE a_buffer[P_MAX];
	DATA_TYPE b_buffer[Q_MAX];
	for (SIZE_TYPE i = 0; i < p; i++) {
		x_buffer[i] = 0;
		b_buffer[i] = b[i];
	}
	for (SIZE_TYPE i = 0; i < q; i++) {
		y_buffer[i] = 0;
		a_buffer[i] = a[i];
	}
	for (unsigned int i = 0; i < n; i++) {
		for (SIZE_TYPE  j = 0; j < p; j++) {
			x_buffer[p-j] = x_buffer[p-j-1];
		}
		x_buffer[0] = x[i];
		DATA_TYPE y_local_x = 0;
		for (SIZE_TYPE j = 0; j < p; j++) {
			y_local_x += b_buffer[j]*x_buffer[j];
		}
		DATA_TYPE y_local_y = 0;
		for (SIZE_TYPE j = 1; j < q; j++) {
			y_local_y += a_buffer[j]*y_buffer[j];
		}
		DATA_TYPE y_local = y_local_x - y_local_y;
		for (SIZE_TYPE j = q-1; j > 0; j--) {
			y_buffer[j] = y_buffer[j-1];
		}
		y_buffer[1] = y_local;
		y[i] = y_local;
	}
}
}

Now the following figure shows the new synthesis report.

No violation is reported. C-Simulation is OK and passes the test, but RTL/C Co-Simulation is faulty which I think it should probably be a bug in Vitis-HLS 2020.2.

If you are interested in learning HLS, these online courses are designed for you.

High-Level Synthesis for FPGA Online Courses

Anyway the following code modification improves the perfromance without any problem.

#include "iir_filter_kernel.h"
extern "C" {
void iir_filter_kernel(
		DATA_TYPE x[1024],
		DATA_TYPE y[1024],
		DATA_TYPE a[3],
		DATA_TYPE b[3],
		int  p,
		int  q,
		int n)
{
	DATA_TYPE x_buffer[P_MAX];
#pragma HLS ARRAY_PARTITION variable=x_buffer dim=1 complete
	DATA_TYPE y_buffer[Q_MAX];
#pragma HLS ARRAY_PARTITION variable=y_buffer dim=1 complete
	DATA_TYPE a_buffer[P_MAX];
#pragma HLS ARRAY_PARTITION variable=a_buffer dim=1 complete
	DATA_TYPE b_buffer[Q_MAX];
#pragma HLS ARRAY_PARTITION variable=b_buffer dim=1 complete
	for (int i = 0; i < p; i++) {
		x_buffer[i] = 0;
		b_buffer[i] = b[i];
	}
	for (int i = 0; i < q; i++) {
		y_buffer[i] = 0;
		a_buffer[i] = a[i];
	}
	for (int i = 0; i < n; i++) {
		for (int  j = 0; j < p; j++) {
#pragma HLS pipeline
			x_buffer[p-j] = x_buffer[p-j-1];
		}
		x_buffer[0] = x[i];
		DATA_TYPE y_local_x = 0;
		for (int j = 0; j < P_MAX; j++) {
#pragma HLS unroll
			if (j < p)
				y_local_x += b_buffer[j]*x_buffer[j];
			else
				y_local_x += 0;
		}
		DATA_TYPE y_local_y = 0;
		for (int j = 1; j < Q_MAX; j++) {
#pragma HLS unroll
			if (j < q)
				y_local_y += a_buffer[j]*y_buffer[j];
			else
				y_local_y += 0;
		}
		DATA_TYPE y_local = y_local_x - y_local_y;
		for (int j = 0; j < q; j++) {
#pragma HLS pipeline
			y_buffer[q-j] = y_buffer[q-j-1];
		}
		y_buffer[1] = y_local;
		y[i] = y_local;
	}
}
}

Source Code on GitHub. (https://github.com/highlevelsynthesis/ReducingIIinHLS/tree/main/ReducingIIinHLS-05)

Leave a Reply