Compiling a software-implemented algorithm in HLS does not usually provide the performance expected from a hardware accelerator. Therefore, code modification is required to achieve hardware class performance.

Let’s consider the following code. This example is taken from the benchmark here. According to the reference this code “performs dot product of two matrices, which skips the operation when the weight is zero.”

void sparse_matrix_power(int data[200], int all_zero[100], int w[10000]){
	int j, i, temp;
	for (j=0; j<100; j++){
#pragma HLS PIPELINE
		temp = all_zero[j];
		if (temp != 1){
			for (i=0; i<100; i++){
				data[i+100] += w[i*100+j]*data[i];
			}
		}
	}
}
This image has an empty alt attribute; its file name is course-ad.png

In the corresponding paper, the authors claim that the code has a dynamic behaviour for which the static scheduler is not successful. So they propose dynamic scheduling for HLS.

Here, I am going to show that by modifying the code, the static scheduler can provide a high-performance hardware accelerator for the code.

This is the Vitis-HLS report after synthesising the original code.

As can be seen, the code takes a fixed number of clock cycles to finish, which means it suppresses the benefit of the code that skips the calculating of zero data.

The code can be synthesised for 500MHz. Therefore, it needs 10101*2ns=20.202 us.

Now let’s modify the code,

The following code shows this modification.

void sparse_matrix_power(int data[200], int all_zero[100], int w[10000]){
	int j, i, temp;
	int iter = 0;
	int all_zero_index[100];
	j=0;
	for( i=0; i < 100; i++){
#pragma HLS PIPELINE
		temp = all_zero[i];
		if (temp != 1){
			iter+=1;
			all_zero_index[j++]=i;
		}
	}
	int k = 0;
	for (k=0; k<iter; k++){
		for (i = 0; i < 100; i++) {
#pragma HLS PIPELINE
			j = all_zero_index[k];
			data[i+100] += w[i*100+j]*data[i];
		}
	}
}
This image has an empty alt attribute; its file name is course-ad.png

The following figure shows the synthesis report generated by Vitis-HLS.

As can be seen, the first loop needs 100 cycles to finish. The nested loops require 100*iter*2 cycles to finish. Therefore, considering the 500MHz design clock frequency, the second implementation requisites t=(100 + 200*iter)*2ns=0.2 + 0.4*iter us. As 0 <iter<100 then 0.2 us < t < 40.2 us.

As can be seen, if the data has more zeros, it takes less time to be executed. The modified code is much faster than the dynamic scheduler proposed in the original paper.

It is easy to reduce the initiation interval of the nested loop from 2 to 1. If you are interested, you can play with the code and reduce the II.

Summary:

  • A software-based description of an algorithm usually requires a set of code modifications to achieve high performance in HLS
  • The aforementioned modified code can be synthesised with 500MHz clock frequency, while the original paper proposed the dynamic scheduling has managed to synthesis the code for the clock frequency of about 60MHz
  • Proper code modification can improve an HLS code by improving the parallelism and increasing the design clock frequency.
This image has an empty alt attribute; its file name is course-ad.png

Leave a Reply