Let’s consider the following code snippet.
const unsigned int N = 1024; void dot_product(float a[N], float b[N], unsigned int n, float &acc) { float s = 0; for (int i = 0; i < n; i++) { #pragma HLS pipeline s += a[i]*b[i]; } acc = s; }
The Xilinx Vitis-HLS synthesises the for-loop into a pipelined microarchitecture with II=1. Therefore, the whole design takes about n cycles to finish.
Now, let’s increase the performance by partially unroll the loop by the factor of B.
One way is using the HLS pragma as follows:
const unsigned int N = 1024; const unsigned int B = 32; void dot_product(float a[N], float b[N], unsigned int n, float &acc) { float s = 0; for (int i = 0; i < n; i++) { #pragma HLS UNROLL factor=B #pragma HLS pipeline s += a[i]*b[i]; } acc = s; }
However, synthesising this code results in a pipelined loop with II=160, which is not desirable.
Now let’s do that manually by using a temporary variable and array partitioning, as shown below.
const unsigned int N = 1024; const unsigned int B = 32; void dot_product(float a[N], float b[N], unsigned int n, float &acc) { #pragma HLS ARRAY_PARTITION variable=a dim=1 factor=B/2 cyclic #pragma HLS ARRAY_PARTITION variable=b dim=1 factor=B/2 cyclic float s = 0; for (int i = 0; i < n/B; i++) { #pragma HLS pipeline float s_tmp = 0; for (int j = 0; j < B; j++) s_tmp += a[i*B+j]*b[i*B+j]; s += s_tmp; } acc = s; }
Now, after synthesis, the loop II is 1. That means the final design is B times faster than the original implementation (here 32 times faster).