Abstract: This blog shows how the loop interchange optimisation technique in HLS can improve the design performance. If you would like to learn how to code a hardware in HLS please refer here.

Loop interchange is one of the traditional optimisation techniques used by advanced compilers to improve locality reference in accessing memory elements. It can improve the cache performance in software programs running on a CPU.

Loop-interchange can also improve the pipelined-loop initiation interval (II) in HLS. For this purpose, let’s consider the following example.

#define N 128
#define M 128
void loop_interchange_example(float A[N][M], float c) {
  for (int i = 0; i < N; i++) {
    for (int j = 0; j < M-1; j++) {
#pragma HLS pipeline
      A[i][j+1] = A[i][j] + c;
    }
  }
}

The following figures show the Viviado-HLS synthesis report.

As can be seen, the II of the pipelined nested loops is 7 and the total function latency is 113794 clock cycles.


Now let’s change the loops order as shown in the following code.

#define N 128
#define M 128
void loop_interchange_example(float A[N_MAX][M_MAX], float c) {
  for (int j = 0; j < M-1; j++) {
    for (int i = 0; i < N; i++) {
#pragma HLS pipeline
      A[i][j+1] = A[i][j] + c;
    }
  }
}

The following figures show the Viviado-HLS synthesis report for this code.

As can be seen, the II of the pipelined nested loops is 1 and the total function latency is 16263 clock cycles.

This shows that the second code is 113794/16263 = 7 times faster.