Fundamentals of High-Level Synthesis – Part 4: Dependency, Concurrency and parallelism

In the last two blogs, I explicitly talked about concurrency and parallelism. I also explained the dependency. However, the dependency concept and its relation to concurrency and parallelism need more discussion. Here, I am going to talk about this relationship.

Maximising parallelism is our ultimate goal in accelerating an algorithm using FPGA. This goal is achievable mainly due to the abundant computational and memory resources available. However, utilising parallelism is not easily accessible if enough concurrency doesn’t exist in the task description. As I mentioned in the previous post, one way of describing the concurrency in the code is using proper programming patterns such as map. However, it is not, all the time, easy to find a suitable programming pattern. In this case, the concept of dependency can be helpful. In addition, the concept of dependency can help to find the proper concurrency pattern.

Dependency is a technique that is usually used by compilers to exploit concurrency. Dependency analysis is a formal theory investigating the concurrency and order of execution among statements in a program. Directed graphs are conventional approaches to model dependencies in a program. The node in this graph represents instructions, and the links denote the dependency among instructions.

There are two types of dependencies in a piece of code: data dependency and control dependency.

Two instructions are data-dependent if the execution of one requires the results of the other.

High-Level Synthesis for FPGA: Online Courses

A data dependency graph (or dataflow graph) can show this relationship. The following figure shows the relation between two instructions S1 and S2. In this graph, the S2 instruction (or destination node) requires some data from its predecessor instruction, i.e., S1 (or source node).

Note that, in a dataflow graph, execution of the source node results in the execution of the destination node, and there is no way to stop destination node execution. This is in contrast to the control dependency.

In control dependency, the execution of a node depends on the results of another node. For example, there is a control dependency between S1, S2 and S3, as the S1 instruction controls the execution of S2 and S3.

Dependencies among instructions or code blocks are more critical when they are combined with iterative structures like for-loops. In an iterative structure, not only the dependency between instructions inside an iteration is important, but also the dependency between instruction in two or more adjacent iterations needs special attention.

High-Level Synthesis for FPGA: Online Courses

But, how can we use dependency analysis to develop a better design in HLS? I address this question with an example.

Understanding and reducing dependencies among instructions and code blocks can help programmers to develop more efficient code. Let us consider the sgemv operator from BLAS. This operator is a matrix-vector multiplication shown by the following equation, where a and b are two float constants, x and y are two vectors, and A is a matrix of size NxM.

y = a*A*x + b*y

The following code shows an implementation of this operator in Xilinx Vivado-HLS. It consists of two nested for-loop.

#define M 32
#define N 32
void sgemv_accel(float A[N][M], float x[M], float y[N], 
                float alpha, float beta) {

#pragma HLS ARRAY_PARTITION variable=A complete dim=1
#pragma HLS ARRAY_PARTITION variable=A complete dim=2
#pragma HLS ARRAY_PARTITION variable=x complete dim=1
#pragma HLS ARRAY_PARTITION variable=y complete dim=1

  for (int i = 0; i < N; i++) {
    y[i] = beta * y[i];
    for (int j = 0; j < M; j++) {
#pragma HLS UNROLL
      y[i] += alpha*x[j]*A[i][j];
    }
  }
}

The inner loop is unrolled for parallel execution, and however, the data dependency between two adjacent iterations prevents parallelism. After synthesising this function, considering the ZynqMPSoC-based Ultra96v2 Board, the latency of the inner loop is 134 clock cycles, and the latency of the entire function is 4289 clock cycles.

High-Level Synthesis for FPGA: Online Courses

It is possible to reduce the data dependency by using the loop interchange transformation. The following code shows the transformed code. The first loop initialises the y vector with the second term in the sgemv equation. The second nested loops perform the first term in the sgemv equation.

#define M 32
#define N 32

void sgemv_accel(float A[N][M], float x[M], float y[N], 
                float alpha, float beta) {
#pragma HLS ARRAY_PARTITION variable=A complete dim=1
#pragma HLS ARRAY_PARTITION variable=A complete dim=2
#pragma HLS ARRAY_PARTITION variable=x complete dim=1
#pragma HLS ARRAY_PARTITION variable=y complete dim=1

  for (int i = 0; i < N; i++) {
#pragma HLS UNROLL
    y[i] = beta * y[i];
  }

  for (int j = 0; j < M; j++) {
    for (int i = 0; i < N; i++) {
#pragma HLS UNROLL
      y[i] += alpha*x[j]*A[i][j];
    }
  }
}

As there is no data dependency between adjacent iterations of the inner-loop, it can be parallelised by unrolling the loop. After synthesising this code with Xilinx Vivado-HLS and considering the ZynqMPSoC-based Ultra96v2 Board, the latency of the inner loop would be 11 clock cycles, and the total latency is 356 clock cycles, which is 4289/356=12.05 times faster than the first implementation.

In summary, the result of every dependency in a code is poor parallelism; analyse the code, find the dependencies and resolve them for more performance.