Fundamentals of High-Level Synthesis Part 2: Concurrency vs Parallelism

Concurrency and parallelism are two main concepts in high-level synthesis (HLS) design flow that their understanding is crucial in implementing an algorithm efficiently on FPGAs.

Whereas concurrency is a concept at the level of algorithm, parallelism is a hardware-dependent concept. For example, Let’s consider the two L1 and L2 for-loops in the following snippet code. As there is not any data or control dependency between them, they are concurrent and potentially can be executed in parallel. However, their parallel execution depends on the underlying computational platform.

L1:for (int i=0; i < N; i++)
     c[i]= a[i]+b[i]
L2:for (int j=0; j < M; j++)
     d[j]= e[j]*f[j]

If there is only one ALU to perform the addition and multiplication, then these loops should be executed sequentially one after the other or their iterations can be interleaved and run serially. If there is more ALUs available, then the two loops can be performed in parallel. Therefore, the amount of parallelism extracted from a concurrent code depends on the underlying hardware resources. Note that as loop iterations are independent, the concurrency can also be explored among the iterations of each loop. The interested reader can think about different cases of parallelism in implementing the loop iterations. Fig. 2 illustrates only a few cases.

The main goal of an ideal HLS tool is transforming the concurrency in an algorithm into parallelism in the underlying hardware, automatically, without any help from designers. If there were an infinite amount of hardware resources without any overhead (or cost), then transforming the concurrency to parallelism would be a trivial task. But, because of limited computational and memory resources on the target FPGAs and their associated costs, the HLS tool should solve difficult optimisation problems, which several of them are NP-hard.

As shown in Fig. 3, constraints describe the hardware resources limitations, and the HLS tools use them as a model of the hardware (along with the cost model). Since HLS tools are not ideal, designers usually should help the compilers to do their tasks efficiently. This help is generally added to the code in two forms: Firstly, the code should follow a coding style suitable for the hardware. Secondly, designers should add compiler directives (e.g., pragmas) to the algorithm code to tell the compiler more information about the appropriate underlying hardware parallelism of the final hardware microarchitecture.

Fig. 3 Concurrency-Parallelism transformation

Note that, it is better to understand the concurrency in an algorithm and techniques to model that. Having these models in mind, designers can systematically choose an efficient coding style and proper compiler directives.

Concurrency is all about the variable dependency among statements, instructions, or blocks in an algorithm that determines they can perform together. This dependency comes in two different forms: data dependency and control dependency.

Data dependency: If two statements access the data in the same variable (whether for reading or writing), then there is a data dependency between them. This dependency can have different forms. The simplest one is when one statement writes to a variable, and another read that variable. For example, consider the following instructions in an algorithm code. The first instruction writes into the x variable. And the second one reads the result. This dependency is called flow (or true) dependency and prevents the two statements from performing at the same time. Therefore, they are generally not concurrent.

…
 s1: x = b + c;
 s2: y = a + e;
…

The data dependency can be modelled by a graph illustrated in Fig. 5, which shows the execution flow of statements. As can be seen, statement s2 must be run after s1.

Control dependency: If the execution of two statements is controlled by a third statement such as a conditional if, then two statements may not perform concurrently. The following code is a simple example.

…
 s1: if (a < 2) {
 s2:   x = y + z;
 s3: } else {
 s4:   t = p + t
 s5: }
 s6: z=x+t ;
…

Although the s2 and s3 statements are independent, they cannot be concurrent as always only one of them should be executed. In this case, the s1 instruction controls this selection.