In the previous blog, I explained the difference between concurrency and parallelism concepts in high-level synthesis. Whereas concurrency is a concept at the algorithmic or functional level, parallelism is a realisation of the concurrency at the hardware level. Software programmers employ compiler tools to transform the algorithmic description into a representation suitable for hardware. Accordingly, in the high-synthesis context, the role of the HLS compiler is transforming the concurrency to parallelism to maximise the efficiency, which can be throughput, latency, energy consumption, among others. Therefore, parallelism brings the performance, and concurrency enables the parallelism exploitation. The desired performance cannot be achieved if designers cannot describe the concurrency adequately to enable efficient parallelism. Now, how does this transformation work? And how can designers describe the concurrency to facilitate this transformation? To answer these questions, I try to explain different types of concurrency and parallelism patterns and suitable transformations among them.
How can we describe concurrency?
Algorithmic patterns are the fundamental techniques to describe and model the concurrency in an algorithm. Traditionally, patterns are defined as recurrent structures that can be used to describe, model, predict, solve or even implement problems systematically.
Programming algorithmic patterns such as map, reduce, scan, stencil, just to name a few, can be used to describe the concurrency in an algorithm. In this post, I only consider the map pattern and try to explain how to implement that efficiently using HLS.
Map: This pattern receives an input dataset, applies a function (or a few functions) to its elements independently, and generates an output dataset. There is no data or control dependency among these functions. Therefore, they can be easily modelled by a data dependency graph. Although the concurrency in this pattern seems trivial, finding the most efficient parallel implementation considering the hardware constraints and the resource cost functions can be tricky and challenging.
For example, let’s consider the grey image thresholding as an example which is the most straightforward image segmentation algorithm. The naïve algorithm replaces each image pixel with black colour if its density is less than a fixed constant threshold T, otherwise with white colour. This algorithm can be modelled by a map pattern in which the following function is applied to each pixel independently, where I(x,y) is the input pixel density at the (x,y) location.
The corresponding C code is as follows.
for (int x = 0; x < N; x++) for (int y = 0; y < M; y++) if (I[x][y] > T) O[X][y]= 0; else O[x][y]=255;
The following graph depicts the map pattern that applies the thr function to each element in the input dataset and generates an output element.
How can we describe parallelism?
Parallelism means running statements together. If two or more consecutive statements are independent, then the compiler schedules them to be executed simultaneously if enough computation and memory resources are available. For successive individual statements, the programmer does not need to do anything except writing the code to increase the concurrency among them. In loop structures, the situation is a bit complex, which is explained in the sequel.
Two main techniques implement parallelism in iterative structures, such as for loop: loop unrolling and loop pipelining.
Loop unrolling is a common loop transformation technique in compilers to exploit the parallelism among loop iterations.
This technique enables the compiler to schedule loop iterations together. In the ideal case, a complete loop unrolling can provide maximum parallelism if unlimited resources are available. However, partially unrolling loops are desirable because of limited memory ports and computational resources. Finding the optimal number of loop unrolling is a challenge that varies for each problem.
Generally, the number of unrolled iterations (also known as loop unrolling factor) depends on the concurrency among loop iterations and the number of resources available in the underlying hardware.
Let’s consider the map concurrency pattern explained earlier. As all iterations are independent or there is the maximum concurrency, the number of FPGA resources defines the maximum loop unrolling factor that improves the performance. The number of memory ports available for reading and writing is usually one of the main factors restricting the map pattern’s parallelism. If the memory has p ports for reading and p ports for writing, then unrolling the inner loop of the image thresholding code with a factor of p improves the performance by a factor of p. Any unrolling factor more than p does not impact the performance. Note that there is a direct relationship between the number of memory ports and the maximum number of loop unrolling that can improve performance in the map pattern.
Loop pipelining is another technique to parallelise loop iterations. In this technique, a loop iteration is divided into multiple phases, and iterations are executed, having some overlap among these phases.
Let’s consider the image thresholding example if we divide each iteration into three phases: read I, run thr() function and write O. Then the pipeline timing diagram would be as follows:
The interested reader can think about using both loop-pipelining and partial loop-unrolling with the factor of p for the image thresholding example.