Map Pattern in High-Level Synthesis

Data parallelism is mostly one of the first parallel execution patterns implemented in hardware and processor-based systems. In the simple form of this parallelism, also called the map pattern, a single function is independently applied to data set elements. Because there is no dependency among the different function instances applied to the data set, they can execute in parallel.

The map pattern is one of the common parallel patterns in different algorithms. This pattern applies a function (usually known as an elemental function) on individual independent input data and generates a set of outputs.

The map pattern can be shown as y=map(x, f), which receives an input vector (or array) of elements, denoted by x, and f function that will be applied on the input vector, element-by-element and generates an output vector (or array) of elements, denoted by y.

The map pattern can be implemented by serial programming paradigm using an iterative structure such as for-loop. The map pattern iterations are independent and can be executed without any knowledge of each other.

Describing a map pattern in HLS is straightforward, and a loop statement can describe this pattern. However, optimising the synthesis process can be challenging.

for (int i = 1; i < n; i++)
  y[i]=f(x[i]);

The HLS optimisation techniques can define the execution model or determine the type or number of resources used in the hardware implementation. The execution model can be sequential, parallel, pipelined or hybrid.

Sequential

The for-loop can be synthesised by an HLS tool. Each iteration comprises three operations: read data from the input memory, execute the f function, and write the result to the output memory. The generated hardware repeats these three operations sequentially until all iterations are exhausted.

Parallel

Since different iterations in a map pattern are independent, they can, theoretically, be executed in parallel. The implementation can be achieved by adding the UNROLL synthesis directive to the code. The number of resources used in the fully parallel implementation is directly proportional to the dataset size, which means it is not feasible for a large dataset due to limited hardware resources.

Online Courses on HLS for FPGA

If you are interested in learning high-level synthesis for FPGA, please refer to my online course.

Pipeline

One of the main challenges in fully or partially unrolling the map pattern in HLS is memory port limitation, especially when data should be read from the main memory, which utilising the memory partitioning technique is not possible. In this case, pipelining the for-loop implementing the map pattern is a promising technique.

Hybrid

If the memory can provide multiple read and write ports, then a hybrid pipelining&unrolling technique can provide the maximum performance achievable for implementing a map pattern in HLS.

Summary

This table shows the map pattern latency, in terms of clock cycles, considering different optimisation techniques. Here n is the number of iterations, l is the loop iteration latency, II is the pipelined loop initiation interval, and p is the unrolling factor.