Goal General view of synthesis tools
Credit  This work has been done under the ENPOWER project (funded by EPSRC) at University of Bristol.

Fig. 1 shows the hierarchy of an advanced FPGA synthesis tool set. It consists of three main parts: Logic Synthesis (LS), High-Level synthesis (HLS) and OpenCL.

Synthesis-tools-small
Fig. 1  FPGA synthesis tool set

Logic Synthesis

At the bottom of the tool set shown in Fig.1 is logic synthesis which traditionally gets a design description in an HDL language and transforms it into a network of gates. Different technology-independent optimisation techniques are used to optimise this network of gates. Technology mapping techniques implement the network of gates by utilising primitive elements in the FPGA such as look-up tables (LUTs), registers and memories. Then, they are placed and routed for a given FPGA and finally a bitstream is generated for the FPGA configuration. At this step, the final design resource utilisation and clock frequency are determined.

High-Level Synthesis

On the top of the logic synthesis, HLS tools receive the design description in a high-level language such as C/C++ and generate the corresponding HDL code which will be synthesised later by LS to generate the bitstream. HLS tools should add timing and parallelism to the C/C++ un-timed and unparalleled code. Therefore, they heavily utilise variable analysis (e.g., load/store analysis) to extract dependencies among variables in order to exploit the concurrency among instructions. Then these concurrencies are realised by the full-parallel or pipelined execution of instructions. Loop structures in the high-level language are the main part of the codes that HLS tools apply the dependence analysis. In the full-parallel execution of a loop, all iterations are executed in parallel. This technique also known as loop unrolling potentially provides the minimum latency for the given loop. However, different types of dependencies among loop iterations and the lack of sufficient parallel ports to provide access to the data located in memories or registers restrict the full-parallel execution. In addition, this technique is not scalable as it requires a large amount of resources to implement iterations independently. To solve the limitations of this method, pipelining is a powerful technique that is used by HLS tools and is applicable to loop iterations and functions. In this technique, loop iterations are executed by having overlap cycles (or stages). The distance between two consecutive iterations (in terms of the clock cycles) in a pipelined execution is called initiation interval (II).

The following figure shows a simple loop and its corresponding pipelined timing diagram. If we assume the array a is located in a memory with only two ports to read from, then in each cycle, at most two elements can be read which means two iterations cannot be executed in parallel. This resource constraint dictates at least one cycle distance between two consecutive iterations. Therefore, the initiation interval of the pipelined loop in this figure is 1. Note that data dependencies among variables in different iterations can induce a specific distance between iterations.

pipelined-loop
Fig. 2 Pipelined loop

In order to help HLS tools to facilitate the dependence analysis and extracting the parallelism, compiler directives are provided by the tools that designers can use to optimise their codes. Designers should be familiar with these directives and use them efficiently to provide a high-performance design.

FPGA OpenCL

Software engineers have proposed a formal description of parallelism in an algorithm by introducing the OpenCL framework. Hardware designers have realised the potential of this framework to intensify the effectiveness of their HLS tools. Therefore, as shown in Fig. 1, they proposed another layer of synthesis techniques on top of the HLS tools to integrate the formal parallelism described in the OpenCL code into their tool chain.

FPGAs allow the designer to design their own hardware architectures for kernels to increase the performance and reduce the power consumption. In contrast to GPUs or multi-core CPUs in which the programmer should develop their own code to take most of the predefined hardware architecture, in the FPGA environment designers can choose and define an architecture suitable for each individual OpenCL kernel. This gives a good opportunity to the FPGA OpenCL framework to beat GPU performance in some applications that fit the features available in the FPGA. This paper reveals the effectiveness of this architecture design during code development for the histogram operation.