With the emergence of compute-intensive applications in different areas such as Artificial Intelligence (AI), Machine Learning (ML), Convolutional Neural Networks (CNN), Internet of Things (IoT), image processing, computer vision, advanced driver-assistance systems (ADS) just to name a few, the demand for application-specific accelerators has risen. The main goal of these accelerators is improving the performance by reducing the latency and increasing the throughput. However, reducing energy consumption is also a key factor for using the accelerators in today embedded systems which commonly draw their power from batteries. To address this demand, designers have been exploring three computational platforms, including FPGA, GPU, and ASIC.

GPUs are the hardware platforms that have been adapted for accelerating compute-intensive parallel algorithms which cover most of the new applications. The availability, easy programming, and debugging are the main benefits of these platforms. However, high energy consumption is the main drawback that prevents them from using in energy-critical embedded systems. Therefore, researchers are working to make them energy-efficient in such systems. A sound track record of research and development products is Nvidia Jetson Family.

ASICs have recently been the main focus of the industry to provide AI-related hardware accelerators. Examples are Intel® Movidius™ Neural Compute Stick and Google Coral USB. Although these accelerators have been successful in reducing energy consumptions, they are not as versatile as GPUs.

FPGAs are the hardware platforms that provide high-performance, low energy consumption, and versatility. Traditionally, FPGAs were reconfigurable hardware platforms for prototyping and low-level controller design. However, by significant advancement in the new FPGA architectures (especially integrating floating-point operator modules and hardware embedded processors), their usage as computing accelerators has become more popular.

Although FPGAs can potentially deliver high-performance accelerators, their programming is not straightforward. Conventionally, hardware description languages (HDLs) such as VHDL or Verilog and their related design flow were the only way to design for FPGAs effectively. The design complexity associated with HDLs has prevented FPGAs from being ubiquitous. The main reason behind this phenomenon is the cycle accuracy required by synthesizable HDLs. To support the cycle accuracy, specific design methodologies, techniques, and templates along with an in-depth knowledge of hardware architectures should be learned by designers. This hardware-close implementation makes the design flow tedious, error-prone, and hard to debug. The essential technique to overcome these problems is removing the cycle accuracy from the input description by raising the design level from register transfer level (RTL) to functional level and using high-level languages such as C/C++ instead of HDLs. This approach leaves the task of generating the corresponding cycle-accurate HDL to the compilers or high-level synthesis (HLS) tools. Figure 1 depicts the relationship between a typical HLS compiler and the low-level compilers.

The new HLS tools can efficiently transform a high-level description into the equivalent cycle-accurate HDLs. They still need a little help from designers by augmenting the algorithm code with the compiler directives to achieve a high-performance implementation.

Figure 1 FPGA design flow

Let us take the following dotProduct function as an example to clarify the impact of compiler directives. Also, let us assume the latency of each addition or multiplication is three clock cycles.

void dotProduct(float a[N], float b[N], float &c) {
  float d = 0;
  for (int i = 0; i < N; i++) {
    d+=a[i]*b[i];
  }
  c=d;
}

An HLS tool can generate the corresponding HDL code for this naïve software implementation of the dotProduct function. However, all the loop iterations are run sequentially. As each iteration requires two read operations that can be performed in parallel and an addition after a multiplication, it takes l=(1+3+3)=7 clock cycle to complete, as shown in Figure 2. Considering this timing, the latency of the function would be about (N*l) clock cycles which means 71.68μs if N=1024 and f=100 MHz (the design frequency).

Figure 2 One Loop iteration timing

Now if we add a pipeline directive to the loop as shown in the following code, then the HLS compiler uses the pipelined microarchitecture to implement this loop which the resulting timing diagram is shown in Figure 3. The pipeline microarchitecture allows the loop iterations are executed with some overlap in parallel if enough computing and memory resources are available and the overlap does not violate the data dependencies inside and among iterations.

void dotProduct(float a[N], float b[N], float &c) {
  float d = 0;
  for (int i = 0; i < N; i++) {
#pragma HLS pipeline
    d+=a[i]*b[i];
  }
  c=d;
}
Figure 3 Pipelined loop timing diagram

In this example, the loop carried dependency (LCD) on variable (LCD is the data dependency among iterations) results in a three clock cycles gap between the execution of two consecutive iterations. This timing interval is called Initiation Interval (II) which is the main factor determining the performance of a pipelined loop. In the ideal case, the II should be 1 for the maximum performance. In this case, if N=1024 and f=100MHz, the loop requires ((N-1)*II+l) clock cycles to finish, that means 30.76μs, which is 2.33 times faster than the naïve implementation.

The necessity of adding compiler directives to the C/C++ description reduces the adoption of HLS design flow by software engineers. To alleviate this impact, framework-specific synthesis (FSS) approaches are introduced by academia and industry considering frameworks such as OpenCL, CUDA, P4, among others. These new high-level descriptions reduce the number of compiler-directives required for high-performance implementation.

Here, I leave a question for interested readers. What is the minimum number of clock cycles required to implement the above dotProduct function if a and b vectors are stored in the FPGA BRAMs? (You can consider Ultra96 FPGA board as the target hardware)