The final project in my online course “Function Acceleration on FPGA with Vitis-Part 1: Fundamental” has recently won the September Project Hero prize in the Big Xcellent Adventure with Xilinx. Here I briefly explain the key technique to develop such a project.
FPGAs provide unique features for running compute-intensive algorithms. These features include lower energy consumption and higher performance compared to CPUs and GPUs. In contrast to CPUs and GPUs that utilise a fixed hardware architecture, FPGAs can provide application-specific hardware structures, which are the main reason behind lower energy consumption and higher performance.
However, using traditional approaches to map a software-oriented algorithm on FPGAs is a tedious process. High-Level Synthesis (HLS) is a promising technique for addressing this challenge. This design technique converts an algorithm in C/C++ into its equivalent HDL code, which can later be synthesised into an FPGA bitstream.

The Xilinx Vitis software tool enables the HLS design flow for various applications and FPGA platforms.
Vitis
The Xilinx Vitis Unified Software Platform provides an environment to efficiently map complex algorithms described in C/C++ on FPGAs. An application code in this environment consists of two main parts: host and kernels.

The host program uses OpenCL APIs to communicate with kernels. The kernels can also communicate together using the OpenCL APIs.
We can use the Vitis development environment to describe both the host program and the kernels. The Vitis uses the V++ compiler to synthesis the kernel code into the corresponding RTL hardware. Note that V++ is the Xilinx compiler corresponding to the Vitis-HLS and Vivado. The Xilinx Vivado toolset is then used to integrate or link the generated RTL hardware into the underlying hardware platform and generate the FPGA bitstream. Finally, the Vitis will use g++ to compile the host program.
FPGA-Based Embedded Systems
Vitis technology targets FPGA hardware platforms, such as the Alveo™ Data Center accelerator cards and Versal or Zynq® UltraScale+™ MPSoC and Zynq 7000 SoC-based embedded system platforms. Here, let us consider the Zynq MPSoC UltraScale+ embedded system
Zynq UltraScale+ consists of two main parts: Programmable login or PL and Processing System or PS. The system requires a DDR memory to save data and programs. There are six memory ports between the PL and PS, and each port has 128 bits. The memory controller system in the PS shares the DDR4 memory between PS and these six ports. The DDR4 memory bandwidth is around 38.4 GByte/s.
As said before, an application consists of two parts: host and kernels. Whereas the host program will be run on the PS, the PL part executes the kernels. The performance of the accelerator running on the PL depends on utilising the hardware resources efficiently, for example, utilising the maximum memory bandwidth available in the system may lead to a high performance task on FPGA.

Performance
One of the main goals of function-acceleration in HLS is utilising the maximum memory bandwidth available on the FPGA.
Let us consider the HLS code snippet shown in the following figure to understand the memory bandwidth utilisation. The read function contains a for-loop to transfer the float data elements of the A vector located in the DDR4 memory into the x_local variable located in the BRAM inside the FPGA.
The HLS synthesises this for-loop into a circuit that uses a burst data transfer scheme to read data from the DDR4 memory. That means a data element can be transferred to the BRAM in each clock cycle.
Now let us assume that n is 2048 and f is 150 MHs. As the design clock frequency is 150 MHz, its period would be 6.67ns. The timing diagram shows the burst data transfer. Note that for the sake of simplicity, here, I have ignored the overhead of the burst protocol.
In each clock cycle, one floating-point data is read from the main memory. Therefore, reading the whole array takes 13.66 us.
Now let us calculate the memory bandwidth. Bandwidth is defined as the number of bytes transferred in a second. As in this example, in each clock, one floating-point number, which consists of four bytes, is read from memory, then the bandwidth usage would be 600 Mbyte/sec.

Here, the design only uses 32-bits of one memory port. To increase the performance and provide a relatively fast data transaction, we should use the full bit-width of all memory ports available on the Zynq MPSoC. In addition, this data transaction should be in parallel with other tasks in the application to minimise the total execution time.
Utilising these techniques for implementing a complex application in Vitis is the key to developing a high-performance application that can be compared with its rivals on CPUs and GPUs.
If you are interested in learning some of these techniques, please refer to my online course, titled:
Function Acceleration on FPGA with Vitis-Part 1: Fundamental (Embedded System Accelerators)
Special Price: Link