The Xilinx Vitis toolset uses the OpenCL programming model to support heterogeneous computing.
In this model, an application code is divided into two parts: host code and kernels.
Kernels are the tasks that should be accelerated by the underlying FPGA platform. Therefore, kernel codes will be executed on FPGA.
The host code executes on the processor (X86 or ARM). The host code is responsible for preparing data and calls the kernels running on the FPGA.
Here, I consider the ZynqMP-SoC platform available by the Ultra96v2 board. Therefore, the host processor is ARM 64-bit, and the FPGA is ZynqMP PL. These two components are connected through a set of AXI4 interfaces.
The kernel code can be written in C/C++, RTL, or OpenCL.

There are three kernel execution modes: sequential, pipelined, free-running.

In the sequential mode, the host program starts the kernel code by calling the proper API. The kernel starts its execution and when it has finished its task informs the host program. After that the kernel can be started again. (Figure a)

In the pipelined mode, the host should call the kernel to start. Whenever the kernel can accept a new data, the host can call a new instance of the kernel running along with the previous kernel in a pipeline fashion. (Figure b)

In the free-running mode, the kernel starts running as soon as the FPGA is programmed. The kernel is running continuously along with the host program. The availability of data synchronises the data transaction between the host and kernel codes. (Figure c)

Example: Let’s consider the image histogram as a kernel and map that on the ZynqMP FPGA considering the kernel sequential execution mode.

Kernel code: I am using C/C++ language to describe the kernel code. The following code reads the image and applies the thresholding formula to each pixel and writes the result to the output image.

extern "C" {
void image_thresholding (
    unsigned char input_image[N_MAX*M_MAX],
    unsigned char output_image[N_MAX*M_MAX],
    unsigned int  n,
    unsigned int  m,
    unsigned int  threshold,
    unsigned int  maxVal
  unsigned char input_pixel;
  unsigned char output_pixel;
  for (unsigned int i = 0; i < n*m; i++) {
    input_pixel = input_image[i];
    output_pixel = (input_pixel > threshold)? maxVal : 0;
    output_image[i] = output_pixel;

This kernel code can be synthesis by the Vitis-HLS toolset.

Each loop iteration reads a pixel from memory, performs the thresholding function and writes the result back to the memory.

As it has a simple for-loop, the tool applies the loop pipelining optimisation to the code automatically. The following diagram shows the pipeline execution of the loop iterations. The resulted initiation interval (II) is one, that means the kernel can accept one pixel per clock. Therefore, the kernel execution time can be formulated as t = ((n-1)II+l)T+O, where n is the number of pixels in the input image, II = 1, l is a loop latency (which is 4 based on the Vitis-HLS report), T is the design clock period, and O is the overhead caused by the runtime system and our coding style.

Host code: The OpenCL host code running on the ARM Cortex-A53 consists of 10 main parts.

1- Preparing the input data
2- Detecting the accelerator device
3- Create and configure the Context 
4- Creating the input/output buffers
5- Set the kernel arguments
6- Configure  the data transfer from input buffers to the device
7- Call the kernel to execute
8- Configure the data transfer from the output buffers to the host
9- Wait for the kernel to finish its task
10-Check the result data 

The following code shows these 10 steps.

int main(int argc, char* argv[])
  int status = 0;

//--1- Preparing the input data
  Mat src_image;
  Mat grey_image;
  if (! {
    cout << "Could not open image" << endl;
    return 0;

  cvtColor(src_image, grey_image, cv::COLOR_BGR2GRAY);
  Mat dst;
  dst = grey_image.clone();
  unsigned int threshold_type = 0;
  unsigned int  threshold_value = 128;
  unsigned int max_binary_value = 255;

  unsigned int DATA_SIZE = grey_image.rows * grey_image.cols;
  size_t size_in_bytes = DATA_SIZE * sizeof(unsigned char);
  std::cout << " size_in_bytes = '" << size_in_bytes << "'\n";

//--2- Detecting the accelerator device
  if(argc != 2) {
    std::cout << "Usage: " << argv[0] <<" <xclbin>" << std::endl;
    return EXIT_FAILURE;
  char* xclbinFilename = argv[1];

  std::vector<cl::Device> devices;
  cl::Device device;
  std::vector<cl::Platform> platforms;
  bool found_device = false;
  for(size_t i = 0; (i < platforms.size() ) & (found_device == false) ;i++){
    cl::Platform platform = platforms[i];
    std::string platformName = platform.getInfo<CL_PLATFORM_NAME>();
    if ( platformName == "Xilinx"){
      platform.getDevices(CL_DEVICE_TYPE_ACCELERATOR, &devices);
	  if (devices.size()){
	    device = devices[0];
	    found_device = true;
  if (found_device == false){
    std::cout << "Error: Unable to find Target Device "
           << device.getInfo<CL_DEVICE_NAME>() << std::endl;
    return EXIT_FAILURE;

//--3- Create and configure the Context 
    cl::Context context(device);
    cl::CommandQueue q(context, device, CL_QUEUE_PROFILING_ENABLE);
    std::cout << "Loading: '" << xclbinFilename << "'\n";
    std::ifstream bin_file(xclbinFilename, std::ifstream::binary);
    bin_file.seekg (0, bin_file.end);
    unsigned nb = bin_file.tellg();
    bin_file.seekg (0, bin_file.beg);
    char *buf = new char [nb];, nb);
    cl::Program::Binaries bins;
    cl::Program program(context, devices, bins);
    cl::Kernel krnl_image_thresholding(program,"image_thresholding");

//--4- Creating the input buffers

    cl::Buffer buffer_in(context,  CL_MEM_READ_ONLY  | CL_MEM_USE_HOST_PTR, size_in_bytes,, NULL);
    cl::Buffer buffer_out(context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, size_in_bytes,,        NULL);
//--5- Set the kernel argument
    //set the kernel Arguments
    int narg=0;
    krnl_image_thresholding.setArg(narg++, buffer_in);
    krnl_image_thresholding.setArg(narg++, buffer_out);
    krnl_image_thresholding.setArg(narg++, grey_image.cols);
    krnl_image_thresholding.setArg(narg++, grey_image.rows);
    krnl_image_thresholding.setArg(narg++, threshold_value);
    krnl_image_thresholding.setArg(narg++, max_binary_value);

//--6- Configure  the data transfer from input buffers to the device
    q.enqueueMapBuffer(buffer_in, CL_TRUE, CL_MAP_WRITE, 0, size_in_bytes);
//--7- Call the kernel to execute
//--8- Configure the data transfer from the output buffers to the host
    q.enqueueMapBuffer(buffer_out, CL_TRUE, CL_MAP_READ, 0, size_in_bytes);
//--9- wait for the kernel to finish its task

//--10- Check the result data
	imwrite("grey_threshold.jpg", dst);

	cout << "Created grey image" << endl;

    return status;

The following figure shows the resulted RTL design in Vivado. As can be seen, our kernel IP is connected to the HP0 port. As the HP0 port has two separate channels for read and write, then the kernel should be able to read/write data from/to the DDR memory simultaneously in a streaming fashion.

If we consider an 800×502 image and the frequency of 100MHz for the design, then based on the formula as mentioned earlier the kernel execution time would be t = ((800*502-1)1+4)10ns+overhead=(4,016,030+overhead) ns.

However, the execution time reported after running the code on Ultra96v2 board is 6.099 ms, which means the overhead is 2.08297msec or 34.15%. In future blogs, I will try to reduce this overhead and increase the kernel speed by increasing the kernel bandwidth utilisation.