Goal Implementing a large vector-addition on FPGA
Approach Stream computing
Benefits Utilising high memory bandwidth
Credit  This work has been done under the ENPOWER project (funded by EPSRC) at University of Bristol.

 

Vector addition is one of the simplest tasks to show the efficiency of a parallel implementation technique in utilising all the bandwidth provided by a platform.

An implementation of the vector addition has three main stages: reading data from the memory, performing addition, and writing the result to the memory.  Using a simple implementation for vector addition, these three stages are explained.

The following code shows a simple implementation for vector addition:

__kernel void __attribute__ ((reqd_work_group_size(2048, 1, 1)))
vector_addition(__global float* arrayA, __global float* arrayB, __global float* arrayC) {
    int globalIndex = get_global_id(0);
    arrayC[globalIndex] = arrayA[globalIndex] + arrayB[globalIndex];
}

Phase 1: Reading data

In this example, the data has been saved in global memory and accessed by index of the NDRange space which here has only one dimension.

OpenCL Global Memory

The global memory is accessible to both the OpenCL host and device. The host is responsible for managing (allocation and deallocation) this memory inside the memory space. At each time only host or device has access to the global memory. First, host creates the global memory and transfers the data from the host memory space to the global memory space and then loses its control when the device running the corresponding kernel. Then the device has full read/write control on the global memory. When the kernel execution has finished, the host takes back the control and can read the possible results.
In the Xilinx OpenCL platform the global memory is a part of device memory and it divided into off-chip and on-chip global memory.

Accessing individual data in the global memory is a slow process and potentially can cause bottlenecks in kernels that access the data frequently. There are three main techniques to handle this issue

  • Accessing a group of data for example using the OpenCL vector data type
  • Transferring the data using the burst transfer protocol which it is possible by using async_work_group_copy OpenCL function inside kernels.
  • Using the streaming data transfer for a large data set each of which elements used once and in order in kernels.

Phase 2: Computation

Line 4 in the above OpenCL code represents the computation which it is a simple addition. However, performing multiple computations in parallel can improve the performance. This can be achieved by correctly managing the OpenCL NDRange as the kernel is executed in parallel over the NDRange.

Phase 3: Writing data

The discussion is more or less the same as phase 1.

As in this example reading/writing data into/from the global memory is the bottleneck, I focus on optimising the data transfer by using the streaming data transfer scheme along with using the OpenCL vector type.  The following code shows this optimisation.

pipe  DATA_TYPE16 pa __attribute__((xcl_reqd_pipe_depth(PIPE_DEPTH)));
pipe  DATA_TYPE16 pb __attribute__((xcl_reqd_pipe_depth(PIPE_DEPTH)));
pipe  DATA_TYPE16 pc __attribute__((xcl_reqd_pipe_depth(PIPE_DEPTH)));

__kernel void __attribute__ ((reqd_work_group_size(WORK_GROUP_SIZE, 1, 1)))
read_data_kernel(__global DATA_TYPE16* vectorA, __global DATA_TYPE16* vectorB) {
    int globalIndex = get_global_id(0);
    DATA_TYPE16 a = vectorA[globalIndex];
    DATA_TYPE16 b = vectorB[globalIndex];
    __attribute__((xcl_pipeline_loop))

    for (int i = 0; i < THREAD_IN_WORK_ITEM_SIZE; i++) {
        globalIndex = i;

        a = vectorA[globalIndex];
        b = vectorB[globalIndex];

        write_pipe_block(pa, &a);
        write_pipe_block(pb, &b);
    }
}

__kernel void __attribute__ ((reqd_work_group_size(WORK_GROUP_SIZE, 1, 1)))
add_data_kernel(int inc) {
    int globalIndex = get_global_id(0);
    __attribute__((xcl_pipeline_loop))

    for (int i = 0; i < THREAD_IN_WORK_ITEM_SIZE; i++) {
        DATA_TYPE16 a;
        DATA_TYPE16 b;
        DATA_TYPE16 c;

        globalIndex = i;

        read_pipe_block(pa, &a);
        read_pipe_block(pb, &b);

        c = a + b;

        write_pipe_block(pc, &c);
    }
}

__kernel void __attribute__ ((reqd_work_group_size(WORK_GROUP_SIZE, 1, 1)))
write_data_kernel(__global DATA_TYPE16* vectorC) {
    int globalIndex = get_global_id(0);
    DATA_TYPE16 c;
    __attribute__((xcl_pipeline_loop))

    for (int i = 0; i < THREAD_IN_WORK_ITEM_SIZE; i++) {
        globalIndex = i;
        read_pipe_block(pc, &c);

        vectorC[globalIndex] = c;
    }
}

The source code for the vector addition can be found at here.