DDR Memory Transactions in High-Level Synthesis

Communicating with DDR memories in FPGA can be a performance bottleneck for hardware accelerators. Using a proper data transaction protocol to satisfy the accelerator requirements can be a real challenge if we want to use hardware languages such as VHDL. However, high-level synthesis (HLS) enables designers to choose and configure the required protocol only using the proper C/C++ coding style.

For example, the Xilinx Vitis-HLS toolset uses the concept of ports and coding style to address DDR memory transactions. Designers with low or no hardware background can implement a highly efficient data movement logic between their FPGA design and the system memory.

Let us have a quick look at these techniques in HLS.

Ports

Ports are the hardware entities that connect an FPGA design to off-chip resources such as DDR memories. Ports can provide different bus widths. For example, a 32-bit port can read floating-point data using a single transaction, or a 128-bit port can read four floating-point data elements in one transaction.

FPGAs usually support multi-port wide-bus width, which can provide a high memory bandwidth for accelerators.

To utilise all the memory bandwidths available, the code should be written such that to

1- Enable the burst data transaction protocol

2- Use wide-bus width

3- Utilise multiple-port

Online Courses on HLS for FPGA

If you are interested in learning high-level synthesis for FPGA, please refer to my online course.

Bust Data Transaction Protocol

A code can support the burst protocol if it uses a monotonically increasing order of accesses to the array elements located in the DDR memory. This figure shows a for-loop with increasing order of access to the array A, which can be implemented by burst data transfer protocol.

However, the following for-loop does not provide the proper memory access pattern.

Wide-Bus Width

There are different techniques to enable this feature in a C/C++ code. One way is using a struct structure. For example, the following code read n floating-point data elements using n iterations. So each time, it reads one element.

void data_read(float *A, int n, ...) {
  for (int i = 0; i < n; i++) {
#pragma HLS PIPELINE
    float a = A[i];
    ...
  }
}

However, the following code utilises 128-bit bust to read four floating-point data in one transaction

struct four_float {
  float a,
  float b,
  float c,
  float d,
}
void data_read(four_float *A, int n, ...) {
#pragma HLS PIPELINE
  for (int i = 0; i < n/4; i++) {
    four_float a4 = A[i];
    a1 = a;
    a2 = b;
    a3 = c;
    a4 = d;
    ...
  }
}

another way is partially unrolling the for-loop as follows

void data_read(float *A, int n, ...) {
  for (int i = 0; i < n; i+=4) {
#pragma HLS PIPELINE
    for (int j = 0; j < 4; j++) {
#pragma HLS UNROLL
      float a = A[i+j];
      ...
    } 
  }
}

Multiple Ports

We can modify the data_read function or use a few copies of this function to utilise multiple memory ports. However, as multiple ports read from memory separately, they should access different parts of the input data. The input data can be divided into multiple chunks to address this constraint.