Showing: 1 - 3 of 3 RESULTS

Convolutional Neural-Network on Zynq –part 00: Convolution in Caffe

Recently, I have started to used FPGA (e.g. Zynq) to run neural-networks (NNs) defined in Caffe. My first step is performing the NN inference on FPGA. To do this and to be able to integrate the FPGA platform into Caffe, I have started to understand the Caffe C++ code and structure. Through a series of blogs, I am trying to explain my understanding using a few simple examples.

The goal of these blogs is not using the Caffe in efficient way to implement an application but get familiar with the Caffe code. Therefore, this blog is written for code developers not for application developers. In addition, I assume that reader is already familiar with the basic concepts in Caffe such as net, blob, layer and so on that can be found in its website.

In this first blog, I am going to define convolutional neural network (CNN). Although there are many books and articles explaining CNNs, their concepts and applications, here I am trying to keep everything simple just enough to be used in understanding the Caffe structure and how to add FPGA back-end for it.

Almost all articles explaining CNN start from neural-network (NN) concept, however, here I decided to start with convolution. This approach helps people who do not have background knowledge of NN start having early real experiments.

What is an image convolution?

First, what is a convolution? In general, convolution is a binary operator which combines two input functions and generates a new function highlighting a feature in one of the input function.  The function whose features are going to be highlighted is called the main function and the second function is called the kernel.

In image processing, convolution is used to apply different filters on an image such as blurring, sharpen, edge detection and so on.

The following figure shows how to apply a kernel of size  on an input image of size .

Fig.1 Image convolution

How to write a simple convolution in Caffe?

Step 00—Include required header files

#include <caffe/caffe.hpp>
#include <opencv2/highgui/highgui.hpp>

Step 01—Select CPU or GPU

#ifdef CPU_ONLY

Step 02: Define a network

shared_ptr<Net<float> > net_;

Step 03: Load the network from a file

net_.reset(new Net<float>(model_file, TEST));

Step 04: assign weights for the Sobel filter

shared_ptr<Layer<float> > conv_layer = net_->layer_by_name("conv");
float* weights = conv_layer->blobs()[0]->mutable_cpu_data();

weights[0] = -1;     weights[1] = 0;      weights[2] = 1;
weights[3] = -2;     weights[4] = 0;      weights[5] = 2;
weights[6] = -1;     weights[7] = 0;      weights[8] = 1;

Step 05: read the input image

string image_file = argv[2];
cv::Mat img = cv::imread(image_file, -1);

Step 06: reshape the input blob to the size of the input image

shared_ptr<Blob<float> > input_blob = net_->blob_by_name("data");
num_channels_ = input_blob->channels();
input_blob->Reshape(1, num_channels_, img.rows, img.cols);

Step 07: reshape the whole network correspondingly


Step 08: copy the input image to the network input blob

int width = input_blob->width();
int height = input_blob->height();

float* input_data = input_blob->mutable_cpu_data();
cv::Mat channel(height, width, CV_32FC1, input_data);
img.convertTo(channel, CV_32FC1);

Step 09: run the NN inference


Step 10: get the output and save in a file

Blob<float>* output_layer = net_->output_blobs()[0];
int num_out_channels_ = output_layer->channels();

width = output_layer->width();
height = output_layer->height();
float* output_data = output_layer->mutable_cpu_data();
cv::Mat outputImage(height, width, CV_32FC1, output_data);
imwrite("outout_Image.jpg", outputImage);

If the input image is

Then the output would be
The complete code can be found at here.

Sparse Matrix-Vector Multiplication (SpMV) on Zynq FPGA

Digital System Design with High-Level Synthesis for FPGA: Combinational Circuits

Reference for this blog:
M. Hosseinabady and J. L. Nunez-Yanez, “A Streaming Dataflow Engine for Sparse Matrix-Vector Multiplication Using High-Level Synthesis,” in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 6, pp. 1272-1285, June 2020, doi: 10.1109/TCAD.2019.2912923.

Spars matrices in which most of the elements are zeros arise in many computational applications including simulations, machine learning and so on. The equation of Equ. ‎(1)  can be considered as a sparse matrix in whcih only 8 elments out of 32 are nonzero.

\begin{pmatrix}1&2&0&0&0&0&0&0\\ 0&0&1&2&0&0&0&0\\ 0&0&0&0&3&1&0&0\\0&0&0&0&0&0&2&1\\ \end{pmatrix}(1)

A collection of sparse matrices can be found at here.

Fig. 1 highlights the nonzero elements of a sparse matrix with the size of   223×472 and  2768 nonzero elements.

Fig. 1 A sparse matrix occurs in a linear programming problem

Sparse Matrix Vector Multiplication (SpMV)

The SpMV is one of the basic operators in manipulating sparse matrices in real applications. SpMV operator is the performance bottleneck in many large scale iterative algorithms such as conjugate gradient (CG) solving linear systems, eigenvalue solvers and so on. Equs. ‎(2) and ‎(3) represent an SpMV in which A is an sparse matrix of size n \times m , x  and y are two vectors of size m and  n, respectively.

y=Ax (2)
y(i)=\sum_{j=0}^{j=m}{A(i,j)x(j)} (3)

In a naïve scheme,  madditions and multiplications are required to compute an element of   y. Therefore, n \times m additions and multiplications are required for calculating the result.

In a sparse matrix, most of these additions and multiplications are zeros and should be removed from the computation to improve performance. However, removing these zeros is not an easy task and requires a proper sparse matrix representation and computation. In general case, irregularity in non-zero elements distribution in a sparse matrix makes the computation problematic.


Digital System Design with High-Level Synthesis for FPGA: Combinational Circuits


Sparse Matrix Representation

Different approaches have been proposed by researchers for representing a sparse matrix. Associated to each representation there is a computational structure to increase the performance.

Coordinate format

One of the simple representation is the coordinate (COO) format in which each nonzero element of the sparse matrix along with its indices are stored in the memory. Therefore, in an implementation, three arrays called rows, cols and values save the row indices, column indices and data of nonzero elements, respectively. Vectors shown in Equs.‎(4), ‎(5) and ‎(6) are the COO representation of the sparse matrix in Equ.‎(1).

  rows = \begin{pmatrix}0&0&1&1&2&2&3&3\\ \end{pmatrix} (4)
 cols = \begin{pmatrix}0&1&2&3&4&5&6&7\\ \end{pmatrix} (5)
 values = \begin{pmatrix}1&2&1&2&3&1&2&1\\ \end{pmatrix} (6)

Compressed sparse row format

The compressed sparse row (CSR), as the most popular representation, is similar to the COO but with less storage. In this format, cols and values vectors are the same as COO however the rows vector is compressed and called ptrs to save only  (n+1) elements. Each element in the ptrs vector points to place of the first element of each row in the values vector.

  ptrs = \begin{pmatrix}0&2&4&6&8\\ \end{pmatrix} (7)
 cols = \begin{pmatrix}0&1&2&3&4&5&6&7\\ \end{pmatrix} (8)
 values = \begin{pmatrix}1&2&1&2&3&1&2&1\\ \end{pmatrix} (9)

In this case, ptrs[i+1]-ptrs[i] shows the number of nonzero elements in i^{th} row in the original sparse matrix.


Digital System Design with High-Level Synthesis for FPGA: Combinational Circuits


Modified compressed sparse row format

To implement the CSR in FPGA, we slightly modify the CSR format (called MCSR) and replace the ptrs vector with rowLengths vector of length n which keeps the number of nonzero elements in each row. The i^{th} element in the rowLengths vector can be calculated as ptrs[i+1]-ptrs[i] .

 rowLengths = \begin{pmatrix}2&2&2&2\\ \end{pmatrix} (10)
 cols = \begin{pmatrix}0&1&2&3&4&5&6&7\\ \end{pmatrix} (11)
 values = \begin{pmatrix}1&2&1&2&3&1&2&1\\ \end{pmatrix} (12)



The following code can implement the SpMV on the Xilinx Zynq SoC using SDSoC


void spmv_accel(

  u32 max_n,
  u32 row_size,
  u32 col_size,
  u32 data_size
) {
  u32 col_left=0;

  u32 col;
  int row_index = 0;

  for(u32 i = 0; i < col_size; i++) {
    #pragma HLS PIPELINE
    x_local[i] = x[i];
  for(u32 r = 0; r < data_size; r++) {
  #pragma HLS PIPELINE
    if (col_left == 0) {
      col_left = rows[row_index];
      sum = 0;
    value = values[r];
    col = cols[r];
    sum += value * x_local[col];
    if(col_left == 0) {
      y[row_index++] = sum;

The competer code can be found at here.


Digital System Design with High-Level Synthesis for FPGA: Combinational Circuits

Synthesis Tools

Goal General view of synthesis tools
Credit  This work has been done under the ENPOWER project (funded by EPSRC) at University of Bristol.

Fig. 1 shows the hierarchy of an advanced FPGA synthesis tool set. It consists of three main parts: Logic Synthesis (LS), High-Level synthesis (HLS) and OpenCL.

Fig. 1  FPGA synthesis tool set

Logic Synthesis

At the bottom of the tool set shown in Fig.1 is logic synthesis which traditionally gets a design description in an HDL language and transforms it into a network of gates. Different technology-independent optimisation techniques are used to optimise this network of gates. Technology mapping techniques implement the network of gates by utilising primitive elements in the FPGA such as look-up tables (LUTs), registers and memories. Then, they are placed and routed for a given FPGA and finally a bitstream is generated for the FPGA configuration. At this step, the final design resource utilisation and clock frequency are determined.

High-Level Synthesis

On the top of the logic synthesis, HLS tools receive the design description in a high-level language such as C/C++ and generate the corresponding HDL code which will be synthesised later by LS to generate the bitstream. HLS tools should add timing and parallelism to the C/C++ un-timed and unparalleled code. Therefore, they heavily utilise variable analysis (e.g., load/store analysis) to extract dependencies among variables in order to exploit the concurrency among instructions. Then these concurrencies are realised by the full-parallel or pipelined execution of instructions. Loop structures in the high-level language are the main part of the codes that HLS tools apply the dependence analysis. In the full-parallel execution of a loop, all iterations are executed in parallel. This technique also known as loop unrolling potentially provides the minimum latency for the given loop. However, different types of dependencies among loop iterations and the lack of sufficient parallel ports to provide access to the data located in memories or registers restrict the full-parallel execution. In addition, this technique is not scalable as it requires a large amount of resources to implement iterations independently. To solve the limitations of this method, pipelining is a powerful technique that is used by HLS tools and is applicable to loop iterations and functions. In this technique, loop iterations are executed by having overlap cycles (or stages). The distance between two consecutive iterations (in terms of the clock cycles) in a pipelined execution is called initiation interval (II).

The following figure shows a simple loop and its corresponding pipelined timing diagram. If we assume the array a is located in a memory with only two ports to read from, then in each cycle, at most two elements can be read which means two iterations cannot be executed in parallel. This resource constraint dictates at least one cycle distance between two consecutive iterations. Therefore, the initiation interval of the pipelined loop in this figure is 1. Note that data dependencies among variables in different iterations can induce a specific distance between iterations.

Fig. 2 Pipelined loop

In order to help HLS tools to facilitate the dependence analysis and extracting the parallelism, compiler directives are provided by the tools that designers can use to optimise their codes. Designers should be familiar with these directives and use them efficiently to provide a high-performance design.


Software engineers have proposed a formal description of parallelism in an algorithm by introducing the OpenCL framework. Hardware designers have realised the potential of this framework to intensify the effectiveness of their HLS tools. Therefore, as shown in Fig. 1, they proposed another layer of synthesis techniques on top of the HLS tools to integrate the formal parallelism described in the OpenCL code into their tool chain.

FPGAs allow the designer to design their own hardware architectures for kernels to increase the performance and reduce the power consumption. In contrast to GPUs or multi-core CPUs in which the programmer should develop their own code to take most of the predefined hardware architecture, in the FPGA environment designers can choose and define an architecture suitable for each individual OpenCL kernel. This gives a good opportunity to the FPGA OpenCL framework to beat GPU performance in some applications that fit the features available in the FPGA. This paper reveals the effectiveness of this architecture design during code development for the histogram operation.

Sale on now | Up to 80% OFF on All HLS Courses

2 day left!