Sparse Matrix-Vector Multiplication (SpMV) on Zynq FPGA

Digital System Design with High-Level Synthesis for FPGA: Combinational Circuits

Reference for this blog:
M. Hosseinabady and J. L. Nunez-Yanez, “A Streaming Dataflow Engine for Sparse Matrix-Vector Multiplication Using High-Level Synthesis,” in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 6, pp. 1272-1285, June 2020, doi: 10.1109/TCAD.2019.2912923.

Spars matrices in which most of the elements are zeros arise in many computational applications including simulations, machine learning and so on. The equation of Equ. ‎(1) can be considered as a sparse matrix in whcih only 8 elments out of 32 are nonzero.

\begin{pmatrix}1&2&0&0&0&0&0&0\\ 0&0&1&2&0&0&0&0\\ 0&0&0&0&3&1&0&0\\0&0&0&0&0&0&2&1\\ \end{pmatrix}

A collection of sparse matrices can be found at here.

Fig. 1 highlights the nonzero elements of a sparse matrix with the size of 223×472 and $2768$ nonzero elements.

lp_e226 — Fig. 1 A sparse matrix occurs in a linear programming problem

Sparse Matrix Vector Multiplication (SpMV)

The SpMV is one of the basic operators in manipulating sparse matrices in real applications. SpMV operator is the performance bottleneck in many large scale iterative algorithms such as conjugate gradient (CG) solving linear systems, eigenvalue solvers and so on. Equs. ‎(2) and ‎(3) represent an SpMV in which A is an sparse matrix of size $n \times m$ , $x$ and $y$ are two vectors of size $m$ and $n$ , respectively.

In a naïve scheme, $m$ additions and multiplications are required to compute an element of $y$ . Therefore, $n \times m$ additions and multiplications are required for calculating the result.

In a sparse matrix, most of these additions and multiplications are zeros and should be removed from the computation to improve performance. However, removing these zeros is not an easy task and requires a proper sparse matrix representation and computation. In general case, irregularity in non-zero elements distribution in a sparse matrix makes the computation problematic.

Digital System Design with High-Level Synthesis for FPGA: Combinational Circuits

Sparse Matrix Representation

Different approaches have been proposed by researchers for representing a sparse matrix. Associated to each representation there is a computational structure to increase the performance.

Coordinate format

One of the simple representation is the coordinate (COO) format in which each nonzero element of the sparse matrix along with its indices are stored in the memory. Therefore, in an implementation, three arrays called $rows$ , $cols$ and $values$ save the row indices, column indices and data of nonzero elements, respectively. Vectors shown in Equs.‎(4), ‎(5) and ‎(6) are the COO representation of the sparse matrix in Equ.‎(1).

$rows = \begin{pmatrix}0&0&1&1&2&2&3&3\\ \end{pmatrix}$	(4)
$cols = \begin{pmatrix}0&1&2&3&4&5&6&7\\ \end{pmatrix}$	(5)
$values = \begin{pmatrix}1&2&1&2&3&1&2&1\\ \end{pmatrix}$	(6)

Compressed sparse row format

The compressed sparse row (CSR), as the most popular representation, is similar to the COO but with less storage. In this format, $cols$ and $values$ vectors are the same as COO however the $rows$ vector is compressed and called $ptrs$ to save only $(n+1)$ elements. Each element in the $ptrs$ vector points to place of the first element of each row in the $values$ vector.

$ptrs = \begin{pmatrix}0&2&4&6&8\\ \end{pmatrix}$	(7)
$cols = \begin{pmatrix}0&1&2&3&4&5&6&7\\ \end{pmatrix}$	(8)
$values = \begin{pmatrix}1&2&1&2&3&1&2&1\\ \end{pmatrix}$	(9)

In this case, $ptrs[i+1]-ptrs[i]$ shows the number of nonzero elements in $i^{th}$ row in the original sparse matrix.

Digital System Design with High-Level Synthesis for FPGA: Combinational Circuits

Modified compressed sparse row format

To implement the CSR in FPGA, we slightly modify the CSR format (called MCSR) and replace the $ptrs$ vector with $rowLengths$ vector of length $n$ which keeps the number of nonzero elements in each row. The $i^{th}$ element in the $rowLengths$ vector can be calculated as $ptrs[i+1]-ptrs[i]$ .

$rowLengths = \begin{pmatrix}2&2&2&2\\ \end{pmatrix}$	(10)
$cols = \begin{pmatrix}0&1&2&3&4&5&6&7\\ \end{pmatrix}$	(11)
$values = \begin{pmatrix}1&2&1&2&3&1&2&1\\ \end{pmatrix}$	(12)

Implementation

The following code can implement the SpMV on the Xilinx Zynq SoC using SDSoC

[code language="C"]
void spmv_accel(
  REAL_DATA_TYPE values[MODIFIED_DATA_LENGTH],
  INT_DATA_TYPE  cols[COLS],
  INT_DATA_TYPE  rows[ROWS],
  REAL_DATA_TYPE x[COLS],
  REAL_DATA_TYPE y[ROWS],

  u32 max_n,
  u32 row_size,
  u32 col_size,
  u32 data_size
) {
  REAL_DATA_TYPE x_local[MAX_COL_SIZE];
  u32 col_left=0;
  REAL_DATA_TYPE um;
  REAL_DATA_TYPE value;

  u32 col;
  int row_index = 0;

  for(u32 i = 0; i &amp;lt; col_size; i++) {
    #pragma HLS PIPELINE
    x_local[i] = x[i];
  }
  for(u32 r = 0; r &amp;lt; data_size; r++) {
  #pragma HLS PIPELINE
    if (col_left == 0) {
      col_left = rows[row_index];
      sum = 0;
    }
    value = values[r];
    col = cols[r];
    sum += value * x_local[col];
    col_left--;
    if(col_left == 0) {
      y[row_index++] = sum;
    }
  }
}
[/code]

The competer code can be found at here.