This week’s problem is the traditional matrix-vector multiplication kernel used in several applications such as machine learning and image processing.

This figure shows the software-oriented code that describes the kernel.

The code consists of a two-level loop nest that reads the matrix A and vector  and generates the output elements located in

If we assume n is 4096 and m is 2048, then this figure shows the synthesis report after synthesising the code with Vitis 2020.2.

As can be seen, the first loop is not pipelined, but the inner loop is pipelined with the initiation interval of 1.

After executing the code on Ultra96v2 using the Vitis-2020.2 software tool, the execution time would be 683.491 ms.

Now the question is, “How can we improve the performance?”

Please follow the solution in the video.