Image Blending in High-Level Synthesis

Online Courses on HLS for FPGA

If you are interested in learning high-level synthesis for FPGA, please refer to my online course.

The image blending operator combines two images of the same size to generate the third image. This blog post will explain how to describe this operator in HLS using the Xilinx Vitis unified software platform.

This operator reads pixels of two input images and generates the output pixels using the following equation, in which 0≤α≤1 is the blending ratio.

outImage[i][j]=α×inImage1[i][j]+(1-α)×inImage2[i][j]

Implementing this operator isn HLS comprises of three tasks: read images, perform blending and write the resulted image into memory.

The following code shows this description. It consists of a top function called image_blending and three subfunctions describing the three tasks as mentioned earlier. The three tasks are connected together through three FIFOs supporting the task parallelism feature in HLS [GitHub].

void read_images(
		char *inImage1,
		char *inImage2,
		hls::stream<char> &inImage1_fifo,
		hls::stream<char> &inImage2_fifo,
		int n, int m)
{

	for (int i = 0; i < n*m; i++) {
		inImage1_fifo << inImage1[i];
		inImage2_fifo << inImage2[i];
	}
}


void blending(
		hls::stream<char> &inImage1_fifo,
		hls::stream<char> &inImage2_fifo,
		hls::stream<char> &outImage_fifo,
		float alpha,
		int n, int m)
{
	for (int i = 0; i < n*m; i++) {
		char inPixel1 = inImage1_fifo.read();
		char inPixel2 = inImage2_fifo.read();
		char outPixel = alpha*inPixel1+(1-alpha)*inPixel2;
		outImage_fifo << outPixel;
	}
}

void write_images(
		char               *outImage,
		hls::stream<char>  &outImage_fifo,
		int                 n,
		int                 m)
{
	for (int i = 0; i < n*m; i++) {
		outImage[i] = outImage_fifo.read();
	}
}

void image_blending(
		char *inImage1,
		char *inImage2,
		char *outImage,
		float alpha,
		int n,
		int m)
{
#pragma HLS INTERFACE mode=m_axi bundle=gmem0 port=inImage1
#pragma HLS INTERFACE mode=m_axi bundle=gmem1 port=inImage2
#pragma HLS INTERFACE mode=m_axi bundle=gmem0 port=outImage

	hls::stream<char> inImage1_fifo;
	hls::stream<char> inImage2_fifo;
	hls::stream<char> outImage_fifo;
	read_images(inImage1, inImage2, inImage1_fifo, inImage2_fifo, n, m);
	blending(inImage1_fifo, inImage2_fifo, outImage_fifo, alpha, n, m);
	write_images(outImage, outImage_fifo, n, m);
}

The following figure shows the Vitis-HLS 2021.1 synthesis report. As can be seen, the for-loops in the sub-functions can be pipelined with the initiation interval of 1. This initiation interval leads to optimum task parallelism in the resulted hardware. However, the code only uses 8-bit of a memory port to transfer data between the FPGA and DDR memory.

The Vitis-HLS 2021.1 synthesis report for the first code.

Online Courses on HLS for FPGA

If you are interested in learning high-level synthesis for FPGA, please refer to my online course.

To address this issue, we can modify the code to utilise a wide-bus memory port, for example, a 128-bit width bus. The following code shows this new implementation [GitHub].

#define CHUNK_SIZE 16

struct char16 {
	char pixels[CHUNK_SIZE];
};
void read_images(
		char16 *inImage1,
		char16 *inImage2,
		hls::stream<char16> &inImage1_fifo,
		hls::stream<char16> &inImage2_fifo,
		int n, int m)
{

	for (int i = 0; i < n*m/CHUNK_SIZE; i++) {
		inImage1_fifo << inImage1[i];
		inImage2_fifo << inImage2[i];
	}
}
void rearange_input_images(
		hls::stream<char16> &inImage1_16_fifo,
		hls::stream<char16> &inImage2_16_fifo,
		hls::stream<char>   *inImage1_fifo,
		hls::stream<char>   *inImage2_fifo,
		int n, int m)
{

	for (int i = 0; i < n*m/CHUNK_SIZE; i++) {
		char16 p1 = inImage1_16_fifo.read();
		char16 p2 = inImage2_16_fifo.read();
		for (int j = 0; j < CHUNK_SIZE; j++) {
			inImage1_fifo[j] << p1.pixels[j];
			inImage2_fifo[j] << p2.pixels[j];
		}
	}
}

void blending(
		hls::stream<char> &inImage1_fifo,
		hls::stream<char> &inImage2_fifo,
		hls::stream<char> &outImage_fifo,
		float alpha,
		int n, int m)
{
	for (int i = 0; i < n*m/CHUNK_SIZE; i++) {
		char inPixel1 = inImage1_fifo.read();
		char inPixel2 = inImage2_fifo.read();
		char outPixel = alpha*inPixel1+(1-alpha)*inPixel2;
		outImage_fifo << outPixel;
	}
}

void rearange_output_images(
		hls::stream<char>    *outImage_fifo,
		hls::stream<char16>  &outImage_16_fifo,
		int n, int m)
{
	for (int i = 0; i < n*m/CHUNK_SIZE; i++) {
		char16 p;
		for (int j = 0; j < CHUNK_SIZE; j++) {
			p.pixels[j] = outImage_fifo[j].read();
		}
		outImage_16_fifo << p;
	}
}
void write_images(
		char16               *outImage,
		hls::stream<char16>  &outImage_fifo,
		int                 n,
		int                 m)
{
	for (int i = 0; i < n*m/CHUNK_SIZE; i++) {
		char16 p = outImage_fifo.read();
		outImage[i] = p;
	}
}

void image_blending(
		char16 *inImage1,
		char16 *inImage2,
		char16 *outImage,
		float alpha,
		int n,
		int m)
{
#pragma HLS INTERFACE mode=m_axi bundle=gmem0 port=inImage1 offset=slave
#pragma HLS INTERFACE mode=m_axi bundle=gmem1 port=inImage2 offset=slave
#pragma HLS INTERFACE mode=m_axi bundle=gmem0 port=outImage offset=slave

#pragma HLS DATAFLOW
	hls::stream<char16> inImage1_16_fifo;
	hls::stream<char16> inImage2_16_fifo;

	hls::stream<char>   inImage1_fifo[CHUNK_SIZE];
	hls::stream<char>   inImage2_fifo[CHUNK_SIZE];
	hls::stream<char>   outImage_fifo[CHUNK_SIZE];

	hls::stream<char16> outImage_16_fifo;
	read_images(inImage1, inImage2, inImage1_16_fifo, inImage2_16_fifo, n, m);
	rearange_input_images(inImage1_16_fifo, inImage2_16_fifo, inImage1_fifo, inImage2_fifo, n, m);
	for (int i = 0; i < CHUNK_SIZE; i++) {
#pragma HLS UNROLL
		blending(inImage1_fifo[i], inImage2_fifo[i], outImage_fifo[i], alpha, n, m);
	}
	rearange_output_images(outImage_fifo, outImage_16_fifo, n, m);
	write_images(outImage, outImage_16_fifo, n, m);
}

The following figure shows the Vitis-HLS 2021.2 report for this code. As can be seen, all loops are pipelined with II=1. In addition, dataflow task parallelism orchestrates the execution of all tasks.