HLS Tricks
06/13/2023: Array of AXI Stream Interfaces on Kernels/Functions
In the 2021.2 Version, defining an array of axi-stream interfaces on a kernel (or function) is not supported. However, the 2022.2 version allows this feature. See the following code (N = 3 for example):
void splitter(hls::stream<data_t>& stream_in, hls::stream<data_t> stream_out[N]){
#pragma HLS INTERFACE ap_ctrl_none port=return
for (ap_uint<4> i = 0; i < N; i++){
#pragma HLS PIPELINE
data_t temp;
stream_out >> temp;
stream_out[i] << temp;
}
}
This function is a multiplexer that distributes data to 3 output AXI stream interfaces. The synthesis output is shown below:
As you can see, the II is one and there axi-stream interfaces are inferred on the module, named with stream_out_<index>. This feature allows more parameterized coding style in HLS, that can make the switching between debuging and inplementation easier.
06/20/2023: Data-driven DATAFLOW in Vitis 2022.2+
In Vitis 2022.2 version, a new library hls::task is proposed. The new library allows linking several free-running functions inside a top function/kernel, just like using a linker file to link several free-running kernels in the previous version. This library makes the hardware-friendly design much easier to achieve. Here, we use a simple Quad Vector Add (QVadd) as an example, which can well demonstrate all new features.
The purpose is to add four arrays together. Let's say we have four integer arrays A[N], B[N], C[N], and D[N] to R[N], N is a parameter. In addition, we can only use one adder to achieve the goal. Therefore, we plan to firstly add A[i] and B[i] to E[i]; secondly, add C[i] and D[i] to F[i]; and finally add E[i] and F[i] to get the final R[i].
Obviously, as there is only one adder available and 3 adds are required to get one result, the minimal average II of the whole design is 3. However, it is that straightforward considering that the adder has one clock latency (you can replace the add with multiply or any functions to get a longer latency). The scheduling can be shown in the following chart.
The latency stops clock cycle 2 from doing any computation as F[0] is not yet available. The II is then increased to 4. You could solve this problem by moving A[1] + B[1] currently at clock cycle 4 to clock cycle 2, but then you need a very complex scheduler to order the computations. In addition, what if you don't know the latency of your computation, which is a normal case when using HLS? Hence, we need a passive scheduler that is totally data-driven.
To achieve that, let's define two types of results. One is the temporary result, and the other is the final result. The temporary result is E[i] and F[I], and the final result is R[i]. In this case, if you have a bunch of A~D[i] and E[i]F[i], of course, we prefer to do E[i] + F[i] first as it produces the final result R[i] and will be sent back to the host and never used again. In other words, calculating E[i] + F[i] has higher priority. In contrast, A[i] + B[i] and C[i] + D[i] are equivalent to each other, and if no E[i] and F[i] are available, we should calculate them interleavely. Therefore, we can simply make two task queues. The first queue contains the interleaved allocated low-priority tasks; the second queue contains the high-priority tasks. In the beginning, since no E[i] and F[i] have been produced yet, the scheduler will always let the adder do the low-priority adds, and at the same time, the E[i] and F[i] are also created in pair. Then, once an E[i] and F[i] pair are available, we put it into the high-priority queue. The scheduler then will send the adder the high-priority task and pause the low-priority task automatically. In this scheme, you don't have to worry about the latency and how to organize the adds. It is scheduled automatically and the adder will always be busy, giving an equivalent II of 3 in the end.
The key point of this project is a loop-back stream connection in DATAFLOW, where the output of the adder may be sent back to the adder again. This function is not allowed in Vitis 2021.2 Version. The code is shown below.
// typedef.hpp
#ifndef __TYPEDEF_HPP__
#define __TYPEDEF_HPP__
#include "hls_task.h"
#include "ap_fixed.h"
#include "hls_stream.h"
typedef struct{
int A;
int B;
bool destination;
} adder_in_pack;
typedef struct{
int R;
bool destination;
} adder_out_pack;
#endif
// QVadd.cpp
#include "../common/typedef.hpp"
void adder(hls::stream<adder_in_pack>& in, hls::stream<adder_out_pack>& out){
#pragma HLS INTERFACE ap_ctrl_none port=return
#pragma HLS INTERFACE axis port=in register_mode=off
#pragma HLS PIPELINE II=1 style=flp
adder_in_pack i_temp;
in >> i_temp;
int sum;
sum = i_temp.A + i_temp.B;
adder_out_pack o_temp;
o_temp.R = sum;
o_temp.destination = i_temp.destination;
out << o_temp;
}
void data_router(hls::stream<adder_out_pack>& in, hls::stream<int>& temp_result, hls::stream<int>& final_result){
#pragma HLS INTERFACE ap_ctrl_none port=return
#pragma HLS INTERFACE axis port=in register_mode=off
#pragma HLS PIPELINE II=1 style=flp
adder_out_pack i_temp;
in >> i_temp;
if (i_temp.destination == 0){ // temp result
int o_temp = i_temp.R;
temp_result << o_temp;
}
else{ // final result
int o_temp = i_temp.R;
final_result << o_temp;
}
}
// template ID is not used, it is to make sure that different function calls becomes independet hardware modules
template<int ID>
void stream_merge(hls::stream<int>& in1, hls::stream<int>& in2, hls::stream<adder_in_pack>& out){
#pragma HLS INTERFACE ap_ctrl_none port=return
#pragma HLS PIPELINE II=1 style=flp
int temp1, temp2;
in1 >> temp1;
in2 >> temp2;
adder_in_pack o_temp;
o_temp.A = temp1;
o_temp.B = temp2;
o_temp.destination = 0; // temp result
out << o_temp;
}
void stream_s2p(hls::stream<int>& in, hls::stream<adder_in_pack>& out){
#pragma HLS INTERFACE ap_ctrl_none port=return
#pragma HLS PIPELINE II=1 style=flp
static int last_val = 0;
static bool T_reg = 0;
int i_temp;
in >> i_temp;
if (T_reg == 1){
adder_in_pack o_temp;
o_temp.A = i_temp;
o_temp.B = last_val;
o_temp.destination = 1; // final_result
out << o_temp;
}
T_reg = !T_reg;
last_val = i_temp;
}
void Scheduler(hls::stream<adder_in_pack>& to_adder, hls::stream<adder_in_pack>& ab_stream, hls::stream<adder_in_pack>& cd_stream, hls::stream<adder_in_pack>& ef_stream){
#pragma HLS INTERFACE ap_ctrl_none port=return
#pragma HLS PIPELINE II=1 style=flp
static bool lr = 0; // left-right ping-pong
bool ab_valid = !ab_stream.empty();
bool cd_valid = !ab_stream.empty();
bool ef_valid = !ef_stream.empty();
if (ef_valid){ // ef has higer priority as it produces the final result
adder_in_pack temp;
ef_stream >> temp;
to_adder << temp;
}
else{ // if there are no ef available, calculate a+b=e and c+d=f interleavely
if(lr == 0){
if (ab_valid){
lr = 1;
adder_in_pack temp;
ab_stream >> temp;
to_adder << temp;
}
}
else{
if (cd_valid){
lr = 0;
adder_in_pack temp;
cd_stream >> temp;
to_adder << temp;
}
}
}
}
extern "C" {
// F[i] = A[i] + B[i] + C[i] + D[i]
void QVadd(hls::stream<int>& in1, hls::stream<int>& in2, hls::stream<int>& in3, hls::stream<int>& in4, hls::stream<int>& out){
#pragma HLS INTERFACE ap_ctrl_none port=return
hls_thread_local hls::stream<adder_in_pack, 8> ab_pack;
hls_thread_local hls::stream<adder_in_pack, 8> cd_pack;
hls_thread_local hls::stream<adder_in_pack, 8> ef_pack;
hls_thread_local hls::stream<adder_in_pack, 8> to_adder;
hls_thread_local hls::stream<adder_out_pack, 8> from_adder;
hls_thread_local hls::stream<int, 8> back_to_adder;
hls_thread_local hls::task ab_creator(stream_merge<0>, in1, in2, ab_pack); // generate ab_pack
hls_thread_local hls::task cd_creator(stream_merge<1>, in3, in4, cd_pack); // generate cd_pack
hls_thread_local hls::task ef_creator(stream_s2p, back_to_adder, ef_pack); // generate ef_pack, serial to parallel
hls_thread_local hls::task CTL(Scheduler, to_adder, ab_pack, cd_pack, ef_pack); // schedular
hls_thread_local hls::task real_adder(adder, to_adder, from_adder); // the adder
//
hls_thread_local hls::task router(data_router, from_adder, back_to_adder, out);
}
}
Notice that the back_to_adder stream has the producer behind the consumer, which is illegal in Vitis 2021.2 Version DATAFLOW. In addition, with the hls::task used, the DATAFLOW shouldn't be added unless there are some functions that have non-stream interfaces. The DATAFLOW viewer is shown below:
With this design style, the only thing we have to make sure of is that all sub-kernels, such as stream_merge and adder must have an II of 1. It is much easier to be realized as all functions are synthesized independently and the data-dependency between all functions is considered by the designers rather than the HLS. If the functions were written in a single function, a dependency caused by the loopback bonds to increase the II to 2.
The simulation result is shown in the Figure below. You can obviously see that the average II is 3.
The essence of the HLS::task-based design process that can infer from this example is listed below:
All kernels are in axi-stream interfaces. DMA is put outside the computation kernels (using mm2s and s2mm).
If possible, pass control signals with its data. For example, the destination in this example controls where the output the adder should be sent to. It is passed together with the adder. There are several benefits of using this design style:
All kernels can run passively, there is no global controlling state-machine required.
Since mostly the pipeline latency cannot be determined and there are some unexpected latencies caused by the registers in axi-stream, using this style avoid unexpected halts and bubbles in the pipeline.
Just like Verilog/VHDL, modulizing the design, which means breaking the entire function into several simple and straightforward small function units, can help the HLS do synthesis and stop it considering non-existed data dependencies. In this example, 5 kernels are used, and each only does a very simple task. Therefore, the hardware implementation is almost there when reading the C code.
Pay attention to the depth of the real FIFOs in the design (such as ab_pack in QVadd top function). If you don't specify the FIFO depth, the default depth is 2. It could cause some problems if too small FIFO depth is used.
06/23/2023: Static variables in free-running kernels
Sometimes, static variables are necessary for implementing a free-running function/kernel. It is okay when the function is only called once in the top. However, when it is called multiple times we have to make sure each call leads to an independent hardware kernel where each has its own static variables. However, C/C++ won't understand this and the static variables will be shared by all kernels. This could fail the DATAFLOW check as this is a strong dependency.
To solve this, one possible way is to use class, as an object has its own attributions, and if the object is static, its attributions can also be saved across multiple function calls. Here is an example:
// base function
int shift(int input){
static int REG[32];
#pragma HLS ARRAY_PARTITION type=complete variable=REG dim=1
int temp = REG[31];
for (int i = 31; i > 0; i--){
#pragma HLS UNROLL
REG[i] = REG[i - 1];
}
REG[0] = input;
return REG[31];
}
class SHIFT_REG{
public:
int REG[32];
SHIFT_REG(){
// write pragmas applied to the internal variables in the construction function
#pragma HLS ARRAY_PARTITION type=complete variable=this->REG dim=1
}
int shift(int input){
int temp = REG[31];
for (int i = 31; i > 0; i--){
#pragma HLS UNROLL
REG[i] = REG[i - 1];
}
REG[0] = input;
return REG[31];
}
};
extern "C" {
void shift_reg(hls::stream<int>& in, hls::stream<int>& out){
#pragma HLS INTERFACE ap_ctrl_none port=return
#pragma HLS DATAFLOW
static SHIFT_REG chain1;
static SHIFT_REG chain2;
int i_temp;
int o_temp;
in >> i_temp;
o_temp = chain1.shift(i_temp);
int o_temp_1;
o_temp_1 = chain2.shift(o_temp);
out << o_temp_1;
}
}
In this example, if the shift function is called twice, they are accessing the same integer array. However, if we design two objects, then each object has its own integer array so that they won't interfere with each other.
The second way is much simpler. It tricks the HLS tool by adding a non-used template variable:
template<int ID>
int shift(int input){
static int REG[32];
#pragma HLS ARRAY_PARTITION type=complete variable=REG dim=1
int temp = REG[31];
for (int i = 31; i > 0; i--){
#pragma HLS UNROLL
REG[i] = REG[i - 1];
}
REG[0] = input;
return REG[31];
}
extern "C" {
void shift_reg(hls::stream<int>& in, hls::stream<int>& out){
#pragma HLS INTERFACE ap_ctrl_none port=return
#pragma HLS DATAFLOW
int i_temp;
int o_temp;
in >> i_temp;
o_temp = shift<0>(i_temp);
int o_temp_1;
o_temp_1 = shift<1>(o_temp);
out << o_temp_1;
}
}
Though the ID isn't used at all in the shift function, it can tell HLS to generate two independent kernels. The drawback is the HLS isn't smart enough to use the same synthesis result for the two copies. It basically treats the two copies as two functions, which may increase the total synthesis time. The synthesis result is shown below, as you can see, there are two 'shift' units, each having its own 1026 (1024 for 32*32 and 2 for control) registers.
07/02/2023: Be careful when using Circular State-Machine
A "BUG" is found when using a circular state machine (SM) in Vitis HLS. The circular state machine means the status of the SM runs periodically, which is mostly a counter. To demonstrate, I use a simple example called left-correction. The left-correction example aims at finding the element located at the left of the current element. For example, if a series is '0,1,2,3,4,5,6,7,8,9', the left should be a placeholder -1 followed by '-1,0,1,2,3,4,5,6,7,8' ('-1' is on the left of '0', '0' is on the left of '1'). To realize that, two functions are used. The code is shown below.
#include "hls_task.h"
#define N 10
void last_pack_dropper(hls::stream<int>& i_stream, hls::stream<int>& o_stream){
#pragma HLS interface ap_ctrl_none port=return
#pragma HLS INTERFACE mode=axis port=i_stream register_mode=off
#pragma HLS INTERFACE mode=axis port=o_stream register_mode=off
#pragma HLS pipeline ii=1 style=flp
static int counter = 0;
if (counter == (N - 1)){
i_stream.read();
}
else{
int temp;
i_stream >> temp;
o_stream << temp;
}
if (counter == (N - 1)){
counter = 0;
}
else{
counter++;
}
}
void first_data_place_holder(hls::stream<int>& i_stream, hls::stream<int>& o_stream){
#pragma HLS interface ap_ctrl_none port=return
#pragma HLS INTERFACE mode=axis port=i_stream register_mode=off
#pragma HLS INTERFACE mode=axis port=o_stream register_mode=off
#pragma HLS pipeline ii=1 style=flp
static int counter = 0;
if (counter == 0){
o_stream << -1;
}
else{
int temp;
i_stream >> temp;
o_stream << temp;
}
if (counter == (N - 1)){
counter = 0;
}
else{
counter++;
}
}
extern "C" {
void left_corrector(hls::stream<int>& i_stream, hls::stream<int>& o_stream){
#pragma HLS interface ap_ctrl_none port=return
#pragma HLS INTERFACE mode=axis port=i_stream register_mode=off
#pragma HLS INTERFACE mode=axis port=o_stream register_mode=off
hls_thread_local hls::stream<int, 8> data_path;
hls_thread_local hls::task remove_last(last_pack_dropper, i_stream, data_path);
hls_thread_local hls::task add_place_holder(first_data_place_holder, data_path, o_stream);
}
}
The first function, last_pack_dropper simply drops the last element every 10 elements. If the input is '0,1,2,3,4,5,6,7,8,9', then the output is '0,1,2,3,4,5,6,7,8'. The second function, first_data_place_holder, inserts a placeholder '-1' to the input series, making '0,1,2,3,4,5,6,7,8' '-1,0,1,2,3,4,5,6,7,8', which is what we want. However, if you implement the code and run it on either hardware emulation or hardware, the output becomes '0,-1,1,2,3,4,5,6,7,8'. It may look like an HLS BUG but it is not.
First, the "BUG" must happen in the first_data_place_holder as the last data '9' is correctly removed from the original series. Therefore, let's first see the scheduling of the second kernel.
The flow is very straightforward. However, the tasks are divided into two clock cycles. The first cycle compares the counter with 0 and saves the result to a register for the second cycle to determine which data should be sent out and if the kernel should read from the input FIFO. It can be verified in the generated Verilog file.
assign icmp_ln74_fu_51_p2 = ((counter_1 == 32'd0) ? 1'b1 : 1'b0);
always @ (posedge ap_clk) begin
if (......) begin
counter_1 <= select_ln83_fu_69_p3;
icmp_ln74_reg_83 <= icmp_ln74_fu_51_p2;
end
end
always @ (*) begin
if ((1'b1 == ap_condition_91)) begin
if ((icmp_ln74_reg_83 == 1'd1)) begin
o_stream_TDATA = 32'd4294967295; // -1
end else if ((icmp_ln74_reg_83 == 1'd0)) begin
o_stream_TDATA = data_path4_dout; // data from FIFO
end else begin
o_stream_TDATA = 'bx;
end
end else begin
o_stream_TDATA = 'bx;
end
end
It is the delay that causes the problem. The waveform is shown below:
Due to the delay, before the first data come, the register that saves the comparison between counter and 0 is initialized as '0'. Therefore, when the output condition is first satisfied, it sends out the first data in the input FIFO rather than the placeholder '-1'. In the second clock when the kernel is able to run, the icmp_ln74_reg_83 is updated (counter == 0 in the last period) and a placeholder is sent out. This is why the output of this kernel is wrong.
It is hard to say if it is a BUG or not. The key problem is that if the icmp_ln74_reg_83 register should be used to determine the output logic. If it is just an initial value, then it shouldn't be used to determine if the output should be done or not. If it is a valid number comes from the last comparison between counter and 0, then it should be used. This can be determined by the control signal that updates the icmp_ln74_reg_83 register. If the control signal is valid, then the value inside the icmp_ln74_reg_83 in the next clock period can be used to control the output logic. Now, we can clearly see the problem, this control signal cannot be deduced by any information available inside the current implementation as the SM is periodical. Though we set the counter to be 0 at the beginning, the hardware doesn't optimize specifically to 0. Hence, the HLS thought the 0 in icmp_ln74_reg_83 comes from the last period and finally causes the error.
One way to solve the problem is using 'for' loops instead of static counter variables. However, 'rewind' pragma must be added if the trip count is small to reduce the overhead caused by the loop initialization. Here is the correct code and the simulation waveforms with and without rewind:
void first_data_place_holder(hls::stream<int>& i_stream, hls::stream<int>& o_stream){
#pragma HLS interface ap_ctrl_none port=return
#pragma HLS INTERFACE mode=axis port=i_stream register_mode=off
#pragma HLS INTERFACE mode=axis port=o_stream register_mode=off
for (int i = 0; i < N;i++){
#pragma HLS pipeline ii=1 rewind
if (i == 0){
o_stream << -1;
}
else{
int temp;
i_stream >> temp;
o_stream << temp;
}
}
}
'for' loop implementation without rewind. Bubbles show up.
'for' loop implementation with rewind. Bubbles are removed.
In some special cases, such as systolic array implementations, static counter variables may be inevitable as iteration 'for' loop cannot be used. In this case, we can use the following code:
void first_data_place_holder(hls::stream<int>& i_stream, hls::stream<int>& o_stream){
#pragma HLS interface ap_ctrl_none port=return
#pragma HLS INTERFACE mode=axis port=i_stream register_mode=off
#pragma HLS INTERFACE mode=axis port=o_stream register_mode=off
#pragma HLS pipeline ii=1 style=flp
static int counter = 0;
static bool first = true;
if (first){
if (i_stream.empty() == false){
first = false;
}
counter = 0;
}
else{
if (counter == 0){
o_stream << -1;
}
else{
int temp;
i_stream >> temp;
o_stream << temp;
}
if (counter == (N - 1)){
counter = 0;
}
else{
counter++;
}
}
}
In this implementation, the 'first' register clearly tells the hardware when the period begins and the initial value of the counter. Therefore, the hardware can run correctly according to the two possible situations when counter == 0. In the waveform below, you can clearly see that before the first data is valid, the counter == 0 register is 'X' rather than '0'.
Both solutions works. However, there is a trade-off between the utilization of FFs and LUTs. The initial design (doesn't run correctly) uses 9 FFs and 107 LUTs; The rewaind for loop based solution uses 14 FFs and 143 LUTs; and the final solution that adds a global 'first' signal uses 43 FFs and 113 LUTs. The rewind for loop tend to use more LUTs as the control system is more complex. The global 'first' signal based solution uses less LUTs but more FFs. However, in the 43 FFs there are 32 FFs that are used to buffer the input 32 bits integer, which means the control part only uses 2 more FFs (43-32-9 = 2). Since in more complex examples the buffering the input may be inevitable, I personally prefer adding the global 'first' signal than the rewinding for loop as it is much more flexible.
08/22/2023: Structure Initialization Problem
In C/C++, {} can be used to initialize a structure with all zeros. It works in Vitis HLS in most cases. However, in some rare cases, using {} to create a zero structure may lead to all 'x'. It may not cause any problem when running it on hardware as 'x' is initialized as 0 on hardware. However, it may cause some problems in debugging (hardware emulation) as the 'x' can propagate with operations and finally make everything become 'x'. Here is a small example:
template<int ELES>
void sorting(hls::stream<data_t>& i_stream, hls::stream<data_t>& o_stream){
#pragma HLS interface ap_ctrl_none port=return
#pragma HLS PIPELINE II=1 style=flp
#pragma HLS INTERFACE mode=axis port=i_stream register_mode=off
#pragma HLS INTERFACE mode=axis port=o_stream register_mode=off
static int i = 0;
static bool inited = false;
int i_temp;
if (!inited){
if (!i_stream.empty()){
inited = true;
}
i = 0;
}
else{
data_t o_temp;
if (i < (ELES - 1)){
i_stream >> o_temp;
}
else{
o_temp = {};
}
o_stream << o_temp;
if (i == (ELES - 1)){
i = 0;
}
else{
i++;
}
}
}0,1,
This kernel pads a zero to a series. For example, if the input is 0,1,2,...,8,9, the output is 0,1,2,...,8,9,0. If you run hardware emulation, the waveform is shown below:
Clearly, the last cycle gives an 'x' instead of 0. It can also be verified from the generated verilog.
assign ap_condition_65 = ...; // ap_ctrl block level interface signal. It can be viewed as always '1'.
assign icmp_ln26_fu_95_p2 = (($signed(i) < $signed(32'd10)) ? 1'b1 : 1'b0);
assign ap_phi_reg_pp0_iter0_o_temp_1_reg_69 = 'bx;
always @ (posedge ap_clk) begin
if ((1'b1 == ap_condition_65)) begin
if (((icmp_ln26_fu_95_p2 == 1'd1) & (inited == 1'd1))) begin
ap_phi_reg_pp0_iter1_o_temp_1_reg_69 <= i_stream_TDATA;
end else if ((1'b1 == 1'b1)) begin
ap_phi_reg_pp0_iter1_o_temp_1_reg_69 <= ap_phi_reg_pp0_iter0_o_temp_1_reg_69;
end
end
end
According to the code, when the counter 'i' smaller than 10, the output is i_stream_TDATA. Otherwise, it equals all 'x', which should be set to 0.
This BUG is very rare but it can be solved in several ways:
avoid using {} to initialize a structure:
C++ now supports assigning value via its name in the definition. For example, if a structure is defined like this:
typedef struct{
int A;
int B;
}data_t;
Then, it can be initialized as this:
data_t temp = {
.A = 0,
.B = 0,
};
However, sometimes, it may not always work if there is a large array inside the structure, where listing the array value is too bulky (using {} to initialize an array can also lead to the same bug).
A weird way:
The problem can be solved in a non-sense way. Simply modify the code a little bit:
if (i < (ELES - 1)){
i_stream >> o_temp;
o_stream << o_temp;
}
else{
o_temp = {};
o_stream << o_temp;
}
This code theoretically has no difference from the original code, but it works. Therefore, it must be a bug.