Low latency and energy-efficient object detection and tracking on FPGA
Advised by Prof. Callie Hao at Georgia Tech
Advised by Prof. Callie Hao at Georgia Tech
Our main idea is to add a light-weight branch to generate a rectangular mask, which can help to identify the object and background, such that the accelerator can skip the computation for the background. As a case study, we use SkyNet as the backbone and add a new branch after the third bundle in SkyNet. By sharing convolution layers with the SkyNet backbone network, the new branch can extract preliminary features of the image. In this way, the confidence mask can be obtained through the two-layer fully convolutional network, which greatly reduces the number of network parameters.
SkyNet backbone + new branch
Binarization of confidence mask: After obtaining the confidence mask, we set a threshold for each patch’s score. If the score is above the threshold, we keep this patch and reset its score to 1. Otherwise, we assume this patch to be background and reset its score to 0.
All pass mechanism: To improve the robustness of mask generation and to avoid mis-prediction for hard-to-detect cases, if the score of all patches of a confidence mask does not exceed the threshold, we assume that the new branch fails to identify the location where the object may exist. In this case, the score of all patches in the mask is reset to 1. We believe that this approach is conducive to improving IOU.
Mask shape regularization: We propose the technique of shape regularization to change irregular shapes into regular ones. In this way, we can get a rectangular mask with four values: hmin, hmax, wmin, wmax. The four boundary values can be directly mapped to the corresponding feature map regions by scaling, which enables easy control on FPGA to calculate only part of the image where the object may exist.
Channel shuffle: To reduce the computation overhead and the number of parameters of the new branch, we propose to use group convolution. In order to reduce data movement between DDR and on-chip memory, we use the channel shuffle method. Thus we can load the data we find into on-chip memory in a specific order.