Player detection is one of the challenging tasks in this project, due to the nature of camera settings used to capture the entire football match. The video fotage is captured in two camera placed with close proximity and later these frames are stiched to represent entire football ptich in single frame.
The Yolo deep learning model was used to detect person class objects from every frame. The frame dimension of Statmetrix videos are in the range of 1K - 2K pixels and 4K - 6K pixels in the height and the width, respectively covering entire field in one frame. Resizing the full-frame image to Yolo network size (different network size of 32 multiples) and detecting person class objects results in problems such as single bounding box for multiple person objects and irregular bounding boxes . This is due to the uneven resizing of the original image of ratio 1:3 (h: w) into square images during inference.
To overcome the problem generated while using full-size images, we split the images into sizes that are in the ratio close to 1:1.5 (h: w). Along with splitting, increasing the network size allows the person class objects that far away from the camera to be detected. This is because the anchors used in Yolo model cannot detect the features when resized to such low resolutions. At a higher network size, the feature dimensions are preserved and detected by the trained anchors, thus increasing the overall Average Precision (AP). Using different spits to provide better results compared to full-size image and provide improved AP over the ground truth annotation.
AC1 and AC5 are the video used for testing. Average precision is shown for different network size and splits