MARIO

MARIO: Modular and Extensible Architecture for Computing Visual Statistics in Robocup SPL

Domenico D. Bloisi, Andrea Pennisi, Cristian Zampino, Flavio Biancospino,

Francesco Laus, Gianluca Di Stefano, Michele Brienza, and Rocchina Romano

Introduction

RoboCup and the 2050 Challenge

RoboCup is an annual international robotics initiative founded in 1996 by a group of university professors. The aim of the competition is to promote robotics and AI research by offering an appealing and formidable challenge. One of the effective ways to promote science and engineering research is to set a challenging long term goal. The long term goal – the ”dream” – is in this case to create by the middle of the 21st century a team of fully autonomous humanoid robot capable to compete and win a soccer game, in complying with the official rules of FIFA, against the winner of the most recent World Cup.

RoboCup 2022 Open Research Challenge

Since its foundation, one of the objectives of the RoboCup initiative was to push the boundaries of the research by offering high-level challenges. For the year 2022, the Open Research Challenge proposed by the RoboCup Standard

Platform League (often referred as RoboCup SPL) is about generating statistics from external video data captured using a GoPro-like camera. Attempts to generate consistent statistics were made in the past [3] using only GameController/TeamCom data. Unfortunately, these data are not sufficient and that’s why there is an Open Research Challenge addressing this issue.

With reference to the Open Research Challenge 2022, two major goals exist: the short-term goal is to calculate extrinsic camera parameters (camera matrix) from the camera feed and to locate/track all moving objects (ball, robots) on the field; the long-term goal involves the creation of game statistics (time under control, successful, unsuccessful shots on goal, passes, etc.) based on the located objects and positions from the short-term goal.

MARIO Introduction

MARIO is an end-to-end architecture for computing visual statistics in RoboCup SPL. One of the best qualities of the MARIO project lies in its modularity. Each module perform a specific task and together the modules contribute in carrying

out the proposed challenge.

The SPQR team of La Sapienza University of Rome in collaboration with the UNIBAS WOLVES team of the University of Basilicata took part in the Open Research Challenge of the RoboCup 2022 held in Bangkok (Thailand) with the

MARIO project. It is with great pride that we mention the victory (ex-aequo with the B-Human team) of this competition and present in the table 1 the competition ranking.

The winners were decided through a vote among the SPL teams using the Borda counting mechanism. Each participating SPL team chose its five best teams in order. For the voting, the teams are encouraged to evaluate the performances

according to the following criteria:

Achievement of long/short term goals;
Execution time and hardware requirements;
Metrics (accuracy/precision/recall);
Technical strength;
Novelty.

MARIO System Overview

The architecture of MARIO is modular. This means that each module performs a specific task and all modules work in synergy in order to achieve the Open Research Challenge goals. The figure 1 shows the architecture diagram of MARIO.

In this diagram each module has been divided appropriately according to the type of goal pursued, whether belonging to one of the short-term goals or belonging to one of the long-term goals. In detail:

Short-Term Goals

Camera Calibration. A preliminary camera calibration is performed in order to remove the camera lens distorsion;
Background Subtraction. A background subtraction tecnique is applied;
Homography. Hsomography matrix is used to compute a plan view of the field;
Tracking and Localization. A combination of YOLOv5 and StrongSORT models are used to track and localize the players and the ball.

Long-Term Goals
- - Pose Estimation. A Convolutional Neural Network (CNN) model is used to perform the pose estimation of the robot. A custom dataset is created specifically for this task;
  - Fall Detection. Based on skeletal information obtained, a SpatialTemporal Graph Convolutional Networks (ST-GCN) model is used to perform the fall detection;
  - Illegal Defender. The tracking results are used to check if no more than three players from same team are in the same penalty area;
  - Data Association and Statistics. Game data containing player and ball information are extracted and used to compute the statistics about the game.

Camera Calibration

A camera calibration procedure is performed in order to remove the distorsion caused by the camera lenses. To sanitize the distorted image, it is necessary to know the intrinsic parameters of the camera that took the shots. The intrinsic parameters include information such as the focal length, (fx, fy), and the optical center, (ox, oy). The focal length and the optical center can be used to create a camera matrix, which in turn can be used to remove distortion from the shot.

The camera matrix is unique to the specific camera and once calculated it can be reused on other images taken by the same camera. Formally, the camera matrix is a 3 x 3 matrix that presents the following structure:

Generally, the camera calibration process uses images of a 3D object with a geometrical pattern (e.g., checker board). The pattern is called the calibration grid. The 3D coordinates of the pattern are matched to 2D image points. The correspondences are used to obtain the camera parameters.

In the context of MARIO, we have at our disposal only the individual images captured by the camera. We have no information about the camera model used and, consequently, we do not know anything about its parameters. However, the images of the field do contain some patterns such as the corners, the intersections between the penalty areas and the goal lines, the intersection between the midfield line and the sidelines, the penalty area corners, etc. Taking this into account, an association can be built using these points. In this way, it is possible to compute the camera matrix, the distortion coefficients, and the rotation and translation vectors. This calibration process is repeated for all the different cameras used to capture the match shots. In the end, a calibration file was created for each of these different cameras and saved accordingly.

Figure 3 show an image prior and after the calibration process.

Background Subtraction

Background Subtraction (BS) is a popular and widely used technique that represents a fundamental building block for different Computer Vision applications, ranging from automatic monitoring of public spaces to augmented reality. The BS process is carried out by comparing the current input frame with the model of the scene background and considering as foreground points the pixels that differ from the model. Thus, the fundamental problem is to generate a background model that is as reliable as possible and consistent with the observed scene. BS has been largely studied and many techniques have been developed for tackling the different aspects of the problem.

IMBS (Independent Multimodal Background Subtraction) is a BS method that has been designed for dealing with highly dynamic scenarios characterized by non-regular and high frequency noise. IMBS is a per-pixel, non-recursive, and non-predictive BS method, meaning that:

each pixel signal is regarded as an independent process (per-pixel);
a set of input frames is analysed to estimate the background model based on a statistical analysis of those frames (non-recursive);
the order of the input frames is considered not significant (non-predictive).

The above listed design choices are fundamental for achieving a very fast computation, since (i) working at pixel level and (ii) considering each background model as independent from the previous computed ones allows for carrying out the BS process in parallel.

IMBS-MT (Independent Multimodal Background Subtraction MultiThread) is an enhanced version of IMBS. IMBS-MT differs from the original IMBS in two aspects:

The background formation and foreground extraction processes are carried out in parallel on a disjoint set of subimages from the original input frame;
The background model is initialized incrementally, i.e., the quality of the model is increased as soon as more framesamples are available.

IMBS-MT is designed for performing an accurate foreground extraction in realtime for full HD images. IMBS-MT can deal with illumination changes, camera jitter, movements of small background elements, and changes in the background geometry.

Figure 3 shows an image prior and after the application of IMBS-MT.

Homography

To calculate the homography matrix, which is useful for applying the perspective transformation of the soccer field, we used: as the source image, an image of the soccer field as seen from the camera point of view; as the destination image, a reference image of the field as seen from above.

To calculate a homography between two images, it is necessary to know some points of correspondence between them. To do this, we trained a neural network to recognize patterns like the corners, the intersections between the penalty areas and the goal lines, the intersection between the midfield line and the sidelines, the penalty area corners, etc. The network model used is YOLOv5 (a more detailed explanation of this model will be given in 4.1.3) and the dataset used for training was created using images of all types of field.

Figure 4 shows the patterns identified by the detector.

Using this model, it is possible to:

infer which version of the field is being used. In this case, the labels indicating penalty areas and goal areas are used as discriminants;
calculate the set of source points used for the application of homography, which correspond to the points of intersection and joining of field lines.

The set of destination points are fixed. A K-Means algorithm is then applied in order to create a correspondence between source and destination points.

Homography and field’s version are computed in a fully automatic fashion. Anyway, it is also possible to perform the homography in a manual fashion. In this case, the set of source points are computed through the YOLOv5 model, while the set of destination points are chosen by the user using the mouse. This method is usually performed if the reprojection error of the automatic homography falls below a threshold that would not guarantee a good projection result.

Figure 5 shows an image of the soccer field as seen from the camera point of view and the same image after the application of the perspective transformation.

Tracking and Localization

To carry out robots and ball detection, the database of images provided by the RoboCup SPL is used. Each team provided 5000 images captured from matches of the RoboCup SPL 2019 and for each one they carried out the labelling as explained in the rulebook. As there were seven participating teams, the dataset used consisted of 35000 images.

The dataset was divided into three parts so that the training, validation and testing operations could be carried out using a script that randomly chooses the images to obtain a variety of data within each group.

The images used for training are 24000; images used for validation are 3000; images used for testing are the remaining 8000.

The dataset used for the detection is distributed as-is at the following link.

The training of the neural network was carried out using the YOLOv5 model.

In order to perform the detection, the dataset was appropriately converted to YOLO format.

The training lasted 270 epochs.

Tracking is the task of predicting the positions of objects throughout a video using their spatial and temporal features. More technically, tracking is getting the initial set of detections, assigning unique IDs, and tracking them throughout frames of the video feed while maintaining the assigned IDs. Tracking is generally a two-step process that involves the use of a detection module for target localization - in our case, the YOLOv5 model - and a motion predictor.

Trackers can be classified based on the number of objects to be tracked: a Single Object Tracker (also referred as SOT) track only a single object even if there are many other objects present in the frame; Multiple Object Tracker (also referred as MOT) track multiple objects present in a frame.

The tracker used in the present work is a MOT. In detail, we used a tracker called StrongSORT, an enhanced version of the popular DeepSORT algorithm, which in turn is an extension of the SORT (Simple Online Realtime Tracking) algorithm.

SORT is an approach to object tracking where rudimentary approaches like Kalman filters and Hungarian algorithms are used to track objects and claim to be better than many online trackers. SORT is made of 4 key components which are as follows:

- Detection. In this step, an object detector detects the objects in the frame that are to be tracked. Detectors like FrRCNN, YOLO, and more are most frequently used;
- Estimation. In this step detections are propagated from the current frame to the next using a constant velocity model. When a detection is associated with a target, the detected bounding box is used to update the target state where the velocity components are optimally solved via the Kalman filter framework;
- Data association. A cost matrix is computed as the intersection-overunion (IOU) distance between each detection and all predicted bounding boxes from the existing targets. The assignment is solved optimally using the Hungarian algorithm. If the IOU of detection and target is less than a certain threshold then that assignment is rejected. This technique solves the occlusion problem and helps maintain the IDs;
- Creation and Deletion of Track Identities. This module is responsible for the creation and deletion of IDs. Unique identities are created and destroyed according to the threshold defined in the previous step. If the overlap of detection and target is less than the threshold then it signifies the untracked object. Tracks are terminated if they are not detected for a specified number of frames. Should an object reappear, tracking will implicitly resume under a new identity.

SORT performs very well in terms of tracking precision and accuracy, but returns tracks with a high number of ID switches and fails in case of occlusion.

This is because of the association matrix used. DeepSORT uses a better association metric that combines both motion and appearance descriptors. DeepSORT can be defined as the tracking algorithm which tracks objects not only based on the velocity and motion of the object but also the appearance of the object.

The network is trained on a large-scale person re-identification dataset, making it suitable for tracking context. To train the deep association metric model in the DeepSORT cosine metric learning approach is used.

In the context of MARIO, an enhanced version of DeepSORT is used, the StrongSORT algorithm. StrongSORT algorithm upgrades the DeepSORT algorithm in various aspects such as detection, embedding and association.

Pose Estimation

Multi-person pose estimation is an important task and may be used in different domains, such as action recognition, motion capture, sports, etc. The task is about predicting a pose skeleton for every person in an image. The skeleton consists of keypoints, or joints, that identifies specific body parts such as ankles, knees, hips and elbows. Multi-person pose estimation problem can usually be approached in two ways. The first one, called top-down, applies a person detector and then runs a pose estimation algorithm for every detected person. In this way, the pose estimation problem is decoupled into two sub-problems, and the state-of-the-art achievements from both areas can be used. The inference speed of this approach strongly depends on number of detected people inside the image. The second one, called bottom-up, is more robust to the number of person. At first, all keypoints are detected in a given image, then they are grouped by entity instances. Such approach is usually faster than the previous, since it finds keypoints once and does not rerun pose estimation for each person.

OpenPose is the first real-time multi-person system to jointly detect human body, foot, hand, and facial keypoints on single images. OpenPose uses a CNN model such as VGG-19 for feature map extraction. The feature map is then processed in a multi-stage CNN pipeline in order to generate the Part Confidence Maps (PCM) and the Part Affinity Fields (PAF). In the last step, the Confidence Maps and the Part Affinity Fields are processed by a greedy bipartite matching algorithm to obtain the poses for each entity in the image.

In the context of MARIO, a slightly optimized version of OpenPose, called Lightweight OpenPose, is used. This optimized method allow real-time inferences on CPU with a negligible accuracy drop.

Some of the improvements achieved by the Lightweight OpenPose method concern the following points:

- Lightweight Backbone. A lighter but still performant network is used. The pre-trained network used is MobileNet v1;
- Lightweight Refinement Stage. To produce new estimation of PCMs and PAFs, the refinement stage takes features from backbone, concatenated with previous estimation of PCMs and PAFs. In order to share the computations between PCMs and PAFs and, consequently, realize a consistent speed-up, a single prediction branch is used in initial and refinement stage. All layers are shared with except to the last two, that produce the model outputs;
- Fast Post-processing. Keypoints extraction are performed in a parallel fashion.

In figure 8 we show an example of pose estimation performed on a single frame of a RoboCup soccer game.

We will explain how the training dataset was created and how the Lightweight OpenPose method was adapted to the case of NAO robots.

NAO robots share the same body structure as human beings. Consequently, one can think that little to no effort is needed to use the Lightweight OpenPose method with NAO robots. However, the PCMs and PAFs calculated by this method show the existence of substantial differences between robots and humans – at least, by the model perspective. For this reason, a specialized dataset, named UNIBAS NAO Pose Dataset, was built.

The tool used to create the dataset is COCO Annotator, a web-based image annotation tool designed for versatility and efficiently label images to create training data for image localization and object detection. All annotations share the same basic data structure. The pose may contain up to 18 keypoints: ears, eyes, nose, neck, shoulders, elbows, wrists, hips, knees, and ankles. The annotations are stored using JSON.

The dataset is distributed as-is at the following link.

Fall Detection

Human action recognition has become an active research area in recent years, as it plays a significant role in video understanding. In general, human action can be recognized from multiple modalities and one of these modalities involves the use of dynamic skeletal information. The dynamic skeletal information can be naturally represented by a time series of human joint locations, in the form of 2D or 3D coordinates. Human actions can then be recognized by analyzing the motion patterns thereof. Skeleton and joint trajectories of human bodies are robust to illumination change and scene variation, and they are easy to obtain via accurate depth sensors or pose estimation algorithms. Because of these significant advantages, there are a wide range of skeleton-based approaches to action recognition. The approaches can be categorized into:

feature based methods, that design several handcrafted features to capture the dynamics of joint motion. These could be covariance matrices of joint trajectories, relative positions of joints or rotations and translations between body parts;
deep learning methods, that involves the use of recurrent neural networks (RNN) and temporal CNNs.

Using the skeletal information retrieved through the Lightweight OpenPose method and observing the dynamics of the single skeleton over time it is possible to infer the type of action that is being performed by the robots. In detail, we are interested in capturing the falling action of the robots.

The Spatial-Temporal Graph Convolutional Network (often referred as ST-GCN ) is a graph convolutional network (GCN ) – a generalization of CNN – able to estimate actions from the spatio-temporal graph of a skeleton sequence.

An example of spatio-temporal graph is shown in figure 9.

There are two ways to do convolution on graphs. The first approach is to do it on the spatial domain and the second approach is to do it on the spectral domain.

The ST-GCN method use convolution on the spatial domain, where the data are represented in the form of graph nodes and connections between them. In this context, convolution, in the strict sense of the term, cannot be applied. A rule is needed to turn the graph convolution problem into a standard convolution problem. For ST-GCN, the rule is the following: a kernel only operates to the direct neighbors of the node. Moreover, it divides the neighbor set into several subsets and apply different weights to these subsets during convolution.

ST-GCN proposes three partition strategies to divide the neighbor set:

Uni-labeling. All nodes in a neighborhood are in the same subset;
Distance partitioning. The root node is put in a subset (distance 0) and the remaining neighbors into another subset (distance 1).
Spatial configuration. The nodes are partitioned according to their distances to the skeleton gravity center and compared with the root node.

Three subset are then created: one subset for the root node; one subset for the centripetal nodes, that have a shorter distance than the root node; one subset for the centrifugal nodes, that have longer distance than the root node.

Figure 10 shows the partitioning methods just described.

The network contains several spatio-temporal convolutional blocks. Each of these blocks performs four actions: temporal convolution, partition, graph convolution and, in order to get optimized results, a second temporal convolution.

Figure 11 shows the ST-GCN architectural scheme.

In the context of MARIO, the ST-GCN method is used to identify the fallen robots. Due to the low-resolution images, the robot skeleton may be missing some of its limbs. To solve this issue, an additional fallback method was implemented to detect falls. In detail, the shape of the bounding box obtained in the robot detection phase was used: if the aspect ratio of the bounding is below a specific threshold, the robot is considered to have fallen.

Illegal Defender

The tracking results are used to check if no more than three players from the same team are in the same penalty area. Specifically, the illegal defender works as follows: data of interest are taken from the csv file - in this case, the positions of the robots in the field; the robots of each team are identified and their positions taken into account. If more than tree robots from the same team are in the range of coordinates representing a penalty area, then an illegal defender foul will be counted. At the end, the total number of illegal defender fouls that were made by the two teams will be returned. Figure 12 shows two images that visually explain when an illegal defender foul occurs and when not.

Data Association and Statistics

Data association is the alignment of player tracking data with the data contained in the GameController. The operation allows to improve the accuracy of the available data and to obtain additional information about the players such as team, jersey number, whether the robot is leaving the field, whether the robot is dropped or inactive, whether it committed a foul, etc.

In MARIO, the data association process occurs in the first few frames of the game.

Positions obtained through the use of tracking and positions extracted from the GameController are associated through the use of K-Means. In some cases, this association process turns out to be inaccurate. This is due to the inherent inaccuracy of the position data contained in the GameController. In fact, the positions contained within the GameController are positions calculated by the robots themselves during the game and thus calculated from a limited perspective.

Match statistics and analysis computation is one of long term goals of the Open Research Challenge of Robocup SPL 2022. The statistics module focuses on estimate heatmaps and trackmaps of robots and ball, pass and shot maps and ball possession. All the statistics have been computed using the game data.csv file turned into pandas DataFrame on which carry out the data analysis operations. Graphs of heatmap, trackmap and pass-shot map have been created with matplotlib library and openCV has been used for their visualization in the app.

Heatmap shows the most occupied locations by each robot and ball in the 2D plan.

Then all data about player are graphed on a plot whose density depends on areas most occupied by the chosen robot.

Trackmap shows all the points touched by each robot and ball in the 2D plan.

It’s estimated in a similar way to the heatmaps, but instead of density graph, all the positions (x, y) of chosen robot or ball are inserted on the field model.

Pass-shot map shows all passes, shots, shots on target and goals by each team in different colors.

For ball possession, numerical values are calculated for each team.

MARIO GUI

A graphic interface has been implemented using Python with graphic module tkinter and a custom theme called azure. The app exploits the main feature of MARIO, its modularity. In fact thanks to that you can run all the MARIO flow of execution, from calibration to final stats. It consists of 3 windows (Configuration, Tracking and Analysis) navigable through buttons.

The main window, called MARIO is the configuration window. Here you can chose video of the match to analyze, extrinsic parameters (optional), game controller (optional) and calibration file (if your video isn’t already calibrated); then you can run calibration and background subtraction or go to tracking phase directly by switching the calibrated switch button.

In addition there are 3 icon buttons which link to UNIBAS WOLVES, SPQR team and UNIBAS web pages.

After calibration phase, that outputs the calibrated video, a second window opens, called MARIO-Tracking.

This is the section in which robot tracking takes place. Homography is evaluated before tracking starts. From MARIO Tracking window, 2 videos are showed; the first one is the tracking video with bounding boxes and pose estimation (only for robots) on each robot and ball detected and tracked, the second one is the tracking on the 2D field model.

At the end of this phase game data.csv file is created with all the data about position, team, jersey number, id and frame of every tracked robot.

You can subsequently open analysis window by pressing the GO TO ANALYSIS button in MARIO-Tracking window. In this third window, called MARIOAnalysis you can estimate robot heatmaps and trakmaps, pass and shot map, and calculate the numerical stats of the match such as goals, ball possession, total attempts, attempts on target and total passes that are showed in a window with a soccer match scoreboard design.

MARIO in Action

Environment Setup and Codebase

You can download the MARIO project from the following GitHub repository .

Poster Open Research Challenge 2022

Robocup.pdf

Team

Page updated

Report abuse