Training Set
The Training Set describes basic soccer match data from soccer leagues worldwide (see Table 2). Currently, the Training Set has about 300,000 entries. Future updates are expected to add more matches. We expect to have three versions of the Training Set:
(1) The current, initial Training Set (release in December 2022).
(2) An updated Training Set will be released at the end of January 2023. This will include additional match data from additional leagues.
(3) The final Training Set covers matches played up to 04.04.2023. This version closes the chronological gap in the sequence of matches close to the matches featuring in the Prediction Set (the latter covers matches played after 13.04.2023).
The Training Set includes basic soccer match data. Table 1 illustrates the data covered by the Training Set.
Table 1: Illustration of entries in the Training Set and key to the field names
The data in the Lge column provides a code for the soccer league. For example, ENG1 refers to the first division in England (Premier League), ENG2 to the 2nd division (Championship), ENG3 to the 3rd division (League One), and so on.
In the Sea column the soccer season in which the match was played is recorded. For example, 21-22 in Table 1 refers to the 2021/2022 season of the Premier League in England. In some leagues, a soccer season covers a single calendar year. For example, the top division in Norway, NOR1, typically starts in March or April and finishes in November or December. In this case, 21-22 would refer to the season played in 2021.
The remaining fields in the Training Set are described under Key in Table 1.
Notice that the Training Set as been put together over a period of more than a decade. Therefore, when you use the data in the Training Set, you should keep a few things in mind.
Although we have tried to make the data set as accurate as possible, the data may still contain errors. If you spot such errors, please notify us so that we and fix these.
One of the challenges in putting together such a data set is the change of club names that occur. First, the data is based on sources which may use a different spelling of one and the same club. For example, “Manchester City” and “Manchester City FC”. Thus, we have tried to consistently use a canonical club name. Second, clubs change their name for various reasons; for example, when a new sponsor strikes a deal with the club. Thus, we have tried to keep the club name that was recorded first in the data set as canonical name. This may lead to the strange situation that in more recent matches a club name is being used that is no longer the preferred name of the club. But this should not influence the modeling and prediction challenge.
There are cases where league matches are decided at the “green table” (by the governing body of the league). Generally, we have removed these entries from the database, as these may not reflect the actual performance of the teams. As a result the final league tables one could derive from a season’s worth of matches may not be fully consistent with the official table.
Due to different league format, certain playoff matches are played towards the end of the season. Such matches may lead to knock-out matches which may involve extra time and penalty shoot-outs. Generally, the results in the data ignore whether the final result has been obtained with these additional elements. So, a final score of 4‑3 may be the result of regular time, extra time or penalty shoot-out. However, matches that extend beyond regular time are a very rare occurrence in the data set.
Due to the Covid-19 pandemic, the season 2019/2020 caused some problems to the match schedule in some leagues. Typically, the running season was terminated prematurely. Keep this in mind when you work with the data.
To access the Training Set, please register for the 2023 Soccer Prediction Challenge.
Table 2: Countries and leagues covered by the Training Set
Prediction Set
The task of the Soccer Prediction Challenge 2023 is to predict the outcome of soccer matches from various soccer leagues worldwide in terms of precise score and probabilities of the three outcomes win, draw, lose. The matches to be predicted will be provided in the Prediction Set (Table 3) to be released on 31.01.2023. The reason why the Prediction Set is released for the first time on 31.01.2023 is that the precise schedule for 2023 matches for some leagues is not known until late 2022 / early 2023. Since changes schedule and changes due to other reasons are always possible, we may release updates of the Predictions Set if such changes affect the Prediction Set after 31.01.2023.
The Prediction Set covers matches played after the submission deadline on 13.04.2023. This means at the time when the predictions are made, the actual results of the predicted matches are unknown. Table 1 illustrates the content of the prediction set.
Notice that it is possible that the Prediction Set will include one or two matches by the same team played after 13.04.2023. Also, we do not rule out that matches featuring the Prediction Set will include teams involved in matches played between the release of the final Training Set (04.04.2023) submission deadline (13.04.2023).
To access the Prediction Set, please register for the 2023 Soccer Prediction Challenge.
Table 3: Illustration of Prediction Set.