Predicting the outcomes of sporting events has been the subject of intensive research for many years (Bchner et al. 1997). One obvious motivation for this is betting. Sports betting has become a global multi-billion-dollar industry (Forrest et al. 2005). At least since the late 1960s, statistical forecasting models have been developed for association football (Hill 1974; Maher 1982; Dixon and Coles 1997; Goddard 2005; Angelini and Angelis 2017), also known as soccer. One of the earliest studies on soccer analysis concluded that chance dominates the game (Reep and Benjamin 1968), which makes outcome prediction very difficult.

To date, relatively few studies have investigated machine learning methods for soccer outcome prediction. We speculate that one reason is the lack of readily available open soccer data. Here, we present the Open International Soccer Database to bridge this gap.


Pro League Soccer Database 2023 Url Download


Download Zip 🔥 https://urllie.com/2yGbJP 🔥



Keeping its future usability in mind, we designed the database to provide a very large set of precisely the data that are widely and publicly available on a regular basis for practically all soccer leagues around the world. Specifically, for users who are interested in making predictions for their favorite team(s), it is important to have long historical records, as this is a key factor for reliable predictions. Simply put, such users need to add the latest results of their target teams to the Database, train predictive models, and make the desired predictions. This is of course only feasible if the historical data have the same format as the new data that the users can easily access.

Here, we describe the Open International Soccer Database, as well as the 2017 Soccer Prediction Challenge and its results. All materials related to the Database and Challenge are publicly available under the CC0 1.0 Universal license through the Open Science Framework project sites.Footnote 1 An updated version of the Database with more entries and some corrections has already been made available at the project website. Future updates will also provide references and links to machine learning research that uses it.

This article is organized as follows. First, we review related work on soccer outcome prediction and available soccer databases. We also briefly discuss the need for reproducible research in machine learning, which motivated us to choose the Open Science Framework (OSF) (Foster and Deardorff 2017) as accompanying repository for the Open International Soccer Database and the 2017 Soccer Prediction Challenge. Then, we describe the Database and the Challenge, and finally conclude the paper with a discussion and outlook to future work.

These challenges differ from the Open International Soccer Database and associated 2017 Soccer Prediction Challenge described in this paper in several ways. First, both our Database and Challenge are based on regular league soccer only. Other soccer games and competitions, such as tournaments of national teams and clubs, friendly games, etc., are not covered. The Challenge task was to predict the outcome of the next match of the teams for leagues that met certain conditions at the Challenge deadline by the end of March 2017. Predicting the outcome of multiple matches of the same team was not part of the Challenge. The task of the Challenge was to construct models based on the Challenge learning set only, which is identical to v1.0 of the Database. Previous challenges in this area left it open to the participants what data to use.

One reason for the relatively low number of data-driven studies in soccer might be the lack of publicly available databases. Data on soccer results are of course available from various online sources.Footnote 4 To our knowledge, one of the most comprehensive open databases for soccer analytics is the European Soccer Database (Mathien 2017), an SQL database of about 25,000 soccer matches from the top leagues of 11 countries, covering the seasons from 2000 to 2016. In addition to match statistics (e.g., goals, ball possession, corners, cards, etc.), this database also includes data about team formations and statistics for over 10,000 players. This database is hosted at KaggleFootnote 5 and specifically designed for machine learning analyses. Kaggle is an interesting open data platform whose mission is to bring together data, people, discussion, and code.

Although the importance of replicablility and reproducibility has been pointed out for many years (Hirsh 2008; Drummond 2009; Manolescu et al. 2008; Vanschoren et al. 2012; Berrar 2017; Berrar et al. 2017b), we believe that these issues have not yet received due attention in the machine learning community. For example, the UCI Machine Learning Repository (Lichman 2013) hosts numerous benchmark data sets, but no analytical results, experiments, reproducible code, or any other materials that establish a context, so that pertinent questions remain open, including: How have the data been pre-processed, analyzed, and perhaps enriched so far? What is the state-of-the-art performance on these data sets? Which analytical approaches did not work (i.e., negative results that are usually not published)? Vanschoren et al. (2012) deplored the immense effort that is required to replicate earlier studies on benchmark data sets, simply because in practice it is not feasible to publish all details about the experiments.

There are open source software repositories for machine learning, such as Machine Learning Open Source Software (MLOSS).Footnote 6 Another repository is OpenML, an open platform for hosting data sets, code, and analytical workflows, with the aim to facilitate reproducible research in machine learning (Vanschoren et al. 2013). For each project, OpenML also provides visualization tools (e.g., boxplots), a wiki, user discussions, and tasks, which are machine-readable containers for data subsamples (training and test sets). Furthermore, OpenML is integrated with machine learning environments such as Weka (Hall et al. 2009). OpenML is a very interesting platform; however, only a limited number of data formats are currently supported (e.g., ARFF for tabular data).

In contrast to Kaggle, the Open Science Framework (OSF)Footnote 7 is maintained by a non-profit organization, the Center for Open Science (Foster and Deardorff 2017). OSF supports reproducible research by providing a user-friendly, free archive for data, experimental protocols, supplementary materials, code, etc. and allows the generation of persistent identifiers, i.e., digital object identifiers (DOI) and archival resource keys (ARK), which make projects citable resources. Like Kaggle, OSF, too, has a discussion board. Last but not least, OSF provides a very user-friendly frontend. In our view, OSF currently offers the simplest solution to scientists who wish to host all relevant materials in one public, citable repository. We therefore decided to make the Open International Soccer Database and all materials related to the 2017 Soccer Prediction Challenge available at OSF.

One aspect that makes soccer so popular (and prediction based on goals alone so difficult) is that the final outcome of the majority of soccer matches is uncertain until the end. This is because goals are relatively rare, and the margin of victory for the winning team is relatively low for most matches (Table 5). From the Database, we estimate the average number of home and away goals as 1.48 and 1.11, respectively (see Table 4). This means that, on average, the home team prevails over its opponent by a margin of 0.372 goals, reflecting the home advantage in league soccer. Moreover, when we look at the distribution of the margin of victory in the Database, we find that \(86.71\%\) of all matches end either in a draw or a victory by either team with a goal difference of \(\le 2\), and \(95.47\%\) are either draws or a win by either teams by 3 or fewer goals (Table 5a).

The Open International Soccer Database comprises 216,743 entries, each describing the most commonly available and consistently reported data about the outcome of a league soccer match in terms of the goals scored by each team, teams involved, and league, season and date on which the match was played. The beauty of this type of soccer data is that it is readily available for most soccer leagues worldwide, including lower leagues. Thus, an important research question is to determine the limits of predictability for this type of data. In order to find this out, we invited the machine learning community to develop predictive models based on the version v1.0 of the Database.

The 2017 Soccer Prediction Challenge was part of the special issue on Machine Learning for Soccer in the Machine Learning journal. The Challenge description was published together with the call for papers for this special issue on 17 January 2017 (see supplementary material on the Challenge websiteFootnote 8). Figure 2 shows the overall time frame of the challenge. The participants contacted us by email to express their interest in the Challenge and then received a web link to download the data.

The final Challenge learning set is identical to v1.0 of the Open International Soccer Database presented in this article. In the remainder of this text we will use the term (final) Challenge learning set instead of Database.

xHS, xAS: Predicted goals scored by the home and away team, respectively, expressed as a non-negative real number. This was an optional task of the Challenge, which did not count towards the ranking of the submitted predictions.

xGD: Predicted goal difference expressed as a real number. This was another optional task of the Challenge. Note that the goal difference may not necessarily be the difference between xHS and xAS because a model might compute the goal difference without explicitly calculating the actual number of goals scored by each team.

In order to facilitate a realistic and hard prediction challenge, the matches in the prediction set had to be carefully selected. First and foremost, we required that at the time the submissions were due (midnight, 30/03/2017 CET), the actual outcomes could not be known to anyone (including us, the organizers). Thus, only matches played after the submission deadline could be used for the prediction set. Second, since several of the leagues appearing in the learning set were not in progressFootnote 11 at the submission deadline, we could not include games from these leagues. Third, as explained in Sect. 1, matches from leagues that did not suspend regular league play in the period from 22/03/2017 to 30/03/2017 could not be included. For example, a full match day was played in the ENG3 league on 25/03/2017 and 26/03/2017. Fourth, each team in the prediction set had to appear only once; otherwise, the participants would have to predict the outcomes of two or more matches involving the same team. Thus, only 28 of the 52 leagues from the learning set could be used to select a total of 206 matches for the prediction set. 152ee80cbc

magento 2 download zip

can you download bookworm adventures

dallas live camera