March Madness Model


Inspired by non-transitive probability distributions, I tried to think of a scheme that could (with enough data) model team X being better than team Y, Y being better than Z, and Z being better than X. I built this model for march madness, where the large number of teams provides an ideal scenario for modeling non-transitive team performances. This did well; predicted winners correlated strongly with actual winners. However I bought some historical lines and simulated betting, and I would not have beaten the market wisdom / house edge.

Some code snippets are included below. For additional code, contact me.

First Pass - Quadratic Curve

Idea / Math

My first thought was to model each team as a die with unknown side values and games could be simulated with each team rolling their respective die; the probability of team X beating team Y should be the same as the probability that X's die would be larger than Y's die. I thought I would find the dice sides of best fit, but this is a complex problem. I came up with a general model where I would build an n-dimensional cube, [0, 1]^n, where n is the number of teams. I would create a curve from the origin, (0, 0, ..., 0), to (1, 1, ..., 1). To model the probability that team i would beat team j, I looked at the curve projected onto those two variables:

The area of this face gets split into two, with the two pieces adding to an area of 100%. I said that the piece closer to the i-th axis should be the probability that the i-th team would win a match-up. My thinking with this model was that a curve projecting onto different faces would have enough degrees of freedom to capture non-transitive, but that we could also require enough smoothness on the curve so that being close to an axis when projected onto some dimensions would make the curve generally close to that axis to influence other match-ups. Today I don't know if that's true.

To implement this, I parameterized the curve in each dimension with a function on t, with t ranging from 0 to 1.

The projection to a face i-j would simply be (f_i(t), f_j(t)). For simplicity, I assumed that each of these functions are quadratic. That is,

Where the second equality follows because f_i(0)=0 and f_i(1)=1. We can then easily calculate the portion of the projected curve that is closer to the i-th axis, using the well-known Formula I from this page:

This form is simple, and it makes sense that a win would be related to the difference of the teams parameters (parameter being a proxy for strength). I found the parameters a_i of best fit by minimizing the loss function that summed the square difference between predicted portion of the time that team i beat team j and the actual portion of the time that i beat j, where the sum is taken over all i and j that have played each other at least once in the season.


I ran this algorithm, and got variables a that made sense; that is, better teams generally had higher parameters, as expected. I trusted this to rank teams, but doubted the precision of the probabilities predicted. So I simulated betting the money line when the team I favored was an underdog (but less than a 10:1 underdog), but only in the first two rounds.

At first it looked promising:

MM Table.png

I simulated picking 46 underdogs at random, and picking at random only did as well 1% of the time. However I knew that my profitability was sensitive to a few, large wins. [And probably, I had cherry-picked the exact criteria I would bet on.] So I bought 30 years of casino lines data, and simulated. On this larger set, I saw no profit. [I have lost the results of this larger experiment.]


What this model did well was to relate the probability of winning to difference of parameters. [This idea is revisited in the Final Thoughts section below.] It's not clear that this specific relation is a good one, and my loss function is clearly unusual. If I was to redo this, I would fit a logistic regression where the target is and indicator if team i beats team j, and use predictors of +1 for x_i and -1 for x_j; then fit the coefficients of these, using all data.

This approach ended up not modeling non-transitivity, since each team had only one parameter! If we used cubics instead of quadratics, we could get non-transitivity. The implementation would be much harder as the formula doesn't simplify as nicely, but it could be interesting to see.

I learned the value of testing against a lot of data. If I was doing this today, I would bootstrap to estimate variance of profitability, and take measures to avoid cherry-picking.

Second Pass - Genetic Algorithm


After a few years I had learned about genetic algorithms, and I had an idea to revisit my original dice idea. My thought was that the specimens in a population could be a candidate set of dice, one for each team; the fitness could be the likelihood of the observed season with the candidate dice; and breeding would be to loop through teams and build each team a die selecting sides randomly from the two parents.


Here is select code. Full code is available upon request.

First, here is the code to breed all the specimens.

Which makes calls to the breed function, which then loops through all the dice.

Here's the function that breeds two dice.

Next we mutate and sort the specimens by fitness. [We elsewhere kill all but the most fit specimens.]

Mutate and random_fitness are functions on the Specimen class. Mutate is described here:

Random_fitness loops through all games, many times if specified. It rolls this class' dice for the two teams and adds 1 each time the real winner wins a dice roll. This is a random sampling of the real likelihood. It is more efficient and (I think) a good enough approximation.


I ran this on a few years of data. I simulated placing line bets everywhere that I thought the line was off by more than a few points. [There's a conversion between probability (from money line bets) to point spread.] At the end I had lost about 11% of money I would have bet.

Final Thoughts - MCMC

Recently I did some modeling for a different sport using a logistic regression as described above, but I allowed the parameters to have general distributions. Since dice rolls can be made (approximately) into distributions by putting spikes on each side value, it follows dice rolls are specific distributions belonging to this more general model. [This was some informal work to experiment with pymc3 and to see if this approach would work.] I found that this added complexity improved little over a simple model (like take the team with the better record), and still fell short of beating the wisdom of crowds + casino house advantage. This MCMC model has a difficult time fitting the distributions with the small amount of data available (number of games), but was returning meaningful results.

I think it's probably the case that a model based on records alone cannot be profitable. I think it's likely that many of the approaches we've tried here come close to capturing all the possible predictive value from records alone. But I also think it's likely that a much simpler model would do well.