Home‎ > ‎

Is the scoring fair?

There's lots of discussion (to put is nicely) on the Leader Board's comment section about the fairness of the scoring of the open.  The Open ranks each athlete in each WOD with the score being the number of people above you (plus 1).  Then the sum of your ranks for each WOD is your total score, and low score wins.  This scoring system works well, except in a few cases: where there's a lot of bunching (See WOD 12.2) or where someone really runs away with winning, and second place is far behind but still gets the same score if they were just barely behind the winner (or conversely, if 2nd place is just barely behind 1st and they lose a whole place even though they did almost the same work).  Though, I posit that if the goal is to find the top athletes, they won't be bunched and the top there will be enough discrimination to tell who is better.

There are a bunch of difference scoring systems we could use, and they all have their pluses and minuses; I don't want to debate that.  I want to see if across a bunch of scoring systems if the same people would be on top.  If so, then the scoring is probably pretty fair (for its purpose of picking the fittest athletes). 

Scoring Methods

Here's a quick summary of the scoring systems, I'll test. 

Open: what I just discussed, the Open's current rank-based scoring system

Z-score: Skip at Front Range CrossFit holds the Colorado Open every year, and he uses this scoring system.  Basically, for every WOD the mean score and standard deviation is calculated and your final score is the number of standard deviations you were from the mean (i.e., your Z-score).  Over the course of the weekend, your total score is the sum of your Z-scores for each WOD and high score wins.  This allows someone who finishes really far ahead (or behind) to be rewarded (or punished) for their finish, and it doesn't penalize you for falling in the bunched middle, you'll all just have scores around 0, since you did average. 

Quartile: This is very similar to Z-score, BJ Tyler posted it on the Leader Board, I can't link to a specific comment in the Leader Board, but your score is (x-m)/R where x is your reps, m is mean reps, and R is the difference between 75th and 25th percentile.  I bet this ranks just like the Z-score method, since it includes the same thing, a distance from the mean normalized by a range intrinsic to the data.

Decathlon:  Basically, this just normalizes the winner's score to 1,000 and proportionally doles out points for each competitor based off of that.  This also rewards those who finish far ahead and allows for disparate scores to be weighted equally.  Someone posted it on the Leader Board but I can't seem to find it now.

Every rep counts: I'm going to skip this one, this is just a sum of your scores for each WOD.  This does poorly because some WODs will have inherently higher scores than the other, and we, I believe, want all WODs to be treated equally.  For example, the scores in 12.2 max out at 100, but for 12.3 they're over 500.  That means in this scoring system WOD 12.3 would be 5x more important than WOD 12.2.  That's enough for me to throw it out without testing it.

There are also other scoring systems we could use to normalize between time-domain, and rep-domain (AMRAP), and weight-domain scoring, but since the Open seems to always be AMRAPs, I'm just going to ignore that.

Simple comparison

Now, how do we compare scoring systems?  Well after the Open is done, I'd check to see if some reasonable fraction of people who got into regionals would still get into regionals with the new system, and that it would the top people remained unchanged.  It doesn't matter if someone scores 59th or 61st, neither of them is going to make it past regionals, so we really only care about the upper 1/2 or so of the regional qualifiers.  But since the Open is still going, what I'm going to do is look at the ranking at the end of 12.3 (which is what I have data for, I can update this as I get more data) and measure how much movement would be in the rankings based on other systems.

First, the leader in the Open scoring system is the leader in Z-score, Quartile, and Decathlon style scoring.  So at least, if we want to pick the most fit athlete (as of WOD 12.3) the Open method is as good at the others listed.

In the top 10, from the open method, 8 show up in Z-score, 8 in Quartile, and 7 in Decathlon.

And for the top 100, there's a similar 1.8x spread:

In both plots, the lines are colored so that the highest rank (lowest place #) in the Open method is dark (more blue or green) and the color moves towards white as you move down in ranking. So you can tell visually in the other scoring methods some of the lighter colored lines are getting pulled down in ranking, i.e., giving someone a better score than they got in the Open method.

What I do find interesting is that the other three methods are more similar to each other than the Open method, at least for the top 100.

First, here's the same data in a slightly different format.  It shows on each axis the place an athlete in the top 100 (Open method) gets in each scoring method

Here's what it looks like when I remove the Open method from the plot.  The other three look very similar.

Change in score among the methods

Here are three other interesting plots.  They show the number of places an athlete would gain (or lose) if we switched from the Open method to another method.  Note that at either end, there's not much change: either you're good and we all know it, or you suck and we all know it.  In the middle, there's lots of jumping around because of how the different methods handle ties and near ties.

Stability of the methods

One other thing we can look at is how stable is the scoring system.  That is, we would hope that if we did the exact same Open again, the same people would be in the same places.  If we did the Open again, everyone isn't going to get the same score, some will improve, others will do more poorly.  To get an idea of how stable the scoring system is, lets randomly jiggle everyone's score by some uniformly distributed value between +10% and -10% of their original score.  Then we recalculate the placement and compare that to how they did in the first run of the open.  Technically, we should do this a bunch of times and average it, but since I'm averaging over 30000 competitors, we'll call it good. 

In the table below, we have the average (absolute) change in placement among all competitors and among the top 100.  An average change of 0 would mean a perfectly stable scoring system.  The number you see is the amount on average an athlete moved (up or down) on the second running of the Open.

 Top 100
 Z-score 0.600.038
 Quartile 0.610.038
 Decathlon 0.570.035

Again, we see that the three new methods are all more similar and more stable than the Open method.  But the Open method is surprisingly robust; more so than I thought it would be.

Here's the same data in visual form.  The x-axis is the athlete's rank in the Open using each of the scoring methods; the y-axis is their change in rank in the second simulated Open.  It's easy to see that the three new methods are more stable than the current scoring method.

In the end, I think the Open is doing a good job of finding the best athletes.  There seems to be little movement at the top (and at the bottom) where it counts. 

Though I tend to just feel better about methods like the Z-Score method or even the Decathlon method.  To me they handle both the extreme scores and ties (and near-ties) better.

In the end, it's the quality of judging that I think has a much bigger effect on the scores than anything else.