This site has been moved:
Summary for Debaters
TL;DR: I crunched the data from 337 standard BP tournaments over 2020-2024. Ranking teams by total speaks is more accurate than ranking teams by team points, with respect to empirical outround results. This holds for tournaments of different length, size, year, and region. Also, from simulation, team points show substantial bias against strong teams and towards weak teams. After adjusting for this, there is no unaccounted bias in speaker points relative to team points against ESL/EFL, non-rep, and particular teams.
A summary is below. Full details are available at https://sites.google.com/view/acdebating/.
At the last 7 WUDCs, the bottom teams on 18 advanced through a single outround, and this team would not have been bottom of 18 if they hadn’t iron-personned a round. In contrast, the top teams on 17, who were ranked lower, advanced through a total of 11 elimination rounds, reaching Octos (x1), Quarters (x3), and the Grand Final (x1). This suggests that ranking teams by total speaks is more accurate than ranking teams by team points in at least some situations. We wonder if this is true more broadly.
Simulations (e.g. Rao 2020) show that ranking teams by total speaks then team points (SP) is significantly more accurate than ranking teams by team points then total speaks (SQ). There are legitimate concerns about how well simulations reflect reality, so we do empirical work. We take data from the Global Debating Spreadsheet and add other tournaments we have tabs for, getting 337 standard (not tapered, not round-robin) BP tournaments over 2020-2024. We calculate various metrics for ranking teams, including team points, total speaks, no. 1sts/2nds/3rds/4ths, draw strength (many variations), who beat who, elo, and more.
How do we assess the accuracy of ranking metrics? We look at decided outround comparisons (those where one team advanced and the other was eliminated) and calculate the proportion that the metrics predicted correctly. We can also directly compare two (lists of) ranking metrics by looking at outround comparisons where they disagree (put different teams ahead) and counting the number of times each is correct with respect to the actual result.
We find that total speaks is the most accurate single metric. Comparing total speaks and team points directly, we find that total speaks is more accurate than team points at a very significant level (p=0.000015, binomial test). We find that [team points, total speaks] is more accurate than all alternatives of the form [team points, alts]. We find that SP = [total speaks, team points] is more accurate than SQ = [team points, total speaks] for tournaments for different length, size, year, and region.
There are good analytical reasons to believe that ranking teams by total speaks is more accurate than ranking teams by team points. Team points suffer from substantial randomness, resulting from factors including considerable randomness in calls (a difference of just one team point has a large effect on teams’ rankings), unequal round influence, and randomness in intra-bracket pairings. Speaker points convey more information, such as margins and absolute/cardinal performance of the room as a whole.
There are concerns that speaker points, compared to team points, are biased against various groups of teams, so we investigate bias in team points and speaker points. First, we simulate tournaments (using the Barnes, Kehle, McKenny, and Lee model with minor modifications) with no bias in judging. We find that team points show substantial bias against stronger teams and towards weaker teams. For example, at tournaments like WUDC 2024, a team that averages 79.5 skill (per speaker per round) will, in expectation, undeservingly miss the break more than 40% of the time and rank more than 9 positions below what they deserve (or more than 23 positions if taking the quadratic mean). Tapering tournaments offers a small improvement while ranking by total speaks offers a much larger one.
We calculate apparent bias in speaker points relative to team points, then adjust for the above team point bias to get unaccounted bias. We find that the unaccounted bias in total speaks relative to team points experienced by ESL/EFL and (non-)rep teams is essentially zero in recent times, seldom more than one (total) speaker point over an entire tournament, for WUDC, EUDC, and all tournaments. We also find that no particular teams (those consisting of a specific pair of speakers or those with a specific speaker) experience statistically significant unaccounted bias. Additionally, we find that ranking teams by total speaks shows no bias against ESL/EFL and non-rep teams with respect to elimination round results.
These findings are consistent with existing literature on bias in BP debating. While there may be general discrimination in BP debating, and some literature addresses bias in speaker points, it does not account for differences in baseline skill and is fundamentally uncomparative to bias in team points. Only one instance looks at apparent bias in total speaks relative to team points, and it does not account for team point bias.
There is also substantially less randomness in teams’ tab positions when ranked by SP = [total speaks, team points] than by SQ = [team points, total speaks].
We investigate a few more metrics, including hybrids of speaker points and team points, opponent-adjusted team points, retrospective tapers, and judge-adjusted speaker points. We present a method for adjusting judges’ speaker points for mean and spread that accounts for the teams they judged, and we find that it offers an improvement on (unadjusted) speaker points.
We do not claim that speaker points are perfect. We do claim, supported by the evidence, that ranking teams by total speaks is more accurate than ranking teams by team points, and this does not unduly advantage or disadvantage any group of teams.
Thus, we suggest ranking teams by total speaks (or judge-adjusted speaks) then team points.
Details and more are available at https://sites.google.com/view/acdebating/.