Updated June 2017 On this page, I provide information on the performance of the various versions of the RPI that the NCAA has used for Division I women's soccer over the last ten years; and I compare the versions to each other. I also provide information on the performance of the best of modified versions of the RPI I have developed; and I compare that version to the NCAA's versions. And finally, I provide information on the performance of an Elo-based rating system I developed and of Kenneth Massey's system; and I compare those systems to my best version and the NCAA's versions of the RPI. This page builds on the preceding "RPI: ..." pages, so it will be helpful if readers first review those pages. The second of those pages, "RPI: Measuring the Correlation Between Teams' Ratings and Their Performance" is a particular prerequisite for this page. It explains the "Correlator" I've developed for looking at how well a system's ratings perform. Here, my performance information for each system comes from applying the Correlator to that system. 1. Problem Areas for Rating Systems.Rating systems can have a number of problems. These include:
This can create two rating problems:
Resource: the "RPI:Non-Conference RPI" page. c. 2. How Do the NCAA's Variations of the RPI Perform for Division I Women's Soccer?The NCAA has used the Unadjusted RPI and four variations of the Adjusted RPI over the 10 years from 2007 through 2016. The basic structure of each variation is the same, with the differences being in the bonus/penalty regimes for moving from the Unadjusted RPI to the Adjusted RPI. A detailed description of the four variations is at the "RPI: Formula" page. The following table shows how each of the NCAA's variations performs for the metrics my Correlator uses for evaluating rating systems. For each variation, I applied that variation's formula to the game results data for each of the last 10 years, so the table shows how the variation would have performed had it been in effect for the full 10 year period: In the table, I've arranged the NCAA RPI variations in chronological order, starting with the 2009 ARPI and proceeding through the NCAA's successive versions of the Adjusted RPI. In general, this means I've arranged the variations from the most robust bonus and penalty regime to the least robust. I've put the Unadjusted RPI at the end, since with no bonuses or penalties it is the least robust regime. The " Overall % Correct" column is an overall accuracy measure for the ratings a system produces. It indicates the frequency with which two opponents' game results are consistent with their ratings as adjusted for home field advantage. Thus for the NCAA's RPI, game results are consistent with ratings from 72.6 to 72.8% of the time. Each 0.1% represents 3 games per year out of ~3,000, so the best performer at 72.8% gets 6 more game results out of 3,000 correct per year than the poorest performer at 72.6%.The " ARPI Top 60, % Correct" column is an accuracy measure for the ratings of the Top 60 teams. It looks only at games in which at least one Top 60 team was involved. This covers all potential NCAA Tournament at large selection bubble teams. Here, each 0.1% represents 1 game per year out of ~1,000. Thus for the NCAA's RPI, the best performing system at 78.0% gets 2 more game results out of 1,000 correct per year than the poorest performing system at 77.8%.The " Regions % Performance All, Spread" column is based on the performance percentages of teams when classified by their regional playing pools. A performance percentage of 100% means that the regions' teams, on average, perform in accord with their ratings. A percentage above 100% means that they perform better than their ratings say they should, in other words on average are underrated. A percentage below 100% means they perform more poorly than their ratings say they should, in other words on average are overrated. This "Spread" column measures the difference between the performance percentages of the best performing region and the poorest performing region. Thus it's one measure of the general fairness of the rating system. A Spread of 0% would represent perfection.The " Regions All, Over and Under Total" column also is based on the regions' performance percentages. It measures the cumulative extent to which the five regions' performance percentages differ from 100%. Thus it, too, is a measure of the general fairness of the rating system. And again, a Total of 0% would represent perfection.Whereas the first two Regions columns are measures of general fairness, the " Regions % Performance Trend All, Spread" column measures the extent of a rating system's discrimination among regions based on the average strength of the regions' teams. It is derived from a regions performance percentage table in which the regions are arranged in order from the best to the poorest average rating for the regions' teams, coupled with a chart that shows how the regions' performance percentages change as their average ratings change. The chart includes a computer-generated straight trend line that shows the relationship between regions' average ratings and their performance percentages. The "Spread" is the difference between the trend line's performance percentage at the high average rating end of the chart and the line's performance percentage at the low average rating end of the chart. A Spread of 0% means there is no discrimination in relation to regions' average ratings. A Spread above 0% means the rating system, on average, discriminates against teams from stronger regions and in favor of teams from weaker regions. A Spread below 0% means the rating system discriminates in favor of teams from stronger regions and against teams from weaker regions. The above table shows that all of the NCAA's RPI versions discriminate against teams from stronger regions and in favor of teams from weaker regions.The three conference columns match the three region columns, but are for conferences. The " Conferences % Performance All, Spread" column suggests that the Committee's changes over time have resulted in the ratings becoming less fair.The " Conferences All, Over and Under Total" column likewise suggests that the Committee's changes over time have resulted in the ratings becoming less fair.The " Conferences % Performance Trend All, Spread" column follows the same pattern as the two Conference general fairness measures, showing that the Committee's changes have made the RPI more discriminatory against stronger conferences and in favor of weaker ones.Altogether, the table shows the RPI's tendency to discriminate against stronger regions and conferences and in favor of weaker ones. It also shows that versions of the Adjusted RPI with more robust bonus and penalty adjustments tend to reduce the discrimination, especially as to conferences. Over the years, I've experimented with even more robust bonus and penalty adjustments than the 2009 ARPI ones. As it's turned out, the 2009 ARPI bonus and penalty regime appears to be about the best one can do, through the use of bonuses and penalties, at reducing the discrimination. Although I believe the Women's Soccer Committee in part created the 2009 bonus and penalty structure to create a 2 ranking position advance for teams winning away games against opponents ranked in the top 40 and a 1 ranking position advance for teams tieing away games against opponents ranked in the top 40, I don't know what rationale the Committee had for creating the balance of the 2009 bonus and penalty structure. I suspect, however, that the purpose of the total bonus and penalty regime was to use bonuses and penalties to moderate the discrimination problem as much as possible, since that's what it actually does. On the other hand, maybe it simply is a coincidence that the 2009 regime is the best one can do at using bonuses and penalties to reduce discrimination. In addition, the NCAA's RPI has a Strength of Schedule Measurement Problem. I've discussed this in detail at the "RPI: Strength of Schedule Problem" page. The following table shows the significance of the problem for the five NCAA RPI variations covered in the above table:This table is based on the ratings of teams in each of the 10 years from 2007 through 2016. Its purpose is to show the differences between teams' RPI rankings and their rankings in terms of what they contribute to their opponents' strengths of schedule: The " Average" row shows the average difference, over the 10 years, between teams' RPI rankings and their rankings as contributors to opponents' strengths of schedule. As the row shows, the average difference is roughly 30 ranking positions.The " Median" row shows the median difference, which is in the range of 22 to 24 positions.The " Largest" row shows the largest difference between a team's RPI rank and its rank as a contributor to opponents' strengths of schedule. As the row shows, there can be very big differences. As an example from the 2016 season, Georgia's ARPI rank (based on the 2015 ARPI formula) was 87. But, its rank as a contributor to opponents' strengths of schedule was only 218, for a 131 position difference. On the other hand, Howard's ARPI rank was 225. But, its rank as a contributor to strength of schedule was 69, for a 156 position difference in the other direction than Georgia. What this means is that the NCAA's RPI treated a team playing Howard (SoS Contributor rank of 69) as having played a much stronger opponent than if the team had played Georgia (SoS Contributor rank of 218), notwithstanding that the RPI itself ranked Georgia much more highly than Howard. This is the kind of rating problem that causes coaches to attempt to "game" the RPI when scheduling non-conference opponents. It also is the kind of problem that generates skepticism about the RPI as a rating system.The " % 5 or less" row shows the percent of teams for which the difference between its RPI rank and its contributor to strength of schedule rank is 5 positions or fewer. The percent is in the 15 to 17% range. Conversely, for 83 to 85% of teams, the difference is greater than 5 ranking positions.The above information is about the RPI. What about the NCAA's Non-Conference RPI? Here is a comparable table to the first one above, for the Unadjusted Non-Conference RPI and the four versions of the Adjusted Non-Conference RPI the NCAA has used over the last 10 years: Looking at the entire table, I don't see a big difference among the various Non-Conference RPI variations except that the Unadjusted NCRPI handles best the problem of discrimination among regions based on region strength and the 2009 ANCRPI has the most general fairness as to conferences. The NCRPI table, however, when compared to the earlier RPI table on this page, shows that in overall accuracy and accuracy in games involving a Top 60 team, the NCRPI versions are much less accurate than the RPI versions. This is not surprising since the NCRPI calculations use a much smaller data base of games. What this suggests is that the Non-Conference RPI is not very reliable at comparing individual teams to each other. The table also shows that the Non-Conference RPI almost completely solves the RPI's discrimination problem as to strong versus weak conferences and does much better than the RPI in terms of conference general fairness. The NCRPI also does much better than the RPI in terms of region general fairness and discrimination as to strong versus weak regions. The NCRPI nevertheless still has a discrimination problem as to regions. Thus there is a trade-off of a large loss in accuracy for a gain in fairly ratings conferences and, to a lesser extent, regions. Comparing the Non-Conference RPI versions to each other, there's a good argument that the best of the versions is the Unadjusted NCRPI because it handles best the problem of discrimination among regions. When it comes to the disconnect between teams' RPI ranks and their contributor to strength of schedule ranks, the Non-Conference RPI has problems similar to the RPI. The following table compares the Unadjusted NCRPI to the Unadjusted RPI: The bottom line regarding the Non-Conference RPI, in my opinion, is that at best it may be an improvement over the regular RPI in rating conferences and regions in relation to each other. On the other hand, it is significantly poorer than the regular RPI at rating individual teams in relation to each other. And, if you go to the "RPI: Non-Conference RPI" page under the heading "How Much Difference Does the ANCRPI Make in the Ranking of Conferences?," you'll see that the differences between the NCRPI's and the RPI's rankings of conferences, particularly for the Top 10 conferences, are small. The Top 10 conferences are the ones that matter for rating system comparison purposes since history shows that the Women's Soccer Committee draws virtually all of its seeds and at large selections from the Top 10 conferences.3. Are There Other Variations of the RPI That Would Perform Better for Division I Women's Soccer?I've experimented with more than 100 modified versions of the RPI to see if it's possible to address the above problems while retaining the RPI's basic skeletal structure. I've found a number of variations that perform better than the NCAA's current RPI. The best of these is the 5 Iteration RPI whose performance, in relation to the problems discussed above, is far superior to the NCAA's current RPI. Here's how it works:The 5 Iteration RPI uses a "multiple iteration" approach starting from the NCAA's Unadjusted RPI. In this approach, once I have generated the URPI using the NCAA's current formula (the "
This approach produces the 5 Iteration URPI. I then apply bonuses and penalties, as does the NCAA, to produce the 5 Iteration Adjusted RPI, using teams' 5 Iteration URPI ranks as the basis for the bonus and penalty "tiers.". When considering potential bonus and penalty structures, I looked at the four structures the NCAA has used for Division I women's soccer over the last 10 years. The structure that produced the best-performing 5 Iteration ARPI is the 2009 structure, which is the most robust of the structures the NCAA has used. Thus the best of the 5 Iteration RPI structures is the 5 Iteration ARPI, 2009 BPs. How much better is the 5 Iteration ARPI, 2009 BPs, than the NCAA's RPI? The following table compares the 5 Iteration ARPI, 2009 BPs, with the NCAA's current 2015 ARPI and the NCAA's better performing 2009 ARPI: This table shows that the 5 Iteration ARPI, 2009 BPs, is far superior to either the NCAA's current 2015 ARPI or the NCAA's 2009 ARPI. In general accuracy, the 5 Iteration version is very slightly more accurate overall than the NCAA's versions and very slightly less accurate in relation to games involving Top 60 teams. In general fairness to regions and conferences, however, the 5 Iteration version is far better than the NCAA's versions. And, the 5 Iteration version completely eliminates discrimination among conferences in relation to conference strength and significantly reduces discrimination among regions in relation to region strength. In addition, the following table compares these three systems in terms of the disconnect between teams' RPI ranks and their contributor to opponents' strengths of schedule ranks: As this table shows, unlike the NCAA 2009 and 2015 ARPIs' serious disconnect between teams' ARPI ranks and their ranks as contributors to opponents' strengths of schedule, the 5 Iteration ARPI has a minimal disconnect. Unlike for the NCAA's ARPI versions, where coaches aspiring to have their teams in the NCAA Tournament have significant incentives to attempt to use this disconnect to "game" the system, for the 5 Iteration version I know from experience that the differences are too small to allow successful gaming of the system. Finally, which better matches the Women's Soccer Committee's actual decisions on NCAA Tournament at large selections and seeds, the NCAA's current version of the ARPI or the 5 Iteration version? To answer this question, I matched the ranks of the NCAA's 2015 ARPI and the 5 Iteration ARPI, 2009 BPs with the NCAA's decisions over the last 10 years. I then determined the average ranks of the teams the Committee gave at large selections, the teams from among the 2015 ARPI's Top 60 to which the Committee denied at large selections, the #1, #2, #3, and #4 seeds respectively, and the entire group of 16 teams that received seeds, to see which rating system better matched the Committee's actual decisions. The following table shows the results: Starting first with the At Large selections, the 2015 ARPI's average rank of the Committee's at large selections, over the 10 year period, is 32.60. For the 5 Iteration ARPI, 2009 BPs, it is 31.14. The 5 Iteration version thus had better ranks for the selected teams, on average, of 1.46 ranking positions. What this means is that the 5 Iteration version is more consistent with the Committee's at large selections than the 2015 ARPI version. In terms of teams in the Top 60 that did not get at large selections, their average rank under the 2015 ARPI was 51.00, whereas under the 5 Iteration version it was 54.73, for a difference of 3.73. Again, this means that the 5 Iteration version is more consistent with the Committee's decisions.For seeding, there is not a big difference. The 5 Iteration version is more consistent with the Committee's decisions on the #1 and #4 seeds as well as the entire group of 16 seeds. The 2015 ARPI version is more consistent as to the #2 and #3 seeds. In both cases, however, the differences are not large. The bottom line is that the Division I women's soccer rating system would be significantly better if the NCAA would replace the current rating system with the 5 Iteration ARPI, 2009 BPs.I do have one other suggested RPI version, however, that is worth considering. It is the 5 Iteration ARPI, 2009 BPs, but with two tweaks to make the formula simpler: 1. As discussed on the "RPI: Element 2 Issues" page, there is an oddity in how the NCAA computes strength of schedule. The oddity results in Team A making different contributions to different opponents' strengths of schedule, even though all the opponents played the same Team A. There is a simple change to how the NCAA computes strength of schedule that would eliminate this oddityWith these two tweaks to the 5 Iteration RPI for purposes of simplicity, I produce the 5 Iteration Pure All 4 ARPI, 2009 BPs. It compares to the 5 Iteration ARPI, 2009 BPs, as shown in the following two tables: The above table shows that the more complex 5 Iteration ARPI is slightly better than the simpler 5 Iteration Pure All 4 ARPI version. Based on my experience, the differences indicated in these two tables are, for practical purposes, inconsequential. Particularly regarding the Women's Soccer Committee's NCAA Tournament at large selection and seeding decisions, if the Committee were using the 5 Iteration RPI, the differences shown by the two tables are not enough to make a difference in the decisions the Committee would make. That being the case, if I were choosing between these two rating systems, I would use the 5 Iteration Pure All 4 ARPI, 2009 BPs. It's simpler and by shifting to the Pure approach, it removes an oddity -- and criticism -- of the current NCAA RPI. 4. Could an Elo-Based System Perform Better for DI Women's Soccer?I also have experimented with Elo-based systems to see whether they might perform better than the RPI-based systems. Use this link for a general introduction to Elo-like systems. Although the original Elo system was for chess rankings, Elo-like systems now are used for some other sports, including for FIFA's rankings of women's national soccer teams. The best-performing Elo-based system I tested uses a "seeding" of teams at the beginning of the season. The seeding it uses is teams' ratings as of the end of the preceding system. The initial seeding then interacts, game by game, with the results of games played during the course of the season. Usually, the initial seeding's effect is minimized by the end of the season. For those interested in details, my best system uses a K factor of 70, does not consider home field advantage (because considering home field advantage produces poorer rating system performance), and does not consider goal differential. It does not consider goal differential because the NCAA has a policy against using goal differential in its rating system, although the use of goal differential might improve the system's performance. The following table shows the performance of my best-performing Elo-based system as compared to the NCAA's current 2015 ARPI and the two 5 Iteration systems: As this table shows, the Elo-based system's correlations with overall game results are poorer than for the other systems and for games involving a Top 60 team. On the other hand, the Elo-based system is much better than the other systems at rating regions fairly, nearly eliminating any problem as to regions. But, the 5 Iteration ARPI does better at rating conferences fairly, although the Elo-based system does much better than the NCAA's ARPI as to conferences. Because of the way Elo-based systems work, they do not have a team rank versus team contribution to strength of schedule rank problem, but this is not a significant problem for the 5 Iteration ARPI either. Regarding correlations with game results, I believe the reason for the Elo-based system's problem is that teams do not play enough games during a season to overcome the effects of using an initial seed of teams in all cases. This is not a problem for most teams, as teams' strength on average does not change a lot from year to year, but it is a problem for top tier teams that have significantly below par performances and for middle tier teams that have significantly above par performances this year as compared to last year. The NCAA historically has firmly opposed using systems that start with an initial seeding of teams, because of the problems those systems have as described in the preceding paragraph.. Thus unless that NCAA policy changes, an Elo-based system using an initial seeding of teams is not a viable candidate for use by the NCAA. I also have tested an Elo-based system that uses a common starting seed, but it performs dismally, no doubt because for a system using a common starting seed an Elo-based system takes in the vicinity of 28 games to get teams to their correct positions. Division I women's soccer teams play, on average, just under 20 games during the regular season (including conference tournaments). 5. How Does Kenneth Massey's System Perform in Comparison to the Others? Massey provides the only other ratings of Division I women's soccer teams that I am aware of. You can find Massey's women's college ratings here: Kenneth Massey Ratings. Although his system is proprietary, so that I do not know the details of how it works, I know it uses an initial seeding of teams (that gradually diminishes its influence on ratings over the course of the season) and it uses game scores (as distinguished from simple won-lost-tied results). I have evaluated Massey's system using my Correlator, in the same way as I have the other systems, based on 10 years of ratings and data. Here is an extension of the preceding table, with Massey's results added at the bottom: As the table shows, Massey's ratings correlate with overall game results as well as the best of the systems overall and is very close to the best as to games involving at least one Top 60 team. For regions, Massey performs remarkably well, both in general fairness and as to discrimination based on region strength. His system is far superior, as to regions, to any of the RPI based systems. For conferences, Massey performs far better than the NCAA's best ARPI system, but not as well as the 5 Iteration ARPI systems. Thus as between Massey and the 5 Iteration systems, there is a trade off between region and conference fairness. I believe that Massey, because of the nature of his system, does not have the rating rank versus rank as a contributor to an opponent's strength of schedule problem that the NCAA's RPI has. Given all of this, in my opinion Massey's system is excellent and, if the NCAA were to seek a system to supplement an RPI-based system, would be a good supplemental system for the Women's Soccer Committee to use. 6. Conclusion. The 5 Iteration ARPI, 2009 BPs, is far superior to any of the RPI versions the Women's Soccer Committee has used over the last 10 years. The Women's Soccer Committee should use it, rather than the current 2015 ARPI version it is using. At a minimum, the Committee should use it in parallel with the 2015 ARPI, for a trial period. And, if the NCAA is prepared to look outside the RPI for a system -- as it currently is discussing for basketball -- it should look to Massey's system as the system to use. |