The Jumping Elo has been a strong-performing metric since we first launched it. As initially developed, it performed as the best single metric when looking at things like:
- Predicting how likely a horse is to have a round 1 clear or sub 4 score
- Predicting rank correlation score, where the ranking of incoming horses by a metric was compared against their final position.
- Computing a field strength adjustment for other primary metrics (like Opposition Beaten Percentage, average clear rates, etc).
However, we kept an eye on optimising the Jumping Elo even further and had a few ideas of where it might be improved. This has resulted in some tweaks to the metric going into 2022.
We have tweaked the Elo in two main areas:
- Firstly, we have changed our measurement of metric performance to be based on how well the difference in Elo ratings separates winners in head-to-head matchups. This causes the Elo to be quite reactive to performance for horses in the low end of the Elo scale (a strong result for a low Elo horse will have a relatively large impact), and for Elos to become more stable for horses in the high end of the Elo scale (high Elo horses are less volatile, their rating changes less from only a single result).
- Secondly, we introduced a show level and Prize Group adjustment. This means the Elo algorithm places more importance on a horse’s performance at major championships, 4* and 5* Grand Prixs, Nations Cups and World Cups, and thereafter prioritizes Prize Groups AA/HH, A/H and B. This mainly affects Elo performance at those top-level events, which is more where we actually use the Elo rating with performance teams.
Modelling jumping performance and how best to measure Elo performance
Classically, performance of the Elo metric is measured using a score called the Brier Score. This measures how well a metric predicts probabilities. In Elo, this is measured by how well the Elo ratings predict the probability of head-to-head matchups in jumping. For example, say Horse A is likely to beat Horse B 75% of the time. This probability can be calculated directly from Elo ratings. A core component of the Elo algorithm is proposing that horses’ abilities are logistically (can approximate as normally) distributed around their Elo ratings, and that we can compute the likelihood of winning a matchup based on comparing the distributions for each horse.
However, in jumping, horse ability – proxied by scoring/performance on the day - is likely not perfectly described by a logistic distribution. Some of this comes directly from the scoring format of the sport; it’s not a continuous score, but rather mainly driven by a discrete number of poles knocked. There is also a relatively hard upper limit to performance in that a horse can’t score better than 0 penalties (at least until the jump off round).
One way of measuring the non-logistic distribution of horse ability is by looking a horse’s actual performance on the day as measured by the position percentile (i.e. placing in the top 5%,10% or 50%, etc). We can take a horse that has had between 15-20 runs and only consider horses of a similar experience/point in their career. We can then rank them by their average position percentile to date, and look at their performance on the day relative to their average performance to date. If performance were well modelled by a logistic curve, we might expect to see something similar to a bell curve. However, as shown in the graph below, we see a much flatter distribution. Some of this is a limitation of measuring performance by position percentile in that it’s bounded between 0 and 1, but some of it is due to jumping performance not being normally/logistically distributed.
Show Jumping Performance Histogram
Average Position Percentile vs Position on the Day
If jumping performance were well modelled by a logistic curve, we might expect to see something similar to a bell curve. However, as shown here, we see a much flatter distribution.
For these reasons, it’s best not to use only the Brier Score when developing the Elo. Since the model does not really describe the performance distribution, our predicted head-to-head percentages will be somewhat off.
A Brier Score can be decomposed into 3 main pieces. Uncertainty, Reliability and Resolution. Uncertainty is a measure of how uncertain the base problem is (we are measuring head-to-head matchups, which have a default probability of 50%). Reliability is how well the predicted probability correlates to actual observed outcome probabilities; does the 75% prediction actually come true 75% of the time? Resolution is how much the predicted probabilities differ from the average. Resolution closely mirrors another metric known as the ROC AUC (see graph below). ROC AUC measures how well the Elo separates winners and losers of head-to-head matchups. This metric suits our use case better, where we care more about ranking horses by their head-to-head Elo comparisons and less about predicting the precise probabilities of those head-to-head matchups.
Brier Resolution and ROC AUC vs K Factor
Brier Resolution and ROC AUC are very similar in their optimal K factor readings.
Optimal K factor region
By looking at ROC AUC scores, our optimal K factor moves into a different area (as illustrated by comparing the two graphs below). K factors determine if the Elo will place more importance on recent performances versus older performances. Higher K factors place more importance on recent performances relative to historical ones.
Brier Score vs K Factor
We aim to minimise Brier Scores, so K factors around 65 would be optimal according to the above graph.
ROC AUC vs K Factor
We aim to maximise ROC AUC Scores, so K factors around 100 would be optimal according to the above graph.
K Factor decreasing with incoming Elo
In the 2021 Elo formula, we had two different K factors. We had a larger K factor for horses early in their career (within their first 12 runs) and a smaller K factor for horses with 13 or more runs. When optimising the Elo, we went away from having a binary K factor based on experience. Instead, we looked at introducing a K factor that scaled with a horse’s Elo, the K factor gradually reducing the stronger a horse’s Elo becomes. This means that the stronger an Elo gets, the less it will vary based on the outcome of a single result. This type of adjustment leads to better performance, especially when measured under the ROC AUC. While it increases Elo inflation, making it slightly more cumbersome to compare Elos year to year, it does a better job of ranking horses by their head-to-head Elo comparisons.
ROC AUC Score Head to Head vs K Factor by Elo Slope
We can see with larger negative slopes (the blue dots), ROC score improves compared to using no slope (a single K factor) as indicated by the yellow dots.
K Factor adjustment based on show level
The second area where we adjusted the Elo was to put in a show level importance factor. This means we increase K factors for major competitions. The competition classing we use, with K factor percentage, is as follows:
- Major Championship: 136%
- 5* World Cup / Nations Cup / Grand Prix: 130%
- 4* World Cup / Nations Cup / Grand Prix: 124%
- AA or HH Prize Groups: 118%
- A or H Prize Groups: 112%
- B Prize Groups: 106%
- Other: 100%
This has a relatively small effect when we look at ROC scores across all competitions (see the first graph below), but a larger effect when we look at ROC scores at major competitions (see the second graph below). We picked the 6% increase per competition class as an intermediate value that performed close to optimal at major championships and general 4* and 5* classes.
ROC AUC Score vs Show Level K Factor Adjustment - All Results
Adding in a show level K factor adjustment results in an increase in ROC AUC score measured across all results.
ROC AUC Score vs Show Level K Factor Adjustment - Major Results
Major championships; 5* World Cups, Nations Cups and Grand Prixs; and 4* World Cups, Nations Cups and Grand Prixs
The show level K factor adjustment has a larger impact on ROC AUC scores when measured solely at major competitions (major championships; 5* World Cups, Nations Cups and Grand Prixs; and 4* World Cups, Nations Cups and Grand Prixs).
Sean Murray is the expert data scientist at EquiRatings. He ensures all of our work is based in quality research and implements our scientific methods, processes, algorithms and systems. Previously he worked as a consultant with Deloitte.