Over the past years, we've heard many complaints about Bayesian ELO being used as the way to rate players on the ladders. Two of the most common complaints are:
1) You can win a game and lose rating points
2) You can get a very inflated rating
We'll get into these complaints after a bit of discussion. Let us start with the history of Bayesian ELO.I: History of the Bayesian ELO rating system
Bayesian statistics has been around for a long time. Understanding Bayesian Statistics intrinsically through its formulas is not advised for the average reader on this forum. Instead, we'll give a very simplified core idea of Bayesian Statistics through an anecdote.
Let's say you're in your house. You've lost your car keys somewhere in the house, but you have no idea where. So you assume they could be anywhere in the house. At some point you find them; they were below the couch because your cat decided to go and play with them. Two days later, you have lost your keys yet again. This time though, you have a previous experience: they were below the couch last time. You should check there first, since the probability that they're there is higher! This is (in simplified version) what Bayesian Statistics tries to do: your previous observations influence your probability distribution now.
The idea of creating a rating system with those underlying principles is not new. In fact, the idea first sprung up in 1929. A German paper depicts this: https://bit.ly/2ZScKyv
It would take a long time for a digital rating system with the principle of Bayes to be made though. It was around 2004 when Remi Coulom started developing an algorithm and development seems to have stopped in 2006. This algorithm was developed to be an improvement over "EloStat", an algorithm that estimated ratings too. You can view his forum thread where he discusses all the way through the development stage here. I'd recommend reading it, as it's quite a wonderful thread, but I'm a nerd: https://bit.ly/2H8PpSA
II: Bayesian ELO in round-robin tournaments
As Bayesian ELO was trying to be better than EloStat, tests were shown where it gives far better estimations in small round-robin style tournaments. This still holds true: Bayesian ELO is a great way to estimate ratings in smaller, non-continuous events like round-robin tournaments. A ladder like the Seasonal Ladder could definitely see the improvements that Bayesian ELO gives over other rating systems, particularly regular ELO and EloStat. So what did it do better than EloStat?
1) The algorithm uses an advantage-function. It assumes that one player has a small advantage over the other. For example: playing white in chess, playing first in Go, etc. This is where one of our main problems will be later on.
2) This one requires an example. Let's take two situations:
Situation one: player A plays against player B 1000 times. Player B has an average rating every time they play.
Situation two: player A plays against 1000 different players with an average rating
Should those two situations be equivalent for player A's rating?
EloStat and regular ELO said yes, Bayesian ELO said no. And that brings us to the next point.
III: The difference between Bayesian ELO and regular ELO
According to regular ELO, the two situations depicted above should result in the same rating for player A. Bayesian ELO works a bit different. It tries to predict (and does so extremely adequately in small samples) what the relative strengths of players are. If we assume four players, A, B, C, and D, then say A beats B, B beats C, C beats D, and D beats A, it tries to tell you something about the relative skill of those players. In this case, the result is marvelous: all players will have the same rating, as they should. If we implement some more games, the result would have changed and became a bit assymetric, but we'll get into that. Regular ELO would have actually performed worse with this example. It will give the following ratings:
Player D: 1501
Player C: 1501
Player B: 1499
Player A: 1499
*Done with an average rating of 1500, a k-factor of 32 and advantage of 0*
This shows us the error in regular ELO: the order of games being completed matters. Although the difference looks small, and it is, regular ELO will give a slight advantage to players D and C, even though we know their relative skill should be the same when we only look at this sample. This is because the following statements being true for ELO:
1) The higher the difference between your rating and your opponent's rating, the higher the impact will be on both ratings when an upset happens. (The lower rated player winning)
2) The order of completed games plays a role
When player A beat player B, they had the same rating. Therefore the increase in player A's rating was 16: half of the k-value. The difference in rating was 0.
When player A later lost against player D, player A had a rating of 1516 and player D had a rating of 1484. The difference between the ratings is 32. So k takes on a slightly larger value. Player A loses 17 points and is now worse off. Had the games been completed in the reverse order, player A would have been on top.
Bayesian ELO tried to solve this problem for tournaments, and it did so very successfully.
IV: Problems with Bayesian ELO on continuous events
As we've seen, Bayesian ELO is excellent when it comes to predicting relative skill in small groups. In a tournament, one should advocate for this rating system (although better alternatives exist). However, we use this system on the ladders, and there are a few problems. Let us first dig into the complaints we stated earlier.
1) You can win a game and lose rating points
While true, the rating that you see displayed on the ladders is only one part of the equation. Bayesian ELO is taking more things into account. One of them is how sure it is about your rating. Your final rating is an equation made up of the rating your have according to Bayesian ELO minus the probable inaccuracy it estimates your rating to have. This means that if you're a high rated player, winning against a low rated player, the impact on the probable inaccuracy might far exceed the rating that the system thinks you have. Since this probable inaccuracy is deducted from your actual rating, it might lower your displayed rating. This does mean, however, that the system is more sure about you actually deserving your high rating.
You can get a very inflated rating
This has been the main complaint about the ladders here. The infamous ladder runs. As most of us know, you get ranked on the 1v1 ladder after 20 games. If you manage to win your first 20 games, you get a rating far higher than expected. The question is if this is a problem of the Bayesian ELO system or the way we rank players.
Let's see what happens to players A, B, C and D when player A wins 7 games against all of them. The other players do not complete any games. The result is as follows:
Player A: 1776
Player B: 1408
Player C: 1408
Player D: 1408
This result is kind of expected. At any rate, it doesn't show any problems. So where do the problems begin, then?
There are 364 players ranked on the 1v1 ladder at this moment. A new player (or any player) will never play them all. They'll play a small sample of these players, hopefully within their current rating range. They have to complete 20 games before receiving a rank. Now, remember how Bayesian ELO was designed for tournaments. This means that your rating will have to be somewhat volatile when you start out. If you win your first game in a tournament, it should put you against better opponents in said tournament; that's the easiest way to estimate your actual rating. The first games you play have a big impact on your rating. This is by design, so that Bayesian ELO can say something about relative skill, as well as estimate your skill in a fast way. A tournament has limited games after all.
A ladder, however, does not have a game limit. If you win 10 games consecutively, Bayesian ELO will greatly overestimate your rating to see if you're actually deserving of a high rank. It wants to test you against worthy opponents. However, you might keep winning. The system doesn't know it's going to have to rank you accurately after 20 games, so it keeps increasing your presumed rating to get you to play the other high-rated players. Keep in mind that this is by design, and when used in a tournament setting, this is exactly what it should do. It's like a binary search. It does unfortunately do a bad job when it needs to rate players on a continuous event like the 1v1 ladder. This leads us into the next section.
V: Solutions to the rating system on continuous events
The first thing that should come to mind is the usage of regular ELO. Your rating can't skyrocket or plummet in the way it does with Bayesian ELO, and you can rank players faster than after playing 20 games. There are other rating systems out there that serve continuous events very well. I'd highly recommend Glicko2, but I won't go into detail. Regular ELO would solve a lot of problems already. Some players run scripts with these rating systems, and so far they seem to produce far more accurate results (unfortunately with me losing my trophies, but hey). The second solution would be to have a different waiting condition for people to be ranked. Losses tend to greatly counter the effect the so-called ladder runs (it is a form of binary search after all). You could try to have people rated only after they've lost four games. I'd be willing to test out this parameter
But scratch the ME part; I'd love to say something about it, but I have only 104characters left.
Thanks for listening, and please:
Do it safely. Use a rating system that fits!
Edited 9/20/2020 09:41:32