<< Back to Ladder Forum   Search

Posts 1 - 20 of 46   1  2  3  Next >>   
Bayesian ELO: But why is it an unfortunate choice?: 9/19/2020 16:56:06


Farah♦ 
Level 61
Report
Over the past years, we've heard many complaints about Bayesian ELO being used as the way to rate players on the ladders. Two of the most common complaints are:

1) You can win a game and lose rating points
2) You can get a very inflated rating

We'll get into these complaints after a bit of discussion. Let us start with the history of Bayesian ELO.

I: History of the Bayesian ELO rating system
Bayesian statistics has been around for a long time. Understanding Bayesian Statistics intrinsically through its formulas is not advised for the average reader on this forum. Instead, we'll give a very simplified core idea of Bayesian Statistics through an anecdote.
Let's say you're in your house. You've lost your car keys somewhere in the house, but you have no idea where. So you assume they could be anywhere in the house. At some point you find them; they were below the couch because your cat decided to go and play with them. Two days later, you have lost your keys yet again. This time though, you have a previous experience: they were below the couch last time. You should check there first, since the probability that they're there is higher! This is (in simplified version) what Bayesian Statistics tries to do: your previous observations influence your probability distribution now.
The idea of creating a rating system with those underlying principles is not new. In fact, the idea first sprung up in 1929. A German paper depicts this: https://bit.ly/2ZScKyv
It would take a long time for a digital rating system with the principle of Bayes to be made though. It was around 2004 when Remi Coulom started developing an algorithm and development seems to have stopped in 2006. This algorithm was developed to be an improvement over "EloStat", an algorithm that estimated ratings too. You can view his forum thread where he discusses all the way through the development stage here. I'd recommend reading it, as it's quite a wonderful thread, but I'm a nerd: https://bit.ly/2H8PpSA


II: Bayesian ELO in round-robin tournaments
As Bayesian ELO was trying to be better than EloStat, tests were shown where it gives far better estimations in small round-robin style tournaments. This still holds true: Bayesian ELO is a great way to estimate ratings in smaller, non-continuous events like round-robin tournaments. A ladder like the Seasonal Ladder could definitely see the improvements that Bayesian ELO gives over other rating systems, particularly regular ELO and EloStat. So what did it do better than EloStat?

1) The algorithm uses an advantage-function. It assumes that one player has a small advantage over the other. For example: playing white in chess, playing first in Go, etc. This is where one of our main problems will be later on.

2) This one requires an example. Let's take two situations:

Situation one: player A plays against player B 1000 times. Player B has an average rating every time they play.
Situation two: player A plays against 1000 different players with an average rating

Should those two situations be equivalent for player A's rating?

EloStat and regular ELO said yes, Bayesian ELO said no. And that brings us to the next point.


III: The difference between Bayesian ELO and regular ELO
According to regular ELO, the two situations depicted above should result in the same rating for player A. Bayesian ELO works a bit different. It tries to predict (and does so extremely adequately in small samples) what the relative strengths of players are. If we assume four players, A, B, C, and D, then say A beats B, B beats C, C beats D, and D beats A, it tries to tell you something about the relative skill of those players. In this case, the result is marvelous: all players will have the same rating, as they should. If we implement some more games, the result would have changed and became a bit assymetric, but we'll get into that. Regular ELO would have actually performed worse with this example. It will give the following ratings:

Player D: 1501
Player C: 1501
Player B: 1499
Player A: 1499
*Done with an average rating of 1500, a k-factor of 32 and advantage of 0*


This shows us the error in regular ELO: the order of games being completed matters. Although the difference looks small, and it is, regular ELO will give a slight advantage to players D and C, even though we know their relative skill should be the same when we only look at this sample. This is because the following statements being true for ELO:
1) The higher the difference between your rating and your opponent's rating, the higher the impact will be on both ratings when an upset happens. (The lower rated player winning)
2) The order of completed games plays a role
When player A beat player B, they had the same rating. Therefore the increase in player A's rating was 16: half of the k-value. The difference in rating was 0.
When player A later lost against player D, player A had a rating of 1516 and player D had a rating of 1484. The difference between the ratings is 32. So k takes on a slightly larger value. Player A loses 17 points and is now worse off. Had the games been completed in the reverse order, player A would have been on top.

Bayesian ELO tried to solve this problem for tournaments, and it did so very successfully.


IV: Problems with Bayesian ELO on continuous events
As we've seen, Bayesian ELO is excellent when it comes to predicting relative skill in small groups. In a tournament, one should advocate for this rating system (although better alternatives exist). However, we use this system on the ladders, and there are a few problems. Let us first dig into the complaints we stated earlier.

1) You can win a game and lose rating points

While true, the rating that you see displayed on the ladders is only one part of the equation. Bayesian ELO is taking more things into account. One of them is how sure it is about your rating. Your final rating is an equation made up of the rating your have according to Bayesian ELO minus the probable inaccuracy it estimates your rating to have. This means that if you're a high rated player, winning against a low rated player, the impact on the probable inaccuracy might far exceed the rating that the system thinks you have. Since this probable inaccuracy is deducted from your actual rating, it might lower your displayed rating. This does mean, however, that the system is more sure about you actually deserving your high rating.

You can get a very inflated rating

This has been the main complaint about the ladders here. The infamous ladder runs. As most of us know, you get ranked on the 1v1 ladder after 20 games. If you manage to win your first 20 games, you get a rating far higher than expected. The question is if this is a problem of the Bayesian ELO system or the way we rank players.
Let's see what happens to players A, B, C and D when player A wins 7 games against all of them. The other players do not complete any games. The result is as follows:

Player A: 1776
Player B: 1408
Player C: 1408
Player D: 1408

This result is kind of expected. At any rate, it doesn't show any problems. So where do the problems begin, then?

There are 364 players ranked on the 1v1 ladder at this moment. A new player (or any player) will never play them all. They'll play a small sample of these players, hopefully within their current rating range. They have to complete 20 games before receiving a rank. Now, remember how Bayesian ELO was designed for tournaments. This means that your rating will have to be somewhat volatile when you start out. If you win your first game in a tournament, it should put you against better opponents in said tournament; that's the easiest way to estimate your actual rating. The first games you play have a big impact on your rating. This is by design, so that Bayesian ELO can say something about relative skill, as well as estimate your skill in a fast way. A tournament has limited games after all.
A ladder, however, does not have a game limit. If you win 10 games consecutively, Bayesian ELO will greatly overestimate your rating to see if you're actually deserving of a high rank. It wants to test you against worthy opponents. However, you might keep winning. The system doesn't know it's going to have to rank you accurately after 20 games, so it keeps increasing your presumed rating to get you to play the other high-rated players. Keep in mind that this is by design, and when used in a tournament setting, this is exactly what it should do. It's like a binary search. It does unfortunately do a bad job when it needs to rate players on a continuous event like the 1v1 ladder. This leads us into the next section.


V: Solutions to the rating system on continuous events
The first thing that should come to mind is the usage of regular ELO. Your rating can't skyrocket or plummet in the way it does with Bayesian ELO, and you can rank players faster than after playing 20 games. There are other rating systems out there that serve continuous events very well. I'd highly recommend Glicko2, but I won't go into detail. Regular ELO would solve a lot of problems already. Some players run scripts with these rating systems, and so far they seem to produce far more accurate results (unfortunately with me losing my trophies, but hey). The second solution would be to have a different waiting condition for people to be ranked. Losses tend to greatly counter the effect the so-called ladder runs (it is a form of binary search after all). You could try to have people rated only after they've lost four games. I'd be willing to test out this parameter


VI: Summary
https://www.warzone.com/LadderTeam?LadderTeamID=25697
But scratch the ME part; I'd love to say something about it, but I have only 104characters left.

Thanks for listening, and please:
Do it safely. Use a rating system that fits!

Edited 9/20/2020 09:41:32
Bayesian ELO: But why is it an unfortunate choice?: 9/19/2020 16:59:26


JK_3 
Level 63
Report
tl;dr but I guess if you spend the time to type it out, its important enough to upvote (also, how does Bayesian influence normal ELO? What is wrong with normal ELO according to Fizzer?)
Bayesian ELO: But why is it an unfortunate choice?: 9/19/2020 17:04:20


(deleted) 
Level 62
Report
Alternatively, let Math Wolf have control. The grandfather of statistics on the website.

Great read Farah, It's umbrellatastic.
Bayesian ELO: But why is it an unfortunate choice?: 9/19/2020 17:15:55


Beep Beep I'm A Jeep 
Level 64
Report
Great read Farah, and thanks for putting in the work.

A game like Warzone has to have a well-functioning ladder. If a strategic game has a strong and active competitive scene this will trickle down and attract more players to the game and/or keep them playing the game as well. The 1v1 ladder is mandatory for that.
Bayesian ELO: But why is it an unfortunate choice?: 9/19/2020 17:17:56


Torsten 
Level 61
Report
nice profile AI
Bayesian ELO: But why is it an unfortunate choice?: 9/19/2020 17:19:34


JK_3 
Level 63
Report
You can view his forum thread where he discusses all the way through the development stage here. I'd recommend reading it, as it's quite a wonderful thread, but I'm a nerd: https://bit.ly/2H8PpSA
The requested topic does not exist.
Bayesian ELO: But why is it an unfortunate choice?: 9/19/2020 17:43:09


l4v.r0v 
Level 59
Report
Afaict, EloStat is just Coulom's term for what you call Regular ELO.

Also btw ELO should not be capitalized. It doesn't stand for anything; it's just the "Elo rating system" named after Arpad Elo. But that's a nitpick. Thanks for this write-up!

1) The higher the difference between your rating and your opponent's rating, the higher the impact will be on both ratings
Misleading. This is only true for upsets; if the favorite wins, the impact gets lower as the rating difference gets higher.

The impact of a game is proportional to how surprising the result is.

I.e., say we have an 1800 play a 1200 and a 1600 play a 1400. The 1200 beating the 1800 would gain more than the 1400 beating the 1600, but the 1600 beating the 1400 would gain more than the 1800 beating the 1200.

Bigger surprise = bigger update.

Edited 9/19/2020 17:49:27
Bayesian ELO: But why is it an unfortunate choice?: 9/20/2020 09:40:50


Farah♦ 
Level 61
Report
Misleading. This is only true for upsets; if the favorite wins, the impact gets lower as the rating difference gets higher.

Correct, I forgot to mention that this true for upsets. I've edited the first post to reflect this.
Bayesian ELO: But why is it an unfortunate choice?: 9/20/2020 10:56:31


Math Wolf 
Level 64
Report
To second what knyte (and Farah said)
Statistically, this is linked to the concept of leverage. Not all games have the same impact on your rating. The relative impact of your "worst loss" and "best win" are higher.

That's not a problem an sich: true variation is information, upsets happen every once in a while and are informative. Without them, your rating would get overestimated. Which is exactly what happens with "runs" - it's a selective sample in which upsets or bad losses are avoided as much as possible either through delaying the losses, or by ending the runs when such a loss occurs or is unavoidable.

To go back to the statistical side of things: the problem is now how BayesElo handles variation, it's how it handles potential bias. While any rating system will be subject to such bias, some are more conservative than others. BayesElo simply gives the theoretically unbiased estimate, not taking into account that rankings are more likely to be overestimated than underestimated due to human behaviour.
Regular Elo just updates slower, but has these same problems at its core (with as main difference that games don't expire).

Meanwhile, other systems like TrueSkill, Glicko or the so-called MW-Elo used for MDL are designed to be more conservative: TrueSkill and Glicko by applying a penalty based on how uncertain your rating is (more games means typically more certain, for the specialists: minus 3 times the standard deviation for TrueSkill, minus 2 times the ratings deviation for Glicko); MW-Elo by applying a bonus based on number of finished games and how recently the player finished these games (for the specialists: exponential decay of fixes bonus added whenever a game finishes).
My subjective opinion is that MW-Elo has an advantage for ladders over standard Glicko and TrueSkill because it penalises returning after a prolonged absence where the latter do not.

Not mentioned here, but of equal importance, is the matchmaking algorithm. Ideally, players should be matched to other players withing their range of likely ratings: e.g. someone with a higher uncertainty should get opponents in a wider range than someone with a quite precise rating.
Bayesian ELO: But why is it an unfortunate choice?: 9/20/2020 12:06:19


berdan131
Level 59
Report
Maybe it's too much work to bother and fizzer doesn't see the value added is enough to justify it.

Bayesian or normal, the general rule that better players will be rated higher is true. This will not change if we switch systems.

It's like converting from USA measuring system to standardised. It could be done but people too lazy.

The whole change is to make it a little fairer. Does it matter that much?
Bayesian ELO: But why is it an unfortunate choice?: 9/20/2020 12:11:33


berdan131
Level 59
Report
Ladder runs are quite motivating in my opinion and pretty sick. Some players may it. It may boost their ego and make them play more.

Ladder runs have this benefit which none of you seem to know about.
Bayesian ELO: But why is it an unfortunate choice?: 9/20/2020 12:24:00


Farah♦ 
Level 61
Report
Not mentioned here, but of equal importance, is the matchmaking algorithm. Ideally, players should be matched to other players withing their range of likely ratings: e.g. someone with a higher uncertainty should get opponents in a wider range than someone with a quite precise rating.

If only I had more time to talk about this. But yes, the matchmaking system plays a big role of course.

Bayesian or normal, the general rule that better players will be rated higher is true. This will not change if we switch systems.

I think you might've missed the entire point here. Rating inaccuracies defeat this rule; the more inaccuracy, the less this rule holds true. I think you've just been shown that Bayesian Elo tends to have more inaccuracy.

It's like converting from USA measuring system to standardised. It could be done but people too lazy.

It's not. Calculating a different rating system for all ladder games takes a bit more effort than a quick google search.


Ladder runs are quite motivating in my opinion and pretty sick. Some players may it. It may boost their ego and make them play more.

Ladder runs have this benefit which none of you seem to know about.

Depends on interpretation. When I rejoined the ladder, I got a rating of 2340 on a run. Didn't quite boost my ego, nor did it make me play more. Of course the trophy is nice and all, but it's still not fair. The people rated below me were all better in terms of skill. Therefore, I shouldn't have been rated higher than them.
Bayesian ELO: But why is it an unfortunate choice?: 9/20/2020 15:14:19


berdan131
Level 59
Report
What I meant is if the entire usa decided to change the measuring system. You got me wrong :D

1. Win game lose rating.
Hmm, is it that common? Usually you gain rating.
2. Inflated elo
This is not inherently bad. Depends how you look at it. To some people a perfectly fair system might be too boring. Unpredictable = exciting

----Better players will be rated higher.---- It's still true. The inaccuracy increases. However the general rule holds. Explanation: Some players can't break 1600 or 1800 rating no matter how hard they try. Some people are just bad and it's reflected in ranking.

Normal elo may be a little better. However is this a priority to fizzer? Does it change that much? It feels like some minor detail. I don't know why people make such big deal out of it :d

Bayesian is not perfect but it's good enough in my opinion. To me, warzone is a ready product, like chess or "go", so people obsess over details :P
Bayesian ELO: But why is it an unfortunate choice?: 9/20/2020 15:33:25


Beep Beep I'm A Jeep 
Level 64
Report
berdan, you are saying this because you like to profit from bayeselo, which means that you stall losses.

may i remind you?

https://www.warzone.com/MultiPlayer?GameID=23454837

You are the problem, and I cant even blame you. you are just working with a broken system.
Bayesian ELO: But why is it an unfortunate choice?: 9/20/2020 15:35:19


Roi Joleil
Level 60
Report
[---
1. Win game lose rating.
Hmm, is it that common? Usually you gain rating.
---]

The fact you say "Usually" is already bad enough. If you win = you gain points. if you lose = you lose points. There should Never be a case where you win and lose points or vise versa. Thats rly stupid.

[---
2. Inflated elo
This is not inherently bad. Depends how you look at it. To some people a perfectly fair system might be too boring. Unpredictable = exciting
---]

Whats not "inherently bad" about that? If someone gets something underservingly... its underserving. simple as that.
And the only people who dont care for a "perfect" fair system are people who arent even good. Because they are the ones who wouldnt get a trophy with a fair system. aka You

[---
Bayesian is not perfect but it's good enough in my opinion. To me, warzone is a ready product, like chess or "go", so people obsess over details
---]

This is just an insult to chess and go at this point.

Edited 9/20/2020 15:35:44
Bayesian ELO: But why is it an unfortunate choice?: 9/20/2020 16:03:01


Farah♦ 
Level 61
Report
Let's keep the insults to a minimum. The thread is supposed to be an explanation, and discussion is encouraged. Questions like the ones Berdan posted are more than welcome, as the answers to them might explain a lot to people who don't have the in-depth knowledge about the topic. So let's go over them.

1. Win game lose rating.
Hmm, is it that common? Usually you gain rating.

Yes, 'usually' you gain rating by winning. It's explained in the first post that your rating is made up of more things than the number you see on the ladders. In fact, with BayesElo, you always gain rating when you win a game; several people might tell you otherwise though, as the end product that you see on the ladder is your perceived rating minus a factor of probable inaccuracy. As detailed in the first post, sometimes the effect of a completed game is larger on the sum of factors that are deducted from your perceived rating, so the number on the ladder may go down. Keep in mind that this is not common. It does, however, provide a bad experience to players who have no idea about this concept and see this effect on their ratings (and I doubt there's over 5 people on this site who dug into BayesElo; shoutout to Math Wolf). This doesn't happen a lot, though, so I'd say the argument that it can happen shouldn't be taken as a reason to change the system.

2. Inflated elo
This is not inherently bad. Depends how you look at it. To some people a perfectly fair system might be too boring. Unpredictable = exciting

While I'd agree that in some situation unpredictability could be exciting, that's not what we're working with here. The ladder uses a rating system to determine a player's skill relative to other players. I think it's fair to say that we should aim at making this system accurate, as its meaning decreases with inaccuracy. There is little point to rating people's skill with random parameters. Now, Bayesian Elo isn't unpredictable, nor random, but it surely is exceedingly inaccurate when it comes to rating players on a continuous ladder in the extreme cases of a lot of wins or losses.
I also don't see how a perfectly fair system might be boring. Would you be okay with your opponent getting +10 income because their name ends with the letter 'C'? Why would we try to make a system as fair as possible to rate people if we wanted to add randomness in the first place?

----Better players will be rated higher.---- It's still true. The inaccuracy increases. However the general rule holds. Explanation: Some players can't break 1600 or 1800 rating no matter how hard they try. Some people are just bad and it's reflected in ranking.

The general rule holds, but it should be broken as little as possible. Some players indeed will never break 1600 or 1800. And that's fine. When they quit the ladder for 5 months, come back and suddenly get a rating of 2200 because of a bad rating system, it's not.

Bayesian is not perfect but it's good enough in my opinion. To me, warzone is a ready product, like chess or "go", so people obsess over details :P

Bayesian is extremely good when it's used in the right circumstance. Note that this thread is not aiming to say something bad about Bayesian Elo, just to point out how the implementation on a continuous event is flawed.

I had a metaphor in mind when I was writing this response, but my cat jumped on me and I forgot. It involved a Mercedes and mattresses. Not sure what it was, but I wanted to point it out anyways. Don't buy those mattresses!
Bayesian ELO: But why is it an unfortunate choice?: 9/20/2020 17:28:39


l4v.r0v 
Level 59
Report
The Bayeselo executable also computes uncertainty (standard deviations, iirc) for its rating estimates. Perhaps the solution is as simple as using TrueSkill-like ranking logic and subtracting 3*sigma from its rating to generate rankings?

Alternatively one thing I personally like in terms of elegance is to compute the probabilities of you beating some standard players (say a 1500 with zero deviation, a 1200, 1800, etc.) and to convert those back into a single equivalent Elo rating. That process would account for uncertainty (a 2100 with 300 sigma would get hit much harder than a 2100 with 50 sigma, because the latter is more likely really a 2100 and therefore less likely to get upset by a 1500) without a hamfisted simple penalty that drags down early ratings so drastically and makes it hard to figure out where you stand.

Additionally, it seems MW-Elo's innovations can be readily applied on top of any rating system since last I checked they're just bonuses for activity.

Maybe we should develop a clear list of actual problems that have occurred on the ladder due to Bayeselo? That could be the TL;DR. Ladder runs, obviously, as well as unpredictable rating changes, tanking past opponents' ratings, stalling, and the ladder rule AI technically broke with Elo- they're all byproducts of using Bayeselo. Plus Bayeselo in a way punishes you for having played a lot of games- a 1400 with 150 games who improves in skill will basically have to wait until expiry for their rating to catch up to their new skill level, and we know that 1400s can actually climb in skill drastically in practice because just a few insights are enough for 1800+ skill. Bayeselo is bad at picking up change in skill, although Coulom's other system-WHR- tackles this one flaw.

Elo is simple, and elegant, and well-tested. And the predictability it provides ("if I win this game, my rating will go up by 12") makes it more enjoyable to play on the ladder and compete for your rank.

Edited 9/20/2020 17:37:07
Bayesian ELO: But why is it an unfortunate choice?: 9/20/2020 17:48:45


berdan131
Level 59
Report
I posted here mostly because I was surprised this bothers so many people all of a sudden. Something, which I personally wouldn't give second thoughts about.

I'm not trying to argue which system is better :P

The core gameplay, most important aspect of the game didn't change since many years. The way bonuses and armies interact with one another. Because there isn't really much to change. It's so simple in it's form.

You could remove all the fancy cards, mods, additions, updates and it wouldn't change much for me.

Because the gameplay is so nice in it's pure form. The rest is details.
Bayesian ELO: But why is it an unfortunate choice?: 9/20/2020 18:32:58


Math Wolf 
Level 64
Report
Additionally, it seems MW-Elo's innovations can be readily applied on top of any rating system since last I checked they're just bonuses for activity.

Yes and no - BayesElo has expiration, which other systems don't. The activity bonus/penalty would not necessarily complement well with expiring games.

The Bayeselo executable also computes uncertainty (standard deviations, iirc) for its rating estimates. Perhaps the solution is as simple as using TrueSkill-like ranking logic and subtracting 3*sigma from its rating to generate rankings?

Same answer as before: yes, but again the expiration would cause weird fluctuations as well, e.g. if a loss against an opponent with many games (themselves having a very precise rating) expires, your variation could go up considerably meaning you could potentially lose points from a loss expiring.

Most systems work with updating ratings whenever a game finishes, while BayesElo works with a time-constrained set of games and calculates the ratings jointly. This expiration process is one of the reasons these types of relatively easy fixes aren't as obvious as they should be.

Edited 9/20/2020 18:33:13
Bayesian ELO: But why is it an unfortunate choice?: 9/20/2020 21:40:14


l4v.r0v 
Level 59
Report
Most systems work with updating ratings whenever a game finishes, while BayesElo works with a time-constrained set of games and calculates the ratings jointly. This expiration process is one of the reasons these types of relatively easy fixes aren't as obvious as they should be.
Seems like, given Bayeselo's inflexibility and the complexity it creates, the simplest thing would be to just use swap out Bayeselo for MW-Elo (tried and tested on the MTL, which afaict doesn't have issues as egregious as the 1 v 1 Ladder). Once that system is in place, any newfound issues (like ratings not changing fast enough early on) could be solved using trivial tweaks (like having the K-factor be higher for early games).

Won't need expiration- a 1500 with 1309 games played is to Elo the exact same as a 1500 with 0 games played. Won't need the unexpired games rule on the ladder- accounts don't have widespread indirect impacts on everyone's ratings (indeed, I see no reason why the rule can't be relaxed to just not having 2 accounts on the ladder at the same time). Won't have to worry about stalling since it will have a much smaller impact. Won't have to worry about players like NanoMidget or Nauz hurting past opponents' ratings once they go on boot streaks (my other gripe with the ladder is that this somehow hasn't been fixed yet; there's no reason to not remove players from the ladder automatically after repeated boots!).

A lot of the options for ladder manipulation and abuse just go away if Warzone drops Bayeselo in favor of a more conventional rating system, even just plain old Elo. This really isn't a complicated topic when you compare the impacts of actual, recurring issues on the 1v1 ladder against the largely theoretical (or inapplicable) and comparatively minor problems that Bayeselo is supposed to address.

And the price for that is just probably a few net points' worth of inaccuracy due to the relatively minor quirks of Elo, quirks that sort themselves out fast enough in a continuous ladder.

Edited 9/20/2020 21:52:00
Posts 1 - 20 of 46   1  2  3  Next >>