Bayesian ELO versus Regular ELO: 2/26/2019 21:54:38 
Farah♦
Level 60
Report

Over the years there's been a discussion going on about the rating system for the ladders. Most of the ladders use 'Bayesian ELO'. There's a few problems with this rating system though: 1) Bayesian ELO promotes runs. When the system is unsure of your rating, it will inflate or deflate you significantly. On one end, this encourages people to do 'runs': a run of a small amount of unexpired games where you take the inflating part of the system to your advantage. If you win 19 or 20 games in a row, the system isn't confident you're not a godlike person, so it tends to give you a really high rating. As an example, you may check my ladder run. I had 22 unexpired games with 22 wins and 0 losses: https://www.warzone.com/LadderGames?ID=0&&LadderTeamID=11900&Offset=44This resulted in a rating of 2340. A totally undeserved and inflated rating, which i could take the trophy with. It may have been good performance, but nowhere near deserving the number one spot, as others had way better records against way better players. 2) Bayesian ELO tends to converge your rating to a number as the amount of games you finished gets higher. This means that the more games you complete, the more certain the system is your skill is a set number. When you win or lose a game, your rating changes much less than when you started out playing on the ladder. This may of course be beneficial to the ladder if everyone played a small sample of games, but the truth is that the ladders are continuous events; not tournaments with limited number of games. This means Bayesian ELO stagnates when you've completed a lot of games, while your skill may not. If you keep getting better or worse, your rating should reflect that. But if your rating gets more static as you play more games, you are forced to let games not count anymore, e.g. let them expire. That creates an incentive to play less games. So how would the Regular ELO algorithm solve these problems? It's rather simple: 1) Regular ELO promotes the opposite of runs. It gives or takes a certain amount of points from your rating, no matter how many games you have completed. The only factor it takes into account is the rating your opponent has at the very moment you play them. It takes a while to reach an accurate rating if you're far from average (either far below or far above average), but even that can be adjusted by changing the infamous Kfactor. If you've had a few losses and your rating is reduced by 'x' points, you should be able to retrieve those 'x' points by winning the same amount of games, approximately. The converse is true for Bayesian ELO. If i may take myself as example again: after my run i surrendered two games. This resulted in a negative 160 points to my rating. To get those points back, I'd have to win against a top three player at least four times (by my backoftheenvelopecalculations). I didn't win against any top three player in that whole run of 22 games, btw. 2) Let's take a look at a player like AI. His rating is 2158 and he has a score of 62 wins versus 18 losses. His total of unexpired games is 80. Let's compare that to the player above him. Leopard has a rating of 2172 with 24 wins and 6 losses. So their winrate is about the same (AI: 78%, Leopard: 80%). The average opponent they faced has about the same rating (AI: 1881, Leopard: 1899). So AI has proven to be more consistent, but Leopard has the edge. So far, that could be tolerable. Now let's look at Rufus. He has 37 wins versus 4 losses. A win percentage of 90%. But his average opponent was at a rating of 1730. What is his rating though? 2350. That's a whopping 192 points higher than AI. Keep in mind that this means that Rufus would have a 75% chance of winning against AI, which i find hard to believe according to these statistics. The problem AI is having, is that his rating is being limited by his amount of games. Bayesian ELO is very sure of his rating, so his games influence his rating less. While Bayesian ELO isn't sure of Rufus' rating at all, so his games count way more. Getting a high win percentage against whatever skill level is more important than the opponents you face, given you have a low amount of games. Bayesian ELO works well when there is a set amount of games for every player. Think of a tournament. This is also what it was designed for. The regular ELO system works better for continuous events such as a ladder. If your rating is being influenced by the amount of games you have completed in a continuous event, your rating can't reflect your skill over time accurately. I propose we use the regular ELO to rate people on the ladders. To take a look at what the ladder would look like in terms of rank and ratings, consult this document: https://www.dropbox.com/s/7pghcwwcmbq9ewe/ELORatings.ods?dl=0Any thoughts are welcome.

Bayesian ELO versus Regular ELO: 2/27/2019 09:30:03 
Farah♦
Level 60
Report

Interesting concept I see thrown around a lot around here. How would you propose that previous ELO's be handled? IE If I look at your profile, Farah, will I see your Baysian ELO, even after MWELO was implemented? Or would it all be converted? That's not a question for me to answer, but with regular ELO, you could simply convert everything. For MWELO it would be slightly harder but it's still very doable. Wouldn't it be possible to make a bayesian elo that would rather than expire with time, would instead expire the last unexpired games when you reached say 35? It's possible, but it's cheatable. What if i surrender my first 25 games and then start playing with poor matchups in which i have a high chance of winning? The ladder will keep thinking i'm bad, so it gives me bad matchups. After i win a lot of bad matchups, the losses start to expire and i have a really high winrate. Which is exactly what Bayesian ELO will reward. Your statistics suck. You suck.

Bayesian ELO versus Regular ELO: 2/27/2019 11:27:59 
89thlap
Level 61
Report

I am open for discussion, since so far I don't really see the improvement of your rating system. You brought up Rufus and AI in your comparison. Obviously Rufus' 2480 rating wasn't sustainable but just from looking at their finished games against players rated 2100+ I don't think there should be any doubt about who should have the higher rating. However the regular ELO method that you are promoting doesn't reflect that at all, it even implies the opposite. Is this all due to 41 games not being enough to give a confident rating estimation? Would you mind experimenting with the Kfactor then? I understand if the rating method maybe isn't able to give accurate ratings after 41 games yet, but it certainly should be able to tell the correct tendency about who the better player is  especially if there are so many opponents for direct comparison as there are between Rufus and AI. Rufus
Won Lost
ricky87: 2117
Nutella: 2138
jimmy: 2126
Kurdistan49: 2207
Super Smoove: 2252
89thlap: 2184
Bewmaster: 2130
AlturoSensei: 2129
malakkan: 2310
AI: 2156
AI
Won Lost
ricky87: 2117 AlturoSensei: 2129
Nutella: 2138 Kurdistan49: 2207
ricky87: 2117 Super Smoove: 2252
Leopard: 2195 AlturoSensei: 2129
Bewmaster: 2130
Rufus: 2349
malakkan: 2310
Nutella: 2138 (expired)
89thlap: 2184 (expired)
Edited 2/27/2019 11:55:28

Bayesian ELO versus Regular ELO: 2/27/2019 12:06:26 
AI
Level 63
Report

@Farah:
Great work, thanks.
Whether Regular Elo is better than Bayesian Elo or not, you definetely once again show us that Bayesian Elo is NOT the rating system any Warzone ladder should use in the future. Too many examples occure, where ratings seem terribly inflated / deflated.
One could say, that Bayesian Elo is just "good enough", but it's not. This game claims to deliver competitive gameplay, without pay to win and with few luck involved. As such, you need to have a wellworking rating system.
Btw: I'm not saying all of this because I am not 1st and with Regular Elo I would be.
@89thlap:
I get your point. Your arguments make sense somehow, but then again they seem like cherrypicking.
I mean, come on, you include expired games? The point of expiration is, that these games no longer represent my skill... Also, why matchups above 2100? Do these games represent skill more than other ratingareas? For example, why not compare matchups below 2000? I did just roughly count this, but it seems like we have pretty much the same winrate in these matchups.
Don't get me wrong, I'm not claiming that I should be higher rated than Rufus. But he is 192 points ahead and that's definetely way too much.

Bayesian ELO versus Regular ELO: 2/27/2019 12:07:48 
Rento
Level 60
Report

BayesElo (current system) wasn't designed for situations where some players have 20 games completed, other ones have 80. It's as simple as that.
Having 2300 rating with just 20 unexpired games is less impressive than with 80 games and everyone can see that. And that's a big problem, because in practice it discourages staying on the ladder if you care for your rating. By the same token it encourages runs and stalling. The rating and ranks is why many people play ladder in the first place (instead of exclusively playing casual games against clanmates, for example).
The more you play, the less your rating changes, the more boring it gets.
Regular Elo doesn't have this problem.

Bayesian ELO versus Regular ELO: 2/27/2019 16:59:57 
89thlap
Level 61
Report

Having 2300 rating with just 20 unexpired games is less impressive than with 80 games and everyone can see that. And that's a big problem, because in practice it discourages staying on the ladder if you care for your rating. I generally agree, but malakkan has 69 unexpired games and is rated 2309. So what is the point here? Doesn't that show you can sustain a high rating and play a lot of games? By the same token it encourages runs and stalling. The issue is not that the system encourages stalling, the issue is that there are players who put personal success or personal goals over fairplay and decide to take advantage of this weakness within BayesELO. What's the issue with runs after all? I don't see runs as that big of an issue at all. If Farah for instance comes up with 22 wins in a row I don't feel mad. It is a big achievement and might be rewarded with a trophy. Props to him! After all there haven't been too many successful #1 ladderruns since I've been following the ladder. Just very recently we've had someone who tried it and failed (even if he would have tried to stall his losses). Most of the players that are doing runs are decently skilled and would have good chances of getting the trophy anyway at some point. Additionally, if you look into sports there have always been teams or athletes overperforming for a certain amount of time. Thinking of the Philadelphia Eagles 2018 (NFL), Leicester City 2014 (Premier League), Kaiserslautern 1998 (Bundesliga), Goran Ivanisevic 2001 (Wimbledon), etc. These surprises are an essential and important criteria for good and interesting competition. If you start emphasizing quantity over quality too much to get rid of these "flukes" you will end up with a very predictable and boring competition. If there were 100 games to be played in a season the Patriots will most likely be the #1 team, same for other disciplines. Eventually it might not even end up in a competition of skill but endurance. For WL that means: The fact that players will have to play ~80 games (my guess from looking at Farah's results) to get a competitive and representative ladder rating will be highly repulsive for them  even for those that don't mind playing a lot of games. The more you play, the less your rating changes, the more boring it gets. Agree, that is actually an issue. I mean, come on, you include expired games? The point of expiration is, that these games no longer represent my skill... You don't need those to see the tendencies, I could / should have left them out. I know they don't count which is why I indicated that they are expired. I thought it might still be interesting to see those as well since they offer further opportunities for direct comparison between you and Rufus. Also, why matchups above 2100? Do these games represent skill more than other ratingareas? If you want to be rated 2300 shouldn't you prove that you are able to beat 2100+ players? Those are the players you are competing with directly and those are the players you should compare yourself with. I wouldn't consider myself the best chess player in the world if Carlsen, Caruana, So, Mamedyarov and Anand decided to boycott all events so I end up becoming world champion by playing another opponent that is rated much lower than the other top players. Don't get me wrong, I'm not claiming that I should be higher rated than Rufus. But he is 192 points ahead and that's definetely way too much. Is it? Isn't MDL using a regular ELO type of approach with certain adjustments? You are rated 215 points lower than Rufus on MDL, too. Don't get me wrong, you are a great player and I only followed up on the discussion since your two names have been mentioned before. But to me it seems Rufus is just that much better and that should be reflected in the ladder rankings as long as both players have finished a reasonable amount of games.
To make my point clear: I totally agree that the current rating system could be improved or even replaced. From looking at Farah's results I just wanted to express my doubts that the regular ELO system as proposed (without any further adjustments) is the perfect solution. As I mentioned before I don't know much about the calculations itself. Are we sure there is no way to give an incentive for players to finish more games or adjust the current method so stalling / ladderruns are discouraged and players with a lot of games are rewarded? Because I feel BayesELO's opportunity to compare players with different amount of games is kind of essential for a ladder that is played competitively and casually by approximately 300 players right now.

Bayesian ELO versus Regular ELO: 2/27/2019 18:53:23 
Nick
Level 56
Report

This is all important discussion to be having. I raise an issue and propose a challenge.
1) Is the goal of the ladder rating to give a small handful of players competing for the top spots the rating they "deserve" (already an incredibly subjective term), or to serve the entire set of 1v1 ladder players over the entire span of ratings? The 1v1 ladder ratings on the whole are actually highly accurate at predicting match outcomes. The approximate mean prediction error over a sample of more than 100,000 games is <2%. I posit that the real function of the ladder rating system is to predict game outcomes (so as to create matchups that are as even as possible) across the whole set of players. The 1v1 ladder rating system does this very well for the 1v1 ladder.
2) A challenge: Log loss evaluation of the BayesElo predictions over that same sample of >100,000 games yields a mean log loss of 0.577 when the 1v1 ladder was still using a game expiration length of 3 months, 0.613 when the ladder has used a 5 month game expiration window, and total log loss of 0.603 across the entire prediction history of the 1v1 ladder (interesting that 3 months was better than 5). For reference, lower log loss means more accurate prediction.
Warzone's 1v1 ladder rating system should serve all its players equally, not just its best. It is important to remember that ladders like the MDL attract generally far higherskilled players on average than Warzone's native 1v1 ladder, and hence rating systems that work best for the MDL may be different than those which are best for the native 1v1 ladder.
If anyone comes up with a rating system that can beat the above log loss metrics for 1v1 ladder games to a point of statistical significance, I will support the case for putting Fizzer's time and resources into a switch. Until then, the 1v1 ladder should continue to use BayesElo.
Edited 2/27/2019 18:56:52

Bayesian ELO versus Regular ELO: 2/27/2019 19:18:32 
Nick
Level 56
Report

Id argue that in its current implementation, the ladder isn't representative of the lower half of the ladder either.
A ton of people are well below what they should be rated do to the amount of unexpired games. @Cowboy 1) Could you provide actual (statistically significant) evidence that (on average) the current rating system isn't representative of the lower half? 2) I understand why this might be frustrating, but the data shows that the game was giving you what it thought to be the most even matchups possible, and that it was generally very accurate in its assessment of the equality of those matchups.

Bayesian ELO versus Regular ELO: 2/27/2019 22:10:31 
knyte
Level 58
Report

Edit to avoid a doublepost: What about implementing the Math Wolf activity bonus system but only for players with ratings above a threshold (say 1900)? You could just put those players under stricter scrutiny and encourage them to be more active (i.e., influence their behavior through small incentives which are significant near the top in such a way that the adverse effect of the bonus system on ratings will be offset by the predictivity gains of having better data as a consequence of more activity) since it seems this thread is motivated by the idea that we should be especially interested in accurate ratings near the top (but as Nick points out, concerns about just the top shouldn't motivate how we handle the entire ladder). Also the burden imposed by Nick for data might not actually work out. The predictivity of Elo is sensitive to a lot of things matchmaking, the player pool, even the time of year (which affects the player pool and activity), etc. so we won't really be able to get comparable logloss data where all the variables are controlled; the best we can do is semicomparable environments but that falls into the trap of requiring us to make hardtotest assumptions about player behavior, which is what drives this issue in the first place. Even if we were to temporarily swap the 1v1 ladder to an Elo system, that would affect player behavior and consequently the logloss prediction value of the output.  Seems like Bayesian Elo was designed for learning/accurate estimation, where it beats out Elo ("Elostat" as Coulom calls it). With the same data (i.e., the same player behavior), Bayeselo beats Elo by a mile as demonstrated by Coulom's own analysis (especially the special cases that motivated Coulom's work). But the behavior and data won't be consistent between Bayeselo and Elo, due to the "runs" problem you point out. Bayeselo doesn't take into account people's behavior in a competitive ladder (and their incentive to maximize their own rating), so the core assumptions of the model get violated in practice. Chess seems to demonstrate that Elo holds up just fine (or at least well enough) under the stress of actual human competitive play. So +1 to your recommendation for plain Elo over Bayeselo. @Nick: maybe there's something in https://bit.ly/mdlanalysis (there's some stuff in there for the 1v1 ladder). I did some rough calculations to revise ratings (this was years ago, but if I recall correctly it was somewhere between one step of gradient descent and LMS just some rough logic of that sort without the actual rigor of either algorithm) and it does seem like there's more noise at the bottom than at the top. The revised ratings did significantly have more global predictive value of recent results than did the toplevel ratings, but this might not be on the rating algorithm itself so much as the nature and behavior of lowertier ladder players. This math is all pretty handwavy though, so it certainly wouldn't survive any sort of actual statistical review. But it gets the point across I think. Also +1 to directing this conversation in the direction of what the data supports rather than what it feels like from the player perspective. Neither Elo nor Bayeselo achieves anything close to perfect prediction, and matchmaking is constrained heavily by other factors like player availability (in fact, it seems like the biggest driver for quality matchmaking is probably just the size of the player pool). So personal observations could be highly inaccurate and unrepresentative of how the ladder actually performs. Would be nice to do a ladder analysis with actual rigor in the future.
Edited 2/27/2019 22:52:33

Bayesian ELO versus Regular ELO: 2/28/2019 12:37:30 
Rento
Level 60
Report

Another problem with BayesElo and current matchmaking is that it sometimes matches players with such a big rating difference that the better player will lose rating even after he wins. Yes, it happens rarely and almost exclusively to the top players where it's harder to make a good match. But it's still a drawback that regular elo doesn't have. 1) Is the goal of the ladder rating to give a small handful of players competing for the top spots the rating they "deserve" (already an incredibly subjective term), or to serve the entire set of 1v1 ladder players over the entire span of ratings? Current 1v1 ladder is absolutely terrible at retention of top players. As a consequence the top 10 isn't a real top 10 because most top players don't care to participate in the current system. You can argue that top 10 constitutes only under 3% of the entire player pool. I'll argue that top10 is what's being displayed first after you click the community tab and that people check it and care more about it than about the bottom 100. Finally, it's not like ELO would fix top 10 while screwing up everything else. I think it would be more fun for everyone if a good win streak would give you a good rating boost even if you have 80 unexpired games.
Edited 2/28/2019 12:38:10

Bayesian ELO versus Regular ELO: 2/28/2019 22:16:35 
Rento
Level 60
Report

It's possible to check. Unfortunately I can't find a recent example on 1v1 ladder. Although I found a couple of Malakkan's games where his rating changed by less than 1 point after he won, like against HalfMoon just today. So that's not great already. Is all working good if you don't get even 1 full point after you win? But let's take a look at last season of seasonal ladder. The upper one is how the season ended (these are raw 'bayeselo' ratings, add 1300 to every player to get what Warzone displays) The bottom one is how the rating would look like if we cut the 89thlap vs T54321 game (89thlap's second to last game, 89thlap won). All the other games and results remain unchanged. You can see that if that game never happened, 89thlap's "bayeselo" rating would be 4 points higher. Now, I'm not complaining about how seasonal works. Bayeselo actually works pretty good there (compared to alternatives). But I'm showing you that it's possible to lose points after winning a game under Bayeselo system.

Bayesian ELO versus Regular ELO: 2/28/2019 22:48:05 
89thlap
Level 61
Report

Now, I'm not complaining about how seasonal works. Bayeselo actually works pretty good there (compared to alternatives). But I'm showing you that it's possible to lose points after winning a game under Bayeselo system. Due to proper matchmaking with low rating differences between players it is super rare to lose points after winning a game on the 1v1 ladder. Actually it is close to impossible. I did the calculations once when I was rated 2300+ and found that even if I would have been matched up with the lowest rated player (~900) I would "only" lose 1 point. On the Seasonal however you will be paired up with pretty much anyone when you are behind on games. Also rating differences can be much bigger due to the 65 extra points per game which makes the scenario of losing points after winning a game much more likely. I am fine with BayesELO on the seasonal though, I think you could improve matchmaking to decrease the issue of very unfavorable matchups. Would you mind explaining why BayesELO is better on Seasonal than regular ELO? My understanding was that Seasonals would be benefecial for regular ELO since all players will have the same game count at the end of the season. Or is the issue with people dropping out early / joining late and hence not finishing 20 games?

Bayesian ELO versus Regular ELO: 2/28/2019 23:32:06 
TBest
Level 60
Report

Would you mind explaining why BayesELO is better on Seasonal than regular ELO? My understanding was that Seasonals would be benefecial for regular ELO since all players will have the same game count at the end of the season. Or is the issue with people dropping out early / joining late and hence not finishing 20 games?
If you haven't read it already the site for BayesElo is a quick and good read listing both pro's and con's. One thing that is not mentioned here already, is that both ELO and BayesELO assume draws are possible. (However, in WZ that is not the case ofc). Also, Bayes let's you give an advantage to first pick (this is set to 10 elo for WZ, iirc). https://www.remicoulom.fr/BayesianEloThe 1v1 ladder ratings on the whole are actually highly accurate at predicting match outcomes. The approximate mean prediction error over a sample of more than 100,000 games is <2%. That surprised me. Is this <2% true across all the rating 'groups'? For comperision, this is better then FIDE's chess rating's ability to predict which can be off by ~5%. https://en.chessbase.com/post/sonasoverallreviewofthefideratingsystem220813(The article by Jeff Sonas is much more in depth, i recommend a read if you have time. It has some clever way of analyzing a ratings performance when ranking player and ti would be interesting to see how WZ's rating would hold up to a similar stuff.)
Edited 2/28/2019 23:32:33

Bayesian ELO versus Regular ELO: 3/1/2019 10:42:04 
89thlap
Level 61
Report

If you haven't read it already the site for BayesElo is a quick and good read listing both pro's and con's. One thing that is not mentioned here already, is that both ELO and BayesELO assume draws are possible. (However, in WZ that is not the case ofc). Also, Bayes let's you give an advantage to first pick (this is set to 10 elo for WZ, iirc). Thanks for sharing, but I have read this before. It doesn't really answer my question regarding the Seasonal unfortunately. However it is doing a good job promoting Bayesian ELO which kind of fortifies my opinion of trying to rather adjust the current calculation method instead of replacing it.

Bayesian ELO versus Regular ELO: 3/1/2019 17:34:31 
Venus Angelic
Level 57
Report

"Another problem with BayesElo and current matchmaking is that it sometimes matches players with such a big rating difference that the better player will lose rating even after he wins."
Agreed with you guys that this is ridiculous and should not happen under any situation. :P
I also have to wonder more of the arguments against changing the algorithm because I'm sure Fizzer prefers this system for a valid reason, and I definitely agree that the algorithm encourages runs, but at the same time there are already "ladder rules" that have been put in place to discourage players from starting alt runs, but even otherwise I don't think it's easy to get 1st place on a run unless you get ridiculously skewed matchups and maybe a couple of lucky wins during the final matchups.

Bayesian ELO versus Regular ELO: 3/2/2019 01:38:50 
knyte
Level 58
Report

Also predictivity is only half the issue. There's reason Overwatch, League, etc., have weird rating systems that probably aren't great on prediction. Game/incentive design is also important. The fact that MDL not only happened but succeeded is basically a major upset against the 1v1 Ladder itself.

Bayesian ELO versus Regular ELO: 3/8/2019 13:58:37 
malakkan
Level 63
Report

MDL in general and the MW rating system is brilliantly designed to keep you motivated, whether you want to play competitively (since the rating system and the '10games condition' remove all those boring rants about stalling, run and rating manipulation) or just play casually and get good matches on good templates (what QM most of the time fails to do). Unless they really love MME and don't care about seeing people with overinflated rating getting temporarily above them, the 1v1 ladder certainly fails to keep good players engaged for a long time. There is however a design flaw which is annoying with the current implementation and would probably discourage even the most monomaniac MME fan : the Boston run pattern. There are currently 2 players that I don't want to be matched with if I expect to have a fair rating. They both are good players, who rely pretty heavily on gambles/smart predictions and the outcome of a game against them is often unpredictible. That's fine, I like such opponents. The issue is that after a while, they get bored of Warzone, stop logging in and start a nice boot streak. Which will only stop after 50 (!) boots to get them to a rating below 1000 in less than 2 months (both players generally have 5 ongoing games of course). If you are unlucky enough to have lost against them while they were active, your rating will incredibly suffer from it due to BayesElo. That has been discussed dozens of time already I guess, but just removing a player from the ladder after a few boots would solve the issue. (One of those players is obviously Boston, who has this interesting rating shape displayed above. The other one is that Dutch player hello / later / KakkieG / Ricky87 who keeps creating new accounts with the same pattern. And you guessed it right, I lost twice to him last month :(

Post a reply to this thread
Before posting, please proofread to ensure your post uses proper grammar and is free of spelling mistakes or typos.

