Bayesian ELO versus Regular ELO: 2/26/2019 21:54:38 
Farah♦
Level 61
Report

Over the years there's been a discussion going on about the rating system for the ladders. Most of the ladders use 'Bayesian ELO'. There's a few problems with this rating system though: 1) Bayesian ELO promotes runs. When the system is unsure of your rating, it will inflate or deflate you significantly. On one end, this encourages people to do 'runs': a run of a small amount of unexpired games where you take the inflating part of the system to your advantage. If you win 19 or 20 games in a row, the system isn't confident you're not a godlike person, so it tends to give you a really high rating. As an example, you may check my ladder run. I had 22 unexpired games with 22 wins and 0 losses: https://www.warzone.com/LadderGames?ID=0&&LadderTeamID=11900&Offset=44This resulted in a rating of 2340. A totally undeserved and inflated rating, which i could take the trophy with. It may have been good performance, but nowhere near deserving the number one spot, as others had way better records against way better players. 2) Bayesian ELO tends to converge your rating to a number as the amount of games you finished gets higher. This means that the more games you complete, the more certain the system is your skill is a set number. When you win or lose a game, your rating changes much less than when you started out playing on the ladder. This may of course be beneficial to the ladder if everyone played a small sample of games, but the truth is that the ladders are continuous events; not tournaments with limited number of games. This means Bayesian ELO stagnates when you've completed a lot of games, while your skill may not. If you keep getting better or worse, your rating should reflect that. But if your rating gets more static as you play more games, you are forced to let games not count anymore, e.g. let them expire. That creates an incentive to play less games. So how would the Regular ELO algorithm solve these problems? It's rather simple: 1) Regular ELO promotes the opposite of runs. It gives or takes a certain amount of points from your rating, no matter how many games you have completed. The only factor it takes into account is the rating your opponent has at the very moment you play them. It takes a while to reach an accurate rating if you're far from average (either far below or far above average), but even that can be adjusted by changing the infamous Kfactor. If you've had a few losses and your rating is reduced by 'x' points, you should be able to retrieve those 'x' points by winning the same amount of games, approximately. The converse is true for Bayesian ELO. If i may take myself as example again: after my run i surrendered two games. This resulted in a negative 160 points to my rating. To get those points back, I'd have to win against a top three player at least four times (by my backoftheenvelopecalculations). I didn't win against any top three player in that whole run of 22 games, btw. 2) Let's take a look at a player like AI. His rating is 2158 and he has a score of 62 wins versus 18 losses. His total of unexpired games is 80. Let's compare that to the player above him. Leopard has a rating of 2172 with 24 wins and 6 losses. So their winrate is about the same (AI: 78%, Leopard: 80%). The average opponent they faced has about the same rating (AI: 1881, Leopard: 1899). So AI has proven to be more consistent, but Leopard has the edge. So far, that could be tolerable. Now let's look at Rufus. He has 37 wins versus 4 losses. A win percentage of 90%. But his average opponent was at a rating of 1730. What is his rating though? 2350. That's a whopping 192 points higher than AI. Keep in mind that this means that Rufus would have a 75% chance of winning against AI, which i find hard to believe according to these statistics. The problem AI is having, is that his rating is being limited by his amount of games. Bayesian ELO is very sure of his rating, so his games influence his rating less. While Bayesian ELO isn't sure of Rufus' rating at all, so his games count way more. Getting a high win percentage against whatever skill level is more important than the opponents you face, given you have a low amount of games. Bayesian ELO works well when there is a set amount of games for every player. Think of a tournament. This is also what it was designed for. The regular ELO system works better for continuous events such as a ladder. If your rating is being influenced by the amount of games you have completed in a continuous event, your rating can't reflect your skill over time accurately. I propose we use the regular ELO to rate people on the ladders. To take a look at what the ladder would look like in terms of rank and ratings, consult this document: https://www.dropbox.com/s/7pghcwwcmbq9ewe/ELORatings.ods?dl=0Any thoughts are welcome.

Bayesian ELO versus Regular ELO: 2/26/2019 21:58:44 
Farah♦
Level 61
Report

TrueSkill is even better to avoid runs.
Also, MWELO as used by MDL is obviously the best as it was designed for Warzone ladders. :) True and True. But regular ELO is way easier to implement quickly (if Fizzer doesn't get your script ;))

Bayesian ELO versus Regular ELO: 2/26/2019 23:08:19 
The Joey
Level 59
Report

Interesting concept I see thrown around a lot around here. How would you propose that previous ELO's be handled? IE If I look at your profile, Farah, will I see your Baysian ELO, even after MWELO was implemented? Or would it all be converted?

Bayesian ELO versus Regular ELO: 2/27/2019 09:30:03 
Farah♦
Level 61
Report

Interesting concept I see thrown around a lot around here. How would you propose that previous ELO's be handled? IE If I look at your profile, Farah, will I see your Baysian ELO, even after MWELO was implemented? Or would it all be converted? That's not a question for me to answer, but with regular ELO, you could simply convert everything. For MWELO it would be slightly harder but it's still very doable. Wouldn't it be possible to make a bayesian elo that would rather than expire with time, would instead expire the last unexpired games when you reached say 35? It's possible, but it's cheatable. What if i surrender my first 25 games and then start playing with poor matchups in which i have a high chance of winning? The ladder will keep thinking i'm bad, so it gives me bad matchups. After i win a lot of bad matchups, the losses start to expire and i have a really high winrate. Which is exactly what Bayesian ELO will reward. Your statistics suck. You suck.

Bayesian ELO versus Regular ELO: 2/27/2019 12:06:26 
Beep Beep I'm A Jeep
Level 64
Report

@Farah:
Great work, thanks.
Whether Regular Elo is better than Bayesian Elo or not, you definetely once again show us that Bayesian Elo is NOT the rating system any Warzone ladder should use in the future. Too many examples occure, where ratings seem terribly inflated / deflated.
One could say, that Bayesian Elo is just "good enough", but it's not. This game claims to deliver competitive gameplay, without pay to win and with few luck involved. As such, you need to have a wellworking rating system.
Btw: I'm not saying all of this because I am not 1st and with Regular Elo I would be.
@89thlap:
I get your point. Your arguments make sense somehow, but then again they seem like cherrypicking.
I mean, come on, you include expired games? The point of expiration is, that these games no longer represent my skill... Also, why matchups above 2100? Do these games represent skill more than other ratingareas? For example, why not compare matchups below 2000? I did just roughly count this, but it seems like we have pretty much the same winrate in these matchups.
Don't get me wrong, I'm not claiming that I should be higher rated than Rufus. But he is 192 points ahead and that's definetely way too much.

Bayesian ELO versus Regular ELO: 2/27/2019 16:59:57 
89thlap
Level 61
Report

Having 2300 rating with just 20 unexpired games is less impressive than with 80 games and everyone can see that. And that's a big problem, because in practice it discourages staying on the ladder if you care for your rating. I generally agree, but malakkan has 69 unexpired games and is rated 2309. So what is the point here? Doesn't that show you can sustain a high rating and play a lot of games? By the same token it encourages runs and stalling. The issue is not that the system encourages stalling, the issue is that there are players who put personal success or personal goals over fairplay and decide to take advantage of this weakness within BayesELO. What's the issue with runs after all? I don't see runs as that big of an issue at all. If Farah for instance comes up with 22 wins in a row I don't feel mad. It is a big achievement and might be rewarded with a trophy. Props to him! After all there haven't been too many successful #1 ladderruns since I've been following the ladder. Just very recently we've had someone who tried it and failed (even if he would have tried to stall his losses). Most of the players that are doing runs are decently skilled and would have good chances of getting the trophy anyway at some point. Additionally, if you look into sports there have always been teams or athletes overperforming for a certain amount of time. Thinking of the Philadelphia Eagles 2018 (NFL), Leicester City 2014 (Premier League), Kaiserslautern 1998 (Bundesliga), Goran Ivanisevic 2001 (Wimbledon), etc. These surprises are an essential and important criteria for good and interesting competition. If you start emphasizing quantity over quality too much to get rid of these "flukes" you will end up with a very predictable and boring competition. If there were 100 games to be played in a season the Patriots will most likely be the #1 team, same for other disciplines. Eventually it might not even end up in a competition of skill but endurance. For WL that means: The fact that players will have to play ~80 games (my guess from looking at Farah's results) to get a competitive and representative ladder rating will be highly repulsive for them  even for those that don't mind playing a lot of games. The more you play, the less your rating changes, the more boring it gets. Agree, that is actually an issue. I mean, come on, you include expired games? The point of expiration is, that these games no longer represent my skill... You don't need those to see the tendencies, I could / should have left them out. I know they don't count which is why I indicated that they are expired. I thought it might still be interesting to see those as well since they offer further opportunities for direct comparison between you and Rufus. Also, why matchups above 2100? Do these games represent skill more than other ratingareas? If you want to be rated 2300 shouldn't you prove that you are able to beat 2100+ players? Those are the players you are competing with directly and those are the players you should compare yourself with. I wouldn't consider myself the best chess player in the world if Carlsen, Caruana, So, Mamedyarov and Anand decided to boycott all events so I end up becoming world champion by playing another opponent that is rated much lower than the other top players. Don't get me wrong, I'm not claiming that I should be higher rated than Rufus. But he is 192 points ahead and that's definetely way too much. Is it? Isn't MDL using a regular ELO type of approach with certain adjustments? You are rated 215 points lower than Rufus on MDL, too. Don't get me wrong, you are a great player and I only followed up on the discussion since your two names have been mentioned before. But to me it seems Rufus is just that much better and that should be reflected in the ladder rankings as long as both players have finished a reasonable amount of games.
To make my point clear: I totally agree that the current rating system could be improved or even replaced. From looking at Farah's results I just wanted to express my doubts that the regular ELO system as proposed (without any further adjustments) is the perfect solution. As I mentioned before I don't know much about the calculations itself. Are we sure there is no way to give an incentive for players to finish more games or adjust the current method so stalling / ladderruns are discouraged and players with a lot of games are rewarded? Because I feel BayesELO's opportunity to compare players with different amount of games is kind of essential for a ladder that is played competitively and casually by approximately 300 players right now.

Bayesian ELO versus Regular ELO: 2/27/2019 18:53:23 
Nick
Level 57
Report

This is all important discussion to be having. I raise an issue and propose a challenge.
1) Is the goal of the ladder rating to give a small handful of players competing for the top spots the rating they "deserve" (already an incredibly subjective term), or to serve the entire set of 1v1 ladder players over the entire span of ratings? The 1v1 ladder ratings on the whole are actually highly accurate at predicting match outcomes. The approximate mean prediction error over a sample of more than 100,000 games is <2%. I posit that the real function of the ladder rating system is to predict game outcomes (so as to create matchups that are as even as possible) across the whole set of players. The 1v1 ladder rating system does this very well for the 1v1 ladder.
2) A challenge: Log loss evaluation of the BayesElo predictions over that same sample of >100,000 games yields a mean log loss of 0.577 when the 1v1 ladder was still using a game expiration length of 3 months, 0.613 when the ladder has used a 5 month game expiration window, and total log loss of 0.603 across the entire prediction history of the 1v1 ladder (interesting that 3 months was better than 5). For reference, lower log loss means more accurate prediction.
Warzone's 1v1 ladder rating system should serve all its players equally, not just its best. It is important to remember that ladders like the MDL attract generally far higherskilled players on average than Warzone's native 1v1 ladder, and hence rating systems that work best for the MDL may be different than those which are best for the native 1v1 ladder.
If anyone comes up with a rating system that can beat the above log loss metrics for 1v1 ladder games to a point of statistical significance, I will support the case for putting Fizzer's time and resources into a switch. Until then, the 1v1 ladder should continue to use BayesElo.
Edited 2/27/2019 18:56:52

Bayesian ELO versus Regular ELO: 2/27/2019 19:18:32 
Nick
Level 57
Report

Id argue that in its current implementation, the ladder isn't representative of the lower half of the ladder either.
A ton of people are well below what they should be rated do to the amount of unexpired games. @Cowboy 1) Could you provide actual (statistically significant) evidence that (on average) the current rating system isn't representative of the lower half? 2) I understand why this might be frustrating, but the data shows that the game was giving you what it thought to be the most even matchups possible, and that it was generally very accurate in its assessment of the equality of those matchups.

Bayesian ELO versus Regular ELO: 2/27/2019 22:10:31 
l4v.r0v
Level 59
Report

Edit to avoid a doublepost: What about implementing the Math Wolf activity bonus system but only for players with ratings above a threshold (say 1900)? You could just put those players under stricter scrutiny and encourage them to be more active (i.e., influence their behavior through small incentives which are significant near the top in such a way that the adverse effect of the bonus system on ratings will be offset by the predictivity gains of having better data as a consequence of more activity) since it seems this thread is motivated by the idea that we should be especially interested in accurate ratings near the top (but as Nick points out, concerns about just the top shouldn't motivate how we handle the entire ladder). Also the burden imposed by Nick for data might not actually work out. The predictivity of Elo is sensitive to a lot of things matchmaking, the player pool, even the time of year (which affects the player pool and activity), etc. so we won't really be able to get comparable logloss data where all the variables are controlled; the best we can do is semicomparable environments but that falls into the trap of requiring us to make hardtotest assumptions about player behavior, which is what drives this issue in the first place. Even if we were to temporarily swap the 1v1 ladder to an Elo system, that would affect player behavior and consequently the logloss prediction value of the output.  Seems like Bayesian Elo was designed for learning/accurate estimation, where it beats out Elo ("Elostat" as Coulom calls it). With the same data (i.e., the same player behavior), Bayeselo beats Elo by a mile as demonstrated by Coulom's own analysis (especially the special cases that motivated Coulom's work). But the behavior and data won't be consistent between Bayeselo and Elo, due to the "runs" problem you point out. Bayeselo doesn't take into account people's behavior in a competitive ladder (and their incentive to maximize their own rating), so the core assumptions of the model get violated in practice. Chess seems to demonstrate that Elo holds up just fine (or at least well enough) under the stress of actual human competitive play. So +1 to your recommendation for plain Elo over Bayeselo. @Nick: maybe there's something in https://bit.ly/mdlanalysis (there's some stuff in there for the 1v1 ladder). I did some rough calculations to revise ratings (this was years ago, but if I recall correctly it was somewhere between one step of gradient descent and LMS just some rough logic of that sort without the actual rigor of either algorithm) and it does seem like there's more noise at the bottom than at the top. The revised ratings did significantly have more global predictive value of recent results than did the toplevel ratings, but this might not be on the rating algorithm itself so much as the nature and behavior of lowertier ladder players. This math is all pretty handwavy though, so it certainly wouldn't survive any sort of actual statistical review. But it gets the point across I think. Also +1 to directing this conversation in the direction of what the data supports rather than what it feels like from the player perspective. Neither Elo nor Bayeselo achieves anything close to perfect prediction, and matchmaking is constrained heavily by other factors like player availability (in fact, it seems like the biggest driver for quality matchmaking is probably just the size of the player pool). So personal observations could be highly inaccurate and unrepresentative of how the ladder actually performs. Would be nice to do a ladder analysis with actual rigor in the future.
Edited 2/27/2019 22:52:33

Bayesian ELO versus Regular ELO: 2/28/2019 12:37:30 
Rento
Level 61
Report

Another problem with BayesElo and current matchmaking is that it sometimes matches players with such a big rating difference that the better player will lose rating even after he wins. Yes, it happens rarely and almost exclusively to the top players where it's harder to make a good match. But it's still a drawback that regular elo doesn't have. 1) Is the goal of the ladder rating to give a small handful of players competing for the top spots the rating they "deserve" (already an incredibly subjective term), or to serve the entire set of 1v1 ladder players over the entire span of ratings? Current 1v1 ladder is absolutely terrible at retention of top players. As a consequence the top 10 isn't a real top 10 because most top players don't care to participate in the current system. You can argue that top 10 constitutes only under 3% of the entire player pool. I'll argue that top10 is what's being displayed first after you click the community tab and that people check it and care more about it than about the bottom 100. Finally, it's not like ELO would fix top 10 while screwing up everything else. I think it would be more fun for everyone if a good win streak would give you a good rating boost even if you have 80 unexpired games.
Edited 2/28/2019 12:38:10

Post a reply to this thread
Before posting, please proofread to ensure your post uses proper grammar and is free of spelling mistakes or typos.

