<< Back to Warzone Classic Forum   Search

Posts 1 - 15 of 15   
Is the new ladder rating system working?: 6/8/2023 13:20:17


Beep Beep I'm A Jeep 
Level 64
Report
Disclaimer: Not claiming 100% correctness, but I’m happy to be corrected by you guys.

The ladder has been in place for a month now and I thought I was doing a quick recap of what we have learned so far.
First of all, I am coming from the perspective of someone who has been wanting a change from Bayeselo for a long time, and I was happy to hear that Fizzer was gonna change to TrueSkill.
All problems solved, I thought, but is it true? We do indeed have TrueSkill in place now, but it in practice has been tweaked to an extend that the natural spirit of TrueSkill got lost or at least changed.
There are good and bad changes here, and I wanna discuss them in this post.

Before I dive into it, let me say that I focus on 1v1, but of course this can be extended to the team ladders as well. Some effects are even a bit worse for the team ladders.

Not wanting to sound too negative at this point, in the end I’m going to propose quick and easy solutions, so stay tuned.


What is TrueSkill?

This is not point of discussion here. For the background of how it works, I recommend these links, but I’m focusing on the practical aspects.
https://www.microsoft.com/en-us/research/wp-content/uploads/2007/01/NIPS2006_0688.pdf
https://trueskill.readthedocs.io/_/downloads/en/latest/pdf/


How is TrueSkill implemented in Warzone ladders?

Now for this, I had to solve this with trial and error basically, but anyways, here are the parameters: (I’m still not claiming to be 100% correct, since I do not have actual evidence these are the parameters)

Rating = 3(μ -3σ)

μ_start = 183,33
σ_start = 50

The important invisible parameters are:
β = 500
τ = 0,1
draw_probability = 0

I’m gonna leave this link for people wanting to play around with it:
https://trueskill-calculator.vercel.app/

Differences in the implementation to normal TrueSkill:

The standard TrueSkill implementation has these parameters:

Rating = μ -3σ

μ_start = 25
σ_start = 8,33

The important invisible parameters are:
β = 4,166 (0,5*σ)
τ = 0,0833 (0,01*σ)


As you can see, pretty much everything is changed. Now let’s discuss the effects of these changes.

Rating formula: 3(μ -3σ) vs. μ -3σ

I like this change! This basically just stretches the ratings but has no effect in how fast players overtake each other or anything like that. It really just makes the range of ratings a bit bigger, which I like.

μ_start: 183,33 vs. 25
xc_start: 50 vs. 8,333

Again a good change. This is just a design decision. Higher μ means the rating corridor is gonna be wider (similar to my previous point), and higher σ means more uncertainty about the rating. Higher uncertainty practically means that with a lower amount of finished games, you’re gonna be much lower rated.
You could say that a high σ is one of the best ways that TrueSkill has at hand to eliminate the previous problem of the Bayeselo system, which are ladder runs with low amount of games.

β = 500 (10*σ) vs. 4,166 (0,5*σ)
τ = 0,1 (0,002*σ) vs. 0,0833 (0,01*σ)

These two have to be looked at together, since they influence each other too much.
To put it simply, ß describes the rating differences that player A and B should have when player A beats player B with 76% probability.
τ put simply describes how volatile μ (and therefore ratings) are.

Both factors influence σ and σ, but especially with Fizzers implementation, it can be said that β has a much higher influence on σ and τ has a much higher influence on μ.

The relevant point here is not so much the absolute numbers, but more how they behave in relation to σ. As we can see, Fizzer implemented β with a crazy difference to σ then it was initially intended. For τ, the difference is only factor 5, but that still slows down rating updates significantly.


Why did Fizzer change the values?

I think he had good reasons to do so, with the following intentions:

Fizzer wants people to start somewhere from where there is a huge range to climb up.
→ This is implemented quite nicely, with the above described parameters to stretch the rating range and with a high σ_start. These two in combination give the impression that everyone starts at a rating of 100, when in fact, everyone starts at rating Rating 550! (3*μ)
That means, the average player will slowly climb as σ declines from 100 to 550, without actually playing better. Just playing average. Most people will not realize this and feel motivated. Genius!

Fizzer wants people to play a lot.
→ This is implemented effectively by the extremely high β and too low τ. The σ is consistently coming down, but almost in a linear slow way for a long time. Playing a looooot is encouraged greatly because of that, because as I described above, simply by playing a lot you will climb consistently for many games (potentially years).

Fizzer really wants people to play a loooooot.
→ Making the same point again, but it should be emphasized how big the effects of this are. You have to play approximately 700 games to get σ down to where it’s effects are not so relevant anymore. In comparison, the normal TrueSkill implementation with their β and τ configuration takes about 12 games. Assigning a reasonable rating fast is one of the strengths of TrueSkill, but you can see that this strength was intentionally removed.
700 games vs 12!

Fizzer wants climbing up to last a long time, even for the top players.
→ Effectively implemented by both low τ, but more importantly for this one, the high ß. 500 for a 76% win probability is nuts, especially since it is multiplied by 3 in the final formula.
To put this into some perspective, in Elo rating, it is 200 rating difference for this same probability. (https://www.walkofmind.com/programming/chess/elo.htm)
To put it into perspective with an example, the top players on MTL reached 2300 rating with the mean starting rating of 1500, a 800 points difference. For this system, as an easy (not 100% accurate) reference you could say that the 2300 MTL player would reach a rating of about 6550 with the average player sitting at 550 rating.
But of course reaching that will take a long time (not because of skill issues though).

The advantages and disadvantages of this system

I’m going to cut my points short and concise, without deeper explanations at this point. All my conclusions can be explained by my thoughts and explanations above.

The advantages of this system:

1. Once stabilized with lower σ, it is going to be extremely skill-based.
2. It motivates (especially average) players over a longer time period, because they climb slow and steady.
3. Ladder runs are no longer possible.

The practical problems and disadvantages of this:

1. Until σ is stabilized, it is going to be pretty activity-based.
2. The adjustment of σ is so slow, that people potentially won’t feel the effects and therefore motivation from climbing steadily.
3. Ratings below 0 are possible and also too likely with such a slowly adjusting σ.
4. People don’t want to play 700 games before getting a stable rating, and this is also not in the spirit of a rating system.
5. New joiners in a year or two will have to finish hundreds of games before even attempting to play for top ranks.
6. It will take years to establish meaningful top ratings.
7. The top ranks in the future are potentially too skill-based (not volatile enough - and yes, I’m actually saying that)


Final thoughts and possible improvements

I am still happy with the change to TrueSkill. As you can see in the advantage list, many of the previous problems have been successfully solved!
The concrete configuration caused some negative effects too. But that’s expected, nothing’s perfect in the first iteration, and it can easily be finetuned.

My proposal:

Make just the following four parameter changes, in order to create a balance between the desired effects (steady climbing, playing a lot, etc.) and an even more enjoyable ladder.

Rating = 2(μ -3σ)
μ_start = 283,33
β = 100
τ = 10

That, imo, creates a nicer balance. It can be a whole different discussions, how exactly they should be and I’d also be happy to participate in that, but I hope I could make clear that they somehow should be changed in this direction. Oh, and if someone pays me for it, I'm gonna create a pretty slidedeck for this, since this is a shitty text. I hate texts.

Edited 6/8/2023 13:22:21
Is the new ladder rating system working?: 6/8/2023 13:40:34


alexclusive 
Level 65
Report
That's such a great explanation, thank you for the effort
Is the new ladder rating system working?: 6/8/2023 13:47:32


Johnny Silverhand 
Level 58
Report
I agree that the climb is too slow, and net gain from games exists when it shouldn't IMO.

Let's take the rank 1 player on 2v2 for example, he's 10-2.

He was rated 150 at 9-1
He's rated 151 at 10-2



I don't think rank 1 should elevate in rating from a scenario like that, a win and a loss vs low-ish rated teams should be a net loss for anyone who's highly rated.

Edited 6/8/2023 13:49:03
Is the new ladder rating system working?: 6/8/2023 13:59:14

3.141592653589793238462643383279502884197169399375
Level 60
Report
we know fizzy wants to encourage participation, but has he overdone it?
(that's what I got from the post)
Is the new ladder rating system working?: 6/8/2023 14:47:38


FiveSmith 
Level 60
Report
Interesting, yet a bit difficult read. I kindly ask for some explanations for common folks like me.

@Beep Beep I'm A Jeep

You have to play approximately 700 games to get σ down to where it’s effects are not so relevant anymore.

How is this value of "700 games" calculated?

...you could say that the 2300 MTL player would reach a rating of about 6550 with the average player sitting at 550 rating.

Is that really true? May you please give more details, how that was calculated.
The counterarguments to that claim is that Warzone has TrueSkill in QM and CW, and we don't see there any 10x difference between top and mean ratings. We see 2-3x difference there.

My proposal:...
Rating = 2(μ -3σ)
μ_start = 283,33
β = 100
τ = 10

How do these proposed values convert to the "games till rating stabilizes" parameter?

Edited 6/8/2023 14:58:42
Is the new ladder rating system working?: 6/8/2023 15:26:09


TheGreatLeon
Level 61
Report
Let’s start with the obvious: this is a million times better than BayesianElo. Thank you Fizzer for making this change.

I’m not going to touch on the math and specific variables but I do think you outline an interesting trade-off between what is ‘optimal’ in terms of determining the relative ranking of player + generating good matchups and what is ‘optimal’ in terms of encouraging players to join and play + play more.

Qualitatively, right now, I think that balance is being hit well. The pairings and rankings feel good. I’m consistently getting good games against opponents I consider peers. I haven’t had any repeat opponents. The players at the top of the rankings are very good players. In the old ladder, playing against “rank 20” was meaningless - they might be good or bad - whereas now I trust that “rank 20” is a very good but not phenomenal player.

That is probably aided by the fact that activity has a minimal impact at the moment with everyone playing roughly the same number of games. As long as people stay active, it feels like this system is going to work very well.

If there are actually players with negative ratings (or in danger of dropping below 0), I propose we just add 1000 to everyone’s rating. For anyone who unfamiliar with the math, there are zero downsides to this approach, it’s purely psychological and won’t change anything in terms of pairings and rankings but will avoid this problem.

I wouldn’t alter the system further at the moment. At least at my middling rating, it is working quite well in my opinion.
Is the new ladder rating system working?: 6/8/2023 15:46:50


καλλιστηι 
Level 62
Report
Do I understand it correctly, that if a new player with 100 rating defeats a player with 200 rating, he gets more rating than a 100 rated player who has been on a ladder for a while against the same opponent? (since they both get the same change in μ, but the first player also gets a decrease in σ)?
Is the new ladder rating system working?: 6/8/2023 16:21:28

Fizzer 
Level 64

Warzone Creator
Report
Great analysis. I am considering changes that you suggested here. It was tweaked to draw out the rating increase speed, with the intent of tweaking after it's live. The logic was that it's better to start out making ratings grow too slowly then speed it up in an update than it is to accidentally go the other way.
Is the new ladder rating system working?: 6/8/2023 16:28:11


Beep Beep I'm A Jeep 
Level 64
Report
How is this value of "700 games" calculated?


Not really calculated, as it also depends on the opponent you're playing. σ change is slightly different depending on opponent's μ and σ.
But you can estimate it roughly. For example we know from the current observations, 50 games bring it down about 3.5 points.
Solve 50-3,5x = 0 --> x = 14. So play 14*50 = 700 games. But it's just roughly, and actually the decrease should slow down, so probably much more than 700. But it doesn't matter if it is 500 or 1000 games, the point remains the same.

Is that really true? May you please give more details, how that was calculated.
The counterarguments to that claim is that Warzone has TrueSkill in QM and CW, and we don't see there any 10x difference between top and mean ratings. We see 2-3x difference there.


QM is most certainly not TrueSkill and idk what CW ratings even are. Why do you think those are TrueSkill?
The other question: Calculating rating for the average player is very easy. It's just the mean rating without uncertainty. So the average player plays many average games, so the system is certain they're an average player. So it's just Rating = 3(μ -3σ) = 3*183,33 = 550
For the top player, I just took Fizzers configuration of beta, which is that a player with 76% advantage over another player should be 500 points ahead of them. Then I looked at MTL, which has an Elo configuration. In Elo, this point difference is set at 200. Then I saw, that in MTL, the best players are 800 points better than the average player. Which is 4x the 200 point difference. Applying this to 1v1 ladder, and assuming we have the same top player and average player here, this 4x would mean 4x500. So the best player is 2000 μ ahead. But in the final formula, this is also multiplied by 3! Which means 6000 points ahead.
There are problems with this calculation, since you can never truly know how it will behave in reality - for example in Elo also, being 800 points ahead of someone doesn't mean you'd beat them 99.9%, even though the rating indicates that. Those probabilities apply to peers, not across huge rating ranges. But all this doesn't really matter, it is just to show the enormous amount of rating difference that we're gonna see and how long players will have to climb until they eventually stabilize. And for this point, it does not matter if it is 6550 or 4000.

How do these proposed values convert to the "games till rating stabilizes" parameter?


Approximately 80 games until stabilized, but yeah, the sweet spot has to be found.

Do I understand it correctly, that if a new player with 100 rating defeats a player with 200 rating, he gets more rating than a 100 rated player who has been on a ladder for a while against the same opponent? (since they both get the same change in μ, but the first player also gets a decrease in σ)?


It's not so simple, actually, because they don't have the same μ and therefore gain different amounts of μ from the win. Player A has 33,3 μ and player B has 183,33 μ.
But generally speaking and depending on the concrete ratings, both scenarios are possible, either player A or player B getting more final rating from this win.

Edited 6/8/2023 17:02:43
Is the new ladder rating system working?: 6/8/2023 16:30:31


Beep Beep I'm A Jeep 
Level 64
Report
Great analysis. I am considering changes that you suggested here. It was tweaked to draw out the rating increase speed, with the intent of tweaking after it's live. The logic was that it's better to start out making ratings grow too slowly then speed it up in an update than it is to accidentally go the other way.


Thanks Fizzer. I agree with you, it can never be perfect in the beginning.
If you actually consider changing, I am also again more carefully thinking about the concrete parameters. Happy to get in touch with you about it :)
Is the new ladder rating system working?: 6/8/2023 17:59:57


FiveSmith 
Level 60
Report
@Beep Beep I'm A Jeep#1
Thanks for the explanations

QM is most certainly not TrueSkill and idk what CW ratings even are. Why do you think those are TrueSkill?

For the QuickMatch ratings:
- I have seen mentions here and there on forums/discords that it uses TrueSkill (for example here: https://www.warzone.com/Forum/302775-quickmatch-questions)
- The TrueSkill was officially stated to be used in the Real-Time Ladder, which, from what I heard, was later converted to QuickMatch https://www.warzone.com/blog/index.php/2014/03/website-update-2-5-real-time-ladder/
- The QM ratings show "TrueSkill-like" behaviour, i.e.:
-- They update instantly after a game
-- There is known to be a ceiling at 500-550, below which new ratings tend to grow in the TrueSkill manner

The CW ratings is the ratings, that are shown on the clans' pages.
Python's Clan War Rating currently is 714.3
We know that it is TrueSkill: read in on discord, Fizzer admitted that during the AMA.

Edited 6/8/2023 18:00:02
Is the new ladder rating system working?: 6/8/2023 19:41:44


Rento 
Level 61
Report
RT Ladder was indeed pure TrueSkill.
CW probably is TrueSkill as well but I've never looked into it closely. I don't think that anyone has, to my knowledge.

QM ratings are somewhat based on TrueSkill but with a bunch of additional parameters added, so it really doesn't fit into the TrueSkill definition anymore. For example, new players gain flat 10 points for each win, and it stays that way until 400 rating or something. And most importantly, there are the limits of minimum +1 point for each game won and -10 for a loss, no matter how low rated your opponent is. What that means is that if you can maintain over 91% win rate (max 1 loss for every 10 wins), you can grow your rating indefinitely, until you get bored. Which is exactly what happened with Rene and his 3000 global rating. It's not that he can't go higher anymore, or that noone's skilled enough to surpass him. Simply noone else can be arsed to play that many games.
Is the new ladder rating system working?: 6/9/2023 06:35:40


Derfellios
Level 61
Report
Great post! I agree with the suggested changed but would like to add that the formula for the rating Beep gave was slightly incorrect. The implemented formula is

Rating = max(0, 3(μ -3σ))

which is slightly different. In essence, the change implies that if one has a rating of 0, it it impossible to distinguish ratings of 0 and -100 and so one can win multiple games and not gain any rating. Already, multiple players have reached 0 rating and some have won games without gaining rating (actually the end of game message says you "lost" 0 rating). This is very demotivating.

Of course this problem will only affect a small portion of the player base, but as the solutions are straightforward, I do not understand why this system is in place. It can be changed by removing the lower border of 0 of the rating, or a higher starting value than 100 is needed such that there is more room below the starting value.

Beep quickly raised this issue in a single line, but I don't think his solution will fix this problem.
Is the new ladder rating system working?: 6/9/2023 06:44:45


Orcinus orca
Level 60
Report
Yeah, the QM scenario where a certain skill level allows you indefinitely pump your rating is certainly unacceptable on the ladders.

The 700 games to stability is something that if accurate needs to be fixed. If I assume an average of 5 turns per game (very conservative, IMO), 5 games at a time and 1 day per turn (the default boot time), then an individual can complete 1 day/game it would take nearly two years to reach 700 games. Obviously this is a very rough estimate of game completion rate, but checking the top of the 1v1 ladder:

1. Beep Beep Jeep: 53games/30 days = 1.7/day
2. Rufus: 64 games/30 days = 2.1/day
3. SerFen: 57games/30 days = 1.9/day
4. Kryzy: 91games/30 days = 3.0/day
5. Gunslinger: 52games/30 days = 1.7/day

And these are the top 5, the truly dedicated skilled players. If I check the neighborhood of 50 on the ladder:

50. Matt431: 19 games/ 30 days = 0.63/day
51. Adreso: 24 games/ 21 days = 1.1/day
52. Pablito: 25 games/30 days = 0.83/day
53. Schubei: 32 games/30 days = 1.1/day
54. Tear-Z: 33 games/30 days = 1.1/day

Based on this I think the 1/day assumption for pace of play is a decent rough estimate at least on the 1v1 ladder. Team ladders obviously much slower.

I understand we want to encourage persistent play on the ladder (I think the rewards and permanent elo are good enough incentives for that), but rating should converge to skill level fairly quickly. The one good thing about the old system is for say a 2000 rated player it only took them about 7 wins to reach the point of playing opponents of comparable skill level.

Edited 6/9/2023 06:45:16
Is the new ladder rating system working?: 6/9/2023 12:23:29


Beep Beep I'm A Jeep 
Level 64
Report
For the QuickMatch ratings:
- I have seen mentions here and there on forums/discords that it uses TrueSkill (for example here:


Okay well, it might be based on TrueSkill somehow, but as Rento explained, it behaves nothing like the spirit of TrueSkill.

The CW ratings is the ratings, that are shown on the clans' pages.
Python's Clan War Rating currently is 714.3
We know that it is TrueSkill: read in on discord, Fizzer admitted that during the AMA.


I see. It might be TrueSkill, but let me make 2 remarks. First, as you can see, TrueSkill has a few parameters and you can tweak them and create entirely different results. Secondly, a clan will never even come close to the same domination over other clans as an individual player could over other players. It is also a "number of participants" game, and there are simply more individuals than clans. For your question that means a 3x to medium clans makes sense.

Rating = max(0, 3(μ -3σ))
In essence, the change implies that if one has a rating of 0, it it impossible to distinguish ratings of 0 and -100 and so one can win multiple games and not gain any rating.
Beep quickly raised this issue in a single line, but I don't think his solution will fix this problem.


You're right, thanks for this addition.
My solution addresses this by putting the starting rating to 400 and adjusting faster, but it might not be sufficient. I think the improved version here is to change the formula in my solution to
Rating =2(μ -3σ)+600
in order to create a nice clean start at 1000 and giving more room downwards. This has no drawbacks at all, except the psychological effect of "falling a lot". I would still argue, that this is better than having a rating of 0, which suggests, you are literally playing so bad, it cannot be any worse.

Edited 6/9/2023 12:25:26
Posts 1 - 15 of 15