Rating Systems (1): Elo and its limitations

14 Aug 2020 1919h

Introduction

This is going to be a new series on rating systems, which is a vastly underrated (pun intended) area of statistics and data science. Rating systems has actually been a part of my life (and probably yours), from early days in chess and then games with matchmaking like CS:GO and Valorant, to now thinking if contract bridge should also have one.

Historical Context

Having been around for centuries, chess is a game which people have wasted much time on arguing/debating who is the best player. It’s probably slightly surprising then that the first modern rating systems only appeared around or after the end of World War 2. The first systems (Ingo and Harkness) were quite simple and used the idea of the average rating of opponents with adjustments for the results.

It is worth noting that the English Chess Federation still has a rating system still in operation (since 1958) with somewhat similar ideas: a player’s grade is his opponent’s grade +- 50 (depending on win, or loss; it’s the opponent’s grade if it’s a draw). The ECF system also caps the difference to 40 points between both players (even if the actual difference is more than 40), and the player’s rating is the average over all the matches during a time period.

ECF System

The ECF system, while limited, actually contains some fundamental ideas:

One big criticism is how the ECF rating system is a “lagging” indicator, especially for junior players who improve faster than the rating system can catch up.

Elo System

The rating system takes its name after it’s inventor, Arpad Elo, a Hungarian-American professor of physics who was also a chess player.

Expected Score

The main idea of the rating system is that of the expected score, which is the sum of the probability of winning plus half the probability of drawing (in chess, a score of 1 is given for a win and 0.5 for a draw). Elo suggested scaling the ratings such that a difference of 200 points would give an expected score of 0.75 for the stronger player. The average rating of 1500 was chosen by the US Chess Federation and this is used as the initial rating. Note that no distinction is made between wins and draws in this expected value; a draw is simply half a win.

The expected score is simply a logistic function, for example the expected score of Player A and Player B is given below, where R are the respective ratings:


Note that the two equations are symmetrical.

The rating is then updated by multiplying the difference between the actual and expected score by a K factor.

Elo originally set K=10, which is deemed to be too low, i.e. too insensitive/lagging behind actual performance. The FIDE tiers this to three different levels:

Mathematical Issues

Practical Issues