Introduction

In this notebook I will be analysing the top 100 chess players over a number of years. The dataset includes data from July 2000 to June 2017 and has information on name, rank, country, number of games played and birth year, among others. There is no overall aim for the analysis. This is my first project after completing a course in Python and Pandas and I want to practise some of the features explained in the course.

Data Cleaning

# Import the libraries I will use in the project
import numpy as np
import pandas as pd
import matplotlib as plt

# Read in the chess players data
chess_data = pd.read_csv("fide_historical.csv")
chess_data.head()

	ranking_date	rank	name	title	country	rating	games	birth_year
0	27-07-00	1	Kasparov, Garry	g	RUS	2849	35	1963
1	27-07-00	2	Kramnik, Vladimir	g	RUS	2770	23	1975
2	27-07-00	3	Anand, Viswanathan	g	IND	2762	23	1969
3	27-07-00	4	Morozevich, Alexander	g	RUS	2756	28	1977
4	27-07-00	5	Adams, Michael	g	ENG	2755	38	1971

chess_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11511 entries, 0 to 11510
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   ranking_date  11511 non-null  object
 1   rank          11511 non-null  int64 
 2   name          11511 non-null  object
 3   title         11511 non-null  object
 4   country       11511 non-null  object
 5   rating        11511 non-null  int64 
 6   games         11511 non-null  int64 
 7   birth_year    11511 non-null  int64 
dtypes: int64(4), object(4)
memory usage: 719.6+ KB

Straight away I can see that the ranking_data column should be converted to datetime format to make it easier to work with.

chess_data["ranking_date"] = pd.to_datetime(chess_data["ranking_date"], format="%d-%m-%y")
chess_data.head()

	ranking_date	rank	name	title	country	rating	games	birth_year
0	2000-07-27	1	Kasparov, Garry	g	RUS	2849	35	1963
1	2000-07-27	2	Kramnik, Vladimir	g	RUS	2770	23	1975
2	2000-07-27	3	Anand, Viswanathan	g	IND	2762	23	1969
3	2000-07-27	4	Morozevich, Alexander	g	RUS	2756	28	1977
4	2000-07-27	5	Adams, Michael	g	ENG	2755	38	1971

chess_data.size

chess_data["title"].value_counts()

title
g     11485
wg       14
f         9
m         3
Name: count, dtype: int64

There are around 100k pieces of data, but as expected the vast majority of entries are grandmasters. It is not useful to do analysis on the players' titles with so little data for the other three categories, so I will remove it from the dataframe.

chess_data.drop("title", axis=1, inplace=True)
chess_data.head()

	ranking_date	rank	name	country	rating	games	birth_year
0	2000-07-27	1	Kasparov, Garry	RUS	2849	35	1963
1	2000-07-27	2	Kramnik, Vladimir	RUS	2770	23	1975
2	2000-07-27	3	Anand, Viswanathan	IND	2762	23	1969
3	2000-07-27	4	Morozevich, Alexander	RUS	2756	28	1977
4	2000-07-27	5	Adams, Michael	ENG	2755	38	1971

Analysis of the First Month

To make the initial analysis simpler, I want to take just the data for the first month.

# Only take the first month of data
first_month = chess_data[chess_data["ranking_date"] == "2000-07-27"].copy()

There are a few things I want to try out with this data. First, I want to see how many players from each country there are. Then I want to plot the ratings and see how it varies in the top 100 players. Finally, I want to see how number of games played and age relate to the rank of the players.

# Find the distribution of players among countries
first_month["country"].value_counts()

country
RUS    25
CHN     6
ISR     5
ENG     5
GER     5
USA     5
UKR     5
NED     4
ARM     4
FRA     4
HUN     3
GEO     2
BUL     2
CRO     2
SUI     2
UZB     2
MDA     2
BIH     2
CZE     2
SLO     1
IND     1
BLR     1
POL     1
SWE     1
ESP     1
BRA     1
EST     1
DEN     1
BEL     1
SVK     1
PER     1
KAZ     1
Name: count, dtype: int64

There are lots of countries with just 1 or 2 players in the top 100, but Russia takes up a whole quarter of the rankings. This makes sense because in the Cold War, state-sponsored programmes were started in Russia to use chess to show their intellectual superiority. The strong culture of chess in Russia must have continued even to 2000, the year that is being analysed.

# Plot the ratings of the top 100 players
first_month["rating"].plot(xlabel="Rank", ylabel="Rating", title="Rank Against Rating")

<Axes: title={'center': 'Rank Against Rating'}, xlabel='Rank', ylabel='Rating'>

We can see from the graph that the top 100 players range from 2600 to around 2850. The graph looks as though it levels off towards the right, meaning there are probably lots of people around 2550-2600 rating who barely miss the cutoff for the top 100.

# Plot the number of games played against the rank
first_month.plot(x="rank", 
                 y="games",
                 xlabel="Rank",
                 ylabel="Games Played",
                 title="Number of Games Played by Rank")

<Axes: title={'center': 'Number of Games Played by Rank'}, xlabel='Rank', ylabel='Games Played'>

There has been lots of controversy over the method used to calculate ratings in chess, including whether the number of games played impacts what score you are likely to have. For example, if a high rated player plays lots of games, they might lose just one, and as a result lose lots of rating. But someone who plays less games does not have as much opportunity to make a mistake and lose one. However, this graph seems to show that at least among the top rated players, the number of games does not impact what rank you can expect to have. One interesting feature of the graph is that some players have not played any games at all.

first_month["games"].value_counts().get(0)

There are 6 players who did not play any games in this period. This means despite not playing, their rank remained high enough to stay in the top 100.

# Create a new column with the age of the player
first_month["age"] = 2000 - first_month["birth_year"]

first_month.head()

	ranking_date	rank	name	country	rating	games	birth_year	age
0	2000-07-27	1	Kasparov, Garry	RUS	2849	35	1963	37
1	2000-07-27	2	Kramnik, Vladimir	RUS	2770	23	1975	25
2	2000-07-27	3	Anand, Viswanathan	IND	2762	23	1969	31
3	2000-07-27	4	Morozevich, Alexander	RUS	2756	28	1977	23
4	2000-07-27	5	Adams, Michael	ENG	2755	38	1971	29

# Plot the age against the rank
first_month.plot(x="rank", 
                 y="age",
                 xlabel="Rank",
                 ylabel="Age",
                 title="Age of Players by Rank")

<Axes: title={'center': 'Age of Players by Rank'}, xlabel='Rank', ylabel='Age'>

Again, the age of the player does not seem to correlate with their ranking. From the graph we can see that most of the top 100 chess players are aged between 20 and 50, but there is one extreme value around 70 years.

# Find the maximum age
first_month["age"].max()

# Find the minimum age
first_month["age"].min()

Analysis of the Whole Dataset

Now I want to try some analysis on all of the data. I want to do the following:

See how the highest rating has changed over time
Plot the average age of the top 100 players over time
Draw a bar chart of the total number of players from each country

# Find the highest rating player for each month
highest_rating = chess_data[chess_data["rank"] == 1]

#highest_rating.plot(x="ranking_date",
                    #y="rating",
                    #xlabel="Date",
                    #ylabel="Rating")

(highest_rating
 .groupby("ranking_date")
 .agg({"rating": "mean"})
 .plot(xlabel="Year", ylabel="Maximum Rating", title="Maximum Rating Over the Years")
)

<Axes: title={'center': 'Maximum Rating Over the Years'}, xlabel='Year', ylabel='Maximum Rating'>

There is a clear trend in the maximum rating over time. From 2000 to 2007, the maximum rating decreases by almost 70 points. Then it increases 100 points to 2880 in around 2014. It decreased again for the rest of the time covered by the data. The algorithm for calculating rating is pretty complicated, but the minimum in 2007 suggests that competition was fiercest at this time and there was no clear "best" player who could easily beat everyone else.

chess_data["rating"].max()

chess_data[chess_data["rating"] == 2882]

	ranking_date	rank	name	country	rating	games	birth_year
7675	2014-05-27	1	Carlsen, Magnus	NOR	2882	1	1990

The maximum ever rating is 2882 by Magnus Carlsen which supports this theory. Around 2014, Magnus Carlsen was by far the best player in the world and had no trouble beating almost everyone.

chess_data["age"] = chess_data["ranking_date"].dt.year - chess_data["birth_year"]
(chess_data
 .groupby("ranking_date")
 .agg({"age": "mean"})
 .plot(xlabel="Year", ylabel="Age", title="Average Age of Players Over Time")
)

<Axes: title={'center': 'Average Age of Players Over Time'}, xlabel='Year', ylabel='Age'>

This shows an interesting pattern in how the ages of the top 100 players have changed over time. Although the average remains fairly stable, ranging from just under 29 to almost 34, there is a clear trend. It decreased from 2000 to 2011, and then has started increasing again. This might be because there was a new generation of chess players around 2011 who still haven't been replaced yet. It's worth analysing this in more detail.

# Make a list of all the people in the top 100 on 2011-01-27
query = (chess_data["ranking_date"] == "2011-01-27") & (chess_data["rank"] < 101)
names_2011 = chess_data[query]["name"]

# For every time recorded in the data after 2011, find the number of people who were in the original top 100 in 2011
remaining_players = (chess_data
                        .query("name in @names_2011")
                        .groupby("ranking_date")
                        .nunique()
                        .query("ranking_date.dt.year > 2010")["name"]
                    )

remaining_players.plot(ylim=(0, 115), xlabel="Year", ylabel="Remaining players", title="Players Remaining from 2011 Over the Years")

<Axes: title={'center': 'Players Remaining from 2011 Over the Years'}, xlabel='Year', ylabel='Remaining players'>

This graph supports the hypothesis that the average age of the top 100 players has been increasing due to a new generation of chess players. The number of players remaining from 2011 stays relatively high, but it is gradually decreasing, which could suggest that the average age will start to decrease again as yet another generation of players replaces the current one.

chess_data["year"] = chess_data["ranking_date"].dt.year

# Count occurrences for each country per year
country_counts = chess_data.groupby(["year", "country"]).size().reset_index(name="count")

# Calculate the total count for each year
year_totals = country_counts.groupby("year")["count"].sum().reset_index(name="total")

# Merge the total count back into the data
country_counts = pd.merge(country_counts, year_totals, on='year')

# Normalise counts by dividing by the total count for each year
country_counts["normalised_count"] = country_counts["count"] / country_counts['total']

# Rank countries by normalised count within each year
country_counts["rank"] = country_counts.groupby("year")["normalised_count"].rank(ascending=False, method="dense")

# Filter for top 3 countries per year
top_countries = country_counts[country_counts["rank"] <= 3]

# Create a pivot table for normalised counts
normalised_pivot = top_countries.pivot(index="year", columns="country", values="normalised_count").fillna(0)

# Ensure only the top 3 countries for each year are included
normalised_pivot = normalized_pivot.apply(lambda row: row.nlargest(3), axis=1).fillna(0)

plot = normalised_pivot.plot.bar(figsize=(10, 6), stacked=True, xlabel="Year", ylabel="Proportion of Total Players")
plot.legend(title="Country", bbox_to_anchor=(1.05, 1), loc="upper left")

<matplotlib.legend.Legend at 0x20cb35493d0>

To be honest I have no idea how to order the bars for each year, so it's hard to compare how countries have changed over time. This is something I'll have to look at later when I have more experience. However, the graph does show that the top 3 countries every year made up a similar proportion of the total top 100 players, around 35 to 40 percent. It's also clear that Russia has remained dominant throughout the whole period, with other notable countries including Ukraine and China.

Evaluation

I was able to successfully use the skills learnt in the Python and Pandas course to carry out simple analysis of chess players, and provide some data visualisation. In my analysis, I was able to find the following key points:

Russia is very dominant, taking up a quarter of the top 100 in the first year
There are likely lots of people bordering the cutoff rating for the top 100
Contrary to popular belief, number of games played does not seem to impact rating; there is also no pattern in the ages of the top 100 players
The maximum rating was lowest around 2007, suggesting competition was highest at this point
The average age of the top 100 players decreased from 2000 to 2011, and then started increasing again; further analysis confirmed this was due to a new generation of chess players that have not been replaced yet

My next goal is to continue improving my fundamentals in data science with Python. I will find a course that teaches more specific statistical analysis, since the analysis in this project was very basic, with just averages and plotting of data. At some point, I will also need to revisit the bar chart and find a way to improve its readability.

Analysis of Top 100 Chess Players

Introduction

Data Cleaning

Analysis of the First Month

Analysis of the Whole Dataset

Evaluation