In this notebook I will be analysing the top 100 chess players over a number of years. The dataset includes data from July 2000 to June 2017 and has information on name, rank, country, number of games played and birth year, among others. There is no overall aim for the analysis. This is my first project after completing a course in Python and Pandas and I want to practise some of the features explained in the course.
# Import the libraries I will use in the project
import numpy as np
import pandas as pd
import matplotlib as plt
# Read in the chess players data
chess_data = pd.read_csv("fide_historical.csv")
chess_data.head()
ranking_date | rank | name | title | country | rating | games | birth_year | |
---|---|---|---|---|---|---|---|---|
0 | 27-07-00 | 1 | Kasparov, Garry | g | RUS | 2849 | 35 | 1963 |
1 | 27-07-00 | 2 | Kramnik, Vladimir | g | RUS | 2770 | 23 | 1975 |
2 | 27-07-00 | 3 | Anand, Viswanathan | g | IND | 2762 | 23 | 1969 |
3 | 27-07-00 | 4 | Morozevich, Alexander | g | RUS | 2756 | 28 | 1977 |
4 | 27-07-00 | 5 | Adams, Michael | g | ENG | 2755 | 38 | 1971 |
chess_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11511 entries, 0 to 11510 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ranking_date 11511 non-null object 1 rank 11511 non-null int64 2 name 11511 non-null object 3 title 11511 non-null object 4 country 11511 non-null object 5 rating 11511 non-null int64 6 games 11511 non-null int64 7 birth_year 11511 non-null int64 dtypes: int64(4), object(4) memory usage: 719.6+ KB
Straight away I can see that the ranking_data column should be converted to datetime format to make it easier to work with.
chess_data["ranking_date"] = pd.to_datetime(chess_data["ranking_date"], format="%d-%m-%y")
chess_data.head()
ranking_date | rank | name | title | country | rating | games | birth_year | |
---|---|---|---|---|---|---|---|---|
0 | 2000-07-27 | 1 | Kasparov, Garry | g | RUS | 2849 | 35 | 1963 |
1 | 2000-07-27 | 2 | Kramnik, Vladimir | g | RUS | 2770 | 23 | 1975 |
2 | 2000-07-27 | 3 | Anand, Viswanathan | g | IND | 2762 | 23 | 1969 |
3 | 2000-07-27 | 4 | Morozevich, Alexander | g | RUS | 2756 | 28 | 1977 |
4 | 2000-07-27 | 5 | Adams, Michael | g | ENG | 2755 | 38 | 1971 |
chess_data.size
92088
chess_data["title"].value_counts()
title g 11485 wg 14 f 9 m 3 Name: count, dtype: int64
There are around 100k pieces of data, but as expected the vast majority of entries are grandmasters. It is not useful to do analysis on the players' titles with so little data for the other three categories, so I will remove it from the dataframe.
chess_data.drop("title", axis=1, inplace=True)
chess_data.head()
ranking_date | rank | name | country | rating | games | birth_year | |
---|---|---|---|---|---|---|---|
0 | 2000-07-27 | 1 | Kasparov, Garry | RUS | 2849 | 35 | 1963 |
1 | 2000-07-27 | 2 | Kramnik, Vladimir | RUS | 2770 | 23 | 1975 |
2 | 2000-07-27 | 3 | Anand, Viswanathan | IND | 2762 | 23 | 1969 |
3 | 2000-07-27 | 4 | Morozevich, Alexander | RUS | 2756 | 28 | 1977 |
4 | 2000-07-27 | 5 | Adams, Michael | ENG | 2755 | 38 | 1971 |
To make the initial analysis simpler, I want to take just the data for the first month.
# Only take the first month of data
first_month = chess_data[chess_data["ranking_date"] == "2000-07-27"].copy()
There are a few things I want to try out with this data. First, I want to see how many players from each country there are. Then I want to plot the ratings and see how it varies in the top 100 players. Finally, I want to see how number of games played and age relate to the rank of the players.
# Find the distribution of players among countries
first_month["country"].value_counts()
country RUS 25 CHN 6 ISR 5 ENG 5 GER 5 USA 5 UKR 5 NED 4 ARM 4 FRA 4 HUN 3 GEO 2 BUL 2 CRO 2 SUI 2 UZB 2 MDA 2 BIH 2 CZE 2 SLO 1 IND 1 BLR 1 POL 1 SWE 1 ESP 1 BRA 1 EST 1 DEN 1 BEL 1 SVK 1 PER 1 KAZ 1 Name: count, dtype: int64
There are lots of countries with just 1 or 2 players in the top 100, but Russia takes up a whole quarter of the rankings. This makes sense because in the Cold War, state-sponsored programmes were started in Russia to use chess to show their intellectual superiority. The strong culture of chess in Russia must have continued even to 2000, the year that is being analysed.
# Plot the ratings of the top 100 players
first_month["rating"].plot(xlabel="Rank", ylabel="Rating", title="Rank Against Rating")
<Axes: title={'center': 'Rank Against Rating'}, xlabel='Rank', ylabel='Rating'>
We can see from the graph that the top 100 players range from 2600 to around 2850. The graph looks as though it levels off towards the right, meaning there are probably lots of people around 2550-2600 rating who barely miss the cutoff for the top 100.
# Plot the number of games played against the rank
first_month.plot(x="rank",
y="games",
xlabel="Rank",
ylabel="Games Played",
title="Number of Games Played by Rank")
<Axes: title={'center': 'Number of Games Played by Rank'}, xlabel='Rank', ylabel='Games Played'>
There has been lots of controversy over the method used to calculate ratings in chess, including whether the number of games played impacts what score you are likely to have. For example, if a high rated player plays lots of games, they might lose just one, and as a result lose lots of rating. But someone who plays less games does not have as much opportunity to make a mistake and lose one. However, this graph seems to show that at least among the top rated players, the number of games does not impact what rank you can expect to have. One interesting feature of the graph is that some players have not played any games at all.
first_month["games"].value_counts().get(0)
6
There are 6 players who did not play any games in this period. This means despite not playing, their rank remained high enough to stay in the top 100.
# Create a new column with the age of the player
first_month["age"] = 2000 - first_month["birth_year"]
first_month.head()
ranking_date | rank | name | country | rating | games | birth_year | age | |
---|---|---|---|---|---|---|---|---|
0 | 2000-07-27 | 1 | Kasparov, Garry | RUS | 2849 | 35 | 1963 | 37 |
1 | 2000-07-27 | 2 | Kramnik, Vladimir | RUS | 2770 | 23 | 1975 | 25 |
2 | 2000-07-27 | 3 | Anand, Viswanathan | IND | 2762 | 23 | 1969 | 31 |
3 | 2000-07-27 | 4 | Morozevich, Alexander | RUS | 2756 | 28 | 1977 | 23 |
4 | 2000-07-27 | 5 | Adams, Michael | ENG | 2755 | 38 | 1971 | 29 |
# Plot the age against the rank
first_month.plot(x="rank",
y="age",
xlabel="Rank",
ylabel="Age",
title="Age of Players by Rank")
<Axes: title={'center': 'Age of Players by Rank'}, xlabel='Rank', ylabel='Age'>
Again, the age of the player does not seem to correlate with their ranking. From the graph we can see that most of the top 100 chess players are aged between 20 and 50, but there is one extreme value around 70 years.
# Find the maximum age
first_month["age"].max()
69
# Find the minimum age
first_month["age"].min()
17
Now I want to try some analysis on all of the data. I want to do the following:
# Find the highest rating player for each month
highest_rating = chess_data[chess_data["rank"] == 1]
#highest_rating.plot(x="ranking_date",
#y="rating",
#xlabel="Date",
#ylabel="Rating")
(highest_rating
.groupby("ranking_date")
.agg({"rating": "mean"})
.plot(xlabel="Year", ylabel="Maximum Rating", title="Maximum Rating Over the Years")
)
<Axes: title={'center': 'Maximum Rating Over the Years'}, xlabel='Year', ylabel='Maximum Rating'>
There is a clear trend in the maximum rating over time. From 2000 to 2007, the maximum rating decreases by almost 70 points. Then it increases 100 points to 2880 in around 2014. It decreased again for the rest of the time covered by the data. The algorithm for calculating rating is pretty complicated, but the minimum in 2007 suggests that competition was fiercest at this time and there was no clear "best" player who could easily beat everyone else.
chess_data["rating"].max()
2882
chess_data[chess_data["rating"] == 2882]
ranking_date | rank | name | country | rating | games | birth_year | |
---|---|---|---|---|---|---|---|
7675 | 2014-05-27 | 1 | Carlsen, Magnus | NOR | 2882 | 1 | 1990 |
The maximum ever rating is 2882 by Magnus Carlsen which supports this theory. Around 2014, Magnus Carlsen was by far the best player in the world and had no trouble beating almost everyone.
chess_data["age"] = chess_data["ranking_date"].dt.year - chess_data["birth_year"]
(chess_data
.groupby("ranking_date")
.agg({"age": "mean"})
.plot(xlabel="Year", ylabel="Age", title="Average Age of Players Over Time")
)
<Axes: title={'center': 'Average Age of Players Over Time'}, xlabel='Year', ylabel='Age'>
This shows an interesting pattern in how the ages of the top 100 players have changed over time. Although the average remains fairly stable, ranging from just under 29 to almost 34, there is a clear trend. It decreased from 2000 to 2011, and then has started increasing again. This might be because there was a new generation of chess players around 2011 who still haven't been replaced yet. It's worth analysing this in more detail.
# Make a list of all the people in the top 100 on 2011-01-27
query = (chess_data["ranking_date"] == "2011-01-27") & (chess_data["rank"] < 101)
names_2011 = chess_data[query]["name"]
# For every time recorded in the data after 2011, find the number of people who were in the original top 100 in 2011
remaining_players = (chess_data
.query("name in @names_2011")
.groupby("ranking_date")
.nunique()
.query("ranking_date.dt.year > 2010")["name"]
)
remaining_players.plot(ylim=(0, 115), xlabel="Year", ylabel="Remaining players", title="Players Remaining from 2011 Over the Years")
<Axes: title={'center': 'Players Remaining from 2011 Over the Years'}, xlabel='Year', ylabel='Remaining players'>
This graph supports the hypothesis that the average age of the top 100 players has been increasing due to a new generation of chess players. The number of players remaining from 2011 stays relatively high, but it is gradually decreasing, which could suggest that the average age will start to decrease again as yet another generation of players replaces the current one.
chess_data["year"] = chess_data["ranking_date"].dt.year
# Count occurrences for each country per year
country_counts = chess_data.groupby(["year", "country"]).size().reset_index(name="count")
# Calculate the total count for each year
year_totals = country_counts.groupby("year")["count"].sum().reset_index(name="total")
# Merge the total count back into the data
country_counts = pd.merge(country_counts, year_totals, on='year')
# Normalise counts by dividing by the total count for each year
country_counts["normalised_count"] = country_counts["count"] / country_counts['total']
# Rank countries by normalised count within each year
country_counts["rank"] = country_counts.groupby("year")["normalised_count"].rank(ascending=False, method="dense")
# Filter for top 3 countries per year
top_countries = country_counts[country_counts["rank"] <= 3]
# Create a pivot table for normalised counts
normalised_pivot = top_countries.pivot(index="year", columns="country", values="normalised_count").fillna(0)
# Ensure only the top 3 countries for each year are included
normalised_pivot = normalized_pivot.apply(lambda row: row.nlargest(3), axis=1).fillna(0)
plot = normalised_pivot.plot.bar(figsize=(10, 6), stacked=True, xlabel="Year", ylabel="Proportion of Total Players")
plot.legend(title="Country", bbox_to_anchor=(1.05, 1), loc="upper left")
<matplotlib.legend.Legend at 0x20cb35493d0>
To be honest I have no idea how to order the bars for each year, so it's hard to compare how countries have changed over time. This is something I'll have to look at later when I have more experience. However, the graph does show that the top 3 countries every year made up a similar proportion of the total top 100 players, around 35 to 40 percent. It's also clear that Russia has remained dominant throughout the whole period, with other notable countries including Ukraine and China.
I was able to successfully use the skills learnt in the Python and Pandas course to carry out simple analysis of chess players, and provide some data visualisation. In my analysis, I was able to find the following key points:
My next goal is to continue improving my fundamentals in data science with Python. I will find a course that teaches more specific statistical analysis, since the analysis in this project was very basic, with just averages and plotting of data. At some point, I will also need to revisit the bar chart and find a way to improve its readability.