Parsing World Population Prospects (WPP) XLSX data in Python
The United Nations provides the Word Population Prospects (WPP) dataset on geographic and age distribution of mankind as downloadable XLSX files.
Reading these files in Python is rather easy. First we have to find out how many rows to skip. For the 2019 WPP dataset this value is 16
since row 17
contains all the column headers. The number of rows to skip might be different depending on the dataset. We’re using WPP2019_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.xlsx
in this example.
We can use Pandas read_excel()
function to import the dataset in Python:
import pandas as pd
df = pd.read_excel("WPP2019_INT_F03_1_POPULATION_BY_AGE_ANNUAL_BOTH_SEXES.xlsx", skiprows=16, na_values=["..."])
This will take a few seconds until the large dataset has been processed. Now we can check if skiprows=16
is the correct value. It is correct if pandas did recognize the column names correctly:
>>> df.columns
Index(['Index', 'Variant', 'Region, subregion, country or area *', 'Notes',
'Country code', 'Type', 'Parent code', 'Reference date (as of 1 July)',
'0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39',
'40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79',
'80-84', '85-89', '90-94', '95-99', '100+'],
dtype='object')
Now let’s filter for a country:
russia = df[df["Region, subregion, country or area *"] == 'Russian Federation']
This will show us the population data for multiple years in 5-year intervals from 1950
to 2020
. Now let’s filter for the most recent year:
russia.loc[russia["Reference date (as of 1 July)"].idxmax()]
This will show us a single dataset:
Index 3255
Variant Estimates
Region, subregion, country or area * Russian Federation
Notes NaN
Country code 643
Type Country/Area
Parent code 923
Reference date (as of 1 July) 2020
0-4 9271.69
5-9 9350.92
10-14 8174.26
15-19 7081.77
20-24 6614.7
25-29 8993.09
30-34 12543.8
35-39 11924.7
40-44 10604.6
45-49 9770.68
50-54 8479.65
55-59 10418
60-64 10073.6
65-69 8427.75
70-74 5390.38
75-79 3159.34
80-84 3485.78
85-89 1389.64
90-94 668.338
95-99 102.243
100+ 9.407
Name: 3254, dtype: object
How can we plot that data? First, we need to select all the columns that contain age data. We’ll do this by manually inserting the name of the first such column (0-4
) into the following code and assuming that there are no columns after the last age column:
>>> df.columns[df.columns.get_loc("0-4"):]
Index(['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39',
'40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79',
'80-84', '85-89', '90-94', '95-99', '100+'],
dtype='object')
Now let’s select those columns from the russia
dataset:
most_recent_russia = russia.loc[russia["Reference date (as of 1 July)"].idxmax()]
age_columns = df.columns[df.columns.get_loc("0-4"):]
russian_age_data = most_recent_russia[age_columns]
Let’s have a look at the dataset:
>>> russian_age_data
0-4 9271.69
5-9 9350.92
10-14 8174.26
15-19 7081.77
20-24 6614.7
25-29 8993.09
30-34 12543.8
35-39 11924.7
40-44 10604.6
45-49 9770.68
50-54 8479.65
55-59 10418
60-64 10073.6
65-69 8427.75
70-74 5390.38
75-79 3159.34
80-84 3485.78
85-89 1389.64
90-94 668.338
95-99 102.243
100+ 9.407
That looks useable, note however that the values are in thousands, i.e. we have to multiply the values by 1000
to obtain the actual estimates of the population. Let’s plot it:
from matplotlib import pyplot as plt
plt.style.use("ggplot")
plt.title("Age composition of the Russian population (2020)")
plt.ylabel("People in age group [Millions]")
plt.xlabel("Age group")
plt.gcf().set_size_inches(15,5)
# Data is given in thousands => divide by 1000 to obtain millions
plt.plot(russian_age_data.index, russian_age_data.as_matrix() / 1000., lw=3)
The finished plot will look like this:
Here’s our finished script:
#!/usr/bin/env python3
import pandas as pd
df = pd.read_excel("WPP2019_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.xlsx", skiprows=16)
# Filter only russia
russia = df[df["Region, subregion, country or area *"] == 'Russian Federation']
# Filter only most recent estimate (1 row)
most_recent_russia = russia.loc[russia["Reference date (as of 1 July)"].idxmax()]
# Retain only value columns
age_columns = df.columns[df.columns.get_loc("0-4"):]
russian_age_data = most_recent_russia[age_columns]
# Plot!
from matplotlib import pyplot as plt
plt.style.use("ggplot")
plt.title("Age composition of the Russian population (2020)")
plt.ylabel("People in age group [Millions]")
plt.xlabel("Age group")
plt.gcf().set_size_inches(15,5)
# Data is given in thousands => divide by 1000 to obtain millions
plt.plot(russian_age_data.index, russian_age_data.as_matrix() / 1000., lw=3)
# Export as SVG
plt.savefig("russian-demographics.svg")