The United Nations provides the Word Population Prospects (WPP) dataset on geographic and age distribution of mankind as downloadable XLSX files.
Reading these files in Python is rather easy. First we have to find out how many rows to skip. For the 2019 WPP dataset this value is 16
since row 17
contains all the column headers. The number of rows to skip might be different depending on the dataset. We’re using WPP2019_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.xlsx
in this example.
We can use Pandas read_excel()
function to import the dataset in Python:
import pandas as pd df = pd.read_excel("WPP2019_INT_F03_1_POPULATION_BY_AGE_ANNUAL_BOTH_SEXES.xlsx", skiprows=16, na_values=["..."])
This will take a few seconds until the large dataset has been processed. Now we can check if skiprows=16
is the correct value. It is correct if pandas did recognize the column names correctly:
>>> df.columns Index(['Index', 'Variant', 'Region, subregion, country or area *', 'Notes', 'Country code', 'Type', 'Parent code', 'Reference date (as of 1 July)', '0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79', '80-84', '85-89', '90-94', '95-99', '100+'], dtype='object')
Now let’s filter for a country:
russia = df[df["Region, subregion, country or area *"] == 'Russian Federation']
This will show us the population data for multiple years in 5-year intervals from 1950
to 2020
. Now let’s filter for the most recent year:
russia.loc[russia["Reference date (as of 1 July)"].idxmax()]
This will show us a single dataset:
Index 3255 Variant Estimates Region, subregion, country or area * Russian Federation Notes NaN Country code 643 Type Country/Area Parent code 923 Reference date (as of 1 July) 2020 0-4 9271.69 5-9 9350.92 10-14 8174.26 15-19 7081.77 20-24 6614.7 25-29 8993.09 30-34 12543.8 35-39 11924.7 40-44 10604.6 45-49 9770.68 50-54 8479.65 55-59 10418 60-64 10073.6 65-69 8427.75 70-74 5390.38 75-79 3159.34 80-84 3485.78 85-89 1389.64 90-94 668.338 95-99 102.243 100+ 9.407 Name: 3254, dtype: object
How can we plot that data? First, we need to select all the columns that contain age data. We’ll do this by manually inserting the name of the first such column (0-4
) into the following code and assuming that there are no columns after the last age column:
>>> df.columns[df.columns.get_loc("0-4"):] Index(['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79', '80-84', '85-89', '90-94', '95-99', '100+'], dtype='object')
Now let’s select those columns from the russia
dataset:
most_recent_russia = russia.loc[russia["Reference date (as of 1 July)"].idxmax()] age_columns = df.columns[df.columns.get_loc("0-4"):] russian_age_data = most_recent_russia[age_columns]
Let’s have a look at the dataset:
>>> russian_age_data 0-4 9271.69 5-9 9350.92 10-14 8174.26 15-19 7081.77 20-24 6614.7 25-29 8993.09 30-34 12543.8 35-39 11924.7 40-44 10604.6 45-49 9770.68 50-54 8479.65 55-59 10418 60-64 10073.6 65-69 8427.75 70-74 5390.38 75-79 3159.34 80-84 3485.78 85-89 1389.64 90-94 668.338 95-99 102.243 100+ 9.407
That looks useable, note however that the values are in thousands, i.e. we have to multiply the values by 1000
to obtain the actual estimates of the population. Let’s plot it:
from matplotlib import pyplot as plt plt.style.use("ggplot") plt.title("Age composition of the Russian population (2020)") plt.ylabel("People in age group [Millions]") plt.xlabel("Age group") plt.gcf().set_size_inches(15,5) # Data is given in thousands => divide by 1000 to obtain millions plt.plot(russian_age_data.index, russian_age_data.as_matrix() / 1000., lw=3)
The finished plot will look like this:
Here’s our finished script:
#!/usr/bin/env python3 import pandas as pd df = pd.read_excel("WPP2019_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.xlsx", skiprows=16) # Filter only russia russia = df[df["Region, subregion, country or area *"] == 'Russian Federation'] # Filter only most recent estimate (1 row) most_recent_russia = russia.loc[russia["Reference date (as of 1 July)"].idxmax()] # Retain only value columns age_columns = df.columns[df.columns.get_loc("0-4"):] russian_age_data = most_recent_russia[age_columns] # Plot! from matplotlib import pyplot as plt plt.style.use("ggplot") plt.title("Age composition of the Russian population (2020)") plt.ylabel("People in age group [Millions]") plt.xlabel("Age group") plt.gcf().set_size_inches(15,5) # Data is given in thousands => divide by 1000 to obtain millions plt.plot(russian_age_data.index, russian_age_data.as_matrix() / 1000., lw=3) # Export as SVG plt.savefig("russian-demographics.svg")