World Population Prospects (WPP) XLSX-Daten in Python parsen

Die Vereinten Nationen stellen den Datensatz Word Population Prospects (WPP) zur geografischen und Altersverteilung der Menschheit als herunterladbare XLSX-Dateien bereit.

Das Lesen dieser Dateien in Python ist recht einfach. Zuerst müssen wir herausfinden, wie viele Zeilen übersprungen werden sollen. Für den WPP-Datensatz von 2019 ist dieser Wert 16, da Zeile 17 alle Spaltenüberschriften enthält. Die Anzahl der zu überspringenden Zeilen kann je nach Datensatz unterschiedlich sein. Wir verwenden WPP2019_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.xlsx in diesem Beispiel.

Wir können die Pandas-Funktion read_excel() verwenden, um den Datensatz in Python zu importieren:

read_wpp_excel.py

import pandas as pd

df = pd.read_excel("WPP2019_INT_F03_1_POPULATION_BY_AGE_ANNUAL_BOTH_SEXES.xlsx", skiprows=16, na_values=["..."])

import pandas as pd

df = pd.read_excel("WPP2019_INT_F03_1_POPULATION_BY_AGE_ANNUAL_BOTH_SEXES.xlsx", skiprows=16, na_values=["..."])

Dies wird einige Sekunden dauern, bis der große Datensatz verarbeitet wurde. Nun können wir prüfen, ob skiprows=16 der richtige Wert ist. Es ist korrekt, wenn Pandas die Spaltennamen richtig erkannt hat:

df_columns_output.txt

>>> df.columns
Index(['Index', 'Variant', 'Region, subregion, country or area *', 'Notes',
     'Country code', 'Type', 'Parent code', 'Reference date (as of 1 July)',
     '0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39',
     '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79',
     '80-84', '85-89', '90-94', '95-99', '100+'],
    dtype='object')

>>> df.columns
Index(['Index', 'Variant', 'Region, subregion, country or area *', 'Notes',
     'Country code', 'Type', 'Parent code', 'Reference date (as of 1 July)',
     '0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39',
     '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79',
     '80-84', '85-89', '90-94', '95-99', '100+'],
    dtype='object')

Nun filtern wir nach einem Land:

filter_russia.py

russia = df[df["Region, subregion, country or area *"] == 'Russian Federation']

russia = df[df["Region, subregion, country or area *"] == 'Russian Federation']

Dies zeigt uns die Bevölkerungsdaten für mehrere Jahre in 5-Jahres-Intervallen von 1950 bis 2020. Nun filtern wir nach dem aktuellsten Jahr:

most_recent_russia.py

russia.loc[russia["Reference date (as of 1 July)"].idxmax()]

russia.loc[russia["Reference date (as of 1 July)"].idxmax()]

Dies zeigt uns einen einzelnen Datensatz:

most_recent_russia_output.txt

Index                                                 3255
Variant                                          Estimates
Region, subregion, country or area *    Russian Federation
Notes                                                  NaN
Country code                                           643
Type                                          Country/Area
Parent code                                            923
Reference date (as of 1 July)                         2020
0-4                                                9271.69
5-9                                                9350.92
10-14                                              8174.26
15-19                                              7081.77
20-24                                               6614.7
25-29                                              8993.09
30-34                                              12543.8
35-39                                              11924.7
40-44                                              10604.6
45-49                                              9770.68
50-54                                              8479.65
55-59                                                10418
60-64                                              10073.6
65-69                                              8427.75
70-74                                              5390.38
75-79                                              3159.34
80-84                                              3485.78
85-89                                              1389.64
90-94                                              668.338
95-99                                              102.243
100+                                                 9.407
Name: 3254, dtype: object

Index                                                 3255
Variant                                          Estimates
Region, subregion, country or area *    Russian Federation
Notes                                                  NaN
Country code                                           643
Type                                          Country/Area
Parent code                                            923
Reference date (as of 1 July)                         2020
0-4                                                9271.69
5-9                                                9350.92
10-14                                              8174.26
15-19                                              7081.77
20-24                                               6614.7
25-29                                              8993.09
30-34                                              12543.8
35-39                                              11924.7
40-44                                              10604.6
45-49                                              9770.68
50-54                                              8479.65
55-59                                                10418
60-64                                              10073.6
65-69                                              8427.75
70-74                                              5390.38
75-79                                              3159.34
80-84                                              3485.78
85-89                                              1389.64
90-94                                              668.338
95-99                                              102.243
100+                                                 9.407
Name: 3254, dtype: object

Wie können wir diese Daten plotten? Zuerst müssen wir alle Spalten auswählen, die Altersdaten enthalten. Wir tun dies, indem wir den Namen der ersten solchen Spalte (0-4) manuell in den folgenden Code einfügen und annehmen, dass es nach der letzten Altersspalte keine weiteren Spalten gibt:

age_columns_index.py

>>> df.columns[df.columns.get_loc("0-4"):]
Index(['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39',
     '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79',
     '80-84', '85-89', '90-94', '95-99', '100+'],
    dtype='object')

>>> df.columns[df.columns.get_loc("0-4"):]
Index(['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39',
     '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79',
     '80-84', '85-89', '90-94', '95-99', '100+'],
    dtype='object')

Nun wählen wir diese Spalten aus dem russia-Datensatz aus:

russian_age_data = most_recent_russia[age_columns]

prepare_russian_age_data.py

most_recent_russia = russia.loc[russia["Reference date (as of 1 July)"].idxmax()]
age_columns = df.columns[df.columns.get_loc("0-4"):]

russian_age_data = most_recent_russia[age_columns]

most_recent_russia = russia.loc[russia["Reference date (as of 1 July)"].idxmax()]
age_columns = df.columns[df.columns.get_loc("0-4"):]

russian_age_data = most_recent_russia[age_columns]

Werfen wir einen Blick auf den Datensatz:

russian_age_data_output.txt

>>> russian_age_data
0-4      9271.69
5-9      9350.92
10-14    8174.26
15-19    7081.77
20-24     6614.7
25-29    8993.09
30-34    12543.8
35-39    11924.7
40-44    10604.6
45-49    9770.68
50-54    8479.65
55-59      10418
60-64    10073.6
65-69    8427.75
70-74    5390.38
75-79    3159.34
80-84    3485.78
85-89    1389.64
90-94    668.338
95-99    102.243
100+       9.407

>>> russian_age_data
0-4      9271.69
5-9      9350.92
10-14    8174.26
15-19    7081.77
20-24     6614.7
25-29    8993.09
30-34    12543.8
35-39    11924.7
40-44    10604.6
45-49    9770.68
50-54    8479.65
55-59      10418
60-64    10073.6
65-69    8427.75
70-74    5390.38
75-79    3159.34
80-84    3485.78
85-89    1389.64
90-94    668.338
95-99    102.243
100+       9.407

Das sieht verwendbar aus, beachten Sie jedoch, dass die Werte in Tausend angegeben sind, d.h. wir müssen die Werte mit 1000 multiplizieren, um die tatsächlichen Schätzungen der Bevölkerung zu erhalten. Lassen Sie uns es plotten:

plot_russian_age_data.py

from matplotlib import pyplot as plt
plt.style.use("ggplot")

plt.title("Alterszusammensetzung der russischen Bevölkerung (2020)")
plt.ylabel("Menschen in Altersgruppe [Millionen]")
plt.xlabel("Altersgruppe")
plt.gcf().set_size_inches(15,5)
# Daten in Tausend angegeben => durch 1000 teilen, um Millionen zu erhalten
plt.plot(russian_age_data.index, russian_age_data.as_matrix() / 1000., lw=3)

from matplotlib import pyplot as plt
plt.style.use("ggplot")

plt.title("Alterszusammensetzung der russischen Bevölkerung (2020)")
plt.ylabel("Menschen in Altersgruppe [Millionen]")
plt.xlabel("Altersgruppe")
plt.gcf().set_size_inches(15,5)
# Daten in Tausend angegeben => durch 1000 teilen, um Millionen zu erhalten
plt.plot(russian_age_data.index, russian_age_data.as_matrix() / 1000., lw=3)

Der fertige Plot sieht so aus:

Hier ist unser fertiges Skript:

russian_demographics_plot.py

#!/usr/bin/env python3
import pandas as pd
df = pd.read_excel("WPP2019_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.xlsx", skiprows=16)
# Nur Russland filtern
russia = df[df["Region, subregion, country or area *"] == 'Russian Federation']

# Nur aktuellste Schätzung filtern (1 Zeile)
most_recent_russia = russia.loc[russia["Reference date (as of 1 July)"].idxmax()]
# Nur Wertspalten beibehalten
age_columns = df.columns[df.columns.get_loc("0-4"):]
russian_age_data = most_recent_russia[age_columns]

# Plotten!
from matplotlib import pyplot as plt
plt.style.use("ggplot")

plt.title("Alterszusammensetzung der russischen Bevölkerung (2020)")
plt.ylabel("Menschen in Altersgruppe [Millionen]")
plt.xlabel("Altersgruppe")
plt.gcf().set_size_inches(15,5)
# Daten in Tausend angegeben => durch 1000 teilen, um Millionen zu erhalten
plt.plot(russian_age_data.index, russian_age_data.as_matrix() / 1000., lw=3)

# Als SVG exportieren
plt.savefig("russian-demographics.svg")

#!/usr/bin/env python3
import pandas as pd
df = pd.read_excel("WPP2019_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.xlsx", skiprows=16)
# Nur Russland filtern
russia = df[df["Region, subregion, country or area *"] == 'Russian Federation']

# Nur aktuellste Schätzung filtern (1 Zeile)
most_recent_russia = russia.loc[russia["Reference date (as of 1 July)"].idxmax()]
# Nur Wertspalten beibehalten
age_columns = df.columns[df.columns.get_loc("0-4"):]
russian_age_data = most_recent_russia[age_columns]

# Plotten!
from matplotlib import pyplot as plt
plt.style.use("ggplot")

plt.title("Alterszusammensetzung der russischen Bevölkerung (2020)")
plt.ylabel("Menschen in Altersgruppe [Millionen]")
plt.xlabel("Altersgruppe")
plt.gcf().set_size_inches(15,5)
# Daten in Tausend angegeben => durch 1000 teilen, um Millionen zu erhalten
plt.plot(russian_age_data.index, russian_age_data.as_matrix() / 1000., lw=3)

# Als SVG exportieren
plt.savefig("russian-demographics.svg")

Check out similar posts by category: Bioinformatics, Data Science, Pandas, Python

If this post helped you, please consider buying me a coffee or donating via PayPal to support research & publishing of new posts on TechOverflow

Buy me a coffee