How to read IDF diabetes statistics in Python using Pandas
The International Diabetes Foundation provides a Data portal with various statistics related to diabetes.
In this post we’ll show how to read the Diabetes estimates (20-79 y) / People with diabetes, in 1,000s
data export in CSV format using pandas
.
First download IDF (people-with-diabetes--in-1-000s).csv
from the data page.
Now we can parse the CSV file:
import pandas as pd
# Download at https://www.diabetesatlas.org/data/en/indicators/1/
df = pd.read_csv("IDF (people-with-diabetes--in-1-000s).csv")
# Parse year columns to obtain floats and multiply by thousands factor. Pandas fails to parse values like "12,345.67"
for column in df.columns:
try:
int(column)
df[column] = df[column].apply(lambda s: None if s == "-" else float(s.replace(",", "")) * 1000)
except:
pass
As you can see in the postprocessing step, the number of diabetes patients are given in 1000s in the CSV, so we multiply them by 1000 to obtain the actual numbers.
If you want to modify the data columns (i.e. the columns referring to year), you can use this simple template:
for column in df.columns:
try:
int(column) # Will raise ValueError() if column is not a year number
# Whatever you do here will only be applied to year columns
df[column] = df[column] * 0.75 # Example on how to modify a column
# But note that if your code raises an Exception, it will be ignored!
except:
pass
Let’s plot some data:
regions = df[df["Type"] == "Region"] # Only regions, not individual countries
from matplotlib import pyplot as plt
plt.style.use("ggplot")
plt.gcf().set_size_inches(20,4)
plt.ylabel("Diabetes patients [millions]")
plt.xlabel("Region")
plt.title("Diabetes patients in 2019 by region")
plt.bar(regions["Country/Territory"], regions["2019"] / 1e6)
Note that if you use a more recent dataset than the version I’m using the 2019
column might not exist in your CSV file. Choose an appropriate column in that case.