pandas

What fraction of the year has passed until a given Timestamp in pandas?

To compute what fraction of the year has passed since the start of the year, use this function:

import pandas as pd

def fraction_of_year_passed(date):
    """Compute what fraction of the current year has already passed up to the given date"""
    start_of_year = pd.Timestamp(now.year, 1, 1)
    start_of_next_year = pd.Timestamp(now.year + 1, 1, 1)
    # Compute seconds in entire year and seconds since start of year
    entire_year_seconds = (start_of_next_year - start_of_year).total_seconds()
    seconds_since_start_of_year = (date - start_of_year).total_seconds()
    return seconds_since_start_of_year / entire_year_seconds

Usage example:

print(fraction_of_year_passed(pd.Timestamp("2020-03-01"))) # prints 0.16393442622950818

Detailed explanation:

First, we define that start of the calendar year date belongs to, and the start of the calendar year after that:

start_of_year = pd.Timestamp(now.year, 1, 1)
start_of_next_year = pd.Timestamp(now.year + 1, 1, 1)

Now we compute the number of seconds in the entire year and the number of seconds passed between the start of the year and date:

entire_year_seconds = (start_of_next_year - start_of_year).total_seconds()
seconds_since_start_of_year = (date - start_of_year).total_seconds()

The rest is simple: Just divide seconds_since_start_of_year / entire_year_seconds to obtain what fraction of the year has passed until date.

Posted by Uli Köhler in pandas, Python

How to compute number of days in a year in Pandas

In our previous post we showed how to used the pendulum library in order to compute the number of days in a given year using the pendulum library.

This post shows how to achieve the same using pandas:

import pandas as pd
def number_of_days_in_year(year):
    start = pd.Timestamp(year, 1, 1)
    end = pd.Timestamp(year + 1, 1, 1)
    return (end - start).days)

Usage example:

print(number_of_days_in_year(2020)) # Prints 366
print(number_of_days_in_year(2021)) # Prints 365

Explanation:

First, we define the start date to be the first day (1st of January) of the year we’re interested in:

start = pd.Timestamp(year, 1, 1)

Now we generate the end date, which is the 1st of January of the following year:

end = pd.Timestamp(year + 1, 1, 1)

The rest is simple: Just compute the difference (end – start) and ask pandas to give us the number of days:

(end - start).days

 

Posted by Uli Köhler in pandas, Python

How to generate range of dates in pandas

In this example, we’ll create a list of pandas Timestamp objects that represent 100 consecutive days, starting at a fixed date:

start_date = pd.Timestamp("2020-03-01")

Generating the 100 consecutive days is easy:

all_days = [start_date + pd.Timedelta(d, "days") for d in range(100)]

Note that range(100) will generate all numbers from 0 up to and including 99. Hence, [pd.Timedelta(d, "days") for d in range(100)] will generate a list of Timedeltas that represent 0 days, 1 days, 2 days, …, 99 days.

Full example:

import pandas as pd

start_date = pd.Timestamp("2020-03-01")
all_days = [start_date + pd.Timedelta(d, "days") for d in range(100)]

print(all_days)

 

Posted by Uli Köhler in pandas, Python

Split pandas DataFrame every time a Series is True

In our previous post we explored how to Split pandas DataFrame every time a column is True.

This slightly modified function also works if the given Series is not a column in the DataFrame:

def split_dataframe_by_series(df, series):
    """
    Split a DataFrame where the given series is True. Yields a number of dataframes
    """
    previous_index = df.index[0]

    for split_point in df[series].index:
        yield df[previous_index:split_point]
        previous_index = split_point
    # Yield remainder of dataset
    try:
        yield df[split_point:]
    except UnboundLocalError:
        pass # There is no split point => Ignore

Full example

We’ll use the ZeroCrossing column we built in our previous post on How to detect value change in pandas string column/series which itself builds on our post on How to create pandas time series DataFrame example dataset. Based on that example we add the modified utility function shown above:

import pandas as pd

# Load pre-built time series example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

# Create a new column containing "Positive" or "Negative"
df["SinePositive"] = (df["Sine"] >= 0).map({True: "Positive", False: "Negative"})
# Create "change" column (boolean)
df["ZeroCrossing"] = df["SinePositive"].shift() != df["SinePositive"]
# Set first entry to False
df["ZeroCrossing"].iloc[0] = False

def split_dataframe_by_series(df, series):
    """
    Split a DataFrame where the given series is True. Yields a number of dataframes
    """
    previous_index = df.index[0]

    for split_point in df[series].index:
        yield df[previous_index:split_point]
        previous_index = split_point
    # Yield remainder of dataset
    try:
        yield df[split_point:]
    except UnboundLocalError:
        pass # There is no split point => Ignore

# Print result
split_frames = list(split_dataframe_by_series(df, df["ZeroCrossing"]))
print(f"Split DataFrame into {len(split_frames)} separate frames by zero-crossing")

Note that converting the result of split_dataframe_to_series() into a list might not be neccessary depending on your application. If possible, I recommend directly iterating the data frames using a for loop, e.g.:

for df_section in split_dataframe_by_series(df, df["ZeroCrossing"]):
    pass # TODO: Your code goes here!

 

Posted by Uli Köhler in pandas, Python

Split pandas DataFrame every time a column is True

TL;DR

If the Series you want to use to split is a column in the DataFrame, continue reading this post. Else, read Split pandas DataFrame every time a Series is True.

Use this utility function:

def split_dataframe_by_column(df, column):
    """
    Split a DataFrame where a column is True. Yields a number of dataframes
    """
    previous_index = df.index[0]

    for split_point in df[df[column]].index:
        yield df[previous_index:split_point]
        previous_index = split_point
    # Yield remainder of dataset
    try:
        yield df[split_point:]
    except UnboundLocalError:
        pass # There is no split point => Ignore

# Usage example:
list(split_dataframe_by_column(df, "ZeroCrossing"))

Note that one or more of those dataframes might be empty.

Full example:

We’ll use the ZeroCrossing column we built in our previous post on How to detect value change in pandas string column/series which itself builds on our post on How to create pandas time series DataFrame example dataset. Based on that example we add the utility function shown above:

import pandas as pd

# Load pre-built time series example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

# Create a new column containing "Positive" or "Negative"
df["SinePositive"] = (df["Sine"] >= 0).map({True: "Positive", False: "Negative"})
# Create "change" column (boolean)
df["ZeroCrossing"] = df["SinePositive"].shift() != df["SinePositive"]
# Set first entry to False
df["ZeroCrossing"].iloc[0] = False

def split_dataframe_by_column(df, column):
    """Split a DataFrame where a column is True. Yields a number of dataframes"""
    previous_index = df.index[0]

    for split_point in df[df[column]].index:
        yield df[previous_index:split_point]
        previous_index = split_point
    # Yield remainder of dataset
    try:
        yield df[split_point:]
    except UnboundLocalError:
        pass # There is no split point => Ignore

# Print result
split_frames = list(split_dataframe_by_column(df, "ZeroCrossing"))
print(f"Split DataFrame into {len(split_frames)} separate frames by zero-crossing")
# This prints "Split DataFrame into 20 separate frames by zero-crossing"

 

Posted by Uli Köhler in pandas, Python

Get index where column is True in pandas

TL;DR

Simply use

df[df["ZeroCrossing"]].index

Full example:

We’ll use the ZeroCrossing column we built in our previous post on How to detect value change in pandas string column/series which itself builds on our post on How to create pandas time series DataFrame example dataset. Based on that example, we only modify the last line:

import pandas as pd

# Load pre-built time series example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

# Create a new column containing "Positive" or "Negative"
df["SinePositive"] = (df["Sine"] >= 0).map({True: "Positive", False: "Negative"})
# Create "change" column (boolean)
df["ZeroCrossing"] = df["SinePositive"].shift() != df["SinePositive"]
# Set first entry to False
df["ZeroCrossing"].iloc[0] = False

# Print result
print(df[df["ZeroCrossing"]].index)

This prints

DatetimeIndex(['2020-05-25 20:05:10.040874', '2020-05-25 20:05:10.090874',
               '2020-05-25 20:05:10.140874', '2020-05-25 20:05:10.190874',
               '2020-05-25 20:05:10.240874', '2020-05-25 20:05:10.290874',
               '2020-05-25 20:05:10.340874', '2020-05-25 20:05:10.390874',
               '2020-05-25 20:05:10.440774', '2020-05-25 20:05:10.490874',
               '2020-05-25 20:05:10.540874', '2020-05-25 20:05:10.590874',
               '2020-05-25 20:05:10.640774', '2020-05-25 20:05:10.690874',
               '2020-05-25 20:05:10.740874', '2020-05-25 20:05:10.790874',
               '2020-05-25 20:05:10.840874', '2020-05-25 20:05:10.890774',
               '2020-05-25 20:05:10.940874'],
              dtype='datetime64[ns]', name='Timestamp', freq=None)
Posted by Uli Köhler in pandas, Python

How to convert pandas Timedelta to seconds

TL;DR

Use

my_timedelta / np.timedelta64(1, 's')

Full example

import pandas as pd
import numpy as np
import time

# Create timedelta
t1 = pd.Timestamp("now")
time.sleep(3)
t2 = pd.Timestamp("now")
my_timedelta = t2 - t1

# Convert timedelta to seconds
my_timedelta_in_seconds = my_timedelta / np.timedelta64(1, 's')
print(my_timedelta_in_seconds) # prints 3.00154

 

Posted by Uli Köhler in pandas, Python

How to detect value change in pandas string column/series

TL;DR

In order to get a series that is True every time the input string column changes, use

my_column_changes = df["MyStringColumn"].shift() != df["MyStringColumn"]

The first value of this Series will always be True since the value is considered to be NaN before the start of the series (due to the behaviour of shift()). In order to force the first value to be False, use

my_column_changes.iloc[0] = False

In order to get the rows in the dataframe where the column changes, use

df[my_column_changes]

or use this one-liner:

df[df["MyStringColumn"].shift() != df["MyStringColumn"]]

In order to assign this value to a new column in the DataFrame, use e.g.

df["MyStringColumnChanges"] = df["MyStringColumn"].shift() != df["MyStringColumn"]

Full example:

First we load our example from our previous post on How to create pandas time series DataFrame example dataset:

import pandas as pd

# Load pre-built time series example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

Now we create a new column that contains Positive if the sine wave value in the "Sine" column is positive or "Negative" if that value is negative:

df["SinePositive"] = (df["Sine"] >= 0).map({True: "Positive", False: "Negative"})

Now we create the ZeroCrossing column using the method shown above:

# Create "change" column (boolean)
df["ZeroCrossing"] = df["SinePositive"].shift() != df["SinePositive"]

… and set the first entry to False since we don’t consider the start of the series to be a zero crossing:

df["ZeroCrossing"].iloc[0] = False

Now we can use

df[df["ZeroCrossing"]]

to show the rows in the DataFrame where the zero crossing happened:

                                    Sine   Cosine SinePositive  ZeroCrossing
Timestamp                                                                   
2020-05-25 20:05:10.040874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.090874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.140874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.190874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.240874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.290874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.340874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.390874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.440774 -2.450532e-15 -1.00000     Negative          True
2020-05-25 20:05:10.490874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.540874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.590874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.640774 -1.960673e-15 -1.00000     Negative          True
2020-05-25 20:05:10.690874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.740874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.790874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.840874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.890774  4.901063e-15  1.00000     Positive          True
2020-05-25 20:05:10.940874 -6.283144e-03 -0.99998     Negative          True

 

Full example code:

import pandas as pd

# Load pre-built time series example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

# Create a new column containing "Positive" or "Negative"
df["SinePositive"] = (df["Sine"] >= 0).map({True: "Positive", False: "Negative"})
# Create "change" column (boolean)
df["ZeroCrossing"] = df["SinePositive"].shift() != df["SinePositive"]
# Set first entry to False
df["ZeroCrossing"].iloc[0] = False

# Print result
print(df[df["ZeroCrossing"]])

 

Posted by Uli Köhler in pandas, Python

How to create pandas time series DataFrame example dataset

TL;DR: Use our pre-built example dataset like this:

# Load pre-built time series example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

How to build your own time series example dataset

In our previous post Easily generate sine/cosine waveform data in Python using UliEngineering we showed how to generate sine and cosine waves using UliEngineering.

In this post, we show how to create a pandas DataFrame containing sine and cosine data to be used as a sample time series dataset.

First, we generate the sine and cosine wave data:

import pandas as pd
import numpy as np
from UliEngineering.SignalProcessing.Simulation import sine_wave, cosine_wave

# Configure the properties of the sine wave here
frequency = 10.0 # 10 Hz sine / cosine wave
samplerate = 10000 # 10 kHz
nseconds = 1 # Generate 1 second of data

sine = sine_wave(frequency=frequency, samplerate=samplerate, length=nseconds)
cosine = cosine_wave(frequency=frequency, samplerate=samplerate, length=nseconds)
nsamples = len(sine) # How many values we have in the data arrays

After that, we define the timestamp where the dataset starts:

start_timestamp = pd.Timestamp('now')

Now we can create a list of Timestamp objects representing the points in time where the signal has been sampled:

# Create timestamps by offsetting
timedelta = pd.Timedelta(1/samplerate, 'seconds')
timestamps =  [start_timestamp + i * timedelta for i in range(nsamples)]

Now we’re reading to create the DataFrame object:

df = pd.DataFrame(index=timestamps, data={
    "Sine": sine,
    "Cosine": cosine
})
df.index.name = 'Timestamp'

Now we can use df.plot() to plot the dataset:

# Use nice plotting style
from matplotlib import pyplot as plt
plt.style.use("ggplot")
# Plot dataset
df.plot()
# Make figure larger
plt.gcf().set_size_inches(10, 5)

Additionally we can export the dataset as CSV using

df.to_csv("/ram/timeseries-example.csv")

This example file is also available online at https://techoverflow.net/datasets/timeseries-example.csv

Full example:

#!/usr/bin/env python3
import pandas as pd
import numpy as np
from UliEngineering.SignalProcessing.Simulation import sine_wave, cosine_wave
# Configure the properties of the sine wave here
frequency = 10.0 # 10 Hz sine / cosine wave
samplerate = 10000 # 10 kHz
nseconds = 1 # Generate 1 second of data
sine = sine_wave(frequency=frequency, samplerate=samplerate, length=nseconds)
cosine = cosine_wave(frequency=frequency, samplerate=samplerate, length=nseconds)
nsamples = len(sine) # How many values we have in the data arrays

start_timestamp = pd.Timestamp('now')

# Create timestamps by offsetting
timedelta = pd.Timedelta(1/samplerate, 'seconds')
timestamps =  [start_timestamp + i * timedelta for i in range(nsamples)]

df = pd.DataFrame(index=timestamps, data={
    "Sine": sine,
    "Cosine": cosine
})
df.index.name = 'Timestamp'

df.to_csv("timeseries-example.csv")

 

Posted by Uli Köhler in pandas, Python

How to get last 10 minutes of a pandas DataFrame

In our previous post we showed how to subtract 5 minutes from a pandas DataFrame:

pd.Timestamp('now') - pd.Timedelta(10, 'minutes')

We can also use this knowledge in order to get the last 10 minutes of a pandas DataFrame. In our example, we assume that df[“Timestamp”] contains the timestamp. First, we get the last timestamp in the dataset using

# Use this if the timestamp is the index of the DataFrame
last_ts = df.index.iloc[-1]

or

# ... or use this if the timestamp is in a colum
last_ts = df["Timestamp"].iloc[-1]

Next, we define the first timestamp that shall be considered by subtracting 10 minutes from last_ts:

first_ts = last_ts - pd.Timedelta(10, 'minutes')

Now we can filter the DataFrame using

# Use this if the Timestamp is in a column
filtered_df = df[df["Timestamp"] >= first_ts]

or

# Use this if the Timestamp is the index of the DataFrame
filtered_df = df[df.index >= first_ts]

By filtering, we don’t need the DataFrame to be sorted and the original order will be maintained.

Full example:

This example loads our pre-built time series example dataset from our previous post How to create pandas time series DataFrame example dataset. The code loads that dataset (which is 1 second long) and takes the last 0.5 seconds from it.

import pandas as pd

# Load example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

# Use this if the timestamp is the index of the DataFrame
last_ts = df.index[-1]

first_ts = last_ts - pd.Timedelta(0.5, 'seconds')

filtered_df = df[df.index >= first_ts]

# Plot the result
filtered_df.plot()

 

Posted by Uli Köhler in pandas, Python

How to subtract 5 minutes from pandas Timestamp

In our previous post we showed how to create a pandas Timestamp representing the current point in time:

pd.Timestamp('now')

You can subtract 5 minutes from that timestamp by using Timedelta(5, 'minutes'):

pd.Timestamp('now') - pd.Timedelta(5, 'minutes')
Posted by Uli Köhler in pandas, Python

How to create pandas ‘now’ Timestamp

In order to create a pandas Timestamp representing the current point in time, use

pd.Timestamp('now')

This will create a Timestamp in the current timezone.

Full example:

import pandas as pd

now = pd.Timestamp('now')

print(now) # Prints e.g. Timestamp('2020-05-25 19:02:31.051836')

 

Posted by Uli Köhler in pandas, Python

How to get last row of Pandas DataFrame

Use .iloc[-1] to get the last row (all columns) of a pandas DataFrame, for example:

my_dataframe.iloc[-1]

 

Posted by Uli Köhler in pandas, Python

How to read IDF diabetes statistics in Python using Pandas

The International Diabetes Foundation provides a Data portal with various statistics related to diabetes.

In this post we’ll show how to read the Diabetes estimates (20-79 y) / People with diabetes, in 1,000s data export in CSV format using pandas.

First download IDF (people-with-diabetes--in-1-000s).csv from the data page.

Now we can parse the CSV file:

import pandas as pd

# Download at https://www.diabetesatlas.org/data/en/indicators/1/
df = pd.read_csv("IDF (people-with-diabetes--in-1-000s).csv")
# Parse year columns to obtain floats and multiply by thousands factor. Pandas fails to parse values like "12,345.67"
for column in df.columns:
    try:
        int(column)
        df[column] = df[column].apply(lambda s: None if s == "-" else float(s.replace(",", "")) * 1000)
    except:
        pass

As you can see in the postprocessing step, the number of diabetes patients are given in 1000s in the CSV, so we multiply them by 1000 to obtain the actual numbers.

If you want to modify the data columns (i.e. the columns referring to year), you can use this simple template:

for column in df.columns:
    try:
        int(column) # Will raise ValueError() if column is not a year number
        # Whatever you do here will only be applied to year columns
        df[column] = df[column] * 0.75 # Example on how to modify a column
        # But note that if your code raises an Exception, it will be ignored!
    except:
        pass

Let’s plot some data:

regions = df[df["Type"] == "Region"] # Only regions, not individual countries

from matplotlib import pyplot as plt
plt.style.use("ggplot")
plt.gcf().set_size_inches(20,4)
plt.ylabel("Diabetes patients [millions]")
plt.xlabel("Region")
plt.title("Diabetes patients in 2019 by region")
plt.bar(regions["Country/Territory"], regions["2019"] / 1e6)

Note that if you use a more recent dataset than the version I’m using the 2019 column might not exist in your CSV file. Choose an appropriate column in that case.

Posted by Uli Köhler in Bioinformatics, pandas, Python

Parsing World Population Prospects (WPP) XLSX data in Python

The United Nations provides the Word Population Prospects (WPP) dataset on geographic and age distribution of mankind as downloadable XLSX files.

Reading these files in Python is rather easy. First we have to find out how many rows to skip. For the 2019 WPP dataset this value is 16 since row 17 contains all the column headers. The number of rows to skip might be different depending on the dataset. We’re using WPP2019_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.xlsx in this example.

We can use Pandas read_excel() function to import the dataset in Python:

import pandas as pd

df = pd.read_excel("WPP2019_INT_F03_1_POPULATION_BY_AGE_ANNUAL_BOTH_SEXES.xlsx", skiprows=16, na_values=["..."])

This will take a few seconds until the large dataset has been processed. Now we can check if skiprows=16 is the correct value. It is correct if pandas did recognize the column names correctly:

>>> df.columns
Index(['Index', 'Variant', 'Region, subregion, country or area *', 'Notes',
       'Country code', 'Type', 'Parent code', 'Reference date (as of 1 July)',
       '0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39',
       '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79',
       '80-84', '85-89', '90-94', '95-99', '100+'],
      dtype='object')

Now let’s filter for a country:

russia = df[df["Region, subregion, country or area *"] == 'Russian Federation']

This will show us the population data for multiple years in 5-year intervals from 1950 to 2020. Now let’s filter for the most recent year:

russia.loc[russia["Reference date (as of 1 July)"].idxmax()]

This will show us a single dataset:

Index                                                 3255
Variant                                          Estimates
Region, subregion, country or area *    Russian Federation
Notes                                                  NaN
Country code                                           643
Type                                          Country/Area
Parent code                                            923
Reference date (as of 1 July)                         2020
0-4                                                9271.69
5-9                                                9350.92
10-14                                              8174.26
15-19                                              7081.77
20-24                                               6614.7
25-29                                              8993.09
30-34                                              12543.8
35-39                                              11924.7
40-44                                              10604.6
45-49                                              9770.68
50-54                                              8479.65
55-59                                                10418
60-64                                              10073.6
65-69                                              8427.75
70-74                                              5390.38
75-79                                              3159.34
80-84                                              3485.78
85-89                                              1389.64
90-94                                              668.338
95-99                                              102.243
100+                                                 9.407
Name: 3254, dtype: object
​

How can we plot that data? First, we need to select all the columns that contain age data. We’ll do this by manually inserting the name of the first such column (0-4) into the following code and assuming that there are no columns after the last age column:

>>> df.columns[df.columns.get_loc("0-4"):]
Index(['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39',
       '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79',
       '80-84', '85-89', '90-94', '95-99', '100+'],
      dtype='object')

Now let’s select those columns from the russia dataset:

most_recent_russia = russia.loc[russia["Reference date (as of 1 July)"].idxmax()]
age_columns = df.columns[df.columns.get_loc("0-4"):]

russian_age_data = most_recent_russia[age_columns]

Let’s have a look at the dataset:

>>> russian_age_data
0-4      9271.69
5-9      9350.92
10-14    8174.26
15-19    7081.77
20-24     6614.7
25-29    8993.09
30-34    12543.8
35-39    11924.7
40-44    10604.6
45-49    9770.68
50-54    8479.65
55-59      10418
60-64    10073.6
65-69    8427.75
70-74    5390.38
75-79    3159.34
80-84    3485.78
85-89    1389.64
90-94    668.338
95-99    102.243
100+       9.407

That looks useable, note however that the values are in thousands, i.e. we have to multiply the values by 1000 to obtain the actual estimates of the population. Let’s plot it:

from matplotlib import pyplot as plt
plt.style.use("ggplot")

plt.title("Age composition of the Russian population (2020)")
plt.ylabel("People in age group [Millions]")
plt.xlabel("Age group")
plt.gcf().set_size_inches(15,5)
# Data is given in thousands => divide by 1000 to obtain millions
plt.plot(russian_age_data.index, russian_age_data.as_matrix() / 1000., lw=3)

The finished plot will look like this:

Here’s our finished script:

#!/usr/bin/env python3
import pandas as pd
df = pd.read_excel("WPP2019_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.xlsx", skiprows=16)
# Filter only russia
russia = df[df["Region, subregion, country or area *"] == 'Russian Federation']

# Filter only most recent estimate (1 row)
most_recent_russia = russia.loc[russia["Reference date (as of 1 July)"].idxmax()]
# Retain only value columns
age_columns = df.columns[df.columns.get_loc("0-4"):]
russian_age_data = most_recent_russia[age_columns]

# Plot!
from matplotlib import pyplot as plt
plt.style.use("ggplot")

plt.title("Age composition of the Russian population (2020)")
plt.ylabel("People in age group [Millions]")
plt.xlabel("Age group")
plt.gcf().set_size_inches(15,5)
# Data is given in thousands => divide by 1000 to obtain millions
plt.plot(russian_age_data.index, russian_age_data.as_matrix() / 1000., lw=3)

# Export as SVG
plt.savefig("russian-demographics.svg")

 

 

Posted by Uli Köhler in Bioinformatics, Data science, pandas, Python

How to get milliseconds from pandas.Timedelta object

If you have a pandas.Timedelta object, you can use Timedelta.total_seconds() to get the seconds as a floating-point number with millisecond resolution and then multiply with one billion (1e3, the number of milliseconds in one second) to obtain the number of milliseconds in the Timedelta:

timedelta.total_seconds() * 1e3

In case you want an integer, use

int(round(timedelta.total_seconds() * 1e3))

Note that using round() is required here to avoid errors due to floating point precision.

or use this function definition:

def milliseconds_from_timedelta(timedelta):
    """Compute the milliseconds in a timedelta as floating-point number"""
    return timedelta.total_seconds() * 1e3

def milliseconds_from_timedelta_integer(timedelta):
    """Compute the milliseconds in a timedelta as integer number"""
    return int(round(timedelta.total_seconds() * 1e3))

# Usage example:
ms = milliseconds_from_timedelta(timedelta)
print(ms) # Prints 2000.752
ms = milliseconds_from_timedelta_integer(timedelta)
print(ms) # Prints 2001

 

Posted by Uli Köhler in pandas, Python

How to get microseconds from pandas.Timedelta object

If you have a pandas.Timedelta object, you can use Timedelta.total_seconds() to get the seconds as a floating-point number with microsecond resolution and then multiply with one million (1e6, the number of microseconds in one second) to obtain the number of microseconds in the Timedelta:

timedelta.total_seconds() * 1e6

In case you want an integer, use

int(round(timedelta.total_seconds() * 1e6))

Note that using round() is required here to avoid errors due to floating point precision.

or use this function definition:

def microseconds_from_timedelta(timedelta):
    """Compute the microseconds in a timedelta as floating-point number"""
    return timedelta.total_seconds() * 1e6

def microseconds_from_timedelta_integer(timedelta):
    """Compute the microseconds in a timedelta as integer number"""
    return int(round(timedelta.total_seconds() * 1e6))

# Usage example:
us = microseconds_from_timedelta(timedelta)
print(us) # Prints 2000751.9999999998

us = microseconds_from_timedelta_integer(timedelta)
print(us) # Prints 2000752
Posted by Uli Köhler in pandas, Python

How to get nanoseconds from pandas.Timedelta object

If you have a pandas.Timedelta object, you can use Timedelta.total_seconds() to get the seconds as a floating-point number with nanosecond resolution and then multiply with one billion (1e9, the number of nanoseconds in one second) to obtain the number of nanoseconds in the Timedelta:

timedelta.total_seconds() * 1e9

In case you want an integer, use

int(round(timedelta.total_seconds() * 1e9))

Note that using round() is required here to avoid errors due to floating point precision.

or use this function definition:

def nanoseconds_from_timedelta(timedelta):
    """Compute the nanoseconds in a timedelta as floating-point number"""
    return timedelta.total_seconds() * 1e9

def nanoseconds_from_timedelta_integer(timedelta):
    """Compute the nanoseconds in a timedelta as integer number"""
    return int(round(timedelta.total_seconds() * 1e9))

# Usage example:
ns = nanoseconds_from_timedelta(timedelta)
print(ns) # Prints 2000751999.9999998

ns = nanoseconds_from_timedelta_integer(timedelta)
print(ns) # Prints 2000752000

 

Posted by Uli Köhler in pandas, Python

How to create pandas.Timedelta object from two timestamps

If you have two pandas.Timestamp objects, you can simply substract them using the minus operator (-) in order to obtain a pandas.Timedelta object:

import pandas as pd
import time

# Create two timestamps
ts1 = pd.Timestamp.now()
time.sleep(2)
ts2 = pd.Timestamp.now()

# The difference of these timestamps is a pandas.Timedelta object.
timedelta = ts2 - ts1
print(timedelta) # Prints '0 days 00:00:02.000752'

 

Posted by Uli Köhler in pandas, Python