pandas

How to create pandas time series DataFrame example dataset

TL;DR: Use our pre-built example dataset like this:

# Load pre-built time series example dataset
df = pd.read_csv("https://datasets.techoverflow.net/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

How to build your own time series example dataset

In our previous post Easily generate sine/cosine waveform data in Python using UliEngineering we showed how to generate sine and cosine waves using UliEngineering.

In this post, we show how to create a pandas DataFrame containing sine and cosine data to be used as a sample time series dataset.

First, we generate the sine and cosine wave data:

import pandas as pd
import numpy as np
from UliEngineering.SignalProcessing.Simulation import sine_wave, cosine_wave

# Configure the properties of the sine wave here
frequency = 10.0 # 10 Hz sine / cosine wave
samplerate = 10000 # 10 kHz
nseconds = 1 # Generate 1 second of data

sine = sine_wave(frequency=frequency, samplerate=samplerate, length=nseconds)
cosine = cosine_wave(frequency=frequency, samplerate=samplerate, length=nseconds)
nsamples = len(sine) # How many values we have in the data arrays

After that, we define the timestamp where the dataset starts:

start_timestamp = pd.Timestamp('now')

Now we can create a list of Timestamp objects representing the points in time where the signal has been sampled:

# Create timestamps by offsetting
timedelta = pd.Timedelta(1/samplerate, 'seconds')
timestamps =  [start_timestamp + i * timedelta for i in range(nsamples)]

Now we’re reading to create the DataFrame object:

df = pd.DataFrame(index=timestamps, data={
    "Sine": sine,
    "Cosine": cosine
})
df.index.name = 'Timestamp'

Now we can use df.plot() to plot the dataset:

# Use nice plotting style
from matplotlib import pyplot as plt
plt.style.use("ggplot")
# Plot dataset
df.plot()
# Make figure larger
plt.gcf().set_size_inches(10, 5)

Additionally we can export the dataset as CSV using

df.to_csv("/ram/timeseries-example.csv")

This example file is also available online at https://techoverflow.net/datasets/timeseries-example.csv

Full example:

#!/usr/bin/env python3
import pandas as pd
import numpy as np
from UliEngineering.SignalProcessing.Simulation import sine_wave, cosine_wave
# Configure the properties of the sine wave here
frequency = 10.0 # 10 Hz sine / cosine wave
samplerate = 10000 # 10 kHz
nseconds = 1 # Generate 1 second of data
sine = sine_wave(frequency=frequency, samplerate=samplerate, length=nseconds)
cosine = cosine_wave(frequency=frequency, samplerate=samplerate, length=nseconds)
nsamples = len(sine) # How many values we have in the data arrays

start_timestamp = pd.Timestamp('now')

# Create timestamps by offsetting
timedelta = pd.Timedelta(1/samplerate, 'seconds')
timestamps =  [start_timestamp + i * timedelta for i in range(nsamples)]

df = pd.DataFrame(index=timestamps, data={
    "Sine": sine,
    "Cosine": cosine
})
df.index.name = 'Timestamp'

df.to_csv("timeseries-example.csv")

 

Posted by Uli Köhler in pandas, Python

How to get last 10 minutes of a pandas DataFrame

In our previous post we showed how to subtract 5 minutes from a pandas DataFrame:

pd.Timestamp('now') - pd.Timedelta(10, 'minutes')

We can also use this knowledge in order to get the last 10 minutes of a pandas DataFrame. In our example, we assume that df[“Timestamp”] contains the timestamp. First, we get the last timestamp in the dataset using

# Use this if the timestamp is the index of the DataFrame
last_ts = df.index.iloc[-1]

or

# ... or use this if the timestamp is in a colum
last_ts = df["Timestamp"].iloc[-1]

Next, we define the first timestamp that shall be considered by subtracting 10 minutes from last_ts:

first_ts = last_ts - pd.Timedelta(10, 'minutes')

Now we can filter the DataFrame using

# Use this if the Timestamp is in a column
filtered_df = df[df["Timestamp"] >= first_ts]

or

# Use this if the Timestamp is the index of the DataFrame
filtered_df = df[df.index >= first_ts]

By filtering, we don’t need the DataFrame to be sorted and the original order will be maintained.

Full example:

This example loads our pre-built time series example dataset from our previous post How to create pandas time series DataFrame example dataset. The code loads that dataset (which is 1 second long) and takes the last 0.5 seconds from it.

import pandas as pd

# Load example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

# Use this if the timestamp is the index of the DataFrame
last_ts = df.index[-1]

first_ts = last_ts - pd.Timedelta(0.5, 'seconds')

filtered_df = df[df.index >= first_ts]

# Plot the result
filtered_df.plot()

 

Posted by Uli Köhler in pandas, Python

How to subtract 5 minutes from pandas Timestamp

In our previous post we showed how to create a pandas Timestamp representing the current point in time:

pd.Timestamp('now')

You can subtract 5 minutes from that timestamp by using Timedelta(5, 'minutes'):

pd.Timestamp('now') - pd.Timedelta(5, 'minutes')
Posted by Uli Köhler in pandas, Python

How to create pandas ‘now’ Timestamp

In order to create a pandas Timestamp representing the current point in time, use

pd.Timestamp('now')

This will create a Timestamp in the current timezone.

Full example:

import pandas as pd

now = pd.Timestamp('now')

print(now) # Prints e.g. Timestamp('2020-05-25 19:02:31.051836')

 

Posted by Uli Köhler in pandas, Python

How to get last row of Pandas DataFrame

Use .iloc[-1] to get the last row (all columns) of a pandas DataFrame, for example:

my_dataframe.iloc[-1]

 

Posted by Uli Köhler in pandas, Python

How to get last element of Pandas Series

Use .iloc[-1] to get the last element of a pandas Series, for example:

my_dataframe['MyColumn'].iloc[-1]

 

Posted by Uli Köhler in pandas, Python

How to read IDF diabetes statistics in Python using Pandas

The International Diabetes Foundation provides a Data portal with various statistics related to diabetes.

In this post we’ll show how to read the Diabetes estimates (20-79 y) / People with diabetes, in 1,000s data export in CSV format using pandas.

First download IDF (people-with-diabetes--in-1-000s).csv from the data page.

Now we can parse the CSV file:

import pandas as pd

# Download at https://www.diabetesatlas.org/data/en/indicators/1/
df = pd.read_csv("IDF (people-with-diabetes--in-1-000s).csv")
# Parse year columns to obtain floats and multiply by thousands factor. Pandas fails to parse values like "12,345.67"
for column in df.columns:
    try:
        int(column)
        df[column] = df[column].apply(lambda s: None if s == "-" else float(s.replace(",", "")) * 1000)
    except:
        pass

As you can see in the postprocessing step, the number of diabetes patients are given in 1000s in the CSV, so we multiply them by 1000 to obtain the actual numbers.

If you want to modify the data columns (i.e. the columns referring to year), you can use this simple template:

for column in df.columns:
    try:
        int(column) # Will raise ValueError() if column is not a year number
        # Whatever you do here will only be applied to year columns
        df[column] = df[column] * 0.75 # Example on how to modify a column
        # But note that if your code raises an Exception, it will be ignored!
    except:
        pass

Let’s plot some data:

regions = df[df["Type"] == "Region"] # Only regions, not individual countries

from matplotlib import pyplot as plt
plt.style.use("ggplot")
plt.gcf().set_size_inches(20,4)
plt.ylabel("Diabetes patients [millions]")
plt.xlabel("Region")
plt.title("Diabetes patients in 2019 by region")
plt.bar(regions["Country/Territory"], regions["2019"] / 1e6)

Note that if you use a more recent dataset than the version I’m using the 2019 column might not exist in your CSV file. Choose an appropriate column in that case.

Posted by Uli Köhler in Bioinformatics, pandas, Python

Parsing World Population Prospects (WPP) XLSX data in Python

The United Nations provides the Word Population Prospects (WPP) dataset on geographic and age distribution of mankind as downloadable XLSX files.

Reading these files in Python is rather easy. First we have to find out how many rows to skip. For the 2019 WPP dataset this value is 16 since row 17 contains all the column headers. The number of rows to skip might be different depending on the dataset. We’re using WPP2019_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.xlsx in this example.

We can use Pandas read_excel() function to import the dataset in Python:

import pandas as pd

df = pd.read_excel("WPP2019_INT_F03_1_POPULATION_BY_AGE_ANNUAL_BOTH_SEXES.xlsx", skiprows=16, na_values=["..."])

This will take a few seconds until the large dataset has been processed. Now we can check if skiprows=16 is the correct value. It is correct if pandas did recognize the column names correctly:

>>> df.columns
Index(['Index', 'Variant', 'Region, subregion, country or area *', 'Notes',
       'Country code', 'Type', 'Parent code', 'Reference date (as of 1 July)',
       '0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39',
       '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79',
       '80-84', '85-89', '90-94', '95-99', '100+'],
      dtype='object')

Now let’s filter for a country:

russia = df[df["Region, subregion, country or area *"] == 'Russian Federation']

This will show us the population data for multiple years in 5-year intervals from 1950 to 2020. Now let’s filter for the most recent year:

russia.loc[russia["Reference date (as of 1 July)"].idxmax()]

This will show us a single dataset:

Index                                                 3255
Variant                                          Estimates
Region, subregion, country or area *    Russian Federation
Notes                                                  NaN
Country code                                           643
Type                                          Country/Area
Parent code                                            923
Reference date (as of 1 July)                         2020
0-4                                                9271.69
5-9                                                9350.92
10-14                                              8174.26
15-19                                              7081.77
20-24                                               6614.7
25-29                                              8993.09
30-34                                              12543.8
35-39                                              11924.7
40-44                                              10604.6
45-49                                              9770.68
50-54                                              8479.65
55-59                                                10418
60-64                                              10073.6
65-69                                              8427.75
70-74                                              5390.38
75-79                                              3159.34
80-84                                              3485.78
85-89                                              1389.64
90-94                                              668.338
95-99                                              102.243
100+                                                 9.407
Name: 3254, dtype: object
​

How can we plot that data? First, we need to select all the columns that contain age data. We’ll do this by manually inserting the name of the first such column (0-4) into the following code and assuming that there are no columns after the last age column:

>>> df.columns[df.columns.get_loc("0-4"):]
Index(['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39',
       '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79',
       '80-84', '85-89', '90-94', '95-99', '100+'],
      dtype='object')

Now let’s select those columns from the russia dataset:

most_recent_russia = russia.loc[russia["Reference date (as of 1 July)"].idxmax()]
age_columns = df.columns[df.columns.get_loc("0-4"):]

russian_age_data = most_recent_russia[age_columns]

Let’s have a look at the dataset:

>>> russian_age_data
0-4      9271.69
5-9      9350.92
10-14    8174.26
15-19    7081.77
20-24     6614.7
25-29    8993.09
30-34    12543.8
35-39    11924.7
40-44    10604.6
45-49    9770.68
50-54    8479.65
55-59      10418
60-64    10073.6
65-69    8427.75
70-74    5390.38
75-79    3159.34
80-84    3485.78
85-89    1389.64
90-94    668.338
95-99    102.243
100+       9.407

That looks useable, note however that the values are in thousands, i.e. we have to multiply the values by 1000 to obtain the actual estimates of the population. Let’s plot it:

from matplotlib import pyplot as plt
plt.style.use("ggplot")

plt.title("Age composition of the Russian population (2020)")
plt.ylabel("People in age group [Millions]")
plt.xlabel("Age group")
plt.gcf().set_size_inches(15,5)
# Data is given in thousands => divide by 1000 to obtain millions
plt.plot(russian_age_data.index, russian_age_data.as_matrix() / 1000., lw=3)

The finished plot will look like this:

Here’s our finished script:

#!/usr/bin/env python3
import pandas as pd
df = pd.read_excel("WPP2019_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.xlsx", skiprows=16)
# Filter only russia
russia = df[df["Region, subregion, country or area *"] == 'Russian Federation']

# Filter only most recent estimate (1 row)
most_recent_russia = russia.loc[russia["Reference date (as of 1 July)"].idxmax()]
# Retain only value columns
age_columns = df.columns[df.columns.get_loc("0-4"):]
russian_age_data = most_recent_russia[age_columns]

# Plot!
from matplotlib import pyplot as plt
plt.style.use("ggplot")

plt.title("Age composition of the Russian population (2020)")
plt.ylabel("People in age group [Millions]")
plt.xlabel("Age group")
plt.gcf().set_size_inches(15,5)
# Data is given in thousands => divide by 1000 to obtain millions
plt.plot(russian_age_data.index, russian_age_data.as_matrix() / 1000., lw=3)

# Export as SVG
plt.savefig("russian-demographics.svg")

 

 

Posted by Uli Köhler in Bioinformatics, Data science, pandas, Python

How to get milliseconds from pandas.Timedelta object

If you have a pandas.Timedelta object, you can use Timedelta.total_seconds() to get the seconds as a floating-point number with millisecond resolution and then multiply with one billion (1e3, the number of milliseconds in one second) to obtain the number of milliseconds in the Timedelta:

timedelta.total_seconds() * 1e3

In case you want an integer, use

int(round(timedelta.total_seconds() * 1e3))

Note that using round() is required here to avoid errors due to floating point precision.

or use this function definition:

def milliseconds_from_timedelta(timedelta):
    """Compute the milliseconds in a timedelta as floating-point number"""
    return timedelta.total_seconds() * 1e3

def milliseconds_from_timedelta_integer(timedelta):
    """Compute the milliseconds in a timedelta as integer number"""
    return int(round(timedelta.total_seconds() * 1e3))

# Usage example:
ms = milliseconds_from_timedelta(timedelta)
print(ms) # Prints 2000.752
ms = milliseconds_from_timedelta_integer(timedelta)
print(ms) # Prints 2001

 

Posted by Uli Köhler in pandas, Python

How to get microseconds from pandas.Timedelta object

If you have a pandas.Timedelta object, you can use Timedelta.total_seconds() to get the seconds as a floating-point number with microsecond resolution and then multiply with one million (1e6, the number of microseconds in one second) to obtain the number of microseconds in the Timedelta:

timedelta.total_seconds() * 1e6

In case you want an integer, use

int(round(timedelta.total_seconds() * 1e6))

Note that using round() is required here to avoid errors due to floating point precision.

or use this function definition:

def microseconds_from_timedelta(timedelta):
    """Compute the microseconds in a timedelta as floating-point number"""
    return timedelta.total_seconds() * 1e6

def microseconds_from_timedelta_integer(timedelta):
    """Compute the microseconds in a timedelta as integer number"""
    return int(round(timedelta.total_seconds() * 1e6))

# Usage example:
us = microseconds_from_timedelta(timedelta)
print(us) # Prints 2000751.9999999998

us = microseconds_from_timedelta_integer(timedelta)
print(us) # Prints 2000752
Posted by Uli Köhler in pandas, Python

How to get nanoseconds from pandas.Timedelta object

If you have a pandas.Timedelta object, you can use Timedelta.total_seconds() to get the seconds as a floating-point number with nanosecond resolution and then multiply with one billion (1e9, the number of nanoseconds in one second) to obtain the number of nanoseconds in the Timedelta:

timedelta.total_seconds() * 1e9

In case you want an integer, use

int(round(timedelta.total_seconds() * 1e9))

Note that using round() is required here to avoid errors due to floating point precision.

or use this function definition:

def nanoseconds_from_timedelta(timedelta):
    """Compute the nanoseconds in a timedelta as floating-point number"""
    return timedelta.total_seconds() * 1e9

def nanoseconds_from_timedelta_integer(timedelta):
    """Compute the nanoseconds in a timedelta as integer number"""
    return int(round(timedelta.total_seconds() * 1e9))

# Usage example:
ns = nanoseconds_from_timedelta(timedelta)
print(ns) # Prints 2000751999.9999998

ns = nanoseconds_from_timedelta_integer(timedelta)
print(ns) # Prints 2000752000

 

Posted by Uli Köhler in pandas, Python

How to create pandas.Timedelta object from two timestamps

If you have two pandas.Timestamp objects, you can simply substract them using the minus operator (-) in order to obtain a pandas.Timedelta object:

import pandas as pd
import time

# Create two timestamps
ts1 = pd.Timestamp.now()
time.sleep(2)
ts2 = pd.Timestamp.now()

# The difference of these timestamps is a pandas.Timedelta object.
timedelta = ts2 - ts1
print(timedelta) # Prints '0 days 00:00:02.000752'

 

Posted by Uli Köhler in pandas, Python

How to get current timestamp in Pandas

Use

import pandas as pd

current_timestamp = pd.Timestamp.now()

pd.Timestamp.now() will return the timestamp as pandas.Timestamp object.

Example:

>>> import pandas as pd
>>> pd.Timestamp.now()
Timestamp('2019-10-27 16:11:43.998993')
Posted by Uli Köhler in pandas, Python

Center star approximation: Identifying the center string in Python

Problem:

You need to calculate the center star approximation for a given set of sequences. Instead of calculating the sequence distances and center string by hand, you want the computer to do the hard work.

Continue reading →

Posted by Uli Köhler in Bioinformatics, pandas, Python