# Data science

## Create a numpy array of one datetime64 per day and mark the start of the month

startdate = np.datetime64('2022-01-01T00:00:00.000000')
# 10000 days. This array contains 0 (Day 0), 1 (Day 1), etc
day_offsets = np.arange(10000, dtype=np.int64)
usec_per_day = int(1e6) * 86400 # 86.4k sec per day, 1e6 microseconds per second
# Compute microseconds offset from first day
usec_offsets = day_offsets * usec_per_day
# Compute timestamps
timestamps = startdate + usec_offsets

# Mark start of month in boolean array
is_start_of_month = np.zeros_like(timestamps, dtype=bool)
for index, timestamp in np.ndenumerate(timestamps):
is_start_of_month[index[0]] = timestamp.astype(datetime).day == 1

Using this method, timestamps will be the start of each day

array(['2022-01-01T00:00:00.000000', '2022-01-02T00:00:00.000000',
'2022-01-03T00:00:00.000000', ..., '2101-12-30T00:00:00.000000',
'2101-12-31T00:00:00.000000', '2102-01-01T00:00:00.000000'],
dtype='datetime64[us]')

Posted by Uli Köhler in Data science, Python

## How to make video from Matplotlib plots

First, you need to ensure that your plots are saved all with the same name pattern containing the frame number (starting at 1) which must be padded with zeros! For example, your plot should be named myplot_000001.png or myplot_0123. This can be done using, for example, using

fig.savefig(f'myplots/myplot_{timestep:06d}.png')

06d pads the number (in the timestep variable) up to 6 digits in total.

Now, use ffmpeg like this to create the video:

ffmpeg -f image2 -framerate 25 -i myplots/myplot_%06d.png -vcodec libx264 -crf 22 video.mp4

Posted by Uli Köhler in Data science, Python

## How to replace pandas values by NaN by threshold

When processing pandas datasets, often you need to remove values above or below a given threshold from a dataset. One way to “remove” values from a dataset is to replace them by NaN (not a number) values which are typically treated as “missing” values.

For example: In order to replace values of the xcolumn by NaNwhere the x column is< 0.75 in a DataFrame df, use this snippet:

import numpy as np

df["x"][df["x"] < -0.75] = np.nan

For example, we can run this on the TechOverflow pandas time series example dataset. The original dataset has two columns: Sine and Cosine and looks like this:

After running

df["Sine"][df["Sine"] < -0.75] = np.nan

you can see that all Sine values below 0.75 have been omitted from the plot, but all the values from the Cosine column are left unchanged:

Complete example code:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
plt.style.use("ggplot")

# Load pre-built time series example dataset
df.set_index("Timestamp", inplace=True)

# Plot original code
df.plot()
plt.savefig("TimeSeries-Original.svg")

and this is the code to plot the filtered dataset:

df[df < -0.75] = np.nan
df.plot()
plt.savefig("TimeSeries-NaN.svg")

Posted by Uli Köhler in Data science, pandas, Python

## AWS / Wasabi S3 policy to allow a single user complete access to a single bucket

Use this S3 policy in your bucket configuration to allow a single user (identified by its ARN) complete access to a single bucket:

{
"Id": "MyBucketPolicy",
"Statement": [
{
"Sid": "AllAccess",
"Action": "s3:*",
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::my-bucket-name",
"arn:aws:s3:::my-bucket-name/*"
],
"Principal": {"AWS":["arn:aws:iam::100000012345:user/MyUser"]}
}
]
}

Replace both instances of arn:aws:s3:::my-bucket-name by your bucket ARN and replace arn:aws:iam::100000012345:user/MyUser by your user’s ARN. You can also add multiple user ARNs to "Principal": {"AWS": ... }.

Posted by Uli Köhler in Data science

## How to fix cfgrib TypeError: open_dataset() got an unexpected keyword argument ‘filter_by_keys’

### Problem:

You want to open a GRB2 file using the cfgrib library using code like

import xarray as xr
ds = xr.open_dataset('myfile.grb2', engine='cfgrib', filter_by_keys={'typeOfLevel': 'atmosphere'})

But you see an exception like

TypeError: open_dataset() got an unexpected keyword argument 'filter_by_keys'

### Solution:

You can’t use filter_by_keys=… as argument to xr.open_dataset() directly. You need to use backend_kwargs like this:

ds = xr.open_dataset('myfile.grb2', engine='cfgrib', backend_kwargs={'filter_by_keys': {'typeOfLevel': 'atmosphere'}})

Posted by Uli Köhler in Data science, Python

## Parsing World Population Prospects (WPP) XLSX data in Python

The United Nations provides the Word Population Prospects (WPP) dataset on geographic and age distribution of mankind as downloadable XLSX files.

Reading these files in Python is rather easy. First we have to find out how many rows to skip. For the 2019 WPP dataset this value is 16 since row 17 contains all the column headers. The number of rows to skip might be different depending on the dataset. We’re using WPP2019_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.xlsx in this example.

We can use Pandas read_excel() function to import the dataset in Python:

import pandas as pd

df = pd.read_excel("WPP2019_INT_F03_1_POPULATION_BY_AGE_ANNUAL_BOTH_SEXES.xlsx", skiprows=16, na_values=["..."])

This will take a few seconds until the large dataset has been processed. Now we can check if skiprows=16 is the correct value. It is correct if pandas did recognize the column names correctly:

>>> df.columns
Index(['Index', 'Variant', 'Region, subregion, country or area *', 'Notes',
'Country code', 'Type', 'Parent code', 'Reference date (as of 1 July)',
'0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39',
'40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79',
'80-84', '85-89', '90-94', '95-99', '100+'],
dtype='object')

Now let’s filter for a country:

russia = df[df["Region, subregion, country or area *"] == 'Russian Federation']

This will show us the population data for multiple years in 5-year intervals from 1950 to 2020. Now let’s filter for the most recent year:

russia.loc[russia["Reference date (as of 1 July)"].idxmax()]

This will show us a single dataset:

Index                                                 3255
Variant                                          Estimates
Region, subregion, country or area *    Russian Federation
Notes                                                  NaN
Country code                                           643
Type                                          Country/Area
Parent code                                            923
Reference date (as of 1 July)                         2020
0-4                                                9271.69
5-9                                                9350.92
10-14                                              8174.26
15-19                                              7081.77
20-24                                               6614.7
25-29                                              8993.09
30-34                                              12543.8
35-39                                              11924.7
40-44                                              10604.6
45-49                                              9770.68
50-54                                              8479.65
55-59                                                10418
60-64                                              10073.6
65-69                                              8427.75
70-74                                              5390.38
75-79                                              3159.34
80-84                                              3485.78
85-89                                              1389.64
90-94                                              668.338
95-99                                              102.243
100+                                                 9.407
Name: 3254, dtype: object
​

How can we plot that data? First, we need to select all the columns that contain age data. We’ll do this by manually inserting the name of the first such column (0-4) into the following code and assuming that there are no columns after the last age column:

>>> df.columns[df.columns.get_loc("0-4"):]
Index(['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39',
'40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79',
'80-84', '85-89', '90-94', '95-99', '100+'],
dtype='object')

Now let’s select those columns from the russia dataset:

most_recent_russia = russia.loc[russia["Reference date (as of 1 July)"].idxmax()]
age_columns = df.columns[df.columns.get_loc("0-4"):]

russian_age_data = most_recent_russia[age_columns]

Let’s have a look at the dataset:

>>> russian_age_data
0-4      9271.69
5-9      9350.92
10-14    8174.26
15-19    7081.77
20-24     6614.7
25-29    8993.09
30-34    12543.8
35-39    11924.7
40-44    10604.6
45-49    9770.68
50-54    8479.65
55-59      10418
60-64    10073.6
65-69    8427.75
70-74    5390.38
75-79    3159.34
80-84    3485.78
85-89    1389.64
90-94    668.338
95-99    102.243
100+       9.407

That looks useable, note however that the values are in thousands, i.e. we have to multiply the values by 1000 to obtain the actual estimates of the population. Let’s plot it:

from matplotlib import pyplot as plt
plt.style.use("ggplot")

plt.title("Age composition of the Russian population (2020)")
plt.ylabel("People in age group [Millions]")
plt.xlabel("Age group")
plt.gcf().set_size_inches(15,5)
# Data is given in thousands => divide by 1000 to obtain millions
plt.plot(russian_age_data.index, russian_age_data.as_matrix() / 1000., lw=3)

The finished plot will look like this:

Here’s our finished script:

#!/usr/bin/env python3
import pandas as pd
# Filter only russia
russia = df[df["Region, subregion, country or area *"] == 'Russian Federation']

# Filter only most recent estimate (1 row)
most_recent_russia = russia.loc[russia["Reference date (as of 1 July)"].idxmax()]
# Retain only value columns
age_columns = df.columns[df.columns.get_loc("0-4"):]
russian_age_data = most_recent_russia[age_columns]

# Plot!
from matplotlib import pyplot as plt
plt.style.use("ggplot")

plt.title("Age composition of the Russian population (2020)")
plt.ylabel("People in age group [Millions]")
plt.xlabel("Age group")
plt.gcf().set_size_inches(15,5)
# Data is given in thousands => divide by 1000 to obtain millions
plt.plot(russian_age_data.index, russian_age_data.as_matrix() / 1000., lw=3)

# Export as SVG
plt.savefig("russian-demographics.svg")

Posted by Uli Köhler in Bioinformatics, Data science, pandas, Python

## FFT spectrum frequency resolution calculator

Calculate the frequency resolution of a FFT spectrum using this online calculator – formula included.

TechOverflow calculators:
You can enter values with SI suffixes like 12.2m (equivalent to 0.012) or 14k (14000) or 32u (0.000032).
The results are calculated while you type and shown directly below the calculator, so there is no need to press return or click on a Calculate button. Just make sure that all inputs are green by entering valid values.

points

Hz

$$f_{resolution} = \frac{f_{samplerate}}{N_{FFT}}$$

Posted by Uli Köhler in Calculators, Data science

## Easily compute & visualize FFTs in Python using UliEngineering

UliEngineering is a mixed data analytics library in Python – one of the utilities it provides is an easy-to-use package to compute FFTs. In contrast to other package, this library is oriented towards practical usecases and allows you to do the FFT in only one line of code! No knowledge of Math required.

### How to install UliEngineering

UliEngineering is a Python 3 only library. Install using pip:

sudo pip3 install -U UliEngineering

### Getting started

First, we’ll generate some test data. See this previous post for more details on how to generate sinusoid test data:

from UliEngineering.SignalProcessing.Simulation import *

# Generate test data: 100 Hz + 400 Hz tone
data = sine_wave(frequency=100.0, samplerate=1000, amplitude=1.) \
+ sine_wave(frequency=400.0, samplerate=1000, amplitude=0.5)

The test data consists of a 100 Hz sine wave plus a 400 Hz sine wave (with half the amplitude). That signal is sampled with a sample rate of 1000 Hz.

Now we can compute & visualize the FFT using matplotlib:

# Compute FFT. Use the same samplerate
# NOTE: Window defaults to "blackman"!
from UliEngineering.SignalProcessing.FFT import compute_fft
fft = compute_fft(data, samplerate=1e3)

# Plot
from matplotlib import pyplot as plt
plt.style.use("ggplot")

plt.gcf().set_size_inches(10, 5) # Use (20, 10) to get a larger plot
plt.plot(fft.frequencies, fft.amplitudes)
plt.xlabel("Frequency")
plt.ylabel("Amplitude")

compute_fft(data, samplerate=1e3) returns a tuple (fftx, ffty). It performs an FFT the size of the input (i.e. since data is a length-1000 array, the FFT will be of size 1000).

fft.frequencies is an array of frequencies (in Hz), corresponding to the values in fft.amplitudes. You can also use fft.angles to get relative angles in degrees, but that is not being covered in this blogpost.

As you can see in the plot shown above, the maximum frequency that can be detected is always half the samplerate, i.e. for our samplerate of $f_s = 1000\,\text{Hz}$ it is $500\,\text{Hz}$. See this more math-centric FFT explanation if you want to know more details.

Internally, compute_fft() performs this computation:

$$2 \cdot \frac{\text{abs}\left(\text{FFT}(\text{data} \cdot \text{Window})\right)}{\text{len(data)}}$$
• $2$ is a correction factor that takes into account that we throw away the latter half of the raw FFT result (since we’re doing a real FFT)
• $\frac{1}{\text{len(data)}}$ normalizes the FFT results so they are indepedent of the data length (i.e. if you pass a longer sample of the same sine wave you will still get the same result
• $\text{abs}\left(\cdots\right)$ Converts the complex phase-aware result of the FFT to a spectrum which is easier to read & visualize.
• Window (which defaults to blackman) is the window which is applied to the data to alleviate some mathematical effects at the beginning and the end of the dataset. See wikipedia on window functions for more details. UliEngineering currently offers this list of window functions:
• blackman
• bartlett
• hamming
• hanning
• kaiser (Parameter is fixed to 2.0)
• none

### Selecting frequency ranges

Using the UliEngineering API, selecting a frequency range of the FFT is trivially easy: Just use fft[lowfreq:highfreq]. You can use fft[lowfreq:] to select everything starting from lowfreq or use fft[:highfreq] to select everything up to highfreq.

from UliEngineering.SignalProcessing.FFT import compute_fft
fft = compute_fft(data, samplerate=1e3)

# Select frequency range: 50 to 200 Hz
fft = fft[50.0:200.0]

# Plot
from matplotlib import pyplot as plt
plt.style.use("ggplot")

plt.gcf().set_size_inches(10, 5) # Use (20, 10) to get a larger plot
plt.plot(fft.frequencies, fft.amplitudes)
plt.xlabel("Frequency")
plt.ylabel("Amplitude")
plt.savefig("/ram/fft-frequency-range.svg")

### Extract amplitude & angle at a certain frequency

By using [frequency], i.e. getitem operator with a single value, you get an FFTPoint() object containing the frequency, amplitude and relative angle for a given frequency. The library automatically selects the closest FFT bucket, so even if your FFT does not have a bucket for that specific frequency, you will get sensible results.

from UliEngineering.SignalProcessing.FFT import compute_fft
fft = compute_fft(data, samplerate=1e3)

# Show value at a certain frequency
print(fft[30]) # FFTPoint(frequency=30.0, value=2.91e-08, angle=0.0)
print(fft[100]) # FFTPoint(frequency=100.0, value=0.419, angle=0.0)

### Short FFTs for long data

Using compute_fft(), if we have an extremely long data array, this means we’ll compute an extremely long FFT. In many cases, this is not desirable and you want to compute a fixed-size FFT (most often a power-of-two FFT, e.g. 1024, 2048, 4096 etc).

Let’s generate some long test data and assume we want to compute a size-1024 FFT on it

from UliEngineering.SignalProcessing.FFT import *
fftx, ffty = simple_serial_fft_reduce(data, samplerate=1e3, fftsize=1024)

Plotting the data like above yields

which looks almost exactly like our compute_fft() plot before – just as we expected.

simple_serial_fft_reduce() takes care of all the magic and normalization for us, including partitioning the data into overlapping chunks, adding the FFTs and properly normalizing the result.

The naming convention is significant here:

• simple_...._reduce means this is a variant with sensible defaults (simple) for using a reduction function (default: sum) on multiple FFTs.
• serial means the individual FFTs

### Parallelizing the FFTs

If you have a huge dataset, you can use simple_parallel_fft_reduce() identically tosimple_serial_fft_reduce():

from UliEngineering.SignalProcessing.FFT import *
fftx, ffty = simple_parallel_fft_reduce(data, samplerate=1e3, fftsize=1024)

However in most cases you want to initialize the executor manually so you can re-use it later:

from UliEngineering.SignalProcessing.FFT import *
fftx, ffty = simple_parallel_fft_reduce(data, samplerate=1e3, fftsize=1024, executor=executor)

We can use a ThreadPoolExecutor() since scipy.fftpack (which UliEngineering uses to do the hard math) unlocks the Python GIL.

Note that due to the need to do a lot of housekeeping tasks, simple_parallel_fft_reduce() is much slower than simple_serial_fft_reduce() if you have a dataset so small that parallelization is not effective. My initial recommendation is to consider using the parallel variant if the total execution time of the serial variant is larger than $0.5\,s$

Posted by Uli Köhler in Data science, Mathematics, Python

## Easily generate square/triangle/sawtooth/inverse sawtooth waveform data in Python using UliEngineering

In a previous post, I’ve detailed how to generate sine/cosine wave data with frequency, amplitude, offset, phase shift and time offset using only a single line of code.

This post extends this approach by showing how to generate square wave, triangle wave, sawtooth wave and inverse sawtooth wave data – still in only one line of code.

All of the parameters, including frequency, amplitude, offset, phase shift and time offset apply for those functions as well – see the previous post for details and examples for those parameters.

We are using the UliEngineering library, more specifically the UliEngineering.SignalProcessing.Simulation package:

### How to install UliEngineering

UliEngineering is a Python 3 only library. Install using pip:

sudo pip3 install -U UliEngineering

### Square wave

from UliEngineering.SignalProcessing.Simulation import square_wave

data = square_wave(frequency=10.0, samplerate=10e3)

### Triangle wave

from UliEngineering.SignalProcessing.Simulation import triangle_wave

data = triangle_wave(frequency=10.0, samplerate=10e3)

### Sawtooth wave

from UliEngineering.SignalProcessing.Simulation import sawtooth

data = sawtooth(frequency=10.0, samplerate=10e3)

### Inverse sawtooth wave

from UliEngineering.SignalProcessing.Simulation import inverse_sawtooth

data = inverse_sawtooth(frequency=10.0, samplerate=10e3)

### Plotting code

This code was used to generate the plots for this post in Jupyter:

%matplotlib inline
from matplotlib import pyplot as plt
plt.style.use("ggplot")

from UliEngineering.SignalProcessing.Simulation import square_wave

data = square_wave(frequency=10.0, samplerate=10e3)

# set_size_inches(20, 10) to make it even larger!
plt.gcf().set_size_inches(10, 5)
plt.plot(data, label="original")
plt.savefig("/dev/shm/square-wave.svg")
Posted by Uli Köhler in Data science, Mathematics, Python

## Easily generate sine/cosine waveform data in Python using UliEngineering

In order to generate sinusoid test data in Python you can use the UliEngineering library which provides an easy-to-use functions in UliEngineering.SignalProcessing.Simulation:

### How to install UliEngineering

UliEngineering is a Python 3 only library. Install using pip:

sudo pip3 install -U UliEngineering

### Basic example

from UliEngineering.SignalProcessing.Simulation import sine_wave
# Default: Generates 1 second of data with amplitude = 1.0 (swing from -1.0 ... 1.0)
sine = sine_wave(frequency=10.0, samplerate=10e3)
cosine = cosine_wave(frequency=10.0, samplerate=10e3)

### Amplitude & offset

Use amplitude=0.5 to specify that the sine wave should swing between -0.5 and 0.5.

Use offset=2.0 to specify that the sine wave should be centered vertically around 2.0:

from UliEngineering.SignalProcessing.Simulation import sine_wave

data = sine_wave(frequency=10.0, samplerate=10e3, amplitude=0.5, offset=2.0)

### Phaseshift example

You can specify the phaseshift to use – to use a 180° phaseshift, just use phaseshift=180.0

from UliEngineering.SignalProcessing.Simulation import sine_wave

original = sine_wave(frequency=10.0, samplerate=10e3)
shifted = sine_wave(frequency=10.0, samplerate=10e3, phaseshift=180.)

### Time delay example

While this is functionally equivalent to phase offset, it is often convenient to specify the time delay in seconds instead of the phase shift in degrees:

from UliEngineering.SignalProcessing.Simulation import sine_wave

original = sine_wave(frequency=10.0, samplerate=10e3)
# Shifted signal is delayed 0.01s = 10 milliseconds compared to original
shifted = sine_wave(frequency=10.0, samplerate=10e3, timedelay=0.01)

Note that in the time domain, the signals appear to be shifted backwards when you use a positive timedelay value. This is in accordance with the delay naming, implying that the signal is delayed by that amount.

You can specify both phase shift and time delay, meaning that both will be applied (the offset is added)

### Plotting

If you want to debug your signals visually, this is the code that was used inside Jupyter to generate the plots shown above:

%matplotlib inline
from matplotlib import pyplot as plt
plt.style.use("ggplot")

# Generate data
from UliEngineering.SignalProcessing.Simulation import sine_wave

original = sine_wave(frequency=10.0, samplerate=10e3)
shifted = sine_wave(frequency=10.0, samplerate=10e3, timedelay=0.01)

# set_size_inches(20, 10) to make it even larger!
plt.gcf().set_size_inches(10, 5)
plt.plot(original, label="original")
plt.plot(shifted, label="shifted")
plt.savefig("/dev/shm/timedelay.svg")
plt.legend(loc=1) # Top right
Posted by Uli Köhler in Data science, Mathematics, Python

## Easy zero crossing detection in Python using UliEngineering

In order to perform zero crossing detection in NumPy arrays you can use the UliEngineering library which provides an easy-to-use zero_crossings function:

### How to install UliEngineering

UliEngineering is a Python 3 only library. Install using pip:

sudo pip3 install -U UliEngineering

### Usage example:

from UliEngineering.SignalProcessing.Utils import zero_crossings
# To generate test data
from UliEngineering.SignalProcessing.Simulation import sine_wave

# Generate test data: 10 Hz 10ksps, 1 second (= 1D 10000 values NumPy array)
data = sine_wave(frequency=10.0, samplerate=10e3)

# Prints: [0, 500, 1000, 1500, ...]
print(zero_crossings(data))

Thanks to Jim Brissom on StackOverflow for the original solution!

Posted by Uli Köhler in Data science, Mathematics, Python