When processing pandas datasets, often you need to remove values above or below a given threshold from a dataset. One way to “remove” values from a dataset is to replace them by NaN
(not a number) values which are typically treated as “missing” values.
For example: In order to replace values of the x
column by NaN
where the x
column is< 0.75
in a DataFrame df
, use this snippet:
import numpy as np df["x"][df["x"] < -0.75] = np.nan
For example, we can run this on the TechOverflow pandas time series example dataset. The original dataset has two columns: Sine
and Cosine
and looks like this:
After running
df["Sine"][df["Sine"] < -0.75] = np.nan
you can see that all Sine
values below 0.75
have been omitted from the plot, but all the values from the Cosine
column are left unchanged:
Complete example code:
import pandas as pd import numpy as np from matplotlib import pyplot as plt plt.style.use("ggplot") # Load pre-built time series example dataset df = pd.read_csv("https://datasets.techoverflow.net/timeseries-example.csv", parse_dates=["Timestamp"]) df.set_index("Timestamp", inplace=True) # Plot original code df.plot() plt.savefig("TimeSeries-Original.svg")
and this is the code to plot the filtered dataset:
df[df < -0.75] = np.nan df.plot() plt.savefig("TimeSeries-NaN.svg")