How to detect value change in pandas string column/series
TL;DR
In order to get a series that is True
every time the input string column changes, use
my_column_changes = df["MyStringColumn"].shift() != df["MyStringColumn"]
The first value of this Series
will always be True
since the value is considered to be NaN
before the start of the series (due to the behaviour of shift()
). In order to force the first value to be False
, use
my_column_changes.iloc[0] = False
In order to get the rows in the dataframe where the column changes, use
df[my_column_changes]
or use this one-liner:
df[df["MyStringColumn"].shift() != df["MyStringColumn"]]
In order to assign this value to a new column in the DataFrame
, use e.g.
df["MyStringColumnChanges"] = df["MyStringColumn"].shift() != df["MyStringColumn"]
Full example:
First we load our example from our previous post on How to create pandas time series DataFrame example dataset:
import pandas as pd
# Load pre-built time series example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)
Now we create a new column that contains Positive
if the sine wave value in the "Sine"
column is positive or "Negative"
if that value is negative:
df["SinePositive"] = (df["Sine"] >= 0).map({True: "Positive", False: "Negative"})
Now we create the ZeroCrossing
column using the method shown above:
# Create "change" column (boolean)
df["ZeroCrossing"] = df["SinePositive"].shift() != df["SinePositive"]
… and set the first entry to False
since we don’t consider the start of the series to be a zero crossing:
df["ZeroCrossing"].iloc[0] = False
Now we can use
df[df["ZeroCrossing"]]
to show the rows in the DataFrame
where the zero crossing happened:
Sine Cosine SinePositive ZeroCrossing
Timestamp
2020-05-25 20:05:10.040874 -6.283144e-03 -0.99998 Negative True
2020-05-25 20:05:10.090874 6.283144e-03 0.99998 Positive True
2020-05-25 20:05:10.140874 -6.283144e-03 -0.99998 Negative True
2020-05-25 20:05:10.190874 6.283144e-03 0.99998 Positive True
2020-05-25 20:05:10.240874 -6.283144e-03 -0.99998 Negative True
2020-05-25 20:05:10.290874 6.283144e-03 0.99998 Positive True
2020-05-25 20:05:10.340874 -6.283144e-03 -0.99998 Negative True
2020-05-25 20:05:10.390874 6.283144e-03 0.99998 Positive True
2020-05-25 20:05:10.440774 -2.450532e-15 -1.00000 Negative True
2020-05-25 20:05:10.490874 6.283144e-03 0.99998 Positive True
2020-05-25 20:05:10.540874 -6.283144e-03 -0.99998 Negative True
2020-05-25 20:05:10.590874 6.283144e-03 0.99998 Positive True
2020-05-25 20:05:10.640774 -1.960673e-15 -1.00000 Negative True
2020-05-25 20:05:10.690874 6.283144e-03 0.99998 Positive True
2020-05-25 20:05:10.740874 -6.283144e-03 -0.99998 Negative True
2020-05-25 20:05:10.790874 6.283144e-03 0.99998 Positive True
2020-05-25 20:05:10.840874 -6.283144e-03 -0.99998 Negative True
2020-05-25 20:05:10.890774 4.901063e-15 1.00000 Positive True
2020-05-25 20:05:10.940874 -6.283144e-03 -0.99998 Negative True
Full example code:
import pandas as pd
# Load pre-built time series example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)
# Create a new column containing "Positive" or "Negative"
df["SinePositive"] = (df["Sine"] >= 0).map({True: "Positive", False: "Negative"})
# Create "change" column (boolean)
df["ZeroCrossing"] = df["SinePositive"].shift() != df["SinePositive"]
# Set first entry to False
df["ZeroCrossing"].iloc[0] = False
# Print result
print(df[df["ZeroCrossing"]])