How to detect value change in pandas string column/series

TL;DR

In order to get a series that is True every time the input string column changes, use

my_column_changes = df["MyStringColumn"].shift() != df["MyStringColumn"]

The first value of this Series will always be True since the value is considered to be NaN before the start of the series (due to the behaviour of shift()). In order to force the first value to be False, use

my_column_changes.iloc[0] = False

In order to get the rows in the dataframe where the column changes, use

df[my_column_changes]

or use this one-liner:

df[df["MyStringColumn"].shift() != df["MyStringColumn"]]

In order to assign this value to a new column in the DataFrame, use e.g.

df["MyStringColumnChanges"] = df["MyStringColumn"].shift() != df["MyStringColumn"]

Full example:

First we load our example from our previous post on How to create pandas time series DataFrame example dataset:

import pandas as pd

# Load pre-built time series example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

Now we create a new column that contains Positive if the sine wave value in the "Sine" column is positive or "Negative" if that value is negative:

df["SinePositive"] = (df["Sine"] >= 0).map({True: "Positive", False: "Negative"})

Now we create the ZeroCrossing column using the method shown above:

# Create "change" column (boolean)
df["ZeroCrossing"] = df["SinePositive"].shift() != df["SinePositive"]

… and set the first entry to False since we don’t consider the start of the series to be a zero crossing:

df["ZeroCrossing"].iloc[0] = False

Now we can use

df[df["ZeroCrossing"]]

to show the rows in the DataFrame where the zero crossing happened:

                                    Sine   Cosine SinePositive  ZeroCrossing
Timestamp                                                                   
2020-05-25 20:05:10.040874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.090874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.140874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.190874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.240874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.290874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.340874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.390874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.440774 -2.450532e-15 -1.00000     Negative          True
2020-05-25 20:05:10.490874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.540874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.590874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.640774 -1.960673e-15 -1.00000     Negative          True
2020-05-25 20:05:10.690874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.740874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.790874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.840874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.890774  4.901063e-15  1.00000     Positive          True
2020-05-25 20:05:10.940874 -6.283144e-03 -0.99998     Negative          True

 

Full example code:

import pandas as pd

# Load pre-built time series example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

# Create a new column containing "Positive" or "Negative"
df["SinePositive"] = (df["Sine"] >= 0).map({True: "Positive", False: "Negative"})
# Create "change" column (boolean)
df["ZeroCrossing"] = df["SinePositive"].shift() != df["SinePositive"]
# Set first entry to False
df["ZeroCrossing"].iloc[0] = False

# Print result
print(df[df["ZeroCrossing"]])