Split pandas DataFrame every time a Series is True
In our previous post we explored how to Split pandas DataFrame every time a column is True.
This slightly modified function also works if the given Series
is not a column in the DataFrame
:
def split_dataframe_by_series(df, series):
"""
Split a DataFrame where the given series is True. Yields a number of dataframes
"""
previous_index = df.index[0]
for split_point in df[series].index:
yield df[previous_index:split_point]
previous_index = split_point
# Yield remainder of dataset
try:
yield df[split_point:]
except UnboundLocalError:
pass # There is no split point => Ignore
Full example
We’ll use the ZeroCrossing
column we built in our previous post on How to detect value change in pandas string column/series which itself builds on our post on How to create pandas time series DataFrame example dataset. Based on that example we add the modified utility function shown above:
import pandas as pd
# Load pre-built time series example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)
# Create a new column containing "Positive" or "Negative"
df["SinePositive"] = (df["Sine"] >= 0).map({True: "Positive", False: "Negative"})
# Create "change" column (boolean)
df["ZeroCrossing"] = df["SinePositive"].shift() != df["SinePositive"]
# Set first entry to False
df["ZeroCrossing"].iloc[0] = False
def split_dataframe_by_series(df, series):
"""
Split a DataFrame where the given series is True. Yields a number of dataframes
"""
previous_index = df.index[0]
for split_point in df[series].index:
yield df[previous_index:split_point]
previous_index = split_point
# Yield remainder of dataset
try:
yield df[split_point:]
except UnboundLocalError:
pass # There is no split point => Ignore
# Print result
split_frames = list(split_dataframe_by_series(df, df["ZeroCrossing"]))
print(f"Split DataFrame into {len(split_frames)} separate frames by zero-crossing")
Note that converting the result of split_dataframe_to_series()
into a list
might not be neccessary depending on your application. If possible, I recommend directly iterating the data frames using a for
loop, e.g.:
for df_section in split_dataframe_by_series(df, df["ZeroCrossing"]):
pass # TODO: Your code goes here!