In our previous post we explored how to Split pandas DataFrame every time a column is True.
This slightly modified function also works if the given Series
is not a column in the DataFrame
:
def split_dataframe_by_series(df, series): """ Split a DataFrame where the given series is True. Yields a number of dataframes """ previous_index = df.index[0] for split_point in df[series].index: yield df[previous_index:split_point] previous_index = split_point # Yield remainder of dataset try: yield df[split_point:] except UnboundLocalError: pass # There is no split point => Ignore
Full example
We’ll use the ZeroCrossing
column we built in our previous post on How to detect value change in pandas string column/series which itself builds on our post on How to create pandas time series DataFrame example dataset. Based on that example we add the modified utility function shown above:
import pandas as pd # Load pre-built time series example dataset df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"]) df.set_index("Timestamp", inplace=True) # Create a new column containing "Positive" or "Negative" df["SinePositive"] = (df["Sine"] >= 0).map({True: "Positive", False: "Negative"}) # Create "change" column (boolean) df["ZeroCrossing"] = df["SinePositive"].shift() != df["SinePositive"] # Set first entry to False df["ZeroCrossing"].iloc[0] = False def split_dataframe_by_series(df, series): """ Split a DataFrame where the given series is True. Yields a number of dataframes """ previous_index = df.index[0] for split_point in df[series].index: yield df[previous_index:split_point] previous_index = split_point # Yield remainder of dataset try: yield df[split_point:] except UnboundLocalError: pass # There is no split point => Ignore # Print result split_frames = list(split_dataframe_by_series(df, df["ZeroCrossing"])) print(f"Split DataFrame into {len(split_frames)} separate frames by zero-crossing")
Note that converting the result of split_dataframe_to_series()
into a list
might not be neccessary depending on your application. If possible, I recommend directly iterating the data frames using a for
loop, e.g.:
for df_section in split_dataframe_by_series(df, df["ZeroCrossing"]): pass # TODO: Your code goes here!