pandas

Jul

2021

How to add pandas pd.Timestamp

You can’t directly add Pandas pd.Timestamp instances:

t1 = pd.Timestamp('now')
t2 = pd.Timestamp('now')

t1 + t2
# TypeError: unsupported operand type(s) for +: 'Timestamp' and 'Timestamp'

But you can convert them to a numpy timestamp using their asm8 attribute, convert that timestamp to an integer, add it and convert it back:

t1 = pd.Timestamp('now')
t2 = pd.Timestamp('now')

tsum = (t1.asm8.astype(np.int64) + t2.asm8.astype(np.int64))
tsum_timestamp = pd.Timestamp(tsum.astype('<M8[ns]'))

Jul

2021

How to get average or mean between two pandas pd.Timestamp objects?

In order to compute the mean value between two pd.Timestamp instances, subtract them to obtain a pd.Timedelta and then add said Timedelta object to the first (smaller) timestamp:

t1 = pd.Timestamp('now')
t2 = pd.Timestamp('now')

mean_timestamp = t1 + ((t2 - t1) / 2)

May

2021

How to suppress legend in pandas .plot()

In order to suppress the legend when using pandas .plot(), use

legend=False

as an Argument to .plot(), for example:

df.plot(legend=False)

Apr

2021

How to fix pandas to_sql() AttributeError: ‘DataFrame’ object has no attribute ‘cursor’

Problem:

You are trying to save your DataFrame in an SQL database using pandas to_sql(), but you see an exception like

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-21-3788db1a4131> in <module>
      7 db = sqlalchemy.create_engine('sqlite:///timeseries.db' class="ansi-blue-fg">)
      8 
----> 9 df.to_sql('timeseries', df)

~/miniconda3/lib/python3.8/site-packages/pandas/core/generic.py in to_sql(self, name, con, schema, if_exists, index, index_label, chunksize, dtype, method)
   2603         from pandas.io import sql
   2604 
-> 2605         sql.to_sql(
   2606             self,
   2607             name,

~/miniconda3/lib/python3.8/site-packages/pandas/io/sql.py in to_sql(frame, name, con, schema, if_exists, index, index_label, chunksize, dtype, method)
    587         )
    588 
--> 589     pandas_sql.to_sql(
    590         frame,
    591         name,

~/miniconda3/lib/python3.8/site-packages/pandas/io/sql.py in to_sql(self, frame, name, if_exists, index, index_label, schema, chunksize, dtype, method)
   1825             dtype=dtype,
   1826         )
-> 1827         table.create()
   1828         table.insert(chunksize, method)
   1829 

~/miniconda3/lib/python3.8/site-packages/pandas/io/sql.py in create(self)
    719 
    720     def create(self):
--> 721         if self.exists():
    722             if self.if_exists == "fail":
    723                 raise ValueError(f"Table '{self.name}' already exists.")

~/miniconda3/lib/python3.8/site-packages/pandas/io/sql.py in exists(self)
    706 
    707     def exists(self):
--> 708         return self.pd_sql.has_table(self.name, self.schema)
    709 
    710     def sql_schema(self):

~/miniconda3/lib/python3.8/site-packages/pandas/io/sql.py in has_table(self, name, schema)
   1836         query = f"SELECT name FROM sqlite_master WHERE type='table' AND name={wld};"
   1837 
-> 1838         return len(self.execute(query, [name]).fetchall()) > 0
   1839 
   1840     def get_table(self, table_name, schema=None):

~/miniconda3/lib/python3.8/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1677             cur = self.con
   1678         else:
-> 1679             cur = self.con.cursor()
   1680         try:
   1681             cur.execute(*args, **kwargs)

~/miniconda3/lib/python3.8/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5137             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5138                 return self[name]
-> 5139             return object.__getattribute__(self, name)
   5140 
   5141     def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute 'cursor'

Solution:

You’re calling to_sql() with the wrong arguments! The second argument needs to be the database connection (e.g. an sqlalchemy engine)! You’re probably calling it like this:

df.to_sql('timeseries', df)

but the second argument needs to be db (or whatever your database connection object is named), not df!

Full working example for `to_sql()`

import pandas as pd
# Load pre-built time series example dataset
df = pd.read_csv("https://datasets.techoverflow.net/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

import sqlalchemy
db = sqlalchemy.create_engine('sqlite:///timeseries.db')

df.to_sql('timeseries', db, if_exists="replace")

Apr

2021

How to export Pandas dataset to SQLite database

In our previous post we showed how to connect to an SQLite database using sqlalchemy.

In this blogpost, we’ll show how you can export a pandas DataFrame – for example, our time series example dataset – to the SQLite database.

First, we’ll load the example data frame:

import pandas as pd
# Load pre-built time series example dataset
df = pd.read_csv("https://datasets.techoverflow.net/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

Now we can open the SQLite database as shown in our previous post

import sqlalchemy
db = sqlalchemy.create_engine('sqlite:///timeseries.db')

and export the DataFrame to the database:

df.to_sql('timeseries', db, if_exists="replace")

I always recommend using if_exists="replace" (i.e. if the table already exists, replace it) for a quicker development process.

The database looks like this when viewed in an SQLite viewer like HeidiSQL:

Complete code example

import pandas as pd
# Load pre-built time series example dataset
df = pd.read_csv("https://datasets.techoverflow.net/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

import sqlalchemy
db = sqlalchemy.create_engine('sqlite:///timeseries.db')

df.to_sql('timeseries', db, if_exists="replace")

Apr

2021

How to iterate column names including the index column in pandas

Just want to iterate over your DataFrame’s column names without the index column? See How to iterate column names in pandas!

In order to iterate over all column names including the index column name, use

for column_name in [df.index.name] + list(df.columns):
    print(column_name)

For example, we can print the name of all columns including the index column of the TechOverflow time series example dataset:

import pandas as pd

# Load pre-built time series example dataset
df = pd.read_csv("https://datasets.techoverflow.net/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

for column_name in [df.index.name] + list(df.columns):
    print(column_name)

which will print

Timestamp
Sine
Cosine

Apr

2021

How to get the name of the index column in pandas

In order to get the name of the index column for a pandas DataFrame, use

df.index.name

For example, we can print the name of the index column of the TechOverflow time series example dataset:

import pandas as pd

# Load pre-built time series example dataset
df = pd.read_csv("https://datasets.techoverflow.net/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

print(df.index.name)

which will print

Timestamp

Apr

2021

How to iterate column names in pandas

In order to iterate column names in pandas, use

for column in df.columns:
    print(columns)

For example, for the TechOverflow pandas time series example dataset, df.columnswill be

Index(['Sine', 'Cosine'], dtype='object')

so iterating over the columns using

# Load pre-built time series example dataset
df = pd.read_csv("https://datasets.techoverflow.net/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

# Iterate over columns
for column in df.columns:
    print(column)

will print

Sine
Cosine

Note that df.columns will not show the index column!

Apr

2021

How to replace pandas values by NaN by threshold

When processing pandas datasets, often you need to remove values above or below a given threshold from a dataset. One way to “remove” values from a dataset is to replace them by NaN (not a number) values which are typically treated as “missing” values.

For example: In order to replace values of the xcolumn by NaNwhere the x column is< 0.75 in a DataFrame df, use this snippet:

import numpy as np

df["x"][df["x"] < -0.75] = np.nan

For example, we can run this on the TechOverflow pandas time series example dataset. The original dataset has two columns: Sine and Cosine and looks like this:

After running

df["Sine"][df["Sine"] < -0.75] = np.nan

you can see that all Sine values below 0.75 have been omitted from the plot, but all the values from the Cosine column are left unchanged:

Complete example code:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
plt.style.use("ggplot")

# Load pre-built time series example dataset
df = pd.read_csv("https://datasets.techoverflow.net/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

# Plot original code
df.plot()
plt.savefig("TimeSeries-Original.svg")

and this is the code to plot the filtered dataset:

df[df < -0.75] = np.nan
df.plot()
plt.savefig("TimeSeries-NaN.svg")

Apr

2021

How to replace string in column names in Pandas DataFrame

Use this snippet in order to replace a string in column names for a pandas DataFrame:

new_df = df.rename(columns=lambda s: s.replace("A", "B")) # df will not be modified !

You can also modify the column names in-place (i.e. modify the original DataFrame):

df.rename(columns=lambda s: s.replace("A", "B"), inplace=True)

For example, if you have the columns ["ColumnA", "X", "Y"] before running .rename(), the result will have ["ColumnB", "X", "Y"] (the "A" has been replaced by "B")

Mar

2021

How to auto-fit Pandas pd.to_excel() XLSX column width

If you export XLSX data using df.to_excel(), the column widths in the spreadsheet are left as default and are not adjusted automatically:

# Load example dataset
df = pd.read_csv("https://datasets.techoverflow.net/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

# Export dataset to XLSX
df.to_excel("example.xlsx")

Solution

You can use UliPlot‘s auto_adjust_xlsx_column_width in order to automatically adjust the column width.

pip install UliPlot

Then use it like this in order to export the XLSX:

from UliPlot.XLSX import auto_adjust_xlsx_column_width

# Load example dataset
df = pd.read_csv("https://datasets.techoverflow.net/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

# Export dataset to XLSX
with pd.ExcelWriter("example.xlsx") as writer:
    df.to_excel(writer, sheet_name="MySheet")
    auto_adjust_xlsx_column_width(df, writer, sheet_name="MySheet", margin=0)

Note that the algorithm currently tends to oversize the columns a bit, but in most cases, every type of column will fit.

Dec

2020

How to load sample dataset in Pandas

This code loads an example dataset in Pandas but requires an internet connection:

import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

This is how the iris dataset looks like:

For a more detailed answer with other options on how to import example datasets, see this StackOverflow post.

Jun

2020

What fraction of the year has passed until a given Timestamp in pandas?

To compute what fraction of the year has passed since the start of the year, use this function:

import pandas as pd

def fraction_of_year_passed(date):
    """Compute what fraction of the current year has already passed up to the given date"""
    start_of_year = pd.Timestamp(now.year, 1, 1)
    start_of_next_year = pd.Timestamp(now.year + 1, 1, 1)
    # Compute seconds in entire year and seconds since start of year
    entire_year_seconds = (start_of_next_year - start_of_year).total_seconds()
    seconds_since_start_of_year = (date - start_of_year).total_seconds()
    return seconds_since_start_of_year / entire_year_seconds

Usage example:

print(fraction_of_year_passed(pd.Timestamp("2020-03-01"))) # prints 0.16393442622950818

Detailed explanation:

First, we define that start of the calendar year date belongs to, and the start of the calendar year after that:

start_of_year = pd.Timestamp(now.year, 1, 1)
start_of_next_year = pd.Timestamp(now.year + 1, 1, 1)

Now we compute the number of seconds in the entire year and the number of seconds passed between the start of the year and date:

entire_year_seconds = (start_of_next_year - start_of_year).total_seconds()
seconds_since_start_of_year = (date - start_of_year).total_seconds()

The rest is simple: Just divide seconds_since_start_of_year / entire_year_seconds to obtain what fraction of the year has passed until date.

Jun

2020

How to compute number of days in a year in Pandas

In our previous post we showed how to used the pendulum library in order to compute the number of days in a given year using the pendulum library.

This post shows how to achieve the same using pandas:

import pandas as pd
def number_of_days_in_year(year):
    start = pd.Timestamp(year, 1, 1)
    end = pd.Timestamp(year + 1, 1, 1)
    return (end - start).days)

Usage example:

print(number_of_days_in_year(2020)) # Prints 366
print(number_of_days_in_year(2021)) # Prints 365

Explanation:

First, we define the start date to be the first day (1st of January) of the year we’re interested in:

start = pd.Timestamp(year, 1, 1)

Now we generate the end date, which is the 1st of January of the following year:

end = pd.Timestamp(year + 1, 1, 1)

The rest is simple: Just compute the difference (end – start) and ask pandas to give us the number of days:

(end - start).days

Jun

2020

How to generate range of dates in pandas

In this example, we’ll create a list of pandas Timestamp objects that represent 100 consecutive days, starting at a fixed date:

start_date = pd.Timestamp("2020-03-01")

Generating the 100 consecutive days is easy:

all_days = [start_date + pd.Timedelta(d, "days") for d in range(100)]

Note that range(100) will generate all numbers from 0 up to and including 99. Hence, [pd.Timedelta(d, "days") for d in range(100)] will generate a list of Timedeltas that represent 0 days, 1 days, 2 days, …, 99 days.

Full example:

import pandas as pd

start_date = pd.Timestamp("2020-03-01")
all_days = [start_date + pd.Timedelta(d, "days") for d in range(100)]

print(all_days)

May

2020

Split pandas DataFrame every time a Series is True

In our previous post we explored how to Split pandas DataFrame every time a column is True.

This slightly modified function also works if the given Series is not a column in the DataFrame:

def split_dataframe_by_series(df, series):
    """
    Split a DataFrame where the given series is True. Yields a number of dataframes
    """
    previous_index = df.index[0]

    for split_point in df[series].index:
        yield df[previous_index:split_point]
        previous_index = split_point
    # Yield remainder of dataset
    try:
        yield df[split_point:]
    except UnboundLocalError:
        pass # There is no split point => Ignore

Full example

We’ll use the ZeroCrossing column we built in our previous post on How to detect value change in pandas string column/series which itself builds on our post on How to create pandas time series DataFrame example dataset. Based on that example we add the modified utility function shown above:

import pandas as pd

# Load pre-built time series example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

# Create a new column containing "Positive" or "Negative"
df["SinePositive"] = (df["Sine"] >= 0).map({True: "Positive", False: "Negative"})
# Create "change" column (boolean)
df["ZeroCrossing"] = df["SinePositive"].shift() != df["SinePositive"]
# Set first entry to False
df["ZeroCrossing"].iloc[0] = False

def split_dataframe_by_series(df, series):
    """
    Split a DataFrame where the given series is True. Yields a number of dataframes
    """
    previous_index = df.index[0]

    for split_point in df[series].index:
        yield df[previous_index:split_point]
        previous_index = split_point
    # Yield remainder of dataset
    try:
        yield df[split_point:]
    except UnboundLocalError:
        pass # There is no split point => Ignore

# Print result
split_frames = list(split_dataframe_by_series(df, df["ZeroCrossing"]))
print(f"Split DataFrame into {len(split_frames)} separate frames by zero-crossing")

Note that converting the result of split_dataframe_to_series() into a list might not be neccessary depending on your application. If possible, I recommend directly iterating the data frames using a for loop, e.g.:

for df_section in split_dataframe_by_series(df, df["ZeroCrossing"]):
    pass # TODO: Your code goes here!

May

2020

Split pandas DataFrame every time a column is True

TL;DR

If the Series you want to use to split is a column in the DataFrame, continue reading this post. Else, read Split pandas DataFrame every time a Series is True.

Use this utility function:

def split_dataframe_by_column(df, column):
    """
    Split a DataFrame where a column is True. Yields a number of dataframes
    """
    previous_index = df.index[0]

    for split_point in df[df[column]].index:
        yield df[previous_index:split_point]
        previous_index = split_point
    # Yield remainder of dataset
    try:
        yield df[split_point:]
    except UnboundLocalError:
        pass # There is no split point => Ignore

# Usage example:
list(split_dataframe_by_column(df, "ZeroCrossing"))

Note that one or more of those dataframes might be empty.

Full example:

import pandas as pd

# Load pre-built time series example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

# Create a new column containing "Positive" or "Negative"
df["SinePositive"] = (df["Sine"] >= 0).map({True: "Positive", False: "Negative"})
# Create "change" column (boolean)
df["ZeroCrossing"] = df["SinePositive"].shift() != df["SinePositive"]
# Set first entry to False
df["ZeroCrossing"].iloc[0] = False

def split_dataframe_by_column(df, column):
    """Split a DataFrame where a column is True. Yields a number of dataframes"""
    previous_index = df.index[0]

    for split_point in df[df[column]].index:
        yield df[previous_index:split_point]
        previous_index = split_point
    # Yield remainder of dataset
    try:
        yield df[split_point:]
    except UnboundLocalError:
        pass # There is no split point => Ignore

# Print result
split_frames = list(split_dataframe_by_column(df, "ZeroCrossing"))
print(f"Split DataFrame into {len(split_frames)} separate frames by zero-crossing")
# This prints "Split DataFrame into 20 separate frames by zero-crossing"

May

2020

Get index where column is True in pandas

TL;DR

Simply use

df[df["ZeroCrossing"]].index

Full example:

import pandas as pd

# Load pre-built time series example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

# Create a new column containing "Positive" or "Negative"
df["SinePositive"] = (df["Sine"] >= 0).map({True: "Positive", False: "Negative"})
# Create "change" column (boolean)
df["ZeroCrossing"] = df["SinePositive"].shift() != df["SinePositive"]
# Set first entry to False
df["ZeroCrossing"].iloc[0] = False

# Print result
print(df[df["ZeroCrossing"]].index)

This prints

DatetimeIndex(['2020-05-25 20:05:10.040874', '2020-05-25 20:05:10.090874',
               '2020-05-25 20:05:10.140874', '2020-05-25 20:05:10.190874',
               '2020-05-25 20:05:10.240874', '2020-05-25 20:05:10.290874',
               '2020-05-25 20:05:10.340874', '2020-05-25 20:05:10.390874',
               '2020-05-25 20:05:10.440774', '2020-05-25 20:05:10.490874',
               '2020-05-25 20:05:10.540874', '2020-05-25 20:05:10.590874',
               '2020-05-25 20:05:10.640774', '2020-05-25 20:05:10.690874',
               '2020-05-25 20:05:10.740874', '2020-05-25 20:05:10.790874',
               '2020-05-25 20:05:10.840874', '2020-05-25 20:05:10.890774',
               '2020-05-25 20:05:10.940874'],
              dtype='datetime64[ns]', name='Timestamp', freq=None)

May

2020

How to convert pandas Timedelta to seconds

TL;DR

Use

my_timedelta / np.timedelta64(1, 's')

Full example

import pandas as pd
import numpy as np
import time

# Create timedelta
t1 = pd.Timestamp("now")
time.sleep(3)
t2 = pd.Timestamp("now")
my_timedelta = t2 - t1

# Convert timedelta to seconds
my_timedelta_in_seconds = my_timedelta / np.timedelta64(1, 's')
print(my_timedelta_in_seconds) # prints 3.00154

May

2020

How to detect value change in pandas string column/series

TL;DR

In order to get a series that is True every time the input string column changes, use

my_column_changes = df["MyStringColumn"].shift() != df["MyStringColumn"]

The first value of this Series will always be True since the value is considered to be NaN before the start of the series (due to the behaviour of shift()). In order to force the first value to be False, use

my_column_changes.iloc[0] = False

In order to get the rows in the dataframe where the column changes, use

df[my_column_changes]

or use this one-liner:

df[df["MyStringColumn"].shift() != df["MyStringColumn"]]

In order to assign this value to a new column in the DataFrame, use e.g.

df["MyStringColumnChanges"] = df["MyStringColumn"].shift() != df["MyStringColumn"]

Full example:

First we load our example from our previous post on How to create pandas time series DataFrame example dataset:

import pandas as pd

# Load pre-built time series example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

Now we create a new column that contains Positive if the sine wave value in the "Sine" column is positive or "Negative" if that value is negative:

df["SinePositive"] = (df["Sine"] >= 0).map({True: "Positive", False: "Negative"})

Now we create the ZeroCrossing column using the method shown above:

# Create "change" column (boolean)
df["ZeroCrossing"] = df["SinePositive"].shift() != df["SinePositive"]

… and set the first entry to False since we don’t consider the start of the series to be a zero crossing:

df["ZeroCrossing"].iloc[0] = False

Now we can use

df[df["ZeroCrossing"]]

to show the rows in the DataFrame where the zero crossing happened:

                                    Sine   Cosine SinePositive  ZeroCrossing
Timestamp                                                                   
2020-05-25 20:05:10.040874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.090874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.140874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.190874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.240874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.290874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.340874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.390874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.440774 -2.450532e-15 -1.00000     Negative          True
2020-05-25 20:05:10.490874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.540874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.590874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.640774 -1.960673e-15 -1.00000     Negative          True
2020-05-25 20:05:10.690874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.740874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.790874  6.283144e-03  0.99998     Positive          True
2020-05-25 20:05:10.840874 -6.283144e-03 -0.99998     Negative          True
2020-05-25 20:05:10.890774  4.901063e-15  1.00000     Positive          True
2020-05-25 20:05:10.940874 -6.283144e-03 -0.99998     Negative          True

Full example code:

import pandas as pd

# Load pre-built time series example dataset
df = pd.read_csv("https://techoverflow.net/datasets/timeseries-example.csv", parse_dates=["Timestamp"])
df.set_index("Timestamp", inplace=True)

# Create a new column containing "Positive" or "Negative"
df["SinePositive"] = (df["Sine"] >= 0).map({True: "Positive", False: "Negative"})
# Create "change" column (boolean)
df["ZeroCrossing"] = df["SinePositive"].shift() != df["SinePositive"]
# Set first entry to False
df["ZeroCrossing"].iloc[0] = False

# Print result
print(df[df["ZeroCrossing"]])

Problem:

Solution:

Full working example for to_sql()

Complete code example

Solution

Usage example:

Detailed explanation:

Usage example:

Explanation:

Full example

TL;DR

Full example:

TL;DR

Full example:

TL;DR

Full example

TL;DR

Full example:

Full example code:

Search

Categories

Full working example for `to_sql()`