Pandas Dataframe to Database

From Dataframe to Database Table

Published: December 2020

There are different database connectors but this one is mainly about using SQLAlchemy. You already know that pandas offers the .sql on dataframes and it is extremely user friendly (especially when you add some database configurations and define column datatypes, this method can be very fast).

This post will provide a (potentially) quicker route to add data to your database with sqlalchemy.

Disclaimer: For most databases, you need the classic insert into {} ... statement to push data; SQLAlchemy is not a replacement for SQL.

There are two common problems when someone tries to create the insert statement via f-strings.

Formatting dates to the specification of the database
Handling nulls. If your dataframe has NaN or NaT values, definitely change them to None!

The Idea

Reflect your table
Make sure your dataframe has None for every null variable
Turn your dataframe into a list of dictionaries (this is used to bind variables/parameter)
Execute the insert statement

If you have too large a dataframe (like say 100K rows or more), I suggest chunking the list.

The Snippet

import time
import pandas
import sqlalchemy

# assume you have a sqlalchemy engine and we want to push data to table `some_table`
engine: sqlalchemy.engine.base.Engine # define your engine - connector
dataframe: pandas.DataFrame # your data you want to push

# 1. Reflect - just one way (some db won't allow you to reflect properly)
table = sqlalchemy.Table(
    'some_table',
    sqlalchemy.MetaData(engine), # you can also provide a schema argument to MetaData
    autoload=True
)

# 2. Handle Nulls
dataframe_ = dataframe.where(pd.notnull(dataframe), None)

# 3. Turn dataframe int a list of variables to bind
records = dataframe_.to_dict(orient='records'))

# 4. Push - execute statement
result = engine.execute(table.insert(), records)

# how many were inserted
print(f"Successfully inserted {result.rowcount} records")
result.close()

"""
if you partitioned, you could use a for-loop
for group in toolz.partition_all(30000, records):
    try:
        _ = engine.execute(table.insert(), group)
    except sqlalchemy.exc.OperationalError as e:
        # usually this happens when minor connection problems occur
        # or you're adding too many records too quick
        # something that works is to provide some cooldown time

        time.sleep(5) # maybe 5 seconds

        # then try to insert again
        try:
            ...
        except ... as e_:
            # at this point might as well just raise the error
            # but somethings would have been inserted already :(
            raise e_
"""

Make sure your dataframe has the same columns as your table (make sure that they are the same case, lower or upper or camelcase etc.)

Drawbacks:

This probably only works for some databases.
When you execute insert statements like this, you cannot really undo them. SQLAlchemy offers Session and Transaction objects that allow you to revert actions. This post is for when you are absolutely sure you want to push the data and would have done a bulk insert anyways. https://www.pythonsheets.com/notes/python-sqlalchemy.html#object-relational-add-data

My personal preference (if possible) is to use Objects to declare your database metadata.

Pandas Dataframe to Database

From Dataframe to Database Table

The Idea

The Snippet

Comments