How we make database schema migrations safe and robust at Defacto

September 4, 2025
|
5 min
Listen to this podcast

Table of contents

Defacto provides fast, transparent, and short-term financing for SMBs. Our clients can go from creating an account to receiving funds in their bank account in just a few minutes. This speed and reliability are core to our product.

We're a relatively small engineering team compared to the business complexity we face. A high-quality data model is essential for maintaining our development velocity, which means we frequently adapt our database schema.

While schema changes are simple on a database with little traffic, they become challenging as soon as you're handling complex, concurrent operations. Pathological cases can lead to complete outages—a critical business impact for a Fintech company.

Over time, we've developed a set of tools and practices that have completely eliminated outages related to DB schema migrations. Here’s how we did it.

Our database migrations technical stack

Our DB migrations technical stack is fairly simple:

  • a single shared PostgreSQL database
    • Python for the back-end code
    • SQLAlchemy for data model definition and performing DB queries
    • Alembic for generating, writing and running migrations.

Why database schema changes can be complex

Two main issues make database migrations a high-stakes operation in a production environment.

1. Exclusive locking

At its core, many schema change operations, at least in PostgreSQL, acquire an exclusive lock on the entities they modify. This lock prevents other operations from reading from or writing to those entities until the migration is complete. The database change query waits for its turn to acquire the lock, and all subsequent queries on that table wait behind it.

This would be fine if all SQL queries were fast and all schema change operations were instantaneous. Unfortunately, that's not the reality. A single long-running query can prevent the migration from acquiring a lock, causing the migration to wait.

While it waits, every new application query on that table gets queued. This cascading effect can quickly exhaust all of the application’s processing capacity, resulting in a complete outage.

When deploying schema changes multiple times a week, this risk becomes a significant business liability.

2. Concurrent code versions

Most modern deployment workflows involve a rolling update, where for a period of time, different versions of your application code are running simultaneously against the same database. This creates a race condition. For example old code might try to read or write to a column that a new migration has just dropped.

This can result in a flood of application errors during the deployment window, damaging your service level agreements (SLAs) and user trust.

Defacto’s approach to zero downtime DB schema migrations

To mitigate both locking and concurrent versions problems, our approach uses four key pillars:

  1. Adopting the expand -> migrate -> contract migration pattern.
  2. Setting up a DB migration linter to avoid common anti-patterns.
  3. Providing clear guidelines to engineers throughout the development and deployment process.
  4. Using PostgreSQL features to avoid being impacted by long-running locks.

1. Adopting the Expand -> Migrate -> Contract pattern

Intuitively, developers tend to group all changes required to reach the desired end state of a data model migration into a single migration. It feels natural and it usually works when traffic and workloads are low. As a result, software engineers tend to develop the expectation that this could work just fine every time, forever.

When traffic and workloads increase, it can still work fine for simple migrations, like adding a new column to a table. However, it starts to break down for any complex migration. Splitting migrations into separate smaller steps goes a long way in making them both a lot more robust and simpler to manage.

We split every potentially breaking migrations into three distinct, separately deployed phases.

  • 1. Expand: In the first deployment, we only add new structures. This could be a new column or a new table. The old application code is completely unaware of these changes and continues to function as before. The new code being deployed doesn't use the new structures yet.
  • 2. Migrate: In the second phase, we deploy application code that utilizes the new structures. This typically involves a multi-step deployment. Using a data model change that renames a column, the migration steps would be:
    • Deploy code that writes to both the old and new columns.
    • Run a data backfill to populate the new column with data derived from the old one.
    • Deploy code that switches reads to the new column and stops writing to the old column.
  • 3. Contract: Once the new application code is fully rolled out, stable, and exclusively using the new schema, we deploy a final, separate migration to remove the old structures (e.g., drop the old column).

At every stage, every running version of the application code is compatible with the current database schema.

Preparing for contraction with SQLAlchemy

A subtle but crucial part of the "contract" phase is ensuring the application has stopped using the old column before you drop it. With an ORM like SQLAlchemy, your models might still try to load a column from the database even if it's not explicitly used in your business logic. To prevent this, we temporarily configure the ORM to stop loading the column that is about to be removed.

For example, to stop loading an old_column_to_drop before removing it, we can mark it as deferred:


from sqlalchemy.orm import deferred
from sqlalchemy import Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class MyModel(Base):
    __tablename__ = 'my_table'
    id = Column(Integer, primary_key=True)
    new_column = Column(String)

    # Defer loading of the column we are about to drop.
    # The ORM will no longer try to SELECT it in standard queries.
    old_column_to_drop = deferred(Column(String))

In order to enforce this practice, we use a specific pre-commit hook that checks for columns that are being removed by a DB migration and errors if it’s not marked as deferred in the ORM model.

2. Setting up a DB migration linter

To catch dangerous patterns before they ever reach production, we integrated Squawk as a pre-commit hook. Squawk is a linter for Alembic/SQLAlchemy that automatically flags risky operations in migration files, such as:

  • Adding a column with a DEFAULT value (which can lock the table).
  • Adding a column with a non-NULL value from the start.
  • Adding a new index without the CONCURRENTLY keyword.

This linter serves as an automated first line of defense, enforcing our safety standards and teaching engineers about potential pitfalls.

3. Providing clear engineering guidelines

Automation can't solve everything. Critical thinking is still required. To support our developers, we've put a few lightweight processes in place.

  • GitHub PR Checklist: Our pull request template includes a manual checklist for any PR containing a database migration. It prompts the author to confirm they've followed best practices, considered rollback scenarios, and used the expand/contract pattern if necessary.
  • Explicit Confirmation: An automated GitHub check requires a db-migration-ok label to be added to any DB-migration-related PR. This acts as a deliberate sign-off, forcing the developer to pause and double-check their work against the checklist.
  • Emergency Playbook: We have a central Notion page, accessible via a memorable golink, with instructions for handling a stuck migration. It includes psql queries to identify which transaction is blocking the migration and safe procedures for resolving the issue.

4. Using PostgreSQL features to mitigate locks

Finally, we leverage PostgreSQL's own features to build a system that is more resilient to issues related to DB migrations.

Setting a Lock Timeout

Instead of letting a migration wait indefinitely for a lock (and causing an outage), we configure a timeout. A failed migration is always better than an outage. We apply this timeout directly within our Alembic migrations.


# In your db/env.py
def run_migrations_online():
    # ...
    engine = create_engine(
        context.config.get_main_option("sqlalchemy.url"),
        connect_args={
            "options": "-c lock_timeout=2000"
        },
    )
    

Avoiding long-held locks

The root cause of a blocked migration is often a long-running transaction from the application itself. We're working to enforce an idle_in_transaction_session_timeout across our application, which automatically terminates sessions that are sitting idle within a transaction. This drastically reduces the window of opportunity for a stray transaction to block a critical deployment.

Making migrations retryable

The goal of using lock timeouts is to make migrations fail early instead of lock the DB for a long time and cause an outage.

This means we often have to retry some migrations. When a migration fails, it is retried up to 100 times automatically. If it fails 100 times, a developer needs to rerun the CI process that runs it, ideally after having investigated and resolved the problem that caused it to fail.

To be able to retry migrations safely, they need to be atomic, or at least idempotent. We make most of our migrations atomic by running all steps of a given migration in its own transaction, and we leverage PostgreSQL’s transactional DDL.

Migrations that can’t be atomic (e.g. adding indices concurrently) are made idempotent by using IF NOT EXISTS clauses.

Our DB migration linter described above checks that all migrations are atomic or idempotent, and results in a pre-commit hook error if it’s not the case.

Results and lessons learned

The results have been clear: we have completely eliminated outages caused by database schema migrations. The worst-case scenario we face today is a migration that fails due to a lock timeout and needs to be retried.

However, the transition wasn't without its challenges:

  • Complexity is still a factor:
    • Writing multi-stage migrations requires more planning.
    • Squawk's warnings can be complex and difficult for engineers less familiar with PostgreSQL internals to decipher.
  • Concurrent index creation is confusing: Adding an index concurrently (CREATE INDEX CONCURRENTLY) cannot be run inside a transaction and can take an unpredictable amount of time. This doesn't fit neatly into a standard migration file. Our current process—running the command manually and then adding the index with an IF NOT EXISTS clause in a later migration—feels convoluted to many.

Next steps

We're continuing to improve our process with a focus on two areas:

  1. Leveraging more PostgreSQL features: We are fixing the remaining parts of our application that prevent us from using an aggressive idle_in_transaction_timeout globally. This will further reduce the need for manual intervention on failing migrations.
  2. Improving the developer experience: We plan to build more automation and tooling to make safe migrations the easiest path. This includes better linters and tooling to streamline concurrent index creation and the expand/contract pattern.
  3. Exploring different paradigms: We also plan to explore entirely new paradigms like pgroll, which uses a virtual schema approach to offer native zero-downtime migrations. If you have any production experience with it, we’d be more than happy to hear about it.
Julien Gilli & Gabriel Juchault

explore more

No items found.

Ready to grow on your own terms?

Get started
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
By clicking
"Accept"
, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our
for more information.