Database Migration Strategies

Intermediate 10 min read

What you'll learn

✓Why naive migrations break production
✓The expand and contract pattern
✓How dual writes and backfills work
✓When to use online schema change tools

Prerequisites

•Comfort with SQL and deployments

Database migrations are where careful engineering pays its biggest dividends and where shortcuts cause the loudest outages. This post walks through the strategies that hold up when the table you are changing has a hundred million rows and your service cannot go down.

What and why

A database migration is any change to schema or data in a running system. The simplest version is a developer running ALTER TABLE in production. The hardest version is reshaping a primary key on a busy table across a multi-region cluster while traffic continues. Between those extremes lies a small set of well understood strategies.

The reason to study them is that the obvious approach (deploy new code and a schema change together) only works at small scale. At any meaningful size, schema changes take minutes or hours, and the application and database evolve out of step. The strategies below exist to make that gap safe.

Mental model

Think of a migration as having three actors: the old code, the new code, and the database. At any moment, all three must agree on what the data looks like. If you change the database before the code is ready, old replicas of the code break. If you change the code before the database is ready, the new code crashes. The whole craft of migrations is keeping all three in a compatible state at every step.

Once you internalize that, the patterns name themselves. Expand the schema so it works with both versions of the code. Roll out the new code. Then contract the schema by removing the parts only the old code used.

Architecture

The expand and contract pattern is the workhorse for non trivial migrations.

Step 1 (expand): add new column / table, keep old in place
old code -> reads/writes old
new code -> reads/writes both, prefers new

Step 2 (backfill): copy data from old to new in batches
background job -> reads old -> writes new

Step 3 (dual write switch): new code writes both, reads new
verify drift, fix anomalies

Step 4 (contract): remove old column / table once safe
only new code is running -> drop old

Expand and contract migration flow

In step one, the new column or table is added without touching the old one. Both old and new code can run side by side. In step two, a background job backfills the new structure from the old in small batches to avoid lock pressure. In step three, the new code becomes the source of truth, with optional dual writes to keep the old path correct as a safety net. In step four, once you are confident nothing reads the old column, you drop it.

For schema changes on huge tables, online schema change tools (gh-ost, pt-online-schema-change, native online DDL) build the new table in the background and swap it in, avoiding long locks.

Trade-offs

The first trade-off is speed versus safety. A single ALTER TABLE is fast to write and dangerous to run. Expand and contract is slow to roll out but observable at every step. For anything user facing, prefer slow and observable.

The second is dual writes versus single source of truth. Dual writes give you a fallback if the new path is buggy, but they also double write load and introduce drift. Most teams dual write for a defined window (days, not months) and then commit to the new path.

The third is whether to backfill online or offline. Online backfills run alongside traffic and take longer but never require downtime. Offline backfills are faster but require a window where writes stop or are queued. Offline is sometimes worth it for tables where consistency during the backfill would be hard to guarantee.

Practical tips

Make every migration step independently reversible. If you cannot revert the new code while keeping the database changes, you have not actually decoupled the steps. The test is: can I deploy the previous version of the service right now without breaking anything.

Add monitoring before the migration, not during. You want dashboards for the old path, the new path, and the drift between them. A migration without observability is a guess.

Treat backfills as production workloads. Rate limit them, run them on read replicas where possible, and pause them if primary load spikes. A backfill that takes a week and never causes an alert is better than one that finishes in a day and pages everyone twice.

Wrap-up

Safe database migrations are mostly about preserving compatibility between old code, new code, and the database at every step. Expand and contract, online schema changes, and disciplined backfills are the tools. The hard part is patience: shipping a four step migration over two weeks instead of one risky deploy is a culture choice as much as a technical one, and it is almost always the right one.