Back to Blog
Database
Dec 05, 20252 min read

Zero-Downtime Database Migration: Our Playbook

Migrating a 2TB database is easy if you can take 8 hours of downtime. Migrating it with zero downtime is an art form—here’s the expand/contract playbook.

Zero-Downtime Database Migration: Our Playbook
Share:

Migrating a 2TB database is easy if you can take 8 hours of downtime. Migrating it with zero downtime (or <1 minute) is an art form.

Whether you are moving from On-Prem Oracle to RDS Postgres, or just upgrading Postgres versions, the pattern is the same. We call it the Expand-Contract Pattern.


Phase 1: The Setup (Replication)

You cannot "move" data instantly. You must replicate it.

  1. Snapshot: Take a dump of the Source DB.
  2. Restore: Load it into Target DB.
  3. CDC (Change Data Capture): Catch up on everything that happened since the snapshot.
    • Tools: AWS DMS (Database Migration Service), Debezium, or native logical replication.

Checkpoint: The Target DB is now a "follower" of the Source DB.


Phase 2: Dual Writes (The Code Change)

This is the most critical phase. You need to update your application to know about both databases.

Step A: Update code to write to Source AND Target.

  • NOTE: This is dangerous. If the write to Target fails, should the request fail? usually no (log it async).
  • Better Approach: Let the CDC tool handle the replication. Application stays ignorant.

Step B (Safer): The "Read-Source, Write-Source" state. App works normally. Target DB catches up via replication.


Phase 3: The "Dark Read" (Validation)

Before you switch over, you must verify the data integrity.

Enable Dark Reads (or Shadow Reads):

  1. App reads from Source (returns to user).
  2. App asynchronously reads from Target.
  3. Compare the results.
  4. If they match: Log "Success". If different: Log "Data Mismatch".

Fix any discrepancies found here. Do not proceed until you have 100% match rate.


Phase 4: The Cutover (The Pivot)

The scary moment.

Strategy 1: The Maintenance Window (Safety First)

  1. Put App in "Maintenance Mode" (Read-Only).
  2. Wait for CDC lag to hit 0 (should take seconds).
  3. Update App config to point to Target.
  4. Restart App.
  5. Downtime: ~1-2 minutes.

Strategy 2: The Zero-Downtime Swap

  1. Sequence Handling: Ensure your Primary Keys on Target are offset (e.g., start sequences at +1 Billion) to avoid collisions if you have to fallback.
  2. Flip the Switch: Deploy a config change (Feature Flag) that switches writes to Target.
  3. Reverse Replication: Immediately start replicating from Target back to Source (in case you need to rollback).

Common Pitfalls

  1. Sequences/Auto-Increment: If you switch to Target, and it tries to insert ID 100 but Source already had 100, you crash. Always sync sequences last.
  2. Triggers: Triggers on the Target DB might fire during replication, causing double-execution of logic (e.g., sending two emails). Disable triggers on Target until cutover.
  3. Latency: The Target DB might be cold (empty cache). Expect a performance dip for the first 10 minutes.

The Golden Rule

"If you can't rollback, don't migrate." Always have the reverse-replication path planned. If the new DB performs poorly, you must be able to switch back to the old one without data loss.

Tagged with

DatabaseMigrationHigh Availability

Need help with your infrastructure?

Book a free architecture review and get expert recommendations.

Book Architecture Review
Share:
CL
CloudOpsPro Team

Cloud Infrastructure & DevOps

Dec 05, 20252 min read

Read Next