Writing Migrations That Don't Lock Your Production Database

Database migrations: they’re the silent assassins of many a production outage. We’ve all been there – a seemingly innocuous change grinds your application to a halt. At MisuJob, where we processes 1M+ job listings to power our AI-powered job matching across Europe, smooth database operations are critical. That’s why we’ve honed our migration strategies to minimize downtime and prevent locking issues. Here’s what we’ve learned.

Understanding the Locking Problem

Database migrations, especially those involving schema changes, can acquire locks on tables. These locks prevent other operations from reading or writing to the table, leading to application slowdowns or even complete outages. The duration of these locks depends on the size of the table, the complexity of the migration, and the database system itself. A poorly planned migration on a large table can easily lock your database for minutes, or even hours.

For example, adding a NOT NULL constraint to a column in a large table requires rewriting the entire table. During this rewrite, a full table lock is usually acquired, blocking all other operations. Similarly, creating a new index can also lock the table for a significant amount of time, especially if the table is actively being written to.

Consider the scenario where we need to add a new column is_active to our job_listings table, used extensively by our platform across Europe. A naive approach might look like this:

ALTER TABLE job_listings ADD COLUMN is_active BOOLEAN NOT NULL DEFAULT TRUE;

While this seems simple, the NOT NULL DEFAULT TRUE part is the killer. The database needs to update every row in the table to set this default value, resulting in a full table lock.

Strategies for Minimizing Lock Duration

Fortunately, there are several strategies we use to minimize the duration of locks during migrations and keep our services running smoothly for job seekers across the DACH region, UK, Nordics, and beyond.

1. Online Schema Changes

Many modern database systems offer online schema change capabilities, allowing you to perform certain schema modifications with minimal downtime. These techniques typically involve creating a shadow table, applying the changes to the shadow table, and then atomically renaming the shadow table to replace the original table. This avoids long-lasting exclusive locks.

For example, in PostgreSQL, you can use tools like pg_repack or pg_online_schema_change to perform online schema changes. These tools handle the complexity of creating shadow tables, copying data, and switching tables with minimal locking.

# Example using pg_repack
pg_repack --dbname=misu_jobs --table=job_listings

This command will repack the job_listings table, minimizing downtime and reducing table bloat. Remember to test these tools thoroughly in a staging environment before running them in production.

2. Decompose Large Migrations

Instead of performing large, complex migrations in a single step, break them down into smaller, more manageable steps. This reduces the duration of each individual lock and allows other operations to interleave with the migration process.

Let’s revisit the is_active column example. Instead of adding the column with a NOT NULL DEFAULT TRUE constraint in one go, we can break it down into three steps:

Add the column without the NOT NULL constraint or default value.
Update the rows in batches to set the is_active value.
Add the NOT NULL constraint.

Here’s how this looks in SQL:

-- Step 1: Add the column without constraints
ALTER TABLE job_listings ADD COLUMN is_active BOOLEAN;

-- Step 2: Update rows in batches
UPDATE job_listings SET is_active = TRUE WHERE is_active IS NULL LIMIT 10000;
-- Repeat this until all rows are updated.  Consider using a background job or script

-- Step 3: Add the NOT NULL constraint
ALTER TABLE job_listings ALTER COLUMN is_active SET NOT NULL;

By updating the rows in batches, we avoid locking the entire table for a prolonged period. The LIMIT clause is crucial here. The size of the batch (10000 in this example) should be tuned based on your database server’s resources and workload.

3. Use Concurrent Index Builds

Creating indexes can be a major source of locking. Fortunately, many database systems support concurrent index builds, allowing you to create indexes without blocking read operations. In PostgreSQL, you can use the CREATE INDEX CONCURRENTLY command.

CREATE INDEX CONCURRENTLY idx_job_title ON job_listings (job_title);

This command creates the index in the background, minimizing the impact on read operations. However, it’s important to note that concurrent index builds can take longer than regular index builds and may require more resources. Also, you can only run one CREATE INDEX CONCURRENTLY at a time on a given table.

4. Optimize Queries Before Migrating

Sometimes, the need for a migration arises from poorly performing queries. Before adding a new index or modifying a table schema, it’s crucial to analyze and optimize your queries. Often, a well-tuned query can eliminate the need for a costly migration altogether.

Use your database’s query analyzer (e.g., EXPLAIN in PostgreSQL, EXPLAIN PLAN in MySQL) to identify performance bottlenecks. Look for full table scans, missing indexes, and inefficient join operations.

For example, consider this query:

SELECT * FROM job_listings WHERE job_title LIKE '%engineer%' AND location = 'Berlin';

If you see a full table scan on the job_listings table, adding an index on job_title and location might seem like the obvious solution. However, rewriting the query to use full-text search capabilities or optimizing the LIKE operator could yield better results without requiring a new index.

5. Data Partitioning

For very large tables, consider data partitioning. Partitioning divides a table into smaller, more manageable pieces, allowing you to perform migrations on individual partitions without affecting the entire table.

For example, you might partition the job_listings table by date, region, or industry. This allows you to add a new column to a single partition at a time, minimizing the impact on other partitions.

6. Background Migrations

When possible, move data migration logic out of the main request/response cycle and into background jobs. Tools like Celery (Python) or Sidekiq (Ruby) can be used to process large datasets in the background, minimizing the impact on user experience. This works well when you need to update existing data to conform to the new schema.

For example, if you’re changing the data type of a column, you can use a background job to iterate through the existing rows and convert the data to the new type. This can be done gradually, without locking the table for an extended period.

7. Canary Deployments

Before rolling out a migration to your entire production environment, deploy it to a small subset of your users (a canary deployment). This allows you to monitor the impact of the migration on performance and identify any potential issues before they affect a large number of users. This is especially crucial for schema changes impacting frequently accessed tables.

Monitoring and Observability

Regardless of the migration strategy you choose, it’s crucial to monitor the performance of your database during and after the migration. Monitor key metrics such as CPU utilization, memory usage, disk I/O, and lock wait times. Use database monitoring tools to identify any performance bottlenecks or locking issues.

We also use automated alerts to notify us of any anomalies, such as a sudden increase in lock wait times or a decrease in query performance. This allows us to quickly identify and address any issues before they escalate into a major outage.

Real-World Examples & Data

Let’s consider a real-world scenario: We needed to migrate a column salary from TEXT to NUMERIC in our job_listings table. The table contains millions of rows and is heavily used across Europe, impacting salary insights for job seekers in France, Spain, the Netherlands, and elsewhere. A naive ALTER TABLE statement would have locked the table for an unacceptably long time.

We opted for a multi-stage approach:

Add a new column: salary_numeric NUMERIC
Background Migration: A background process read batches of rows, parsed the text salary column, converted it to a numeric value, and wrote it to salary_numeric.
Update Application Code: We updated our application to read from salary_numeric instead of salary.
Drop the old column: ALTER TABLE job_listings DROP COLUMN salary;
Rename the new column: ALTER TABLE job_listings RENAME COLUMN salary_numeric TO salary;

This approach allowed us to migrate the data without any significant downtime.

Here’s a simplified example of the background job in Python using Celery:

from celery import Celery
import psycopg2  # Assuming PostgreSQL

celery_app = Celery('migrate_salary', broker='redis://localhost:6379/0')

@celery_app.task
def migrate_salary_batch(batch_id, start_id, end_id):
    conn = psycopg2.connect(database="misu_jobs", user="your_user", password="your_password", host="localhost", port="5432")
    cur = conn.cursor()
    try:
        cur.execute(f"""
            UPDATE job_listings
            SET salary_numeric = CASE
                WHEN salary IS NOT NULL AND salary != '' THEN CAST(TRIM(salary, '€') AS NUMERIC)
                ELSE NULL
            END
            WHERE id BETWEEN {start_id} AND {end_id};
        """)
        conn.commit()
        print(f"Batch {batch_id} completed successfully.")
    except Exception as e:
        conn.rollback()
        print(f"Error processing batch {batch_id}: {e}")
    finally:
        cur.close()
        conn.close()

# Example of how to enqueue tasks in batches
batch_size = 1000
total_records = 100000 # Replace with actual count
num_batches = total_records // batch_size

for i in range(num_batches):
    start_id = i * batch_size + 1
    end_id = (i + 1) * batch_size
    migrate_salary_batch.delay(i + 1, start_id, end_id)

This approach ensures that even if parsing fails for some salary values, it doesn’t halt the entire migration process. Each batch is independent, and errors are logged for later investigation.

Impact on Salary Data Integrity

While these migrations are underway, it’s essential to be transparent about potential temporary data inconsistencies. For example, during the salary column migration from TEXT to NUMERIC, we communicated to our users that salary ranges might be temporarily unavailable or inaccurate for a small subset of job listings.

This transparency is crucial for maintaining user trust. We also implemented robust data validation checks to ensure that the converted numeric salary values are within reasonable ranges and consistent with industry standards.

To illustrate the importance of accurate salary data, consider the following table showcasing average software engineer salaries across various European cities:

City	Average Salary (€/year)
Berlin	75,000
Amsterdam	82,000
London	95,000
Paris	70,000
Stockholm	88,000

Inaccurate salary data, even temporarily, could mislead job seekers and impact their career decisions. By prioritizing data integrity and transparency during migrations, we ensure that our users have access to reliable and up-to-date information.

Key Takeaways

Plan Ahead: Carefully analyze the impact of your migrations on production performance before execution.
Decompose: Break down large migrations into smaller, manageable steps to minimize lock duration.
Use Online Schema Changes: Leverage online schema change capabilities to perform schema modifications with minimal downtime.
Monitor and Alert: Implement robust monitoring and alerting to detect and respond to performance issues.
Communicate: Be transparent with your users about potential temporary data inconsistencies during migrations.
Test Thoroughly: Always test migrations in a staging environment before deploying them to production.
Background Jobs: Offload data updates to background jobs whenever possible.

By following these strategies, we’ve significantly reduced the risk of production outages caused by database migrations, ensuring that MisuJob continues to provide a seamless experience for job seekers across Europe.