Request change

DevOps Story: How We Fixed 92 Blocked Migrations in One Day

Date: November 03, 2025Journey: From Legacy System to Updated VersionDuration: 730 Days Without Updates Prologue: The WakeUp CallIt was early morning when the first alert came through. 3:47 AM. The notification buzzed on our phones. Then another. And another. Red warning lights flashing on...

DevOps Story: How We Fixed 92 Blocked Migrations in One Day

Date: November 03, 2025

Journey: From Legacy System to Updated Version

Duration: 730 Days Without Updates

Prologue: The Wake-Up Call

It was early morning when the first alert came through. 3:47 AM. The notification buzzed on our phones. Then another. And another. Red warning lights flashing on the monitoring dashboard, each one telling us something was wrong.

The old server was showing signs of age. Memory leaks were slowing things down. Response times were getting longer. Hardware warnings appeared more frequently. We had to move everything to a new server. We had to restore from a backup.

The good news was we had backups. We had been backing up our database to S3 every single day for the past two years. Automated backups ran every night at 1 AM. We had hundreds of backup files sitting in our S3 bucket, ready to use.

The bad news was we had never updated the application code. Not once in two years.

The application was open source. The developers released new versions constantly. Version 1.0 came out in 2023. Then 1.1, 1.2, 1.3, all the way to version 1.12.3. Each version had bug fixes, security patches, new features, and database migrations.

But we never updated our code. We kept running the old version. The old server ran the same code for two years. Same application. Same database schema. Frozen in time while the world moved forward.

Now we were restoring one of those daily backups—from October 29, 2023, exactly two years ago—to a brand new server. But this time, we had downloaded the latest code. Version 1.12.3. The newest version with all the latest features.

We thought it would work. We thought the backup would be compatible with the new code.

We were wrong.

The Restoration

The new server was clean and ready. We set it up with the latest version of the application code. Version 1.12.3. All the newest features. All the latest security patches. Everything modern and up to date.

Then we downloaded the backup from S3. We had backups every day, so we chose one from two years ago to test the restoration process:

docker compose -f production.yml run --rm awscli download backup_2025_10_29T01_00_02.sql.gz

The command executed. The file downloaded from S3. Progress bars showed the transfer was successful. The backup file was there, ready to restore.

We restored it to the database. The restoration process completed without errors. The database was populated. All the old data was there. Users, documents, teams—everything from two years ago was restored.

We started the application. The containers came online. The services started. Everything looked good.

We checked the logs. No errors. Everything green. The system was running.

We thought we were done. We thought everything was working.

Then someone tried to log in.

The Login Failure

The first user tried to sign in. They entered their email and password. They clicked the login button. Nothing happened.

They tried again. Same result. The page just reloaded, showing an error message: “Authentication failed.”

Another user tried. Same thing. Couldn’t log in.

Another one. Same problem.

We tried our own test accounts. None of them worked. The login page loaded fine. The form was there. But when we clicked the login button, nothing happened. No user could sign in.

We checked the application logs. That’s when we saw the problem.

Error messages filled the screen. Red text scrolling down, one error after another:

Column "Session.ipAddress" does not exist
Error: Cannot write to column that doesn't exist

The application was trying to save the user’s IP address when they logged in. But the column didn’t exist in the database.

The new application code—version 1.12.3—expected the Session table to have an ipAddress column. This column was supposed to store the IP address of each user’s login session. It was added in a migration that ran months ago.

But our database was from two years ago. The old database didn’t have this column. The old version of the application didn’t track IP addresses.

Every time someone tried to log in, the application tried to write the IP address to the database. But the column didn’t exist, so the database rejected the query. The session never saved. The login failed.

“Why can’t they log in?” someone asked.

“The new code expects columns that don’t exist in the old database,” I replied. “The application code has been updated, but the database schema hasn’t. They don’t match anymore.”

The Blocking Migration

We needed to run the migrations to update the database schema. The application code had all the migration files included. We just needed to apply them to our database.

So we ran the migration command:

npx prisma migrate deploy

But it failed immediately. The error message was clear:

Migration "20240408142543_add_recipient_document_delete" failed.
Cannot proceed with other migrations until this one is resolved.

One of the migrations had failed. And because it failed, Prisma wouldn’t let us run any other migrations. It was a safety feature—Prisma prevented us from applying new migrations when an old one was in a failed state.

The failed migration was from April 2024. It was supposed to:

  1. Add a documentDeletedAt column to the Recipient table
  2. Update existing recipients to mark deleted documents
  3. Delete old PENDING documents that were soft-deleted

The migration had tried to delete documents, but foreign key constraints prevented it. The documents had relationships with other tables. You couldn’t delete them without breaking those relationships. So the migration failed halfway through.

And because it failed, every migration after it was blocked. We couldn’t apply any new migrations until we fixed this one.

We were stuck. The database couldn’t be updated. The application couldn’t run properly. Users couldn’t log in.

We had to fix the failed migration first.

The Email Exposure

While we were investigating the login problem, we noticed something else. Something that made us pause.

When users tried to access the application, their email addresses appeared in the browser’s URL bar. Clear as day. Visible to anyone who looked at the screen.

For example, if a user named john.doe@company.com tried to log in, the URL would show:


This was a security problem. User emails shouldn’t be in URLs. They shouldn’t be visible in the browser address bar. They shouldn’t be logged in server access logs for everyone to see.

We traced the problem back. The old application code had an authentication flow that used query parameters to pass user emails. It was a quick way to handle authentication, but it wasn’t secure.

The new code had fixed this. The new version used secure tokens instead of emails. But when we restored the old database, something caused the application to fall back to the old authentication method. The old database structure triggered old code paths.

So even though we were running the new code, the old database made it behave like the old code. User emails were exposed in URLs.

Now we had two problems:

  1. Users couldn’t log in (missing database columns)
  2. When they could log in, their emails were exposed (old authentication flow)

We needed to fix both. But we couldn’t run migrations until we fixed the failed one first.

Fixing the Failed Migration

We had to resolve the failed migration before we could do anything else. We couldn’t just skip it. We had to mark it as completed or rolled back, then manually apply the safe parts.

First, we marked it as rolled back in Prisma’s tracking system:

npx prisma migrate resolve --rolled-back 20240408142543_add_recipient_document_delete

This told Prisma that we were handling this migration ourselves. It cleared the path for other migrations to run.

But we couldn’t just mark it and move on. The migration had tried to do important things. The documentDeletedAt column needed to be added. Recipients needed to be updated. We had to manually apply the safe parts of the migration.

We connected directly to the database. We ran SQL commands one by one, carefully:

-- Add the columnALTER TABLE "Recipient" ADD COLUMN IF NOT EXISTS "documentDeletedAt" TIMESTAMP(3);

The column was added. The database accepted the change. Good.

Next, we updated the recipients to mark which documents were deleted:

UPDATE "Recipient"
SET "documentDeletedAt" = "Document"."deletedAt"FROM "Document"WHERE "Recipient"."documentId" = "Document"."id"AND "Document"."deletedAt" IS NOT NULLAND "Recipient"."documentDeletedAt" IS NULL;

This updated all recipients whose documents had been deleted. We skipped the DELETE operation that had caused the migration to fail. We didn’t need to delete old documents right now—we just needed the column and the data updates.

Finally, we updated Prisma’s migration tracking table to mark this migration as complete:

UPDATE _prisma_migrations
SET
  started_at = COALESCE(started_at, CURRENT_TIMESTAMP),
  finished_at = CURRENT_TIMESTAMP,
  rolled_back_at = NULLWHERE migration_name = '20240408142543_add_recipient_document_delete';

The failed migration was resolved. The path forward was clear. Now we could run the other migrations.

Running All Migrations

With the failed migration resolved, we ran the migration deployment command again:

npx prisma migrate deploy

This time it worked. The migrations began to apply, one after another.

Each migration appeared on the screen:

  • Migration 1: Creating new table…
  • Migration 2: Adding new column…
  • Migration 3: Creating index…
  • Migration 4: Adding foreign key…

On and on they went. Migration after migration. Each one updating the database schema, bringing it closer to what the new application code expected.

Ninety-two migrations in total. Each one applying changes:

  • Creating 12 new tables
  • Adding 47 missing columns
  • Creating 23 new indexes
  • Establishing 8 foreign key relationships

One important migration was 20250217120859_add_session_fields. This one added the ipAddress column to the Session table—the column that was causing the login failures.

Another important one was 20250522054050_add_organisations. This one added the entire organization management system. New tables for organizations, teams, members, settings. Everything needed for the multi-organization features.

The migrations ran for what felt like hours. Each one completing successfully. The database schema transforming from the old version to the new one.

Finally, the last migration completed. All 92 migrations were applied. The database schema was now up to date. It matched what version 1.12.3 expected.

Testing the Login

With all migrations applied, we restarted the application. The services came back online. The containers restarted. Everything initialized.

We tried to log in again. We entered our test credentials. We clicked the login button. We waited.

It worked.

The login succeeded. The session was created. The IP address was saved. We were logged in.

We tested with other accounts. They all worked. Users could sign in again. The authentication system was functional.

But our celebration was short-lived. When users logged in and tried to access their documents, they found nothing.

The Missing Documents

Users could log in now. That was good. But when they looked for their documents, they found empty lists.

“I had twenty documents here,” one user said. “Where are they?”

Another user: “My signed documents are gone. They were here before.”

A third user: “I can’t see any of my work. Everything is missing.”

We checked the database. The documents were definitely there. We ran queries. The data existed. The records were intact. Users had documents. Documents had recipients. Everything was connected properly.

But the application couldn’t find them. The queries were returning empty results.

We examined the application code. The document listing queries filtered results using the documentDeletedAt column. The query looked something like:

SELECT * FROM "Document" d
JOIN "Recipient" r ON r."documentId" = d.idWHERE r."documentDeletedAt" IS NULLAND d."deletedAt" IS NULL

This query only returned documents where documentDeletedAt was NULL—meaning the document wasn’t deleted. But when we checked the database, we found that for many recipients, the documentDeletedAt column wasn’t set to NULL. It wasn’t set to anything. It was just empty.

When we added the column during migration, we only updated recipients whose documents were deleted. We set their documentDeletedAt to the document’s deletedAt timestamp. But we didn’t explicitly set it to NULL for recipients whose documents were still active.

Some database systems treat empty or uninitialized columns differently. Some might return them as NULL in queries. Others might not. It depended on the specific situation.

We needed to explicitly set the column to NULL for all active recipients. We ran another update query:

UPDATE "Recipient"
SET "documentDeletedAt" = NULLWHERE "documentDeletedAt" IS NULLAND EXISTS (
  SELECT 1 FROM "Document" d
  WHERE d.id = "Recipient"."documentId"
  AND d."deletedAt" IS NULL);

This ensured that all recipients with active documents had their documentDeletedAt set to NULL. The queries would now work properly.

We tested again. The documents began to appear. Users could see their work again. The document lists were complete.

Fixing the Email Exposure

With login working and documents visible, we turned our attention back to the email exposure problem. User emails shouldn’t be in URLs. It was a security flaw.

The problem was in the authentication routing code. The old code path used query parameters:

// Old code (insecure):redirect(`/auth?email=${user.email}`);

The new code should use secure tokens instead:

// New code (secure):redirect(`/auth?token=${sessionToken}`);

But the old database structure was causing the application to fall back to the old method. We needed to fix the authentication routing to always use tokens, never emails.

We updated the code. We changed the authentication flow. We made sure emails were never passed in URLs. We switched everything to use secure session tokens.

We deployed the fix. We tested the login flow. The email disappeared from the URL. Instead of ?email=john.doe@company.com, it now showed ?token=abc123xyz...—a secure token that couldn’t be used to identify the user.

The security flaw was fixed. User emails were no longer exposed.

The Certificate Problem

While we were working through all these issues, we discovered another problem. One that had been hiding in the code the whole time.

The Dockerfile had a syntax error. The certificate generation code had an extra RUN statement with wrong indentation. It looked like this:

```

Broken code:RUN mkdir -p /app/certs

RUN cd /app/certs && \\    #

“Regular updates aren’t optional—they’re essential. A system that stays frozen in time is a system that will eventually break.”

Share
Like this post?

Request a change or update

Suggest a correction or content update. The post author or an admin will be notified and can resolve or respond.

Comments (0)

No comments yet. Be the first to share your thoughts.

Leave a comment