When Upgrades Expose Hidden Problems: The Traefik Deployment Journey

traefik-journal2

The Discovery – An Ancient Version in a Modern Stack

It started with a routine check of our infrastructure. We were reviewing our Traefik configuration.

A quick check revealed we were using Traefik 2.8 – a version released 3 years ago in 2022. The current version at that time was 2.11.x, and we were significantly behind.

This raised a red flag. In a modern DevOps environment, running software that’s 3 years old means missing:

  • Critical security patches
  • Important bug fixes
  • New features and improvements
  • Performance optimizations

We decided to upgrade to Traefik 2.11.30, but that’s when we uncovered something even more concerning.

The Stale Image Mystery

After updating our Dockerfile from traefik:2.8 to traefik:2.11.30 and triggering a deployment, we checked if Traefik had actually updated.

To our surprise, Traefik was still running the old image from days ago. All other services showed the latest commit hash, but Traefik was stuck in the past.

We checked the running Traefik version – still showing 2.8. Our upgrade didn’t take effect. The container wasn’t using the new image at all.

This wasn’t just a cosmetic issue. Our reverse proxy, the gateway to all our services, was:

  • Running outdated code (2.8 instead of 2.11.30)
  • Not updating with deployments
  • Missing configuration changes
  • Missing security patches and bug fixes

The rest of the infrastructure was modernizing, but Traefik was frozen in time – both in version and in deployment.

The Investigation

Our first hypothesis: maybe the deployment script just wasn’t pulling the latest image. But we checked the deployment logs and saw docker compose pull was running successfully. So why wasn’t Traefik updating?

We also noticed that even after updating the Dockerfile to use Traefik 2.11.30, the running container was still on 2.8. This suggested the new image wasn’t being built or deployed at all.

We dove into the CI/CD configuration and discovered the root cause: Traefik had no build job.

The CI/CD pipeline had build jobs for all other services, but no build job for Traefik.

Without a build job, no new Traefik image was being created. The registry still had the old image from days ago, so even when docker compose pull ran, it was pulling the same stale image.

The First Attempt

We thought the solution was straightforward: add a build job for Traefik, similar to the others. We created a build job that:

  • Builds the Traefik Docker image
  • Pushes it to the registry
  • Ensures it completes before deployment runs

But this alone wasn’t enough. We needed to ensure the deployment script actually updated Traefik during deployment.

The Rollout Dilemma

Our other services used docker rollout for zero-downtime deployments. We tried adding:

sudo docker rollout -f production.yml -w 120 traefik

But then we realized: this won’t work for Traefik.

Traefik binds to host ports:

  • 0.0.0.0:80:80
  • 0.0.0.0:443:443

When docker rollout tries to do a rolling update, it:

  1. Scales up the new container
  2. Waits for it to be healthy
  3. Scales down the old container

But step 1 fails immediately because both containers would try to bind to the same host ports. You can’t have two processes listening on 0.0.0.0:80 simultaneously. The new container would fail to start with a “port already in use” error.

The Tradeoff

We faced a fundamental constraint: zero-downtime deployments for Traefik require a different architecture.

Options for true zero-downtime:

  1. Multiple Traefik instances with a load balancer – Run Traefik on different ports, put a load balancer (HAProxy, nginx, cloud LB) in front, update instances one at a time
  2. Traefik in a container network – Use an external load balancer that routes to Traefik instances

Both add complexity and infrastructure overhead. For most deployments, a few seconds of downtime during updates is acceptable.

The Solution

We settled on a pragmatic approach:

  1. Build the image – Add a build job in CI/CD to create and push the new Traefik image
  2. Force recreate during deployment:
    • Pull the latest images
    • Use docker compose up -d --force-recreate --no-deps traefik to force Traefik to use the new image

The --force-recreate flag is crucial. Docker Compose doesn’t always detect when only the image tag changes. Without this flag, Compose might think “the service definition hasn’t changed” and skip the update.

The --no-deps flag ensures we only recreate Traefik, not its dependencies.

Why This Matters

Initially, we wanted to upgrade from Traefik 2.8 to 2.11.30 to get:

  • Security patches from 3 years of updates
  • Bug fixes and stability improvements
  • New features and performance optimizations
  • Better compatibility with modern infrastructure

But we discovered an even bigger problem: Traefik wasn’t updating at all, regardless of version.

At first glance, you might ask: “What’s the use if there’s still downtime?”

The answer: The problem wasn’t downtime. The problem was that Traefik wasn’t updating at all.

Without this fix:

  • Configuration changes in traefik.yml never deployed
  • Security patches never applied
  • Traefik was running potentially vulnerable code
  • Services drifted out of sync (different commit tags)
  • Bug fixes in our custom Traefik setup never went live

With this fix:

  • Traefik updates on every deployment
  • It stays synchronized with other services (same commit tag)
  • Configuration changes are deployed
  • Security patches are applied
  • Brief downtime (seconds) during updates is acceptable for most use cases

The Verification

After implementing the solution and deploying, we verified everything was working:

1. Checked the Docker image tag – confirmed it matched the latest deployment

2. Verified the Traefik version inside the container – confirmed: Traefik 2.11.30 (the version we wanted to upgrade to!)

3. Checked all services were in sync – all services now showed the same commit tag, confirming they were deployed together

Success! We achieved both goals:

  1. Upgraded from Traefik 2.8 (3 years old) to 2.11.30
  2. Fixed the deployment so Traefik updates automatically with every deployment

The deployment logs confirmed the container recreation process was working correctly.

Lessons Learned

  1. Version audits matter – Discovering we were running 3-year-old software led us to find a bigger problem
  2. Missing build jobs can silently leave services on old images, even when you think you’re upgrading
  3. Upgrades require deployment – Updating the Dockerfile isn’t enough if the service isn’t being rebuilt and redeployed
  4. Host port bindings prevent true zero-downtime rollouts for single-instance services
  5. docker compose up -d doesn’t always detect image tag changes – use --force-recreate when needed
  6. Brief downtime is acceptable when it’s intentional and infrequent, especially for a security-critical upgrade
  7. Service synchronization matters – all services should be on the same commit tag
  8. One problem can hide another – The upgrade attempt revealed the underlying deployment issue

The Architecture Question

For future consideration: If zero-downtime for Traefik becomes critical, we would need to:

  1. Deploy multiple Traefik instances behind a load balancer
  2. Update instances one at a time
  3. Use health checks to verify readiness
  4. Route traffic away from instances being updated

But for now, the simple force-recreate approach keeps Traefik updated and synchronized with the rest of our infrastructure.

Conclusion

What started as a simple version check (“We’re running Traefik 2.8 from 3 years ago – let’s upgrade!”) led us down a rabbit hole that revealed a fundamental deployment gap.

The upgrade to 2.11.30 was the catalyst that exposed the real issue: Traefik wasn’t being built or deployed at all. Fixing that allowed us to:

  1. Successfully upgrade from 2.8 to 2.11.30 (getting 3 years of security patches)
  2. Establish a working deployment pipeline for Traefik
  3. Ensure Traefik stays synchronized with the rest of our infrastructure

The solution wasn’t glamorous – no zero-downtime magic – but it solved both problems:

  • The immediate need (upgrade from 2.8)
  • The underlying issue (automated deployments)

Sometimes the best solution is the simple one that actually works. And sometimes, trying to upgrade old software reveals infrastructure problems you didn’t know you had.

mrcloudbook.com avatar

Ajay Kumar Yegireddi is a DevSecOps Engineer and System Administrator, with a passion for sharing real-world DevSecOps projects and tasks. Mr. Cloud Book, provides hands-on tutorials and practical insights to help others master DevSecOps tools and workflows. Content is designed to bridge the gap between development, security, and operations, making complex concepts easy to understand for both beginners and professionals.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *