When Upgrades Expose Hidden Problems: The Traefik Deployment Journey

Contents hide

The Discovery – An Ancient Version in a Modern Stack

The Stale Image Mystery

The Investigation

The First Attempt

The Rollout Dilemma

The Tradeoff

The Solution

Why This Matters

The Verification

Lessons Learned

The Architecture Question

Conclusion

The Discovery – An Ancient Version in a Modern Stack

It started with a routine check of our infrastructure. We were reviewing our Traefik configuration.

A quick check revealed we were using Traefik 2.8 – a version released 3 years ago in 2022. The current version at that time was 2.11.x, and we were significantly behind.

This raised a red flag. In a modern DevOps environment, running software that’s 3 years old means missing:

Critical security patches
Important bug fixes
New features and improvements
Performance optimizations

We decided to upgrade to Traefik 2.11.30, but that’s when we uncovered something even more concerning.

The Stale Image Mystery

After updating our Dockerfile from traefik:2.8 to traefik:2.11.30 and triggering a deployment, we checked if Traefik had actually updated.

To our surprise, Traefik was still running the old image from days ago. All other services showed the latest commit hash, but Traefik was stuck in the past.

We checked the running Traefik version – still showing 2.8. Our upgrade didn’t take effect. The container wasn’t using the new image at all.

This wasn’t just a cosmetic issue. Our reverse proxy, the gateway to all our services, was:

Running outdated code (2.8 instead of 2.11.30)
Not updating with deployments
Missing configuration changes
Missing security patches and bug fixes

The rest of the infrastructure was modernizing, but Traefik was frozen in time – both in version and in deployment.

The Investigation

Our first hypothesis: maybe the deployment script just wasn’t pulling the latest image. But we checked the deployment logs and saw docker compose pull was running successfully. So why wasn’t Traefik updating?

We also noticed that even after updating the Dockerfile to use Traefik 2.11.30, the running container was still on 2.8. This suggested the new image wasn’t being built or deployed at all.

We dove into the CI/CD configuration and discovered the root cause: Traefik had no build job.

The CI/CD pipeline had build jobs for all other services, but no build job for Traefik.

Without a build job, no new Traefik image was being created. The registry still had the old image from days ago, so even when docker compose pull ran, it was pulling the same stale image.

The First Attempt

We thought the solution was straightforward: add a build job for Traefik, similar to the others. We created a build job that:

Builds the Traefik Docker image
Pushes it to the registry
Ensures it completes before deployment runs

But this alone wasn’t enough. We needed to ensure the deployment script actually updated Traefik during deployment.

The Rollout Dilemma

Our other services used docker rollout for zero-downtime deployments. We tried adding:

sudo docker rollout -f production.yml -w 120 traefik

But then we realized: this won’t work for Traefik.

Traefik binds to host ports:

0.0.0.0:80:80
0.0.0.0:443:443

When docker rollout tries to do a rolling update, it:

Scales up the new container
Waits for it to be healthy
Scales down the old container

But step 1 fails immediately because both containers would try to bind to the same host ports. You can’t have two processes listening on 0.0.0.0:80 simultaneously. The new container would fail to start with a “port already in use” error.

The Tradeoff

We faced a fundamental constraint: zero-downtime deployments for Traefik require a different architecture.

Options for true zero-downtime:

Multiple Traefik instances with a load balancer – Run Traefik on different ports, put a load balancer (HAProxy, nginx, cloud LB) in front, update instances one at a time
Traefik in a container network – Use an external load balancer that routes to Traefik instances

Both add complexity and infrastructure overhead. For most deployments, a few seconds of downtime during updates is acceptable.

The Solution

We settled on a pragmatic approach:

Build the image – Add a build job in CI/CD to create and push the new Traefik image
Force recreate during deployment:
- Pull the latest images
- Use docker compose up -d --force-recreate --no-deps traefik to force Traefik to use the new image

The --force-recreate flag is crucial. Docker Compose doesn’t always detect when only the image tag changes. Without this flag, Compose might think “the service definition hasn’t changed” and skip the update.

The --no-deps flag ensures we only recreate Traefik, not its dependencies.

Why This Matters

Initially, we wanted to upgrade from Traefik 2.8 to 2.11.30 to get:

Security patches from 3 years of updates
Bug fixes and stability improvements
New features and performance optimizations
Better compatibility with modern infrastructure

But we discovered an even bigger problem: Traefik wasn’t updating at all, regardless of version.

At first glance, you might ask: “What’s the use if there’s still downtime?”

The answer: The problem wasn’t downtime. The problem was that Traefik wasn’t updating at all.

Without this fix:

Configuration changes in traefik.yml never deployed
Security patches never applied
Traefik was running potentially vulnerable code
Services drifted out of sync (different commit tags)
Bug fixes in our custom Traefik setup never went live

With this fix:

Traefik updates on every deployment
It stays synchronized with other services (same commit tag)
Configuration changes are deployed
Security patches are applied
Brief downtime (seconds) during updates is acceptable for most use cases

The Verification

After implementing the solution and deploying, we verified everything was working:

1. Checked the Docker image tag – confirmed it matched the latest deployment

2. Verified the Traefik version inside the container – confirmed: Traefik 2.11.30 (the version we wanted to upgrade to!)

3. Checked all services were in sync – all services now showed the same commit tag, confirming they were deployed together

Success! We achieved both goals:

Upgraded from Traefik 2.8 (3 years old) to 2.11.30
Fixed the deployment so Traefik updates automatically with every deployment

The deployment logs confirmed the container recreation process was working correctly.

Lessons Learned

Version audits matter – Discovering we were running 3-year-old software led us to find a bigger problem
Missing build jobs can silently leave services on old images, even when you think you’re upgrading
Upgrades require deployment – Updating the Dockerfile isn’t enough if the service isn’t being rebuilt and redeployed
Host port bindings prevent true zero-downtime rollouts for single-instance services
docker compose up -d doesn’t always detect image tag changes – use --force-recreate when needed
Brief downtime is acceptable when it’s intentional and infrequent, especially for a security-critical upgrade
Service synchronization matters – all services should be on the same commit tag
One problem can hide another – The upgrade attempt revealed the underlying deployment issue

The Architecture Question

For future consideration: If zero-downtime for Traefik becomes critical, we would need to:

Deploy multiple Traefik instances behind a load balancer
Update instances one at a time
Use health checks to verify readiness
Route traffic away from instances being updated

But for now, the simple force-recreate approach keeps Traefik updated and synchronized with the rest of our infrastructure.

Conclusion

What started as a simple version check (“We’re running Traefik 2.8 from 3 years ago – let’s upgrade!”) led us down a rabbit hole that revealed a fundamental deployment gap.

The upgrade to 2.11.30 was the catalyst that exposed the real issue: Traefik wasn’t being built or deployed at all. Fixing that allowed us to:

Successfully upgrade from 2.8 to 2.11.30 (getting 3 years of security patches)
Establish a working deployment pipeline for Traefik
Ensure Traefik stays synchronized with the rest of our infrastructure

The solution wasn’t glamorous – no zero-downtime magic – but it solved both problems:

The immediate need (upgrade from 2.8)
The underlying issue (automated deployments)

Sometimes the best solution is the simple one that actually works. And sometimes, trying to upgrade old software reveals infrastructure problems you didn’t know you had.

Mr Cloud Book

When Upgrades Expose Hidden Problems: The Traefik Deployment Journey

The Discovery – An Ancient Version in a Modern Stack

The Stale Image Mystery

The Investigation

The First Attempt

The Rollout Dilemma

The Tradeoff

The Solution

Why This Matters

The Verification

Lessons Learned

The Architecture Question

Conclusion

Comments

Leave a Reply Cancel reply

How to Build Nginx Smart Proxy Image with Multi-Domain Support

How to Fix Cloudflare DNS Authentication Error 6003

When Upgrades Expose Hidden Problems: The Traefik Deployment Journey