Skip to content

fix: keep canary serving until roll completes to avoid downtime (#4745)#4948

Draft
kylemclaren wants to merge 1 commit into
masterfrom
fix/issue-4745
Draft

fix: keep canary serving until roll completes to avoid downtime (#4745)#4948
kylemclaren wants to merge 1 commit into
masterfrom
fix/issue-4745

Conversation

@kylemclaren

@kylemclaren kylemclaren commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Fixes #4745.

The canary deploy strategy created a throwaway canary machine, health-checked it, then destroyed it before rolling the real machines. For single-machine apps this left a window with zero serving instances → downtime.

The canary is now registered in DNS (so it serves traffic) and is destroyed only after the rolling update completes, keeping n+1 healthy instances throughout the roll.

Testing: added TestDeployCanaryKeepsInstanceDuringRoll, which records launch/destroy ordering and asserts the canary is destroyed after the real machine is rolled — it fails on master (canary destroyed first) and passes with the fix. go test ./internal/command/deploy/... is green.


Live-verified against a real org (single-machine app, HTTP-probed during deploy; cleaned up afterward):

  • Latest flyctl (master): during a --strategy canary deploy, live probing saw a multi-second zero-serving window (a 6s code=000 timeout in one run; 4.5s slow-200 stalls in another) as the canary was destroyed before the real machine rolled.
  • This branch: same probe-during-canary-deploy → 0 failed responses across ~220 probes; the registered canary keeps serving through the roll.

🤖 Generated with Claude Code

The canary strategy destroyed its canary machine before rolling the real
machines (deployCanaryMachines ran, and destroyed the canary in a deferred
closure, strictly before updateExistingMachines). The canary was also a
non-serving smoke-test machine (DNS SkipRegistration), so it never preserved
availability. For single-machine apps (n=1) this left a window with no serving
instance while the one real machine was updated in place, causing downtime.

Restructure the canary phase to preserve availability:
- deployCanaryMachines now returns the canary LeasableMachines instead of
  destroying them, and registers them in DNS so they serve traffic.
- deployMachinesApp destroys the canaries only after updateExistingMachines
  completes (via defer), so n+1 healthy serving instances exist throughout the
  roll. Canaries are also cleaned up if canary creation or the roll fails.

Adds TestDeployCanaryKeepsInstanceDuringRoll, which reproduces the n=1 case and
asserts the canary is destroyed after the existing machine is rolled.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kylemclaren kylemclaren marked this pull request as draft June 30, 2026 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Canary deployment broken

1 participant