Skip to content

Progressive delivery: canary + analysis

On Ultron Infra, an app’s API isn’t deployed all-at-once. Argo Rollouts replaces the Deployment with a Rollout that ships a new version as a canary: it brings up new pods alongside the stable ones, shifts weight in steps, and a metric analysis decides whether to keep going or bail. A bad release auto-rolls-back before it can take down traffic.

flowchart TD
  New[new image] --> S25[setWeight 25 + pause]
  S25 --> S50[setWeight 50 + pause]
  S50 --> S75[setWeight 75 + pause]
  S75 --> Promote[promote to stable]
  Analysis{{success-rate >= 0.95?}}
  S25 -.background gate.-> Analysis
  Analysis -->|pass| Promote
  Analysis -->|fail x2| Abort[abort -> roll back to stable]

To trigger a canary you bump the image tag (or the redeploy annotation) in workloads/<app>/rollout.yaml and push; Argo CD syncs it. The steps shift traffic gradually:

strategy:
canary:
analysis:
templates:
- templateName: <app>-success-rate
startingStep: 1 # analyze once the canary takes traffic
steps:
- setWeight: 25
- pause: { duration: 60 }
- setWeight: 50
- pause: { duration: 60 }
- setWeight: 75
- pause: { duration: 60 }

The AnalysisTemplate runs as a background check during the canary. It queries the in-cluster Prometheus for the API’s success rate — non-5xx responses over total — and requires it to stay ≥ 95%:

metrics:
- name: success-rate
interval: 30s
failureLimit: 2 # abort after 2 sub-threshold reads
successCondition: "result[0] >= 0.95"
provider:
prometheus:
address: http://prometheus-operated.monitoring:9090
query: |
sum(rate(http_requests_total{namespace="<app>",code!~"5.."}[1m]))
/
sum(rate(http_requests_total{namespace="<app>"}[1m]))

If the metric passes, the rollout walks through its weights and auto-promotes to stable. If it reads below 0.95 twice (failureLimit: 2), the rollout aborts and falls back to the previous version — no human in the loop.

http_requests_total comes from the API’s /metrics endpoint, scraped via a ServiceMonitor. Prometheus is the same instance used for dashboards and alerts (kube-prometheus-stack).

Because the canary gates on new-pod readiness and live success rate, it doubles as a guardrail for risky config changes: a bad backend repoint fails new-pod /readyz, the canary refuses to promote, and the old pods keep serving — see the Operators and Secrets those pods depend on.