Progressive delivery: canary + analysis
On Ultron Infra, an app’s API isn’t deployed all-at-once.
Argo Rollouts replaces the Deployment with a
Rollout that ships a new version as a canary: it brings up new
pods alongside the stable ones, shifts weight in steps, and a metric analysis
decides whether to keep going or bail. A bad release auto-rolls-back before it can take
down traffic.
The canary flow
Section titled “The canary flow”flowchart TD
New[new image] --> S25[setWeight 25 + pause]
S25 --> S50[setWeight 50 + pause]
S50 --> S75[setWeight 75 + pause]
S75 --> Promote[promote to stable]
Analysis{{success-rate >= 0.95?}}
S25 -.background gate.-> Analysis
Analysis -->|pass| Promote
Analysis -->|fail x2| Abort[abort -> roll back to stable]
To trigger a canary you bump the image tag (or the redeploy annotation) in
workloads/<app>/rollout.yaml and push; Argo CD syncs it. The steps shift traffic
gradually:
strategy: canary: analysis: templates: - templateName: <app>-success-rate startingStep: 1 # analyze once the canary takes traffic steps: - setWeight: 25 - pause: { duration: 60 } - setWeight: 50 - pause: { duration: 60 } - setWeight: 75 - pause: { duration: 60 }The analysis gate
Section titled “The analysis gate”The AnalysisTemplate runs as a background check during the canary. It queries the in-cluster Prometheus for the API’s success rate — non-5xx responses over total — and requires it to stay ≥ 95%:
metrics: - name: success-rate interval: 30s failureLimit: 2 # abort after 2 sub-threshold reads successCondition: "result[0] >= 0.95" provider: prometheus: address: http://prometheus-operated.monitoring:9090 query: | sum(rate(http_requests_total{namespace="<app>",code!~"5.."}[1m])) / sum(rate(http_requests_total{namespace="<app>"}[1m]))If the metric passes, the rollout walks through its weights and auto-promotes to
stable. If it reads below 0.95 twice (failureLimit: 2), the rollout aborts and
falls back to the previous version — no human in the loop.
http_requests_total comes from the API’s /metrics endpoint, scraped via a
ServiceMonitor. Prometheus is the same instance used for dashboards and alerts
(kube-prometheus-stack).
Why this is also a safe config cutover
Section titled “Why this is also a safe config cutover”Because the canary gates on new-pod readiness and live success rate, it doubles as a
guardrail for risky config changes: a bad backend repoint fails new-pod /readyz, the
canary refuses to promote, and the old pods keep serving — see the
Operators and Secrets those pods depend on.