Progressive delivery: canary + analysis

On Ultron Infra, an app’s API isn’t deployed all-at-once. Argo Rollouts replaces the Deployment with a Rollout that ships a new version as a canary: it brings up new pods alongside the stable ones, shifts weight in steps, and a metric analysis decides whether to keep going or bail. A bad release auto-rolls-back before it can take down traffic.

The canary flow

flowchart TD
  New[new image] --> S25[setWeight 25 + pause]
  S25 --> S50[setWeight 50 + pause]
  S50 --> S75[setWeight 75 + pause]
  S75 --> Promote[promote to stable]
  Analysis{{success-rate >= 0.95?}}
  S25 -.background gate.-> Analysis
  Analysis -->|pass| Promote
  Analysis -->|fail x2| Abort[abort -> roll back to stable]

To trigger a canary you bump the image tag (or the redeploy annotation) in workloads/<app>/rollout.yaml and push; Argo CD syncs it. The steps shift traffic gradually:

strategy:
  canary:
    analysis:
      templates:
        - templateName: <app>-success-rate
      startingStep: 1          # analyze once the canary takes traffic
    steps:
      - setWeight: 25
      - pause: { duration: 60 }
      - setWeight: 50
      - pause: { duration: 60 }
      - setWeight: 75
      - pause: { duration: 60 }

The analysis gate

The AnalysisTemplate runs as a background check during the canary. It queries the in-cluster Prometheus for the API’s success rate — non-5xx responses over total — and requires it to stay ≥ 95%:

metrics:
  - name: success-rate
    interval: 30s
    failureLimit: 2                       # abort after 2 sub-threshold reads
    successCondition: "result[0] >= 0.95"
    provider:
      prometheus:
        address: http://prometheus-operated.monitoring:9090
        query: |
          sum(rate(http_requests_total{namespace="<app>",code!~"5.."}[1m]))
          /
          sum(rate(http_requests_total{namespace="<app>"}[1m]))

If the metric passes, the rollout walks through its weights and auto-promotes to stable. If it reads below 0.95 twice (failureLimit: 2), the rollout aborts and falls back to the previous version — no human in the loop.

http_requests_total comes from the API’s /metrics endpoint, scraped via a ServiceMonitor. Prometheus is the same instance used for dashboards and alerts (kube-prometheus-stack).

Why this is also a safe config cutover

Because the canary gates on new-pod readiness and live success rate, it doubles as a guardrail for risky config changes: a bad backend repoint fails new-pod /readyz, the canary refuses to promote, and the old pods keep serving — see the Operators and Secrets those pods depend on.