Skip to content

Operate rollouts

On Ultron Infra, every app ships via an Argo Rollouts canary gated on a Prometheus success-rate metric. Day-2, you mostly watch it. Occasionally you promote or abort by hand. Substitute <app> for the workload you’re operating — the rollout is named after the app.

Terminal window
kubectl argo rollouts get rollout <app> -n <app> --watch

This prints the live step, traffic weight, ReplicaSet revisions, pod readiness, and the background AnalysisRun status. Leave it running through a deploy.

Other useful reads:

Terminal window
# Just the analysis runs (the metric gate)
kubectl get analysisrun -n <app>
# Pod-level detail when something's stuck
kubectl get pods -n <app> -l app=<app>
Terminal window
# Skip the remaining pauses and go to 100%
kubectl argo rollouts promote <app> -n <app>
# Force full promotion, ignoring remaining steps/analysis
kubectl argo rollouts promote <app> -n <app> --full
# Abort: roll traffic back to the stable ReplicaSet
kubectl argo rollouts abort <app> -n <app>
# After fixing the underlying issue, restart the canary
kubectl argo rollouts retry rollout <app> -n <app>

The metric gate aborts automatically on failure — successCondition: result[0] >= 0.95, failureLimit: 2. You rarely need a manual abort unless you’re cutting a deploy short.

The canary doubles as a safe way to repoint config (a new DB, a new Keycloak, changed env). The mechanism is the readiness probe:

flowchart LR
  Bad[bad repoint] --> New[new pods start]
  New -->|/readyz fails| NR[never become Ready]
  NR --> Hold[canary won't promote]
  Hold --> Stable[stable pods keep serving]

A bad repoint makes the new pods fail their /readyz probe, so they never become Ready, so the canary won’t promote — and the old (stable) pods keep serving live traffic. This is exactly how an auth-provider cutover is made safe: if the new config had been wrong, no users would have seen it.