Operate rollouts

On Ultron Infra, every app ships via an Argo Rollouts canary gated on a Prometheus success-rate metric. Day-2, you mostly watch it. Occasionally you promote or abort by hand. Substitute <app> for the workload you’re operating — the rollout is named after the app.

Watch a rollout

kubectl argo rollouts get rollout <app> -n <app> --watch

This prints the live step, traffic weight, ReplicaSet revisions, pod readiness, and the background AnalysisRun status. Leave it running through a deploy.

Promote or abort

# Skip the remaining pauses and go to 100%
kubectl argo rollouts promote <app> -n <app>

# Force full promotion, ignoring remaining steps/analysis
kubectl argo rollouts promote <app> -n <app> --full

# Abort: roll traffic back to the stable ReplicaSet
kubectl argo rollouts abort <app> -n <app>

# After fixing the underlying issue, restart the canary
kubectl argo rollouts retry rollout <app> -n <app>

The metric gate aborts automatically on failure — successCondition: result[0] >= 0.95, failureLimit: 2. You rarely need a manual abort unless you’re cutting a deploy short.

The safe config-cutover trick

The canary doubles as a safe way to repoint config (a new DB, a new Keycloak, changed env). The mechanism is the readiness probe:

flowchart LR
  Bad[bad repoint] --> New[new pods start]
  New -->|/readyz fails| NR[never become Ready]
  NR --> Hold[canary won't promote]
  Hold --> Stable[stable pods keep serving]

A bad repoint makes the new pods fail their /readyz probe, so they never become Ready, so the canary won’t promote — and the old (stable) pods keep serving live traffic. This is exactly how an auth-provider cutover is made safe: if the new config had been wrong, no users would have seen it.

If a rollout sits at step 1 forever with the new ReplicaSet at 0 Ready, the new pods are failing readiness — not the metric gate. Check the pod’s /readyz and its env/secret wiring:

kubectl describe pod -n <app> -l app=<app>
kubectl logs -n <app> -l app=<app> --tail=50