Canary & metrics
The point of an Argo Rollouts Rollout is the
metric-gated canary: new pods take a slice of traffic,
Prometheus watches their success rate, and Rollouts
auto-aborts if it dips. That needs three pieces — the app exposing /metrics, a
ServiceMonitor to scrape it, and an AnalysisTemplate
the Rollout consults.
flowchart LR App[app pods<br/>/metrics] --> SM[ServiceMonitor] SM --> Prom[Prometheus] Prom --> AT[AnalysisTemplate<br/>success-rate query] AT --> Roll[Rollout canary] Roll -->|>= 95%| Promote[promote] Roll -->|< 95% x2| Abort[abort + rollback]
Expose /metrics
Section titled “Expose /metrics”The app must serve Prometheus metrics at /metrics on its HTTP port, including a
counter the query can read — e.g. http_requests_total{code="..."}. The success-rate
gate below assumes that metric. No /metrics, no gate.
ServiceMonitor
Section titled “ServiceMonitor”Tells the kube-prometheus-stack Prometheus to scrape the Service. The
release: kube-prom-stack label is required — that Prometheus’s
serviceMonitorSelector only matches monitors carrying it. The relabeling copies the
Rollouts pod-template-hash onto each sample so queries can be scoped to canary pods
later.
apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: <app> namespace: <app> labels: release: kube-prom-stack # REQUIRED: how that Prometheus selects this monitorspec: selector: matchLabels: app: <app> # matches the Service label endpoints: - port: http path: /metrics interval: 15s relabelings: - sourceLabels: [__meta_kubernetes_pod_label_rollouts_pod_template_hash] targetLabel: rollouts_pod_template_hashAnalysisTemplate
Section titled “AnalysisTemplate”The gate. Success rate = non-5xx requests / total requests, queried from Prometheus.
successCondition must hold; failureLimit: 2 aborts after two sub-threshold reads.
The query is namespace-wide here — tighten it to canary-only with
rollouts_pod_template_hash once you need to.
apiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata: name: <app>-success-rate namespace: <app>spec: metrics: - name: success-rate interval: 30s failureLimit: 2 # abort after 2 sub-threshold reads successCondition: "result[0] >= 0.95" provider: prometheus: address: http://prometheus-operated.monitoring:9090 query: | sum(rate(http_requests_total{namespace="<app>",code!~"5.."}[1m])) / sum(rate(http_requests_total{namespace="<app>"}[1m]))Wire it into the Rollout
Section titled “Wire it into the Rollout”Add an analysis block to the Rollout’s strategy.canary (from
GitOps & deploy). It runs as a background analysis
alongside the traffic steps and aborts the canary the moment the template fails.
strategy: canary: analysis: # background metric gate; aborts the canary if it fails templates: - templateName: <app>-success-rate startingStep: 1 # begin analyzing once the canary takes traffic steps: - setWeight: 25 - pause: { duration: 60 } - setWeight: 50 - pause: { duration: 60 } - setWeight: 75 - pause: { duration: 60 }startingStep: 1 holds analysis until the canary actually has traffic (after the
first setWeight), so the first samples reflect real canary requests.
Triggering a canary
Section titled “Triggering a canary”A canary runs whenever the Rollout’s pod template changes. Two ways:
-
New image — bump the
imagetag to the latestsha-<short>from CI. This is the normal deploy. -
Config-only redeploy — bump the redeploy annotation on the pod template when no image change is involved (e.g. a config cutover):
template:metadata:annotations:<domain>/redeploy: "config-cutover" # change the value to force a new canary
Because a bad config cutover fails new-pod readiness, the canary won’t promote and old pods keep serving — the canary doubles as a safe config-change mechanism.
Secrets — the out-of-band secret pattern and the GHCR
public-package vs imagePullSecret choice.