Canary & metrics

The point of an Argo Rollouts Rollout is the metric-gated canary: new pods take a slice of traffic, Prometheus watches their success rate, and Rollouts auto-aborts if it dips. That needs three pieces — the app exposing /metrics, a ServiceMonitor to scrape it, and an AnalysisTemplate the Rollout consults.

flowchart LR
  App[app pods<br/>/metrics] --> SM[ServiceMonitor]
  SM --> Prom[Prometheus]
  Prom --> AT[AnalysisTemplate<br/>success-rate query]
  AT --> Roll[Rollout canary]
  Roll -->|>= 95%| Promote[promote]
  Roll -->|< 95% x2| Abort[abort + rollback]

Expose /metrics

The app must serve Prometheus metrics at /metrics on its HTTP port, including a counter the query can read — e.g. http_requests_total{code="..."}. The success-rate gate below assumes that metric. No /metrics, no gate.

ServiceMonitor

Tells the kube-prometheus-stack Prometheus to scrape the Service. The release: kube-prom-stack label is required — that Prometheus’s serviceMonitorSelector only matches monitors carrying it. The relabeling copies the Rollouts pod-template-hash onto each sample so queries can be scoped to canary pods later.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: <app>
  namespace: <app>
  labels:
    release: kube-prom-stack       # REQUIRED: how that Prometheus selects this monitor
spec:
  selector:
    matchLabels:
      app: <app>                   # matches the Service label
  endpoints:
    - port: http
      path: /metrics
      interval: 15s
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_label_rollouts_pod_template_hash]
          targetLabel: rollouts_pod_template_hash

AnalysisTemplate

The gate. Success rate = non-5xx requests / total requests, queried from Prometheus. successCondition must hold; failureLimit: 2 aborts after two sub-threshold reads. The query is namespace-wide here — tighten it to canary-only with rollouts_pod_template_hash once you need to.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: <app>-success-rate
  namespace: <app>
spec:
  metrics:
    - name: success-rate
      interval: 30s
      failureLimit: 2          # abort after 2 sub-threshold reads
      successCondition: "result[0] >= 0.95"
      provider:
        prometheus:
          address: http://prometheus-operated.monitoring:9090
          query: |
            sum(rate(http_requests_total{namespace="<app>",code!~"5.."}[1m]))
            /
            sum(rate(http_requests_total{namespace="<app>"}[1m]))

Wire it into the Rollout

Add an analysis block to the Rollout’s strategy.canary (from GitOps & deploy). It runs as a background analysis alongside the traffic steps and aborts the canary the moment the template fails.

  strategy:
    canary:
      analysis:                         # background metric gate; aborts the canary if it fails
        templates:
          - templateName: <app>-success-rate
        startingStep: 1                 # begin analyzing once the canary takes traffic
      steps:
        - setWeight: 25
        - pause: { duration: 60 }
        - setWeight: 50
        - pause: { duration: 60 }
        - setWeight: 75
        - pause: { duration: 60 }

startingStep: 1 holds analysis until the canary actually has traffic (after the first setWeight), so the first samples reflect real canary requests.

Triggering a canary

A canary runs whenever the Rollout’s pod template changes. Two ways:

New image — bump the image tag to the latest sha-<short> from CI. This is the normal deploy.

Config-only redeploy — bump the redeploy annotation on the pod template when no image change is involved (e.g. a config cutover):

template:
  metadata:
    annotations:
      <domain>/redeploy: "config-cutover"   # change the value to force a new canary

Because a bad config cutover fails new-pod readiness, the canary won’t promote and old pods keep serving — the canary doubles as a safe config-change mechanism.

Secrets — the out-of-band secret pattern and the GHCR public-package vs imagePullSecret choice.