Skip to content

Canary & metrics

The point of an Argo Rollouts Rollout is the metric-gated canary: new pods take a slice of traffic, Prometheus watches their success rate, and Rollouts auto-aborts if it dips. That needs three pieces — the app exposing /metrics, a ServiceMonitor to scrape it, and an AnalysisTemplate the Rollout consults.

flowchart LR
  App[app pods<br/>/metrics] --> SM[ServiceMonitor]
  SM --> Prom[Prometheus]
  Prom --> AT[AnalysisTemplate<br/>success-rate query]
  AT --> Roll[Rollout canary]
  Roll -->|>= 95%| Promote[promote]
  Roll -->|< 95% x2| Abort[abort + rollback]

The app must serve Prometheus metrics at /metrics on its HTTP port, including a counter the query can read — e.g. http_requests_total{code="..."}. The success-rate gate below assumes that metric. No /metrics, no gate.

Tells the kube-prometheus-stack Prometheus to scrape the Service. The release: kube-prom-stack label is required — that Prometheus’s serviceMonitorSelector only matches monitors carrying it. The relabeling copies the Rollouts pod-template-hash onto each sample so queries can be scoped to canary pods later.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: <app>
namespace: <app>
labels:
release: kube-prom-stack # REQUIRED: how that Prometheus selects this monitor
spec:
selector:
matchLabels:
app: <app> # matches the Service label
endpoints:
- port: http
path: /metrics
interval: 15s
relabelings:
- sourceLabels: [__meta_kubernetes_pod_label_rollouts_pod_template_hash]
targetLabel: rollouts_pod_template_hash

The gate. Success rate = non-5xx requests / total requests, queried from Prometheus. successCondition must hold; failureLimit: 2 aborts after two sub-threshold reads. The query is namespace-wide here — tighten it to canary-only with rollouts_pod_template_hash once you need to.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: <app>-success-rate
namespace: <app>
spec:
metrics:
- name: success-rate
interval: 30s
failureLimit: 2 # abort after 2 sub-threshold reads
successCondition: "result[0] >= 0.95"
provider:
prometheus:
address: http://prometheus-operated.monitoring:9090
query: |
sum(rate(http_requests_total{namespace="<app>",code!~"5.."}[1m]))
/
sum(rate(http_requests_total{namespace="<app>"}[1m]))

Add an analysis block to the Rollout’s strategy.canary (from GitOps & deploy). It runs as a background analysis alongside the traffic steps and aborts the canary the moment the template fails.

strategy:
canary:
analysis: # background metric gate; aborts the canary if it fails
templates:
- templateName: <app>-success-rate
startingStep: 1 # begin analyzing once the canary takes traffic
steps:
- setWeight: 25
- pause: { duration: 60 }
- setWeight: 50
- pause: { duration: 60 }
- setWeight: 75
- pause: { duration: 60 }

startingStep: 1 holds analysis until the canary actually has traffic (after the first setWeight), so the first samples reflect real canary requests.

A canary runs whenever the Rollout’s pod template changes. Two ways:

  • New image — bump the image tag to the latest sha-<short> from CI. This is the normal deploy.

  • Config-only redeploy — bump the redeploy annotation on the pod template when no image change is involved (e.g. a config cutover):

    template:
    metadata:
    annotations:
    <domain>/redeploy: "config-cutover" # change the value to force a new canary

Because a bad config cutover fails new-pod readiness, the canary won’t promote and old pods keep serving — the canary doubles as a safe config-change mechanism.

Secrets — the out-of-band secret pattern and the GHCR public-package vs imagePullSecret choice.