Progressive Canary Releases with Argo Rollouts Analysis and Linkerd Metrics

2025-03-31

/posts/13-argo-rollout-analysis-with-linkerd-and-prometheus/ map[email:mad.mirrajabi@gmail.com location:The Netherlands name:Mad]

Table of Contents

# Introduction

Did this or a similar scenario ever happen to you?

Your team is about to deploy a breaking change to production.
You deploy to acceptance and everything looks great.
You deploy the change to production.
Some very important software who no one knew about but had a dependency on your software starts failing.
You panic and run around to find the fastest way to roll back
After an hour or so, you roll back
You tell yourself: “I will never let this happen again!”
…

Canary releases are a powerful deployment strategy that allows you to gradually roll out new versions of your application while minimizing risk. By combining Argo Rollouts, Linkerd, and Prometheus metrics, you can automate canary deployments that are not only incremental but also validated by real-time metrics analysis.

In this article, I’ll focus on Argo Rollouts Analysis, which enables automated decision-making during a rollout based on metrics from Prometheus (collected from Linkerd). I’ll walk through a practical example of setting up a canary release that progresses step-by-step and automatically rolls back if key performance metrics degrade.

# Prerequisites

Note: I already have a repository that contains everything you need to try out this demo. Please have a look at it if you just want to get things working on a test cluster.

Demo Repo: https://github.com/mirrajabi/argo-rollouts-linkerd-prometheus

Before diving in, ensure you have the following set up:

Linkerd or your favourite service mesh installed in your cluster for service mesh capabilities and metrics collection.
Prometheus configured to scrape Linkerd metrics.
Argo Rollouts installed (kubectl argo rollouts plugin)

# Step 1: Define an Argo Rollout with Canary Strategy

Instead of a standard Kubernetes Deployment, we use an Argo Rollout resource with a canary strategy. Here’s an example:

First, create a namespace for the application and annotate it for Linkerd injection:

kubectl create namespace apps
kubectl annotate namespace apps linkerd.io/inject=enabled

Create a rollout resource in the apps namespace:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
  namespace: apps
spec:
  replicas: 1
  strategy:
    canary:
      canaryService: "myapp-canary"
      stableService: "myapp-stable"
      trafficRouting:
        smi:
          rootService: myapp

      analysis:
        templates:
          - templateName: success-rate-analysis
        args:
          - name: service-name
            value: myapp-canary
          - name: from
            # The name of the service that depends on this application
            value: dependant-api-service
      steps:
      - setWeight: 10
      - pause: {duration: 30s}
      - setWeight: 40
      - pause: {duration: 30s}
      - setWeight: 70
      - pause: {duration: 30s}
      - setWeight: 90
      - pause: {duration: 30s}
      - setWeight: 100
  selector:
    matchLabels:
     app.kubernetes.io/name: myapp
  template:
    metadata:
      labels:
        app.kubernetes.io/name: myapp
    spec:
      containers:
      - name: myapp
        image: ghcr.io/mccutchen/go-httpbin:2.17
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        resources:
          requests:
            memory: 256Mi
            cpu: 100m

This rollout:

Runs the analysis from success-rate-analysis template side-by-side with the rollout to validate success rates before proceeding.
- The SMI is used to create TrafficSplit resources for canary and stable services.
- It passes the service-name and from arguments to the analysis template.
Sends 10% of traffic to the new version.
Repeats the process at 40%, 70%, 90% then 100%.
If the analysis fails, it rolls back to the previous version.
If the analysis succeeds, it continues to the next step and finally promotes the new version to stable.

Note: I’m using a simple HTTP service for demonstration of error responses. Replace the image with your own application.

# Step 2: Define the Canary and Stable Services

To be able to route traffic to the canary and stable versions, we need to define two services: myapp-canary and myapp-stable. These services will be used by Linkerd to route traffic based on the SMI TrafficSplit.

apiVersion: v1
kind: Service
metadata:
  name: myapp
  namespace: apps
  labels:
    app.kubernetes.io/name: myapp
    app.kubernetes.io/service_stage: "default"
spec:
  type: ClusterIP
  ports:
  - name: http
    port: 80
    targetPort: 8080
    protocol: TCP
  selector:
    app.kubernetes.io/name: myapp
---
apiVersion: v1
kind: Service
metadata:
  name: "myapp-canary"
  namespace: apps
  labels:
    app.kubernetes.io/name: myapp
    app.kubernetes.io/service_stage: "canary"
spec:
  type: ClusterIP
  ports:
  - name: http
    port: 80
    targetPort: 8080
    protocol: TCP
  selector:
    app.kubernetes.io/name: myapp
---
apiVersion: v1
kind: Service
metadata:
  name: "myapp-stable"
  namespace: apps
  labels:
    app.kubernetes.io/name: myapp
    app.kubernetes.io/service_stage: "stable"
spec:
  type: ClusterIP
  ports:
  - name: http
    port: 80
    targetPort: 8080
    protocol: TCP
  selector:
    app.kubernetes.io/name: myapp

# Step 3: Configure Argo Rollouts Analysis

The key to automated decision-making is Argo Rollouts Analysis. We define an AnalysisTemplate that queries Prometheus for Linkerd’s success rate metrics.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate-analysis
  namespace: apps
spec:
  args:
  - name: from
  - name: service-name
  metrics:
  - name: success-rate
    interval: 15s
    successCondition: len(result) == 0 || result[0] == 0
    failureCondition: len(result) > 0 && result[0] > 0
    failureLimit: 0
    provider:
      prometheus:
        address: http://prometheus-server.monitoring.svc.cluster.local:80
        timeout: 20
        query: sum by(service) (increase(response_total{deployment="{{args.from}}", dst_service="{{args.service-name}}", direction="outbound", classification="failure", status_code="500"}[30s]))

This template:

Queries Prometheus for the success rate of the canary deployment.
Will fail if there are ANY 500 errors from the {{args.from}} service to the {{args.service-name}} service.
Fails the analysis (and triggers rollback) if the condition fails even once (failureLimit: 0).

Note: You shouldn’t rely that much on the PromQL query in this example. Please adjust the query to your needs.

You don’t really even need the metrics from Linkerd. If you have any usable metrics, you can use them to evaluate the success of your rollout.

# Step 4: Triggering the Rollout

Once above resources have been applied and stabilized, you can update the rollout to trigger a new canary release:


# Trigger a rollout and start calling endpoints that return error responses
kubectl argo rollouts set image myapp myapp=ghcr.io/mccutchen/go-httpbin:2.17.1 -n apps

# Monitor progress
kubectl argo rollouts get rollout myapp -n apps --watch

If the success conditions are not met, the rollout will automatically roll back to the previous version.

# Conclusion

By combining Argo Rollouts, Linkerd (or any other service mesh), and Prometheus/Mimir, you can automate safe, metrics-driven canary releases. The rollout progresses only if key performance indicators (like success rate) remain healthy, reducing the risk of bad deployments.

This approach ensures that:

✅ There will be no need to panic when your colleagues decide to deploy on a Friday afternoon.
✅ Releases are gradual and automated.
✅ Rollbacks happen before large percentage of users are impacted.
✅ Decisions are data-driven (not guesswork).

# Demo

See a full example of all of these in action on this repository: https://github.com/mirrajabi/argo-rollouts-linkerd-prometheus

Mad Mirrajabi