Progressive Canary Releases with Argo Rollouts Analysis and Linkerd Metrics
Table of Contents
#
Introduction
Did this or a similar scenario ever happen to you?
- Your team is about to deploy a breaking change to production.
- You deploy to acceptance and everything looks great.
- You deploy the change to production.
- Some very important software who no one knew about but had a dependency on your software starts failing.
- You panic and run around to find the fastest way to roll back
- After an hour or so, you roll back
- You tell yourself: “I will never let this happen again!”
- …
Canary releases are a powerful deployment strategy that allows you to gradually roll out new versions of your application while minimizing risk. By combining Argo Rollouts, Linkerd, and Prometheus metrics, you can automate canary deployments that are not only incremental but also validated by real-time metrics analysis.
In this article, I’ll focus on Argo Rollouts Analysis, which enables automated decision-making during a rollout based on metrics from Prometheus (collected from Linkerd). I’ll walk through a practical example of setting up a canary release that progresses step-by-step and automatically rolls back if key performance metrics degrade.
#
Prerequisites
Note: I already have a repository that contains everything you need to try out this demo. Please have a look at it if you just want to get things working on a test cluster.
Demo Repo: https://github.com/mirrajabi/argo-rollouts-linkerd-prometheus
Before diving in, ensure you have the following set up:
- Linkerd or your favourite service mesh installed in your cluster for service mesh capabilities and metrics collection.
- Prometheus configured to scrape Linkerd metrics.
- Argo Rollouts installed (kubectl argo rollouts plugin)
#
Step 1: Define an Argo Rollout with Canary Strategy
Instead of a standard Kubernetes Deployment, we use an Argo Rollout resource with a canary strategy. Here’s an example:
First, create a namespace for the application and annotate it for Linkerd injection:
kubectl create namespace apps
kubectl annotate namespace apps linkerd.io/inject=enabled
Create a rollout resource in the apps
namespace:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: myapp
namespace: apps
spec:
replicas: 1
strategy:
canary:
canaryService: "myapp-canary"
stableService: "myapp-stable"
trafficRouting:
smi:
rootService: myapp
analysis:
templates:
- templateName: success-rate-analysis
args:
- name: service-name
value: myapp-canary
- name: from
# The name of the service that depends on this application
value: dependant-api-service
steps:
- setWeight: 10
- pause: {duration: 30s}
- setWeight: 40
- pause: {duration: 30s}
- setWeight: 70
- pause: {duration: 30s}
- setWeight: 90
- pause: {duration: 30s}
- setWeight: 100
selector:
matchLabels:
app.kubernetes.io/name: myapp
template:
metadata:
labels:
app.kubernetes.io/name: myapp
spec:
containers:
- name: myapp
image: ghcr.io/mccutchen/go-httpbin:2.17
ports:
- name: http
containerPort: 8080
protocol: TCP
resources:
requests:
memory: 256Mi
cpu: 100m
This rollout:
- Runs the analysis from
success-rate-analysis
template side-by-side with the rollout to validate success rates before proceeding.- The SMI is used to create TrafficSplit resources for canary and stable services.
- It passes the
service-name
andfrom
arguments to the analysis template.
- Sends 10% of traffic to the new version.
- Repeats the process at 40%, 70%, 90% then 100%.
- If the analysis fails, it rolls back to the previous version.
- If the analysis succeeds, it continues to the next step and finally promotes the new version to stable.
Note: I’m using a simple HTTP service for demonstration of error responses. Replace the image with your own application.
#
Step 2: Define the Canary and Stable Services
To be able to route traffic to the canary and stable versions, we need to define two services: myapp-canary
and myapp-stable
. These services will be used by Linkerd to route traffic based on the SMI TrafficSplit.
apiVersion: v1
kind: Service
metadata:
name: myapp
namespace: apps
labels:
app.kubernetes.io/name: myapp
app.kubernetes.io/service_stage: "default"
spec:
type: ClusterIP
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
selector:
app.kubernetes.io/name: myapp
---
apiVersion: v1
kind: Service
metadata:
name: "myapp-canary"
namespace: apps
labels:
app.kubernetes.io/name: myapp
app.kubernetes.io/service_stage: "canary"
spec:
type: ClusterIP
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
selector:
app.kubernetes.io/name: myapp
---
apiVersion: v1
kind: Service
metadata:
name: "myapp-stable"
namespace: apps
labels:
app.kubernetes.io/name: myapp
app.kubernetes.io/service_stage: "stable"
spec:
type: ClusterIP
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
selector:
app.kubernetes.io/name: myapp
#
Step 3: Configure Argo Rollouts Analysis
The key to automated decision-making is Argo Rollouts Analysis. We define an AnalysisTemplate
that queries Prometheus for Linkerd’s success rate metrics.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate-analysis
namespace: apps
spec:
args:
- name: from
- name: service-name
metrics:
- name: success-rate
interval: 15s
successCondition: len(result) == 0 || result[0] == 0
failureCondition: len(result) > 0 && result[0] > 0
failureLimit: 0
provider:
prometheus:
address: http://prometheus-server.monitoring.svc.cluster.local:80
timeout: 20
query: sum by(service) (increase(response_total{deployment="{{args.from}}", dst_service="{{args.service-name}}", direction="outbound", classification="failure", status_code="500"}[30s]))
This template:
- Queries Prometheus for the success rate of the canary deployment.
- Will fail if there are ANY 500 errors from the
{{args.from}}
service to the{{args.service-name}}
service. - Fails the analysis (and triggers rollback) if the condition fails even once (
failureLimit: 0
).
Note: You shouldn’t rely that much on the PromQL query in this example. Please adjust the query to your needs.
You don’t really even need the metrics from Linkerd. If you have any usable metrics, you can use them to evaluate the success of your rollout.
#
Step 4: Triggering the Rollout
Once above resources have been applied and stabilized, you can update the rollout to trigger a new canary release:
# Trigger a rollout and start calling endpoints that return error responses
kubectl argo rollouts set image myapp myapp=ghcr.io/mccutchen/go-httpbin:2.17.1 -n apps
# Monitor progress
kubectl argo rollouts get rollout myapp -n apps --watch
If the success conditions are not met, the rollout will automatically roll back to the previous version.
#
Conclusion
By combining Argo Rollouts, Linkerd (or any other service mesh), and Prometheus/Mimir, you can automate safe, metrics-driven canary releases. The rollout progresses only if key performance indicators (like success rate) remain healthy, reducing the risk of bad deployments.
This approach ensures that:
- ✅ There will be no need to panic when your colleagues decide to deploy on a Friday afternoon.
- ✅ Releases are gradual and automated.
- ✅ Rollbacks happen before large percentage of users are impacted.
- ✅ Decisions are data-driven (not guesswork).
#
Demo
See a full example of all of these in action on this repository: https://github.com/mirrajabi/argo-rollouts-linkerd-prometheus