Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poc: integrate KeptnMetrics into Flagger analysis #3371

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions examples/flagger/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# PoC: Integration with Flagger

This example shows a integration of Keptn Metrics
into a Flagger Canary.
In this example, we are making use of the Prometheus endpoint provided
by Keptn (i.e. the metrics-operator), which serves the values of all `KeptnMetrics`.

This way, we are able to use a Flagger `MetricTemplate` of type `prometheus`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This way, we are able to use a Flagger `MetricTemplate` of type `prometheus`,
This enables us to use a Flagger `MetricTemplate` of type `prometheus`,

which retrieves the value from a Prometheus instance that has access to the `KeptnMetrics`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
which retrieves the value from a Prometheus instance that has access to the `KeptnMetrics`.
which retrieves the value from a Prometheus instance that has access to the `KeptnMetrics` resource.


The example is based on the [Istio Canary Deployments tutorial](https://docs.flagger.app/tutorials/istio-progressive-delivery)

Check failure on line 11 in examples/flagger/README.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`Istio` is not a recognized word. (unrecognized-spelling)
provided in the Flagger docs.

The difference to the tutorial is that instead of using the `request-duration` duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The difference to the tutorial is that instead of using the `request-duration` duration
The difference from the tutorial is that, instead of using the `request-duration` duration

provided by Istio via Prometheus, we are referring to a `KeptnMetric` called `response-time`.

Check failure on line 15 in examples/flagger/README.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`Istio` is not a recognized word. (unrecognized-spelling)
The Flagger metrics provider is in this case still `prometheus`.

What could be an interesting idea would be to contribute to Flagger by adding
a `keptn` metrics provider to their [provider implementations](https://github.com/fluxcd/flagger/tree/main/pkg/metrics/providers).
This would also open up the possibility to use Keptn `Analyses` in Flagger, which might be a
valuable addition that benefits both projects.

In terms of observability, we do get the OpenTelemetry traces generated by Keptn out of the box
if the relevant annotations are present in the deployment managed by Flagger.

The addition of pre-/post-deployment tasks using Keptn is also possible,
but here Flagger provides a similar concept via [Webhooks](https://docs.flagger.app/usage/webhooks),
which are naturally more tailored to Flagger as they also allow to do intermediate checks after the
pods for the canary deployment have been started, e.g. to decide if more traffic should be sent to the canary.
This is something Keptn does not provide, as we operate on pre-/post-deployment of the deployment, but
are not aware of the canary increments of Flagger.

29 changes: 29 additions & 0 deletions examples/flagger/assets/analysisdefinition.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
apiVersion: metrics.keptn.sh/v1beta1
kind: AnalysisDefinition
metadata:
name: response-time-analysis
namespace: simple-go
spec:
objectives:
- analysisValueTemplateRef:
name: response-time-p95
keyObjective: false
target:
failure:
greaterThan:
fixedValue: 30M
weight: 1
totalScore:
passPercentage: 100
warningPercentage: 75
---
apiVersion: metrics.keptn.sh/v1beta1
kind: AnalysisValueTemplate
metadata:
name: response-time-p95
namespace: simple-go
spec:
provider:
name: my-provider
query: histogram_quantile(0.95, sum by(le) (rate(http_server_request_latency_seconds_bucket{job='{{.workload}}'}[1m[])))

69 changes: 69 additions & 0 deletions examples/flagger/assets/canary.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: podinfo

Check failure on line 4 in examples/flagger/assets/canary.yaml

View workflow job for this annotation

GitHub Actions / Check Spelling

`podinfo` is not a recognized word. (unrecognized-spelling)
namespace: test
spec:
# deployment reference
targetRef:
apiVersion: apps/v1
kind: Deployment
name: podinfo
# the maximum time in seconds for the canary deployment
# to make progress before it is rollback (default 600s)
progressDeadlineSeconds: 60
# HPA reference (optional)
autoscalerRef:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
name: podinfo
service:
# service port number
port: 9898
# container port number or name (optional)
targetPort: 9898
# Istio traffic policy (optional)
trafficPolicy:
tls:
# use ISTIO_MUTUAL when mTLS is enabled

Check failure on line 28 in examples/flagger/assets/canary.yaml

View workflow job for this annotation

GitHub Actions / Check Spelling

`ISTIO` is not a recognized word. (unrecognized-spelling)
mode: DISABLE
# Istio retry policy (optional)
retries:
attempts: 3
perTryTimeout: 1s
retryOn: "gateway-error,connect-failure,refused-stream"
analysis:
# schedule interval (default 60s)
interval: 1m
# max number of failed metric checks before rollback
threshold: 5
# max traffic percentage routed to canary
# percentage (0-100)
maxWeight: 50
# canary increment step
# percentage (0-100)
stepWeight: 10
metrics:
- name: response-time
templateRef:
name: response-time
namespace: keptn-system
# maximum req duration P99
# milliseconds
thresholdRange:
min: 1.0
interval: 30s
# testing (optional)
webhooks:
- name: acceptance-test
type: pre-rollout
url: http://flagger-loadtester.test/
timeout: 30s
metadata:
type: bash
cmd: "curl -sd 'test' http://podinfo-canary:9898/token | grep token"
- name: load-test
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.test:9898/"
87 changes: 87 additions & 0 deletions examples/flagger/assets/deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
apiVersion: v1
kind: Namespace
metadata:
name: test
annotations:
keptn.sh/lifecycle-toolkit: enabled
labels:
istio-injection: enabled

Check failure on line 8 in examples/flagger/assets/deployment.yaml

View workflow job for this annotation

GitHub Actions / Check Spelling

`istio` is not a recognized word. (unrecognized-spelling)
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: podinfo
namespace: test
labels:
app: podinfo
spec:
minReadySeconds: 5
revisionHistoryLimit: 5
progressDeadlineSeconds: 60
strategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
selector:
matchLabels:
app: podinfo
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9797"
labels:
app: podinfo
app.kubernetes.io/name: podinfo
spec:
containers:
- name: podinfod

Check failure on line 38 in examples/flagger/assets/deployment.yaml

View workflow job for this annotation

GitHub Actions / Check Spelling

`podinfod` is not a recognized word. (unrecognized-spelling)
image: ghcr.io/stefanprodan/podinfo:6.0.0

Check failure on line 39 in examples/flagger/assets/deployment.yaml

View workflow job for this annotation

GitHub Actions / Check Spelling

`stefanprodan` is not a recognized word. (unrecognized-spelling)
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 9898
protocol: TCP
- name: http-metrics
containerPort: 9797
protocol: TCP
- name: grpc
containerPort: 9999
protocol: TCP
command:
- ./podinfo
- --port=9898
- --port-metrics=9797
- --grpc-port=9999
- --grpc-service-name=podinfo
- --level=info
- --random-delay=false
- --random-error=false
env:
- name: PODINFO_UI_COLOR

Check failure on line 61 in examples/flagger/assets/deployment.yaml

View workflow job for this annotation

GitHub Actions / Check Spelling

`PODINFO` is not a recognized word. (unrecognized-spelling)
value: "#34577c"
livenessProbe:
exec:
command:
- podcli

Check failure on line 66 in examples/flagger/assets/deployment.yaml

View workflow job for this annotation

GitHub Actions / Check Spelling

`podcli` is not a recognized word. (unrecognized-spelling)
- check
- http
- localhost:9898/healthz
initialDelaySeconds: 5
timeoutSeconds: 5
readinessProbe:
exec:
command:
- podcli
- check
- http
- localhost:9898/readyz
initialDelaySeconds: 5
timeoutSeconds: 5
resources:
limits:
cpu: 2000m
memory: 512Mi
requests:
cpu: 100m
memory: 64Mi
21 changes: 21 additions & 0 deletions examples/flagger/assets/hpa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: podinfo
namespace: test
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: podinfo
minReplicas: 2
maxReplicas: 4
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
# scale up if usage is above
# 99% of the requested CPU (100m)
averageUtilization: 99
15 changes: 15 additions & 0 deletions examples/flagger/assets/ingress-gateway.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: public-gateway
namespace: istio-system
spec:
selector:
istio: ingressgateway

Check failure on line 8 in examples/flagger/assets/ingress-gateway.yaml

View workflow job for this annotation

GitHub Actions / Check Spelling

`ingressgateway` is not a recognized word. (unrecognized-spelling)
servers:
- port:
number: 80
name: http
protocol: HTTP
hosts:
- "*"
10 changes: 10 additions & 0 deletions examples/flagger/assets/keptnmetric.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
apiVersion: metrics.keptn.sh/v1beta1
kind: KeptnMetric
metadata:
name: response-time
namespace: keptn-system
spec:
provider:
name: my-prometheus-provider
query: "histogram_quantile(0.8, sum by(le) (rate(http_server_request_latency_seconds_bucket{status_code='200', job='simple-go-backend'}[5m])))"
fetchIntervalSeconds: 10
11 changes: 11 additions & 0 deletions examples/flagger/assets/metric-template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: response-time
namespace: keptn-system
spec:
provider:
type: keptn
address: ""
query: |
analysis/simple-go/my-analysis-definition/1m/workload=simple-go-service
Loading