forked from keephq/keep
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(api): alert evaluation engine (keephq#3138)
- Loading branch information
Showing
18 changed files
with
1,721 additions
and
118 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
--- | ||
title: "VictoriaMetrics Multi Alert Example" | ||
--- | ||
|
||
This example demonstrates a simple CPU usage multi-alert based on a metric: | ||
|
||
```yaml | ||
workflow: | ||
# Unique identifier for this workflow | ||
id: query-victoriametrics-multi | ||
# Display name shown in the UI | ||
name: victoriametrics-multi-alert-example | ||
# Brief description of what this workflow does | ||
description: victoriametrics | ||
triggers: | ||
# This workflow can be triggered manually from the UI | ||
- type: manual | ||
steps: | ||
# Query VictoriaMetrics for CPU metrics | ||
- name: victoriametrics-step | ||
provider: | ||
# Use the VictoriaMetrics provider configuration | ||
config: "{{ providers.vm }}" | ||
type: victoriametrics | ||
with: | ||
# Query that returns the sum of CPU usage for each job | ||
# Example response: | ||
# [ | ||
# {'metric': {'job': 'victoriametrics'}, 'value': [1737808021, '0.022633333333333307']}, | ||
# {'metric': {'job': 'vmagent'}, 'value': [1737808021, '0.009299999999999998']} | ||
# ] | ||
query: sum(rate(process_cpu_seconds_total)) by (job) | ||
queryType: query | ||
|
||
actions: | ||
# Create an alert in Keep based on the query results | ||
- name: create-alert | ||
provider: | ||
type: keep | ||
with: | ||
# Only create alert if CPU usage is above threshold | ||
if: "{{ value.1 }} > 0.01 " | ||
# Alert must persist for 1 minute | ||
for: 1m | ||
# Use job label to create unique fingerprint for each alert | ||
fingerprint_fields: | ||
- labels.job | ||
alert: | ||
# Alert name includes the specific job | ||
name: "High CPU Usage on {{ metric.job }}" | ||
description: "CPU usage is high on the VM (created from VM metric)" | ||
# Set severity based on CPU usage thresholds: | ||
# > 0.9 = critical | ||
# > 0.7 = warning | ||
# else = info | ||
severity: '{{ value.1 }} > 0.9 ? "critical" : {{ value.1 }} > 0.7 ? "warning" : "info"' | ||
labels: | ||
# Job label is required for alert fingerprinting | ||
job: "{{ metric.job }}" | ||
# Additional context labels | ||
environment: production | ||
app: myapp | ||
service: api | ||
team: devops | ||
owner: alice | ||
|
||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
--- | ||
title: "VictoriaMetrics Single Alert Example" | ||
--- | ||
|
||
This example demonstrates a simple CPU usage alert based on a metric: | ||
|
||
```yaml | ||
# This workflow queries VictoriaMetrics metrics and creates alerts based on CPU usage | ||
workflow: | ||
# Unique identifier for this workflow | ||
id: query-victoriametrics | ||
# Display name shown in the UI | ||
name: victoriametrics-alert-example | ||
# Brief description of what this workflow does | ||
description: Monitors CPU usage metrics from VictoriaMetrics and creates alerts when thresholds are exceeded | ||
|
||
# Define how the workflow is triggered | ||
triggers: | ||
- type: manual # Can be triggered manually from the UI | ||
|
||
# Steps to execute in order | ||
steps: | ||
- name: victoriametrics-step | ||
provider: | ||
# Use VictoriaMetrics provider config defined in providers.vm | ||
config: "{{ providers.vm }}" | ||
type: victoriametrics | ||
with: | ||
# Query average CPU usage rate | ||
query: avg(rate(process_cpu_seconds_total)) | ||
queryType: query | ||
|
||
# Actions to take based on the query results | ||
actions: | ||
- name: create-alert | ||
provider: | ||
type: keep | ||
with: | ||
# Create alert if CPU usage exceeds threshold | ||
if: "{{ value.1 }} > 0.0040" | ||
alert: | ||
name: "High CPU Usage" | ||
description: "[Single] CPU usage is high on the VM (created from VM metric)" | ||
# Set severity based on CPU usage thresholds | ||
severity: '{{ value.1 }} > 0.9 ? "critical" : {{ value.1 }} > 0.7 ? "warning" : "info"' | ||
# Alert labels for filtering and routing | ||
labels: | ||
environment: production | ||
app: myapp | ||
service: api | ||
team: devops | ||
owner: alice | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
--- | ||
title: "Overview" | ||
--- | ||
|
||
The Keep Alert Evaluation Engine is a flexible system that enables you to create alerts based on any data source and define evaluation rules. Unlike traditional monitoring solutions that are tied to specific metrics, Keep's engine allows you to combine data from multiple sources and apply complex logic to determine when and how alerts should be triggered. | ||
|
||
## Core Features | ||
|
||
### Generic Data Source Support | ||
- Query any data source (databases, APIs, metrics systems) | ||
- Combine multiple data sources in a single alert rule | ||
- Apply custom transformations to the data | ||
|
||
### Flexible Alert Evaluation | ||
- Define custom conditions using templated expressions | ||
- Support for complex boolean logic and mathematical operations | ||
- State management for alert transitions (pending->firing->resolved) | ||
- Deduplication and alert instance tracking | ||
|
||
### Customizable Alert Definition | ||
- Full control over alert metadata (name, description, severity) | ||
- Dynamic labels based on evaluation context | ||
- Template support for all alert fields | ||
- Custom fingerprinting for alert grouping | ||
|
||
## Core Components | ||
|
||
### Alert States | ||
- **Pending**: Initial state when alert condition is met (relevant only if `for` supplied) | ||
- **Firing**: Active alert that has met its duration condition | ||
- **Resolved**: Alert that is no longer active | ||
|
||
### Alert Rule Components | ||
1. **Data Collection**: Query steps to gather data from any source | ||
2. **Condition (`if`)**: Expression that determines when to create/update an alert | ||
3. **Duration (`for`)**: Optional time period the condition must be true before firing | ||
4. **Alert Definition**: Complete control over how the alert looks and behaves: | ||
- Name and description | ||
- Severity levels | ||
- Labels for routing | ||
- Custom fields and annotations | ||
|
||
### State Management | ||
- **Fingerprinting**: Unique identifier for alert deduplication and state tracking | ||
- **Keep-Firing**: Control how long alerts remain active | ||
- **State Transitions**: Rules for how alerts move between states | ||
|
||
## Examples | ||
The following examples demonstrate different ways to use the alert evaluation engine: | ||
|
||
- [Single Metric Alert](/alertevaluation/examples/victoriametricssingle) - Basic example showing metrics-based alerting | ||
- [Multiple Metrics Alert](/alertevaluation/examples/victoriametricsmulti) - Advanced example with multiple alert instances |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,39 +1,45 @@ | ||
# This workflow queries VictoriaMetrics metrics and creates alerts based on CPU usage | ||
workflow: | ||
# Unique identifier for this workflow | ||
id: query-victoriametrics | ||
# Display name shown in the UI | ||
name: victoriametrics-alert-example | ||
description: victoriametrics | ||
# Brief description of what this workflow does | ||
description: Monitors CPU usage metrics from VictoriaMetrics and creates alerts when thresholds are exceeded | ||
|
||
# Define how the workflow is triggered | ||
triggers: | ||
- type: manual | ||
- type: manual # Can be triggered manually from the UI | ||
|
||
# Steps to execute in order | ||
steps: | ||
- name: victoriametrics-step | ||
provider: | ||
# Use VictoriaMetrics provider config defined in providers.vm | ||
config: "{{ providers.vm }}" | ||
type: victoriametrics | ||
with: | ||
# Query average CPU usage rate | ||
query: avg(rate(process_cpu_seconds_total)) | ||
queryType: query | ||
|
||
# Actions to take based on the query results | ||
actions: | ||
- name: create-alert | ||
# only create an alert if the CPU usage is greater than 0.005 | ||
if: "{{ steps.victoriametrics-step.results.data.result.0.value.1 }} > 0.001 " | ||
provider: | ||
type: keep | ||
# create an alert with the following details | ||
with: | ||
name: "High CPU Usage" | ||
description: "CPU usage is high on the VM (created from VM metric)" | ||
severity: '{{ steps.victoriametrics-step.results.data.result.0.value.1 }} > 0.9 ? "critical" : {{ steps.victoriametrics-step.results.data.result.0.value.1 }} > 0.7 ? "warning" : "info"' | ||
labels: | ||
environment: production | ||
app: myapp | ||
service: api | ||
team: devops | ||
owner: alice | ||
# optional: customize the fingerprint based on these fields | ||
fingerprint_fields: | ||
- environment | ||
- app | ||
- service | ||
- team | ||
- owner | ||
# Create alert if CPU usage exceeds threshold | ||
if: "{{ value.1 }} > 0.0040" | ||
alert: | ||
name: "High CPU Usage" | ||
description: "[Single] CPU usage is high on the VM (created from VM metric)" | ||
# Set severity based on CPU usage thresholds | ||
severity: '{{ value.1 }} > 0.9 ? "critical" : {{ value.1 }} > 0.7 ? "warning" : "info"' | ||
# Alert labels for filtering and routing | ||
labels: | ||
environment: production | ||
app: myapp | ||
service: api | ||
team: devops | ||
owner: alice |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
workflow: | ||
# Unique identifier for this workflow | ||
id: query-victoriametrics-multi | ||
# Display name shown in the UI | ||
name: victoriametrics-multi-alert-example | ||
# Brief description of what this workflow does | ||
description: victoriametrics | ||
triggers: | ||
# This workflow can be triggered manually from the UI | ||
- type: manual | ||
steps: | ||
# Query VictoriaMetrics for CPU metrics | ||
- name: victoriametrics-step | ||
provider: | ||
# Use the VictoriaMetrics provider configuration | ||
config: "{{ providers.vm }}" | ||
type: victoriametrics | ||
with: | ||
# Query that returns the sum of CPU usage for each job | ||
# Example response: | ||
# [ | ||
# {'metric': {'job': 'victoriametrics'}, 'value': [1737808021, '0.022633333333333307']}, | ||
# {'metric': {'job': 'vmagent'}, 'value': [1737808021, '0.009299999999999998']} | ||
# ] | ||
query: sum(rate(process_cpu_seconds_total)) by (job) | ||
queryType: query | ||
|
||
actions: | ||
# Create an alert in Keep based on the query results | ||
- name: create-alert | ||
provider: | ||
type: keep | ||
with: | ||
# Only create alert if CPU usage is above threshold | ||
if: "{{ value.1 }} > 0.01 " | ||
# Alert must persist for 1 minute | ||
for: 1m | ||
# Use job label to create unique fingerprint for each alert | ||
fingerprint_fields: | ||
- labels.job | ||
alert: | ||
# Alert name includes the specific job | ||
name: "High CPU Usage on {{ metric.job }}" | ||
description: "CPU usage is high on the VM (created from VM metric)" | ||
# Set severity based on CPU usage thresholds: | ||
# > 0.9 = critical | ||
# > 0.7 = warning | ||
# else = info | ||
severity: '{{ value.1 }} > 0.9 ? "critical" : {{ value.1 }} > 0.7 ? "warning" : "info"' | ||
labels: | ||
# Job label is required for alert fingerprinting | ||
job: "{{ metric.job }}" | ||
# Additional context labels | ||
environment: production | ||
app: myapp | ||
service: api | ||
team: devops | ||
owner: alice |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.