Airflow allows users to define SLAs at DAG & task levels to track instances where processes are running longer than usual. However, making sense of the data is a challenge.
The airflow-sla-miss-report
DAG consolidates the data from the metadata tables and provides meaningful insights to ensure SLAs are met when set.
The DAG utilizes three (3) timeframes (default: short
: 1d, medium
: 3d, long
: 7d) to calculate the following KPIs:
Following details broken down on a daily basis for the provided long timeframe (e.g. 7 days):
SLA Miss %: percentage of tasks that missed their SLAs out of total tasks runs
Top Violator (%): task that violated its SLA the most as a percentage of its total runs
Top Violator (absolute): task that violated its SLA the most on an absolute count basis during the day
Following details broken down on an hourly basis for the provided short timeframe (e.g. 1 day):
SLA Miss %: percentage of tasks that missed their SLAs out of total tasks runs
Top Violator (%): task that violated its SLA the most as a percentage of its total runs
Top Violator (absolute): task that violated its SLA the most on an absolute count basis during the day
Longest Running Task: task that took the longest time to execute within the hour window
Average Task Queue Time (s): avg time taken for tasks in `queued` state; can be used to detect scheduling bottlenecks
Following details broken down on a task level for all timeframes:
Current SLA (s): current defined SLA for the task
Short, Medium, Long Timeframe SLA miss % (avg execution time): % of tasks that missed their SLAs & their avg execution times over the respective timeframes
The process reads data from the Airflow metadata database to calculate SLA misses based on the defined DAG/task level SLAs using information. The following metadata tables are utilized:
SerializedDag
: retrieve defined DAG & task SLAsDagRuns
: details about each DAG runTaskInstances
: details about each task instance in a DAG run
- Python: 3.7 and above
- Pip packages:
pandas
- Airflow: v2.3 and above
- Airflow metadata tables:
DagRuns
,TaskInstances
,SerializedDag
- SMTP details in
airflow.cfg
for sending emails
- Login to the machine running Airflow
- Navigate to the
dags
directory - Copy the
airflow-sla-miss-report.py
file to thedags
directory. Here's a fast way:
wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/sla-miss-report/airflow-sla-miss-report.py
- Update the global variables in the DAG with the desired values:
EMAIL_ADDRESSES (optional): list of recipient emails to send the SLA report
SHORT_TIMEFRAME_IN_DAYS: duration in days of the short timeframe to calculate SLA metrics (default: 1)
MEDIUM_TIMEFRAME_IN_DAYS: duration in days of the medium timeframe to calculate SLA metrics (default: 3)
LONG_TIMEFRAME_IN_DAYS: duration in days of the long timeframe to calculate SLA metrics (default: 7)
- Enable the DAG in the Airflow Webserver