monitor failing jobs #2740
Labels
area/general
Related to whole service, not a specific part/integration.
complexity/single-task
Regular task, should be done within days.
kind/internal
Doesn't affect users directly, may be e.g. infrastructure, DB related.
Not always a failed job is marked as a failed celery task.
To automatically detect this kind of situations we should create new variables (successful builds/tests, failed builds/tests), collect and send them to the pushgateway (as we do for the queued and started builds/tests). And raise an alert when the number of failures is near 100% on a broad time frame (10 minutes?). Or something similar.
The text was updated successfully, but these errors were encountered: