monitor failing jobs #2740

majamassarini · 2025-03-05T09:20:45Z

Not always a failed job is marked as a failed celery task.

To automatically detect this kind of situations we should create new variables (successful builds/tests, failed builds/tests), collect and send them to the pushgateway (as we do for the queued and started builds/tests). And raise an alert when the number of failures is near 100% on a broad time frame (10 minutes?). Or something similar.

lbarcziova · 2025-03-05T13:25:43Z

Not always a failed job is marked as a failed celery task.

I think this is correct behaviour, because the task successfully finishes.

To automatically detect this kind of situations we should create new variables (successful builds/tests, failed builds/tests), collect and send them to the pushgateway (as we do for the queued and started builds/tests).

We could maybe just adjust the existing metrics, e.g. copr_builds_finished/test_runs_finished to have status labels.

usercont-release-bot added this to Packit Kanban Board Mar 5, 2025

github-project-automation bot moved this to new in Packit Kanban Board Mar 5, 2025

nforro added complexity/single-task Regular task, should be done within days. area/general Related to whole service, not a specific part/integration. kind/internal Doesn't affect users directly, may be e.g. infrastructure, DB related. labels Mar 6, 2025

nforro moved this from new to backlog in Packit Kanban Board Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

monitor failing jobs #2740

monitor failing jobs #2740

majamassarini commented Mar 5, 2025

lbarcziova commented Mar 5, 2025

monitor failing jobs #2740

monitor failing jobs #2740

Comments

majamassarini commented Mar 5, 2025

lbarcziova commented Mar 5, 2025