Skip to content

Conversation

@gmierz
Copy link
Collaborator

@gmierz gmierz commented Oct 9, 2025

This set of patches adds a new alert management system for telemetry alerts. The commits below attempt to split up the system into some logical chunks with newer commits building on previous ones.

Some generic base and utility classes are added directly to the auto_perf_sheriffing folder. These are not specific to telemetry alerting and could be used in other performance sheriffing automation.

The concrete classes for telemetry alert management are found in the treeherder/perf/auto_perf_sheriffing folder. These are then integrated into the telemetry detection code in Sherlock through the TelemetryAlertManager and run from TelemetryAlertManager.manage_alerts.

The manage_alerts method is defined generically in the AlertManager class. It starts by updating the DB with any changes made in telemetry bugs in Bugzilla - this is only for their resolutions at the moment. After this, bugs are filed for the alerts that are generated for any probes that specify a bug should be filed (by setting the monitor.alert field to True in their probe definition). Once bugs are filed, modifications are made to these bugs and any existing bugs as needed. This currently only modifies the see_also field to associate all bugs filed for the same detection range together - in other words, all the bugs that are part of the same PerformanceTelemetryAlertSummary. At the end of this "bug handling" phase, emails are produced for any alerts that request it (either bugs are produced or emails, but never both to reduce spamming). Finally, it's possible that either the bug modifications or emails fail. In that case, we have a "house keeping" stage where we do retries of the failed alerts on a daily basis.

For treeherder-admins, the relevant changes will be in the first commit where I am adding a new env field to capture the BUG_COMMENTER_API_KEY being set locally. This is needed for testing the bug modification aspect of the management system.

@gmierz gmierz force-pushed the telemetry-alert-manager-comp branch from 44fab00 to 0e26ca1 Compare October 9, 2025 12:51
@gmierz gmierz requested a review from Andrej1198 October 9, 2025 15:43
@gmierz
Copy link
Collaborator Author

gmierz commented Oct 17, 2025

Here's a sample bug that is filed by this: https://bugzilla.mozilla.org/show_bug.cgi?id=1993145

@gmierz gmierz force-pushed the telemetry-alert-manager-comp branch from 94d89a2 to 4f7ecc0 Compare October 24, 2025 16:08
@gmierz gmierz requested a review from Andrej1198 October 24, 2025 16:10
Copy link
Contributor

@Andrej1198 Andrej1198 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my concerns, LGTM

@gmierz gmierz force-pushed the telemetry-alert-manager-comp branch from 4f7ecc0 to 6f2e463 Compare October 29, 2025 14:54
@gmierz
Copy link
Collaborator Author

gmierz commented Oct 29, 2025

Another DB migration has landed before this one, so I had to remake the ones we're doing here.

@gmierz gmierz merged commit 5bcfc9c into mozilla:master Oct 29, 2025
6 checks passed
misspran pushed a commit to misspran/treeherder that referenced this pull request Nov 3, 2025
…ozilla#9015)

This patch adds a new alert management system for telemetry alerts.

Some generic base and utility classes are added directly to the auto_perf_sheriffing folder. These are not specific to telemetry alerting and could be used in other performance sheriffing automation.

The concrete classes for telemetry alert management are found in the treeherder/perf/auto_perf_sheriffing folder. These are then integrated into the telemetry detection code in Sherlock through the TelemetryAlertManager and run from TelemetryAlertManager.manage_alerts.

The manage_alerts method is defined generically in the AlertManager class. It starts by updating the DB with any changes made in telemetry bugs in Bugzilla - this is only for their resolutions at the moment. After this, bugs are filed for the alerts that are generated for any probes that specify a bug should be filed (by setting the monitor.alert field to True in their probe definition). Once bugs are filed, modifications are made to these bugs and any existing bugs as needed. This currently only modifies the see_also field to associate all bugs filed for the same detection range together - in other words, all the bugs that are part of the same PerformanceTelemetryAlertSummary. At the end of this "bug handling" phase, emails are produced for any alerts that request it (either bugs are produced or emails, but never both to reduce spamming). Finally, it's possible that either the bug modifications or emails fail. In that case, we have a "house keeping" stage where we do retries of the failed alerts on a daily basis.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants