Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerta No Send Resolve #4251

Open
vinfry opened this issue Feb 12, 2025 · 21 comments
Open

Alerta No Send Resolve #4251

vinfry opened this issue Feb 12, 2025 · 21 comments

Comments

@vinfry
Copy link

vinfry commented Feb 12, 2025

Good day,

I have started to face weird issues after upgrading Alertmanager to the latest version:

0.28.0

I have a lot of alert rules and different receivers, the one which is in problem is alerta. The "resolve" message is not sent to Alerta and the alert appears in Alerta until I delete it. I could not track why this is happening and the most amazing thing is that it happens at random times.
Here is receiver config in /etc/alertmanager/alertmanager.yml :

  • name: 'alerta'
    webhook_configs:
    • url: 'http://$alerta_host:8080/webhooks/prometheus?api-key=$KEY'
      send_resolved: true

Debug logs of successful sending of the same alert id:

Feb 08 20:21:49 $host alertmanager[1325]: time=2025-02-08T18:21:49.027Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=receiver integration=email[1] $redacted attempts=1 duration=49.21519ms alerts=[$alertname[d560a81][active]]
Feb 08 20:21:49 $host alertmanager[1325]: time=2025-02-08T18:21:49.027Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=receiver integration=email[0] $redacted attempts=1 duration=49.184398ms alerts=[$alertname[d560a81][active]]
Feb 08 20:21:49 $host alertmanager[1325]: time=2025-02-08T18:21:49.051Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=alerta integration=webhook[0] $redacted attempts=1 duration=72.965561ms alerts=[$alertname[d560a81][active]]

some debug logs...

Feb 08 20:26:49 $host alertmanager[1325]: time=2025-02-08T18:26:49.006Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=receiver integration=email[0] $redacted attempts=1 duration=21.456558ms alerts=[$alertname[d560a81][resolved]]
Feb 08 20:26:49 $host alertmanager[1325]: time=2025-02-08T18:26:49.017Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=receiver integration=email[1] $redacted attempts=1 duration=32.341313ms alerts=[$alertname[d560a81][resolved]]
Feb 08 20:26:49 $host alertmanager[1325]: time=2025-02-08T18:26:49.160Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=alerta integration=webhook[0] $redacted attempts=1 duration=175.729277ms alerts=[$alertname[d560a81][resolved]]

The next one failed to send "resolve":

Feb 10 15:58:19 $host alertmanager[1325]: time=2025-02-10T13:58:19.005Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=$receiver integration=email[0] $redacted attempts=1 duration=15.78611ms alerts=[$alertname[d560a81][active]]
Feb 10 15:58:19 $host alertmanager[1325]: time=2025-02-10T13:58:19.007Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=$receiver integration=email[1] $redacted attempts=1 duration=17.254232ms alerts=[$alertname[d560a81][active]]
Feb 10 15:58:19 $host alertmanager[1325]: time=2025-02-10T13:58:19.038Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=alerta integration=webhook[0] $redacted attempts=1 duration=48.615119ms alerts=[$alertname[d560a81][active]]

some debug logs...

Feb 10 16:03:19 $host alertmanager[1325]: time=2025-02-10T14:03:19.009Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=$receiver integration=email[0] $redacted attempts=1 duration=18.54338ms alerts=[$alertname[d560a81][resolved]]
Feb 10 16:03:19 $host alertmanager[1325]: time=2025-02-10T14:03:19.010Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=$receiver integration=email[1] $redacted attempts=1 duration=20.019582ms alerts=[$alertname[d560a81][resolved]]
Feb 10 16:03:49 $host alertmanager[1325]: time=2025-02-10T14:03:49.098Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]

I have debugged a few cases and found nothing that could help. No errors, nothing in TCP traces about rejected connections or similar.
As for Alertmanager itself, those alerts disappear without any problems.
If you need some additional info please let me know.

@grobinson-grafana
Copy link
Contributor

It looks like you have a group_interval of 5s, is that correct? Based on the logs, I am guessing that the webhook took longer than 5 seconds and was canceled 5 seconds later. You should be able to find an error message like notify retry canceled or context deadline exceeded.

@vinfry
Copy link
Author

vinfry commented Feb 13, 2025

It looks like you have a group_interval of 5s, is that correct? Based on the logs, I am guessing that the webhook took longer than 5 seconds and was canceled 5 seconds later. You should be able to find an error message like notify retry canceled or context deadline exceeded.

Here is:
group_interval: 5m
and I see nothing in debug logs like "notify retry canceled" or "context deadline exceeded".

@grobinson-grafana
Copy link
Contributor

Apologies, 5s was a mistake, I meant 5m 😀.

To be able to help further I think we would need to see more logs for the alert d560a81 and for receiver=alerta.

@vinfry
Copy link
Author

vinfry commented Feb 14, 2025

Feb 10 15:57:48 $host alertmanager[1325]: time=2025-02-10T13:57:48.988Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][active]
Feb 10 15:57:49 $host alertmanager[1325]: time=2025-02-10T13:57:49.018Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][active]
Feb 10 15:58:18 $host alertmanager[1325]: time=2025-02-10T13:58:18.989Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][active]]
Feb 10 15:58:18 $host alertmanager[1325]: time=2025-02-10T13:58:18.989Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][active]]
Feb 10 15:58:19 $host alertmanager[1325]: time=2025-02-10T13:58:19.005Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=$email integration=email[0] aggrGroup="$alert attempts=1 duration=15.78611ms alerts=[$alertname[d560a81][active]]
Feb 10 15:58:19 $host alertmanager[1325]: time=2025-02-10T13:58:19.007Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=$email integration=email[1] aggrGroup="$alert attempts=1 duration=17.254232ms alerts=[$alertname[d560a81][active]]
Feb 10 15:58:19 $host alertmanager[1325]: time=2025-02-10T13:58:19.038Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=nm-alerta integration=webhook[0] aggrGroup="$alert attempts=1 duration=48.615119ms alerts=[$alertname[d560a81][active]]
Feb 10 15:59:19 $host alertmanager[1325]: time=2025-02-10T13:59:19.117Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][active]
Feb 10 15:59:19 $host alertmanager[1325]: time=2025-02-10T13:59:19.141Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][active]
Feb 10 16:00:49 $host alertmanager[1325]: time=2025-02-10T14:00:49.196Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][active]
Feb 10 16:00:49 $host alertmanager[1325]: time=2025-02-10T14:00:49.196Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][active]
Feb 10 16:02:18 $host alertmanager[1325]: time=2025-02-10T14:02:18.932Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:02:18 $host alertmanager[1325]: time=2025-02-10T14:02:18.945Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:03:18 $host alertmanager[1325]: time=2025-02-10T14:03:18.990Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][resolved]]
Feb 10 16:03:18 $host alertmanager[1325]: time=2025-02-10T14:03:18.990Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][resolved]]
Feb 10 16:03:19 $host alertmanager[1325]: time=2025-02-10T14:03:19.009Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=$email integration=email[0] aggrGroup="$alert attempts=1 duration=18.54338ms alerts=[$alertname[d560a81][resolved]]
Feb 10 16:03:19 $host alertmanager[1325]: time=2025-02-10T14:03:19.010Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=$email integration=email[1] aggrGroup="$alert attempts=1 duration=20.019582ms alerts=[$alertname[d560a81][resolved]]
Feb 10 16:03:49 $host alertmanager[1325]: time=2025-02-10T14:03:49.098Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:03:49 $host alertmanager[1325]: time=2025-02-10T14:03:49.098Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][resolved]]
Feb 10 16:03:49 $host alertmanager[1325]: time=2025-02-10T14:03:49.098Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][resolved]]
Feb 10 16:03:49 $host alertmanager[1325]: time=2025-02-10T14:03:49.101Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:05:19 $host alertmanager[1325]: time=2025-02-10T14:05:19.125Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:05:19 $host alertmanager[1325]: time=2025-02-10T14:05:19.138Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:06:49 $host alertmanager[1325]: time=2025-02-10T14:06:49.145Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:06:49 $host alertmanager[1325]: time=2025-02-10T14:06:49.232Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:08:18 $host alertmanager[1325]: time=2025-02-10T14:08:18.939Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:08:18 $host alertmanager[1325]: time=2025-02-10T14:08:18.950Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:08:49 $host alertmanager[1325]: time=2025-02-10T14:08:49.099Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][resolved]]
Feb 10 16:08:49 $host alertmanager[1325]: time=2025-02-10T14:08:49.101Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][resolved]]
Feb 10 16:09:49 $host alertmanager[1325]: time=2025-02-10T14:09:49.033Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:09:49 $host alertmanager[1325]: time=2025-02-10T14:09:49.037Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][resolved]]
Feb 10 16:09:49 $host alertmanager[1325]: time=2025-02-10T14:09:49.037Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][resolved]]
Feb 10 16:09:49 $host alertmanager[1325]: time=2025-02-10T14:09:49.063Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:11:19 $host alertmanager[1325]: time=2025-02-10T14:11:19.027Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:11:19 $host alertmanager[1325]: time=2025-02-10T14:11:19.050Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:12:49 $host alertmanager[1325]: time=2025-02-10T14:12:49.225Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:12:49 $host alertmanager[1325]: time=2025-02-10T14:12:49.225Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:14:19 $host alertmanager[1325]: time=2025-02-10T14:14:19.145Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:14:19 $host alertmanager[1325]: time=2025-02-10T14:14:19.204Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:14:49 $host alertmanager[1325]: time=2025-02-10T14:14:49.037Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][resolved]]
Feb 10 16:14:49 $host alertmanager[1325]: time=2025-02-10T14:14:49.038Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][resolved]]
Feb 10 16:15:48 $host alertmanager[1325]: time=2025-02-10T14:15:48.961Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:15:48 $host alertmanager[1325]: time=2025-02-10T14:15:48.961Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][resolved]]
Feb 10 16:15:48 $host alertmanager[1325]: time=2025-02-10T14:15:48.961Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][resolved]]
Feb 10 16:15:48 $host alertmanager[1325]: time=2025-02-10T14:15:48.977Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:17:18 $host alertmanager[1325]: time=2025-02-10T14:17:18.954Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:17:18 $host alertmanager[1325]: time=2025-02-10T14:17:18.959Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alertname[d560a81][resolved]
Feb 10 16:20:48 $host alertmanager[1325]: time=2025-02-10T14:20:48.962Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][resolved]]
Feb 10 16:20:48 $host alertmanager[1325]: time=2025-02-10T14:20:48.962Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][resolved]]

Full log for d560a81 id using grep

@grobinson-grafana
Copy link
Contributor

There are two groups here, but have both been redacted to $alert. I assume these are different routes? Can you make them unique so I can see which notification is for which group? Just $group1 and $group2 would be sufficient.

Feb 10 15:58:18 $host alertmanager[1325]: time=2025-02-10T13:58:18.989Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][active]]
Feb 10 15:58:18 $host alertmanager[1325]: time=2025-02-10T13:58:18.989Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup="$alert alerts=[$alertname[d560a81][active]]
``

@grobinson-grafana
Copy link
Contributor

Here is a test I did, and this is what I would have expected to see if the webhook timed out:

time=2025-02-14T12:47:57.041Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup={}:{} alerts=[[3fff2c2][active]]
time=2025-02-14T12:52:57.040Z level=ERROR source=dispatch.go:360 msg="Notify for alerts failed" component=dispatcher num_alerts=1 err="test1/webhook[0]: notify retry canceled after 2 attempts: context deadline exceeded"
time=2025-02-14T12:52:57.040Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup={}:{} alerts=[[3fff2c2][active]]

It's odd that you did not have a similar error, as I would have expected either a Notify success or Notify for alerts failed message.

@vinfry
Copy link
Author

vinfry commented Feb 17, 2025

I have just searched alertmanager log for 'level=ERROR' but found nothing.
systemctl settings:
ExecStart=/usr/local/bin/alertmanager \ --config.file="/etc/alertmanager/alertmanager.yml" \ --storage.path="/var/lib/alertmanager" \ --cluster.listen-address="0.0.0.0:9094" \ --cluster.peer="$instance2:9094" \ --log.level=debug
Note: pay attention to cluster.peer, i browsed its logs as well.
I might have to put something additional to see level=ERROR in addition to "debug"?
Also, I have a newer example:

Feb 17 13:27:09 $host alertmanager[1325]: time=2025-02-17T11:27:09.769Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][active]
Feb 17 13:27:09 $host alertmanager[1325]: time=2025-02-17T11:27:09.777Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][active]
Feb 17 13:27:39 $host alertmanager[1325]: time=2025-02-17T11:27:39.770Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher $alert alerts=[$alert_name[4c9d39e][active]]
Feb 17 13:27:39 $host alertmanager[1325]: time=2025-02-17T11:27:39.770Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher $alert alerts=[$alert_name[4c9d39e][active]]
Feb 17 13:27:39 $host alertmanager[1325]: time=2025-02-17T11:27:39.795Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=$receiver integration=email[1] $alert attempts=1 duration=24.61405ms alerts=[$alert_name[4c9d39e][active]]
Feb 17 13:27:39 $host alertmanager[1325]: time=2025-02-17T11:27:39.797Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=$receiver integration=email[0] $alert attempts=1 duration=26.48321ms alerts=[$alert_name[4c9d39e][active]]
Feb 17 13:27:39 $host alertmanager[1325]: time=2025-02-17T11:27:39.799Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:27:39 $host alertmanager[1325]: time=2025-02-17T11:27:39.826Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=alerta integration=webhook[0] $alert attempts=1 duration=55.562792ms alerts=[$alert_name[4c9d39e][active]]
Feb 17 13:27:39 $host alertmanager[1325]: time=2025-02-17T11:27:39.835Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:29:09 $host alertmanager[1325]: time=2025-02-17T11:29:09.758Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:29:09 $host alertmanager[1325]: time=2025-02-17T11:29:09.758Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:30:39 $host alertmanager[1325]: time=2025-02-17T11:30:39.776Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:30:39 $host alertmanager[1325]: time=2025-02-17T11:30:39.778Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:32:09 $host alertmanager[1325]: time=2025-02-17T11:32:09.774Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:32:09 $host alertmanager[1325]: time=2025-02-17T11:32:09.793Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:32:39 $host alertmanager[1325]: time=2025-02-17T11:32:39.771Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher $alert alerts=[$alert_name[4c9d39e][resolved]]
Feb 17 13:32:39 $host alertmanager[1325]: time=2025-02-17T11:32:39.771Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher $alert alerts=[$alert_name[4c9d39e][resolved]]
Feb 17 13:32:39 $host alertmanager[1325]: time=2025-02-17T11:32:39.794Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=$receiver integration=email[1] $alert attempts=1 duration=22.808079ms alerts=[$alert_name[4c9d39e][resolved]]
Feb 17 13:32:39 $host alertmanager[1325]: time=2025-02-17T11:32:39.813Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=$receiver integration=email[0] $alert attempts=1 duration=41.195493ms alerts=[$alert_name[4c9d39e][resolved]]
Feb 17 13:33:39 $host alertmanager[1325]: time=2025-02-17T11:33:39.834Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:33:39 $host alertmanager[1325]: time=2025-02-17T11:33:39.834Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher $alert alerts=[$alert_name[4c9d39e][resolved]]
Feb 17 13:33:39 $host alertmanager[1325]: time=2025-02-17T11:33:39.834Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher $alert alerts=[$alert_name[4c9d39e][resolved]]
Feb 17 13:33:39 $host alertmanager[1325]: time=2025-02-17T11:33:39.835Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:35:09 $host alertmanager[1325]: time=2025-02-17T11:35:09.754Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:35:09 $host alertmanager[1325]: time=2025-02-17T11:35:09.762Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:36:39 $host alertmanager[1325]: time=2025-02-17T11:36:39.771Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:36:39 $host alertmanager[1325]: time=2025-02-17T11:36:39.780Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:38:09 $host alertmanager[1325]: time=2025-02-17T11:38:09.749Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:38:09 $host alertmanager[1325]: time=2025-02-17T11:38:09.749Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:38:39 $host alertmanager[1325]: time=2025-02-17T11:38:39.835Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher $alert alerts=[$alert_name[4c9d39e][resolved]]
Feb 17 13:38:39 $host alertmanager[1325]: time=2025-02-17T11:38:39.838Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher $alert alerts=[$alert_name[4c9d39e][resolved]]
Feb 17 13:39:39 $host alertmanager[1325]: time=2025-02-17T11:39:39.810Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:39:39 $host alertmanager[1325]: time=2025-02-17T11:39:39.810Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher $alert alerts=[$alert_name[4c9d39e][resolved]]
Feb 17 13:39:39 $host alertmanager[1325]: time=2025-02-17T11:39:39.810Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher $alert alerts=[$alert_name[4c9d39e][resolved]]
Feb 17 13:39:39 $host alertmanager[1325]: time=2025-02-17T11:39:39.849Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:41:09 $host alertmanager[1325]: time=2025-02-17T11:41:09.820Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:41:09 $host alertmanager[1325]: time=2025-02-17T11:41:09.885Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:42:39 $host alertmanager[1325]: time=2025-02-17T11:42:39.776Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:42:39 $host alertmanager[1325]: time=2025-02-17T11:42:39.787Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=$alert_name[4c9d39e][resolved]
Feb 17 13:44:39 $host alertmanager[1325]: time=2025-02-17T11:44:39.816Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher $alert alerts=[$alert_name[4c9d39e][resolved]]
Feb 17 13:44:39 $host alertmanager[1325]: time=2025-02-17T11:44:39.816Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher $alert alerts=[$alert_name[4c9d39e][resolved]]

As for receivers, I made some modifications to the redaction:)
Just in case, here is the receiver config:

- name: "$receiver"
  email_configs:
  - to: '$email1'
    send_resolved: true
  - to: '$email2'
    send_resolved: true

@grobinson-grafana
Copy link
Contributor

The obfuscation of information makes it so difficult to help. I asked if you could at least make the redacted Group Key unique so I can differentiate between the two groups, but you haven't done this so I can't tell which group (I believe there are two) is for which msg=flushing, or the associated notifications.

Also, the receiver configuration you share doesn't even have an entry for your Alerta webhook?

@vinfry
Copy link
Author

vinfry commented Feb 17, 2025

sorry for this, however, the group is the same, the only difference is email, this group has 2 different emails to send notifications to:
integration=email[0] and integration=email[1]
I have shared the info about alerta webhook in the 1st post, here it is again:

name: 'alerta'
webhook_configs:
url: 'http://$alerta_host:8080/webhooks/prometheus?api-key=$KEY'
send_resolved: true

@grobinson-grafana
Copy link
Contributor

I am wondering if the two routes have the same Group Key somehow, but cannot see from the logs as it has been redacted.

@vinfry
Copy link
Author

vinfry commented Feb 18, 2025

here you go:

Feb 18 11:13:39 $host alertmanager[1325]: time=2025-02-18T09:13:39.784Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=mw integration=email[0] aggrGroup="{}/{email_mw=~\"^(?:true)$\"}:{alertname=\"HighResponseTime\", email_mw=\"true\", environment=\"dev\", instance=\"ms-app\", method=\"reg\", service=\"mw\", severity=\"warning\", source=\"promxy\"}" attempts=1 duration=38.282067ms alerts=[HighResponseTime[4c9d39e][active]]
Feb 18 11:13:39 $host alertmanager[1325]: time=2025-02-18T09:13:39.793Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=mw integration=email[1] aggrGroup="{}/{email_mw=~\"^(?:true)$\"}:{alertname=\"HighResponseTime\", email_mw=\"true\", environment=\"dev\", instance=\"ms-app\", method=\"reg\", service=\"mw\", severity=\"warning\", source=\"promxy\"}" attempts=1 duration=47.575877ms alerts=[HighResponseTime[4c9d39e][active]]
Feb 18 11:13:39 $host alertmanager[1325]: time=2025-02-18T09:13:39.812Z level=DEBUG source=notify.go:876 msg="Notify success" component=dispatcher receiver=alerta integration=webhook[0] aggrGroup="{}/{}:{alertname=\"HighResponseTime\", email_mw=\"true\", environment=\"dev\", instance=\"ms-app\", method=\"reg\", service=\"mw\", severity=\"warning\", source=\"promxy\"}" attempts=1 duration=65.9101ms alerts=[HighResponseTime[4c9d39e][active]]

I left as much info as I could.

@grobinson-grafana
Copy link
Contributor

Thanks for this, I can see that the emails and webhook are for separate alert groups. Can you also confirm that this problem does not occur on Alertmanager 0.27 (you are running 0.28)?

@vinfry
Copy link
Author

vinfry commented Feb 19, 2025

I can confirm that this issue does not affect alertmanager-0.23.0, I didn't try it on 0.27 yet.
Do you think I should try 0.27?

@grobinson-grafana
Copy link
Contributor

Yes please. 0.23 was released almost 4 years ago, and quite a lot will have changed since then. In the meantime, I might give you a patch to test to see if we can figure out what's happening here.

@vinfry
Copy link
Author

vinfry commented Feb 20, 2025

I have downgraded alertmanager to 0.27, the result is the same:

Feb 20 10:49:31 $host alertmanager[3948997]: ts=2025-02-20T08:49:31.600Z caller=notify.go:860 level=debug component=dispatcher receiver=$receiver integration=email[0] aggrGroup="{}/{email_mw=~\"^(?:true)$\"}:{alertname=\"$alert\", email_mw=\"true\", environment=\"$env\", instance=\"$instance\", severity=\"warning\", source=\"promxy\"}" alerts=[$alert[594f0b5][active]] msg="Notify success" attempts=1 duration=30.791608ms
Feb 20 10:49:33 $host alertmanager[3948997]: ts=2025-02-20T08:49:33.036Z caller=notify.go:860 level=debug component=dispatcher receiver=alerta integration=webhook[0] aggrGroup="{}/{}:{alertname=\"$alert\", email_mw=\"true\", environment=\"$env\", instance=\"$instance\", severity=\"warning\", source=\"promxy\"}" alerts=[$alert[594f0b5][active]] msg="Notify success" attempts=1 duration=1.466376604s
Feb 20 10:54:31 $host alertmanager[3948997]: ts=2025-02-20T08:54:31.593Z caller=notify.go:860 level=debug component=dispatcher receiver=$receiver integration=email[0] aggrGroup="{}/{email_mw=~\"^(?:true)$\"}:{alertname=\"$alert\", email_mw=\"true\", environment=\"$env\", instance=\"$instance\", severity=\"warning\", source=\"promxy\"}" alerts=[$alert[594f0b5][resolved]] msg="Notify success" attempts=1 duration=22.725544ms
alertmanager --version
alertmanager, version 0.27.0 (branch: HEAD, revision: 0aa3c2aad14cff039931923ab16b26b7481783b5)
  build user:       root@22cd11f671e9
  build date:       20240228-11:51:20
  go version:       go1.21.7
  platform:         linux/amd64
  tags:             netgo

This is a different alert and has 2 receivers only.

@grobinson-grafana
Copy link
Contributor

So whatever is happening must have been introduced a long time ago (0.27 was released February '24).

@grobinson-grafana
Copy link
Contributor

Just to confirm, are you running a single Alertmanager instance, or are you running Alertmanager in high availability?

@grobinson-grafana
Copy link
Contributor

grobinson-grafana commented Feb 20, 2025

Here is a patch. Do you think you can add it to the v0.28.0 tag, build Alertmanager, and repeat the tests? It should tell us what is happening to those resolved alerts.

diff --git a/notify/notify.go b/notify/notify.go
index 8fa85c0a..c8e0f68b 100644
--- a/notify/notify.go
+++ b/notify/notify.go
@@ -452,6 +452,8 @@ type RoutingStage map[string]Stage

 // Exec implements the Stage interface.
 func (rs RoutingStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+	l.Debug("Executing routing stage", "alerts", fmt.Sprintf("%v", alerts))
+
 	receiver, ok := ReceiverName(ctx)
 	if !ok {
 		return ctx, nil, errors.New("receiver missing")
@@ -470,6 +472,8 @@ type MultiStage []Stage

 // Exec implements the Stage interface.
 func (ms MultiStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+	l.Debug("Executing multi stage", "alerts", fmt.Sprintf("%v", alerts))
+
 	var err error
 	for _, s := range ms {
 		if len(alerts) == 0 {
@@ -490,6 +494,8 @@ type FanoutStage []Stage
 // Exec attempts to execute all stages concurrently and discards the results.
 // It returns its input alerts and a types.MultiError if one or more stages fail.
 func (fs FanoutStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+	l.Debug("Executing fanout stage", "alerts", fmt.Sprintf("%v", alerts))
+
 	var (
 		wg sync.WaitGroup
 		me types.MultiError
@@ -551,6 +557,8 @@ func NewMuteStage(m types.Muter, metrics *Metrics) *MuteStage {

 // Exec implements the Stage interface.
 func (n *MuteStage) Exec(ctx context.Context, logger *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+	logger.Debug("Executing mute stage", "alerts", fmt.Sprintf("%v", alerts))
+
 	var (
 		filtered []*types.Alert
 		muted    []*types.Alert
@@ -596,7 +604,8 @@ func NewWaitStage(wait func() time.Duration) *WaitStage {
 }

 // Exec implements the Stage interface.
-func (ws *WaitStage) Exec(ctx context.Context, _ *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+func (ws *WaitStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+	l.Debug("Executing wait stage", "alerts", fmt.Sprintf("%v", alerts))
 	select {
 	case <-time.After(ws.wait()):
 	case <-ctx.Done():
@@ -697,7 +706,8 @@ func (n *DedupStage) needsUpdate(entry *nflogpb.Entry, firing, resolved map[uint
 }

 // Exec implements the Stage interface.
-func (n *DedupStage) Exec(ctx context.Context, _ *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+func (n *DedupStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+	l.Debug("Executing dedup stage", "alerts", fmt.Sprintf("%v", alerts))
 	gkey, ok := GroupKey(ctx)
 	if !ok {
 		return ctx, nil, errors.New("group key missing")
@@ -745,6 +755,7 @@ func (n *DedupStage) Exec(ctx context.Context, _ *slog.Logger, alerts ...*types.
 	if n.needsUpdate(entry, firingSet, resolvedSet, repeatInterval) {
 		return ctx, alerts, nil
 	}
+	l.Debug("Notifications will not be sent for alerts, no changes", "alerts", fmt.Sprintf("%v", alerts))
 	return ctx, nil, nil
 }

@@ -774,6 +785,8 @@ func NewRetryStage(i Integration, groupName string, metrics *Metrics) *RetryStag
 }

 func (r RetryStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+	l.Debug("Executing retry stage", "alerts", fmt.Sprintf("%v", alerts))
+
 	r.metrics.numNotifications.WithLabelValues(r.labelValues...).Inc()
 	ctx, alerts, err := r.exec(ctx, l, alerts...)

@@ -914,6 +927,8 @@ func NewSetNotifiesStage(l NotificationLog, recv *nflogpb.Receiver) *SetNotifies

 // Exec implements the Stage interface.
 func (n SetNotifiesStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+	l.Debug("Executing set notifies stage", "alerts", fmt.Sprintf("%v", alerts))
+
 	gkey, ok := GroupKey(ctx)
 	if !ok {
 		return ctx, nil, errors.New("group key missing")
@@ -953,6 +968,8 @@ func NewTimeMuteStage(muter types.TimeMuter, marker types.GroupMarker, metrics *
 // Exec implements the stage interface for TimeMuteStage.
 // TimeMuteStage is responsible for muting alerts whose route is not in an active time.
 func (tms TimeMuteStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+	l.Debug("Executing time mute stage", "alerts", fmt.Sprintf("%v", alerts))
+
 	routeID, ok := RouteID(ctx)
 	if !ok {
 		return ctx, nil, errors.New("route ID missing")
@@ -1003,6 +1020,8 @@ func NewTimeActiveStage(muter types.TimeMuter, marker types.GroupMarker, metrics
 // Exec implements the stage interface for TimeActiveStage.
 // TimeActiveStage is responsible for muting alerts whose route is not in an active time.
 func (tas TimeActiveStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+	l.Debug("Executing time active stage", "alerts", fmt.Sprintf("%v", alerts))
+
 	routeID, ok := RouteID(ctx)
 	if !ok {
 		return ctx, nil, errors.New("route ID missing")

@vinfry
Copy link
Author

vinfry commented Feb 24, 2025

Just to confirm, are you running a single Alertmanager instance, or are you running Alertmanager in high availability?

yes, we have HA, the second server acts as "standby" but I always check its logs as well.

Here is a patch. Do you think you can add it to the v0.28.0 tag, build Alertmanager, and repeat the tests? It should tell us what is happening to those resolved alerts.

diff --git a/notify/notify.go b/notify/notify.go
index 8fa85c0a..c8e0f68b 100644
--- a/notify/notify.go
+++ b/notify/notify.go
@@ -452,6 +452,8 @@ type RoutingStage map[string]Stage

// Exec implements the Stage interface.
func (rs RoutingStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {

  • l.Debug("Executing routing stage", "alerts", fmt.Sprintf("%v", alerts))
  • receiver, ok := ReceiverName(ctx)
    if !ok {
    return ctx, nil, errors.New("receiver missing")
    @@ -470,6 +472,8 @@ type MultiStage []Stage

// Exec implements the Stage interface.
func (ms MultiStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {

  • l.Debug("Executing multi stage", "alerts", fmt.Sprintf("%v", alerts))
  • var err error
    for _, s := range ms {
    if len(alerts) == 0 {
    @@ -490,6 +494,8 @@ type FanoutStage []Stage
    // Exec attempts to execute all stages concurrently and discards the results.
    // It returns its input alerts and a types.MultiError if one or more stages fail.
    func (fs FanoutStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
  • l.Debug("Executing fanout stage", "alerts", fmt.Sprintf("%v", alerts))
  • var (
    wg sync.WaitGroup
    me types.MultiError
    @@ -551,6 +557,8 @@ func NewMuteStage(m types.Muter, metrics *Metrics) *MuteStage {

// Exec implements the Stage interface.
func (n *MuteStage) Exec(ctx context.Context, logger *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {

  • logger.Debug("Executing mute stage", "alerts", fmt.Sprintf("%v", alerts))
  • var (
    filtered []*types.Alert
    muted []*types.Alert
    @@ -596,7 +604,8 @@ func NewWaitStage(wait func() time.Duration) *WaitStage {
    }

// Exec implements the Stage interface.
-func (ws *WaitStage) Exec(ctx context.Context, _ *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+func (ws *WaitStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {

  • l.Debug("Executing wait stage", "alerts", fmt.Sprintf("%v", alerts))
    select {
    case <-time.After(ws.wait()):
    case <-ctx.Done():
    @@ -697,7 +706,8 @@ func (n *DedupStage) needsUpdate(entry *nflogpb.Entry, firing, resolved map[uint
    }

// Exec implements the Stage interface.
-func (n *DedupStage) Exec(ctx context.Context, _ *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+func (n *DedupStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {

  • l.Debug("Executing dedup stage", "alerts", fmt.Sprintf("%v", alerts))
    gkey, ok := GroupKey(ctx)
    if !ok {
    return ctx, nil, errors.New("group key missing")
    @@ -745,6 +755,7 @@ func (n *DedupStage) Exec(ctx context.Context, _ *slog.Logger, alerts ...*types.
    if n.needsUpdate(entry, firingSet, resolvedSet, repeatInterval) {
    return ctx, alerts, nil
    }
  • l.Debug("Notifications will not be sent for alerts, no changes", "alerts", fmt.Sprintf("%v", alerts))
    return ctx, nil, nil
    }

@@ -774,6 +785,8 @@ func NewRetryStage(i Integration, groupName string, metrics *Metrics) *RetryStag
}

func (r RetryStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {

  • l.Debug("Executing retry stage", "alerts", fmt.Sprintf("%v", alerts))
  • r.metrics.numNotifications.WithLabelValues(r.labelValues...).Inc()
    ctx, alerts, err := r.exec(ctx, l, alerts...)

@@ -914,6 +927,8 @@ func NewSetNotifiesStage(l NotificationLog, recv *nflogpb.Receiver) *SetNotifies

// Exec implements the Stage interface.
func (n SetNotifiesStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {

  • l.Debug("Executing set notifies stage", "alerts", fmt.Sprintf("%v", alerts))
  • gkey, ok := GroupKey(ctx)
    if !ok {
    return ctx, nil, errors.New("group key missing")
    @@ -953,6 +968,8 @@ func NewTimeMuteStage(muter types.TimeMuter, marker types.GroupMarker, metrics *
    // Exec implements the stage interface for TimeMuteStage.
    // TimeMuteStage is responsible for muting alerts whose route is not in an active time.
    func (tms TimeMuteStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
  • l.Debug("Executing time mute stage", "alerts", fmt.Sprintf("%v", alerts))
  • routeID, ok := RouteID(ctx)
    if !ok {
    return ctx, nil, errors.New("route ID missing")
    @@ -1003,6 +1020,8 @@ func NewTimeActiveStage(muter types.TimeMuter, marker types.GroupMarker, metrics
    // Exec implements the stage interface for TimeActiveStage.
    // TimeActiveStage is responsible for muting alerts whose route is not in an active time.
    func (tas TimeActiveStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
  • l.Debug("Executing time active stage", "alerts", fmt.Sprintf("%v", alerts))
  • routeID, ok := RouteID(ctx)
    if !ok {
    return ctx, nil, errors.New("route ID missing")

Could you build a new tag with those changes included? Unfortunately due to our company policy, we cannot build from sources.

BTW, I have used an old tag 0.23 in our environment and it works without any problems, I see no stuck alerts in alerta, and all "resolve" messages are being sent without any issues.

@grobinson-grafana
Copy link
Contributor

I can't I'm afraid, I'm not an official maintainer. I tried reproducing the issue on 0.28 but was unsuccessful too.

@grobinson-grafana
Copy link
Contributor

If you cannot build from source, it might be worth stepping through each release (0.23 0.24 etc) until you find the release where this stops working. We can then look at the diff between those two releases and see if there is something wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants