-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alerta No Send Resolve #4251
Comments
It looks like you have a |
Here is: |
Apologies, To be able to help further I think we would need to see more logs for the alert |
Full log for d560a81 id using grep |
There are two groups here, but have both been redacted to
|
Here is a test I did, and this is what I would have expected to see if the webhook timed out:
It's odd that you did not have a similar error, as I would have expected either a |
I have just searched alertmanager log for 'level=ERROR' but found nothing.
As for receivers, I made some modifications to the redaction:)
|
The obfuscation of information makes it so difficult to help. I asked if you could at least make the redacted Group Key unique so I can differentiate between the two groups, but you haven't done this so I can't tell which group (I believe there are two) is for which Also, the receiver configuration you share doesn't even have an entry for your Alerta webhook? |
sorry for this, however, the group is the same, the only difference is email, this group has 2 different emails to send notifications to: name: 'alerta' |
I am wondering if the two routes have the same Group Key somehow, but cannot see from the logs as it has been redacted. |
here you go:
I left as much info as I could. |
Thanks for this, I can see that the emails and webhook are for separate alert groups. Can you also confirm that this problem does not occur on Alertmanager 0.27 (you are running 0.28)? |
I can confirm that this issue does not affect alertmanager-0.23.0, I didn't try it on 0.27 yet. |
Yes please. 0.23 was released almost 4 years ago, and quite a lot will have changed since then. In the meantime, I might give you a patch to test to see if we can figure out what's happening here. |
I have downgraded alertmanager to 0.27, the result is the same:
This is a different alert and has 2 receivers only. |
So whatever is happening must have been introduced a long time ago ( |
Just to confirm, are you running a single Alertmanager instance, or are you running Alertmanager in high availability? |
Here is a patch. Do you think you can add it to the diff --git a/notify/notify.go b/notify/notify.go
index 8fa85c0a..c8e0f68b 100644
--- a/notify/notify.go
+++ b/notify/notify.go
@@ -452,6 +452,8 @@ type RoutingStage map[string]Stage
// Exec implements the Stage interface.
func (rs RoutingStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+ l.Debug("Executing routing stage", "alerts", fmt.Sprintf("%v", alerts))
+
receiver, ok := ReceiverName(ctx)
if !ok {
return ctx, nil, errors.New("receiver missing")
@@ -470,6 +472,8 @@ type MultiStage []Stage
// Exec implements the Stage interface.
func (ms MultiStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+ l.Debug("Executing multi stage", "alerts", fmt.Sprintf("%v", alerts))
+
var err error
for _, s := range ms {
if len(alerts) == 0 {
@@ -490,6 +494,8 @@ type FanoutStage []Stage
// Exec attempts to execute all stages concurrently and discards the results.
// It returns its input alerts and a types.MultiError if one or more stages fail.
func (fs FanoutStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+ l.Debug("Executing fanout stage", "alerts", fmt.Sprintf("%v", alerts))
+
var (
wg sync.WaitGroup
me types.MultiError
@@ -551,6 +557,8 @@ func NewMuteStage(m types.Muter, metrics *Metrics) *MuteStage {
// Exec implements the Stage interface.
func (n *MuteStage) Exec(ctx context.Context, logger *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+ logger.Debug("Executing mute stage", "alerts", fmt.Sprintf("%v", alerts))
+
var (
filtered []*types.Alert
muted []*types.Alert
@@ -596,7 +604,8 @@ func NewWaitStage(wait func() time.Duration) *WaitStage {
}
// Exec implements the Stage interface.
-func (ws *WaitStage) Exec(ctx context.Context, _ *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+func (ws *WaitStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+ l.Debug("Executing wait stage", "alerts", fmt.Sprintf("%v", alerts))
select {
case <-time.After(ws.wait()):
case <-ctx.Done():
@@ -697,7 +706,8 @@ func (n *DedupStage) needsUpdate(entry *nflogpb.Entry, firing, resolved map[uint
}
// Exec implements the Stage interface.
-func (n *DedupStage) Exec(ctx context.Context, _ *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+func (n *DedupStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+ l.Debug("Executing dedup stage", "alerts", fmt.Sprintf("%v", alerts))
gkey, ok := GroupKey(ctx)
if !ok {
return ctx, nil, errors.New("group key missing")
@@ -745,6 +755,7 @@ func (n *DedupStage) Exec(ctx context.Context, _ *slog.Logger, alerts ...*types.
if n.needsUpdate(entry, firingSet, resolvedSet, repeatInterval) {
return ctx, alerts, nil
}
+ l.Debug("Notifications will not be sent for alerts, no changes", "alerts", fmt.Sprintf("%v", alerts))
return ctx, nil, nil
}
@@ -774,6 +785,8 @@ func NewRetryStage(i Integration, groupName string, metrics *Metrics) *RetryStag
}
func (r RetryStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+ l.Debug("Executing retry stage", "alerts", fmt.Sprintf("%v", alerts))
+
r.metrics.numNotifications.WithLabelValues(r.labelValues...).Inc()
ctx, alerts, err := r.exec(ctx, l, alerts...)
@@ -914,6 +927,8 @@ func NewSetNotifiesStage(l NotificationLog, recv *nflogpb.Receiver) *SetNotifies
// Exec implements the Stage interface.
func (n SetNotifiesStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+ l.Debug("Executing set notifies stage", "alerts", fmt.Sprintf("%v", alerts))
+
gkey, ok := GroupKey(ctx)
if !ok {
return ctx, nil, errors.New("group key missing")
@@ -953,6 +968,8 @@ func NewTimeMuteStage(muter types.TimeMuter, marker types.GroupMarker, metrics *
// Exec implements the stage interface for TimeMuteStage.
// TimeMuteStage is responsible for muting alerts whose route is not in an active time.
func (tms TimeMuteStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+ l.Debug("Executing time mute stage", "alerts", fmt.Sprintf("%v", alerts))
+
routeID, ok := RouteID(ctx)
if !ok {
return ctx, nil, errors.New("route ID missing")
@@ -1003,6 +1020,8 @@ func NewTimeActiveStage(muter types.TimeMuter, marker types.GroupMarker, metrics
// Exec implements the stage interface for TimeActiveStage.
// TimeActiveStage is responsible for muting alerts whose route is not in an active time.
func (tas TimeActiveStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
+ l.Debug("Executing time active stage", "alerts", fmt.Sprintf("%v", alerts))
+
routeID, ok := RouteID(ctx)
if !ok {
return ctx, nil, errors.New("route ID missing") |
yes, we have HA, the second server acts as "standby" but I always check its logs as well.
Could you build a new tag with those changes included? Unfortunately due to our company policy, we cannot build from sources. BTW, I have used an old tag 0.23 in our environment and it works without any problems, I see no stuck alerts in alerta, and all "resolve" messages are being sent without any issues. |
I can't I'm afraid, I'm not an official maintainer. I tried reproducing the issue on 0.28 but was unsuccessful too. |
If you cannot build from source, it might be worth stepping through each release ( |
Good day,
I have started to face weird issues after upgrading Alertmanager to the latest version:
0.28.0
I have a lot of alert rules and different receivers, the one which is in problem is alerta. The "resolve" message is not sent to Alerta and the alert appears in Alerta until I delete it. I could not track why this is happening and the most amazing thing is that it happens at random times.
Here is receiver config in /etc/alertmanager/alertmanager.yml :
webhook_configs:
send_resolved: true
Debug logs of successful sending of the same alert id:
The next one failed to send "resolve":
I have debugged a few cases and found nothing that could help. No errors, nothing in TCP traces about rejected connections or similar.
As for Alertmanager itself, those alerts disappear without any problems.
If you need some additional info please let me know.
The text was updated successfully, but these errors were encountered: