Skip to content

Commit

Permalink
feat(alertmanager): integrate with ruler (#7222)
Browse files Browse the repository at this point in the history
### Summary

Integrate the new implementations of the alertmanager along with changes to the ruler. This change can be broadly categoried into 3 parts:

#### Frontend
- The earlier `/api/v1/alerts` api was double encoding the response in json and sending it to the frontend. This PR fixes the json response object. 

For instance, we have gone from the response `{
    "status": "success",
    "data": "{\"status\":\"success\",\"data\":[{\"labels\":{\"alertname\":\"[platform][consumer] consumer is above 100% memory utilization\",\"bu\":\"platform\",\"......
}` to the response `{"status":"success","data":[{"labels":{"alertname":"[Metrics] Pod CP......`

- `msteams` has been changed to `msteamsv2` wherever applicable

#### Ruler
The following changes have been done in the ruler component:

- Removal of the old alertmanager and notifier
- The RuleDB methods `Create`, `Edit` and `Delete` have been made transactional
- Introduction of a new `testPrepareNotifyFunc` for sending test notifications
- Integration with the new alertmanager

#### Alertmanager
Although a huge chunk of the alertmanagers have been merged in previous PRs (the list can be found at SigNoz/platform-pod#404), this PR takes care of changes needed in order to incorporate it with the ruler

- Addition of ruleId based matching
- Support for marshalling the global configuration directly from the upstream alertmanager
- Addition of orgId to the legacy alertmanager
- Support for always adding defaults to both routes and receivers while creating them
- Migration to create the required alertmanager tables
- Migration for msteams to msteamsv2 has been added. We will start using msteamv2 config for the new alertmanager and keep using msteams for the old one.

#### Related Issues / PR's

Closes SigNoz/platform-pod#404
Closes SigNoz/platform-pod#176
  • Loading branch information
grandwizard28 authored Mar 9, 2025
1 parent 8abba26 commit 1f33928
Show file tree
Hide file tree
Showing 56 changed files with 1,849 additions and 1,589 deletions.
58 changes: 53 additions & 5 deletions conf/example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,26 +70,74 @@ sqlstore:
##################### APIServer #####################
apiserver:
timeout:
# Default request timeout.
default: 60s
# Maximum request timeout.
max: 600s
# List of routes to exclude from request timeout.
excluded_routes:
- /api/v1/logs/tail
- /api/v3/logs/livetail
logging:
# List of routes to exclude from request responselogging.
excluded_routes:
- /api/v1/health


##################### TelemetryStore #####################
telemetrystore:
# specifies the telemetrystore provider to use.
# Specifies the telemetrystore provider to use.
provider: clickhouse
clickhouse:
# The DSN to use for ClickHouse.
dsn: http://localhost:9000
# Maximum number of idle connections in the connection pool.
max_idle_conns: 50
# Maximum number of open connections to the database.
max_open_conns: 100
# Maximum time to wait for a connection to be established.
dial_timeout: 5s
dial_timeout: 5s
clickhouse:
# The DSN to use for ClickHouse.
dsn: http://localhost:9000

##################### Alertmanager #####################
alertmanager:
# Specifies the alertmanager provider to use.
provider: legacy
legacy:
# The API URL (with prefix) of the legacy Alertmanager instance.
api_url: http://localhost:9093/api
signoz:
# The poll interval for periodically syncing the alertmanager with the config in the store.
poll_interval: 1m
# The URL under which Alertmanager is externally reachable (for example, if Alertmanager is served via a reverse proxy). Used for generating relative and absolute links back to Alertmanager itself.
external_url: http://localhost:9093
# The global configuration for the alertmanager. All the exahustive fields can be found in the upstream: https://github.com/prometheus/alertmanager/blob/efa05feffd644ba4accb526e98a8c6545d26a783/config/config.go#L833
global:
# ResolveTimeout is the time after which an alert is declared resolved if it has not been updated.
resolve_timeout: 5m
route:
# GroupByStr is the list of labels to group alerts by.
group_by:
- alertname
# GroupInterval is the interval at which alerts are grouped.
group_interval: 1m
# GroupWait is the time to wait before sending alerts to receivers.
group_wait: 1m
# RepeatInterval is the interval at which alerts are repeated.
repeat_interval: 1h
alerts:
# Interval between garbage collection of alerts.
gc_interval: 30m
silences:
# Maximum number of silences, including expired silences. If negative or zero, no limit is set.
max: 0
# Maximum size of the silences in bytes. If negative or zero, no limit is set.
max_size_bytes: 0
# Interval between garbage collection and snapshotting of the silences. The snapshot will be stored in the state store.
maintenance_interval: 15m
# Retention of the silences.
retention: 120h
nflog:
# Interval between garbage collection and snapshotting of the notification logs. The snapshot will be stored in the state store.
maintenance_interval: 15m
# Retention of the notification logs.
retention: 120h
6 changes: 5 additions & 1 deletion ee/query-service/app/api/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ import (
"go.signoz.io/signoz/ee/query-service/interfaces"
"go.signoz.io/signoz/ee/query-service/license"
"go.signoz.io/signoz/ee/query-service/usage"
"go.signoz.io/signoz/pkg/alertmanager"
baseapp "go.signoz.io/signoz/pkg/query-service/app"
"go.signoz.io/signoz/pkg/query-service/app/cloudintegrations"
"go.signoz.io/signoz/pkg/query-service/app/integrations"
Expand All @@ -20,6 +21,7 @@ import (
basemodel "go.signoz.io/signoz/pkg/query-service/model"
rules "go.signoz.io/signoz/pkg/query-service/rules"
"go.signoz.io/signoz/pkg/query-service/version"
"go.signoz.io/signoz/pkg/signoz"
"go.signoz.io/signoz/pkg/types/authtypes"
)

Expand Down Expand Up @@ -51,7 +53,7 @@ type APIHandler struct {
}

// NewAPIHandler returns an APIHandler
func NewAPIHandler(opts APIHandlerOptions) (*APIHandler, error) {
func NewAPIHandler(opts APIHandlerOptions, signoz *signoz.SigNoz) (*APIHandler, error) {

baseHandler, err := baseapp.NewAPIHandler(baseapp.APIHandlerOpts{
Reader: opts.DataConnector,
Expand All @@ -67,6 +69,8 @@ func NewAPIHandler(opts APIHandlerOptions) (*APIHandler, error) {
FluxInterval: opts.FluxInterval,
UseLogsNewSchema: opts.UseLogsNewSchema,
UseTraceNewSchema: opts.UseTraceNewSchema,
AlertmanagerAPI: alertmanager.NewAPI(signoz.Alertmanager),
Signoz: signoz,
})

if err != nil {
Expand Down
2 changes: 1 addition & 1 deletion ee/query-service/app/api/auth.go
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ func (ah *APIHandler) registerUser(w http.ResponseWriter, r *http.Request) {
return
}

_, registerError := baseauth.Register(ctx, req)
_, registerError := baseauth.Register(ctx, req, ah.Signoz.Alertmanager)
if !registerError.IsNil() {
RespondError(w, apierr, nil)
return
Expand Down
49 changes: 23 additions & 26 deletions ee/query-service/app/server.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,10 @@ import (
"go.signoz.io/signoz/ee/query-service/integrations/gateway"
"go.signoz.io/signoz/ee/query-service/interfaces"
"go.signoz.io/signoz/ee/query-service/rules"
"go.signoz.io/signoz/pkg/alertmanager"
"go.signoz.io/signoz/pkg/http/middleware"
"go.signoz.io/signoz/pkg/signoz"
"go.signoz.io/signoz/pkg/sqlstore"
"go.signoz.io/signoz/pkg/types"
"go.signoz.io/signoz/pkg/types/authtypes"
"go.signoz.io/signoz/pkg/web"
Expand All @@ -45,7 +47,6 @@ import (
"go.signoz.io/signoz/pkg/query-service/cache"
baseconst "go.signoz.io/signoz/pkg/query-service/constants"
"go.signoz.io/signoz/pkg/query-service/healthcheck"
basealm "go.signoz.io/signoz/pkg/query-service/integrations/alertManager"
baseint "go.signoz.io/signoz/pkg/query-service/interfaces"
basemodel "go.signoz.io/signoz/pkg/query-service/model"
pqle "go.signoz.io/signoz/pkg/query-service/pqlEngine"
Expand Down Expand Up @@ -176,8 +177,8 @@ func NewServer(serverOptions *ServerOptions) (*Server, error) {
}

<-readerReady
rm, err := makeRulesManager(serverOptions.PromConfigPath,
baseconst.GetAlertManagerApiPrefix(),
rm, err := makeRulesManager(
serverOptions.PromConfigPath,
serverOptions.RuleRepoURL,
serverOptions.SigNoz.SQLStore.SQLxDB(),
reader,
Expand All @@ -186,6 +187,8 @@ func NewServer(serverOptions *ServerOptions) (*Server, error) {
lm,
serverOptions.UseLogsNewSchema,
serverOptions.UseTraceNewSchema,
serverOptions.SigNoz.Alertmanager,
serverOptions.SigNoz.SQLStore,
)

if err != nil {
Expand Down Expand Up @@ -268,7 +271,7 @@ func NewServer(serverOptions *ServerOptions) (*Server, error) {
JWT: serverOptions.Jwt,
}

apiHandler, err := api.NewAPIHandler(apiOpts)
apiHandler, err := api.NewAPIHandler(apiOpts, serverOptions.SigNoz)
if err != nil {
return nil, err
}
Expand Down Expand Up @@ -530,47 +533,41 @@ func (s *Server) Stop() error {

func makeRulesManager(
promConfigPath,
alertManagerURL string,
ruleRepoURL string,
db *sqlx.DB,
ch baseint.Reader,
cache cache.Cache,
disableRules bool,
fm baseint.FeatureLookup,
useLogsNewSchema bool,
useTraceNewSchema bool) (*baserules.Manager, error) {

useTraceNewSchema bool,
alertmanager alertmanager.Alertmanager,
sqlstore sqlstore.SQLStore,
) (*baserules.Manager, error) {
// create engine
pqle, err := pqle.FromConfigPath(promConfigPath)
if err != nil {
return nil, fmt.Errorf("failed to create pql engine : %v", err)
}

// notifier opts
notifierOpts := basealm.NotifierOptions{
QueueCapacity: 10000,
Timeout: 1 * time.Second,
AlertManagerURLs: []string{alertManagerURL},
}

// create manager opts
managerOpts := &baserules.ManagerOptions{
NotifierOpts: notifierOpts,
PqlEngine: pqle,
RepoURL: ruleRepoURL,
DBConn: db,
Context: context.Background(),
Logger: zap.L(),
DisableRules: disableRules,
FeatureFlags: fm,
Reader: ch,
Cache: cache,
EvalDelay: baseconst.GetEvalDelay(),

PqlEngine: pqle,
RepoURL: ruleRepoURL,
DBConn: db,
Context: context.Background(),
Logger: zap.L(),
DisableRules: disableRules,
FeatureFlags: fm,
Reader: ch,
Cache: cache,
EvalDelay: baseconst.GetEvalDelay(),
PrepareTaskFunc: rules.PrepareTaskFunc,
UseLogsNewSchema: useLogsNewSchema,
UseTraceNewSchema: useTraceNewSchema,
PrepareTestRuleFunc: rules.TestNotification,
Alertmanager: alertmanager,
SQLStore: sqlstore,
}

// create Manager
Expand Down
33 changes: 21 additions & 12 deletions ee/query-service/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ import (
"os"
"os/signal"
"strconv"
"syscall"
"time"

"go.opentelemetry.io/otel/sdk/resource"
Expand Down Expand Up @@ -150,7 +149,14 @@ func main() {
zap.L().Fatal("Failed to create config", zap.Error(err))
}

signoz, err := signoz.New(context.Background(), config, signoz.NewProviderConfig())
signoz, err := signoz.New(
context.Background(),
config,
signoz.NewCacheProviderFactories(),
signoz.NewWebProviderFactories(),
signoz.NewSQLStoreProviderFactories(),
signoz.NewTelemetryStoreProviderFactories(),
)
if err != nil {
zap.L().Fatal("Failed to create signoz struct", zap.Error(err))
}
Expand Down Expand Up @@ -198,16 +204,19 @@ func main() {
zap.L().Fatal("Failed to initialize auth cache", zap.Error(err))
}

signalsChannel := make(chan os.Signal, 1)
signal.Notify(signalsChannel, os.Interrupt, syscall.SIGTERM)
signoz.Start(context.Background())

for {
select {
case status := <-server.HealthCheckStatus():
zap.L().Info("Received HealthCheck status: ", zap.Int("status", int(status)))
case <-signalsChannel:
zap.L().Fatal("Received OS Interrupt Signal ... ")
server.Stop()
}
if err := signoz.Wait(context.Background()); err != nil {
zap.L().Fatal("Failed to start signoz", zap.Error(err))
}

err = server.Stop()
if err != nil {
zap.L().Fatal("Failed to stop server", zap.Error(err))
}

err = signoz.Stop(context.Background())
if err != nil {
zap.L().Fatal("Failed to stop signoz", zap.Error(err))
}
}
6 changes: 6 additions & 0 deletions ee/query-service/rules/manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ func PrepareTaskFunc(opts baserules.PrepareTaskOptions) (baserules.Task, error)
opts.UseLogsNewSchema,
opts.UseTraceNewSchema,
baserules.WithEvalDelay(opts.ManagerOpts.EvalDelay),
baserules.WithSQLStore(opts.SQLStore),
)

if err != nil {
Expand All @@ -48,6 +49,7 @@ func PrepareTaskFunc(opts baserules.PrepareTaskOptions) (baserules.Task, error)
opts.Logger,
opts.Reader,
opts.ManagerOpts.PqlEngine,
baserules.WithSQLStore(opts.SQLStore),
)

if err != nil {
Expand All @@ -68,6 +70,7 @@ func PrepareTaskFunc(opts baserules.PrepareTaskOptions) (baserules.Task, error)
opts.Reader,
opts.Cache,
baserules.WithEvalDelay(opts.ManagerOpts.EvalDelay),
baserules.WithSQLStore(opts.SQLStore),
)
if err != nil {
return task, err
Expand Down Expand Up @@ -126,6 +129,7 @@ func TestNotification(opts baserules.PrepareTestRuleOptions) (int, *basemodel.Ap
opts.UseTraceNewSchema,
baserules.WithSendAlways(),
baserules.WithSendUnmatched(),
baserules.WithSQLStore(opts.SQLStore),
)

if err != nil {
Expand All @@ -144,6 +148,7 @@ func TestNotification(opts baserules.PrepareTestRuleOptions) (int, *basemodel.Ap
opts.ManagerOpts.PqlEngine,
baserules.WithSendAlways(),
baserules.WithSendUnmatched(),
baserules.WithSQLStore(opts.SQLStore),
)

if err != nil {
Expand All @@ -160,6 +165,7 @@ func TestNotification(opts baserules.PrepareTestRuleOptions) (int, *basemodel.Ap
opts.Cache,
baserules.WithSendAlways(),
baserules.WithSendUnmatched(),
baserules.WithSQLStore(opts.SQLStore),
)
if err != nil {
zap.L().Error("failed to prepare a new anomaly rule for test", zap.String("name", rule.Name()), zap.Error(err))
Expand Down
4 changes: 1 addition & 3 deletions frontend/src/api/alerts/getTriggered.ts
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,11 @@ const getTriggered = async (

const response = await axios.get(`/alerts?${queryParams}`);

const amData = JSON.parse(response.data.data);

return {
statusCode: 200,
error: null,
message: response.data.status,
payload: amData.data,
payload: response.data.data,
};
} catch (error) {
return ErrorResponseHandler(error as AxiosError);
Expand Down
2 changes: 1 addition & 1 deletion frontend/src/api/channels/createMsTeams.ts
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ const create = async (
try {
const response = await axios.post('/channels', {
name: props.name,
msteams_configs: [
msteamsv2_configs: [
{
send_resolved: props.send_resolved,
webhook_url: props.webhook_url,
Expand Down
2 changes: 1 addition & 1 deletion frontend/src/api/channels/editMsTeams.ts
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ const editMsTeams = async (
try {
const response = await axios.put(`/channels/${props.id}`, {
name: props.name,
msteams_configs: [
msteamsv2_configs: [
{
send_resolved: props.send_resolved,
webhook_url: props.webhook_url,
Expand Down
2 changes: 1 addition & 1 deletion frontend/src/api/channels/testMsTeams.ts
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ const testMsTeams = async (
try {
const response = await axios.post('/testChannel', {
name: props.name,
msteams_configs: [
msteamsv2_configs: [
{
send_resolved: true,
webhook_url: props.webhook_url,
Expand Down
4 changes: 2 additions & 2 deletions frontend/src/pages/ChannelsEdit/index.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,8 @@ function ChannelsEdit(): JSX.Element {
};
}

if (value && 'msteams_configs' in value) {
const msteamsConfig = value.msteams_configs[0];
if (value && 'msteamsv2_configs' in value) {
const msteamsConfig = value.msteamsv2_configs[0];
channel = msteamsConfig;
return {
type: ChannelType.MsTeams,
Expand Down
Loading

0 comments on commit 1f33928

Please sign in to comment.