Skip to content

Commit

Permalink
Merge pull request #32 from kubernetes-monitoring/cronjobs
Browse files Browse the repository at this point in the history
Basic cron job alerts.
  • Loading branch information
tomwilkie authored Jun 18, 2018
2 parents 480d927 + 8d4e1b5 commit 5dd0f7c
Show file tree
Hide file tree
Showing 2 changed files with 56 additions and 0 deletions.
39 changes: 39 additions & 0 deletions alerts/apps_alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,45 @@
},
'for': '10m',
},
{
alert: 'KubeCronJobRunning',
expr: |||
time() - kube_cronjob_next_schedule_time{%(kubeStateMetricsSelector)s} > 3600
||| % $._config,
'for': '1h',
labels: {
severity: 'warning',
},
annotations: {
message: 'CronJob {{ $labels.namespaces }}/{{ $labels.cronjob }} is taking more than 1h to complete.',
},
},
{
alert: 'KubeJobCompletion',
expr: |||
kube_job_spec_completions{%(kubeStateMetricsSelector)s} - kube_job_status_succeeded{%(kubeStateMetricsSelector)s} > 0
||| % $._config,
'for': '1h',
labels: {
severity: 'warning',
},
annotations: {
message: 'Job {{ $labels.namespaces }}/{{ $labels.job }} is taking more than 1h to complete.',
},
},
{
alert: 'KubeJobFailed',
expr: |||
kube_job_status_failed{%(kubeStateMetricsSelector)s} > 0
||| % $._config,
'for': '1h',
labels: {
severity: 'warning',
},
annotations: {
message: 'Job {{ $labels.namespaces }}/{{ $labels.job }} failed to complete.',
},
},
],
},
],
Expand Down
17 changes: 17 additions & 0 deletions runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,26 @@ This page collects this repositories alerts and begins the process of describing
##### Alert Name: "KubeDaemonSetNotScheduled"
+ *Message*: `A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are not scheduled.`
+ *Severity*: warning

##### Alert Name: "KubeDaemonSetMisScheduled"
+ *Message*: `A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are running where they are not supposed to run.`
+ *Severity*: warning

##### Alert Name: "KubeCronJobRunning"
+ *Message*: `CronJob {{ $labels.namespaces }}/{{ $labels.cronjob }} is taking more than 1h to complete.`
+ *Severity*: warning
+ *Action*: Check the cronjob using `kubectl decribe cronjob <cronjob>` and look at the pod logs using `kubectl logs <pod>` for further information.

##### Alert Name: "KubeJobCompletion"
+ *Message*: `Job {{ $labels.namespaces }}/{{ $labels.job }} is taking more than 1h to complete.`
+ *Severity*: warning
+ *Action*: Check the job using `kubectl decribe job <job>` and look at the pod logs using `kubectl logs <pod>` for further information.

##### Alert Name: "KubeJobFailed"
+ *Message*: `Job {{ $labels.namespaces }}/{{ $labels.job }} failed to complete.`
+ *Severity*: warning
+ *Action*: Check the job using `kubectl decribe job <job>` and look at the pod logs using `kubectl logs <pod>` for further information.

### Group Name: "kubernetes-resources"
##### Alert Name: "KubeCPUOvercommit"
+ *Message*: `Overcommited CPU resource requests on Pods, cannot tolerate node failure.`
Expand Down

0 comments on commit 5dd0f7c

Please sign in to comment.