Skip to content

Commit

Permalink
Merge pull request #878 from Sentinel-One/dohnto-runbooks1
Browse files Browse the repository at this point in the history
Fill external runbooks urls
  • Loading branch information
povilasv authored Oct 30, 2023
2 parents e6b3318 + c3f0d70 commit d9a7f3a
Showing 1 changed file with 80 additions and 7 deletions.
87 changes: 80 additions & 7 deletions runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ This page collects this repositories alerts and begins the process of describing
##### Alert Name: "KubeAPIDown"
+ *Message*: `KubeAPI has disappeared from Prometheus target discovery.`
+ *Severity*: critical
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapidown/)
##### Alert Name: "KubeControllerManagerDown"
+ *Message*: `KubeControllerManager has disappeared from Prometheus target discovery.`
+ *Severity*: critical
Expand All @@ -24,140 +25,212 @@ This page collects this repositories alerts and begins the process of describing
##### Alert Name: KubeletDown
+ *Message*: `Kubelet has disappeared from Prometheus target discovery.`
+ *Severity*: critical
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletdown/)
##### Alert Name: KubeProxyDown
+ *Message*: `KubeProxy has disappeared from Prometheus target discovery`
+ *Severity*: critical
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeproxydown/)

### Group Name: kubernetes-apps
##### Alert Name: KubePodCrashLooping
+ *Message*: `{{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf \"%.2f\" $value }} / second`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodcrashlooping/)
##### Alert Name: "KubePodNotReady"
+ *Message*: `{{ $labels.namespace }}/{{ $labels.pod }} is not ready.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodnotready/)
##### Alert Name: "KubeDeploymentGenerationMismatch"
+ *Message*: `Deployment {{ $labels.namespace }}/{{ $labels.deployment }} generation mismatch`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedeploymentgenerationmismatch/)
##### Alert Name: "KubeDeploymentReplicasMismatch"
+ *Message*: `Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replica mismatch`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedeploymentreplicasmismatch/)
##### Alert Name: "KubeDeploymentRolloutStuck"
+ *Message*: `Rollout of deployment {{ $labels.namespace }}/{{ $labels.deployment }} is not progressing`
+ *Severity*: warning
##### Alert Name: "KubeStatefulSetReplicasMismatch"
+ *Message*: `StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} replica mismatch`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetreplicasmismatch/)
##### Alert Name: "KubeStatefulSetGenerationMismatch"
+ *Message*: `StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} generation mismatch`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetgenerationmismatch/)
##### Alert Name: "KubeDaemonSetRolloutStuck"
+ *Message*: `Only {{$value | humanizePercentage }} of desired pods scheduled and ready for daemon set {{$labels.namespace}}/{{$labels.daemonset}}`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetrolloutstuck/)
##### Alert Name: "KubeContainerWaiting"
+ *Message*: `{{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is in waiting state.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecontainerwaiting/)
##### Alert Name: "KubeDaemonSetNotScheduled"
+ *Message*: `A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are not scheduled.`
+ *Severity*: warning

+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetnotscheduled/)
##### Alert Name: "KubeStatefulSetUpdateNotRolledOut"
+ *Message*: `StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetupdatenotrolledout/)
##### Alert Name: "KubeHpaReplicasMismatch"
+ *Message*: `'HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has not matched the desired number of replicas for longer than 15 minutes.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpareplicasmismatch/)
##### Alert Name: "KubeHpaMaxedOut"
+ *Message*: `HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has been running at max replicas for longer than 15 minutes.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpamaxedout/)
##### Alert Name: "KubeDaemonSetMisScheduled"
+ *Message*: `A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are running where they are not supposed to run.`
+ *Severity*: warning

+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetmisscheduled/)
##### Alert Name: "KubeJobNotCompleted"
+ *Message*: `Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than {{ "%(kubeJobTimeoutDuration)s" | humanizeDuration }} to complete.`
+ *Severity*: warning
+ *Action*: Check the job using `kubectl describe job <job>` and look at the pod logs using `kubectl logs <pod>` for further information.

##### Alert Name: "KubeJobFailed"
+ *Message*: `Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.`
+ *Severity*: warning
+ *Action*: Check the job using `kubectl describe job <job>` and look at the pod logs using `kubectl logs <pod>` for further information.
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobfailed/)

### Group Name: "kubernetes-resources"
##### Alert Name: "KubeCPUOvercommit"
+ *Message*: `Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.`
+ *Severity*: warning
##### Alert Name: "KubeMemOvercommit"
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuovercommit/)
##### Alert Name: "KubeMemoryOvercommit"
+ *Message*: `Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.`
+ *Severity*: warning
##### Alert Name: "KubeCPUQuotaOvercommit"
+ *Message*: `Cluster has overcommitted CPU resource requests for Namespaces.`
+ *Severity*: warning
##### Alert Name: "KubeMemQuotaOvercommit"
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuquotaovercommit/)
##### Alert Name: "KubeMemoryQuotaOvercommit"
+ *Message*: `Cluster has overcommitted memory resource requests for Namespaces.`
+ *Severity*: warning
##### Alert Name: "KubeQuotaAlmostFull"
+ *Message*: `{{ $value | humanizePercentage }} usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.`
+ *Severity*: info
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotaalmostfull/)
##### Alert Name: "KubeQuotaFullyUsed"
+ *Message*: `{{ $value | humanizePercentage }} usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.`
+ *Severity*: info
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotafullyused/)
##### Alert Name: "KubeQuotaExceeded"
+ *Message*: `{{ $value | humanizePercentage }} usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotaexceeded/)
##### Alert Name: "CPUThrottlingHigh"
+ *Message*: `Processes experience elevated CPU throttling.`
+ *Severity*: info
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/cputhrottlinghigh/)

### Group Name: "kubernetes-storage"
##### Alert Name: "KubePersistentVolumeFillingUp"
+ *Message*: `The persistent volume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has {{ $value | humanizePercentage }} free.`
+ *Severity*: critical
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup/)
##### Alert Name: "KubePersistentVolumeFillingUp"
+ *Message*: `Based on recent sampling, the persistent volume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is expected to fill up within four days.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup/)
##### Alert Name: "KubePersistentVolumeInodesFillingUp"
+ *Message*: `PersistentVolume is filling up.`
##### Alert Name: "KubePersistentVolumeErrors"
+ *Message*: `PersistentVolume is having issues with provisioning.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumeerrors/)

### Group Name: "kubernetes-system"
##### Alert Name: "KubeNodeNotReady"
+ *Message*: `{{ $labels.node }} has been unready for more than 15 minutes."`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodenotready/)
##### Alert Name: "KubeNodeUnreachable"
+ *Message*: `{{ $labels.node }} is unreachable and some workloads may be rescheduled.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodeunreachable/)
##### Alert Name: "KubeletTooManyPods"
+ *Message*: `Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage }} of its Pod capacity.`
+ *Severity*: info
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubelettoomanypods/)
##### Alert Name: "KubeNodeReadinessFlapping"
+ *Message*: `The readiness status of node {{ $labels.node }} has changed {{ $value }} times in the last 15 minutes.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodereadinessflapping/)
##### Alert Name: "KubeletPlegDurationHigh"
+ *Message*: `The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of {{ $value }} seconds on node {{ $labels.node }}.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletplegdurationhigh/)
##### Alert Name: "KubeletPodStartUpLatencyHigh"
+ *Message*: `Kubelet Pod startup 99th percentile latency is {{ $value }} seconds on node {{ $labels.node }}.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletpodstartuplatencyhigh/)
##### Alert Name: "KubeletClientCertificateExpiration"
+ *Message*: `Client certificate for Kubelet on node {{ $labels.node }} expires in 7 days.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletclientcertificateexpiration/)
##### Alert Name: "KubeletClientCertificateExpiration"
+ *Message*: `Client certificate for Kubelet on node {{ $labels.node }} expires in 1 day.`
+ *Severity*: critical
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletclientcertificateexpiration/)
##### Alert Name: "KubeletServerCertificateExpiration"
+ *Message*: `Server certificate for Kubelet on node {{ $labels.node }} expires in 7 days.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletservercertificateexpiration/)
##### Alert Name: "KubeletServerCertificateExpiration"
+ *Message*: `Server certificate for Kubelet on node {{ $labels.node }} expires in 1 day.`
+ *Severity*: critical
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletservercertificateexpiration/)
##### Alert Name: "KubeletClientCertificateRenewalErrors"
+ *Message*: `Kubelet on node {{ $labels.node }} has failed to renew its client certificate ({{ $value | humanize }} errors in the last 15 minutes).`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletclientcertificaterenewalerrors/)
##### Alert Name: "KubeletServerCertificateRenewalErrors"
+ *Message*: `Kubelet on node {{ $labels.node }} has failed to renew its server certificate ({{ $value | humanize }} errors in the last 5 minutes).`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletservercertificaterenewalerrors/)
##### Alert Name: "KubeVersionMismatch"
+ *Message*: `There are {{ $value }} different versions of Kubernetes components running.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeversionmismatch/)
##### Alert Name: "KubeClientErrors"
+ *Message*: `Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ $value | humanizePercentage }} errors.'`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclienterrors/)
##### Alert Name: "KubeClientCertificateExpiration"
+ *Message*: `A client certificate used to authenticate to the apiserver is expiring in less than 7 days.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclientcertificateexpiration/)
##### Alert Name: "KubeClientCertificateExpiration"
+ *Message*: `A client certificate used to authenticate to the apiserver is expiring in less than 1 day.`
+ *Severity*: critical
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclientcertificateexpiration/)
##### Alert Name: "KubeAPITerminatedRequests"
+ *Message*: `The apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests.`
+ *Severity*: warning
+ *Action*: Use the `apiserver_flowcontrol_rejected_requests_total` metric to determine which flow schema is throttling the traffic to the API Server. The flow schema also provides information on the affected resources and subjects.
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapiterminatedrequests/)
##### Alert Name: "KubeAggregatedAPIErrors"
+ *Message*: `Kubernetes aggregated API has reported errors.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeaggregatedapierrors/)
##### Alert Name: "KubeAggregatedAPIDown"
+ *Message*: `Kubernetes aggregated API is down.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeaggregatedapidown/)

### Group Name: "kube-apiserver-slos"
##### Alert Name: "KubeAPIErrorBudgetBurn"
+ *Message*: `The API server is burning too much error budget.`
+ *Severity*: warning
+ *Runbook*: [Link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapierrorbudgetburn/)

## Other Kubernetes Runbooks and troubleshooting
+ [Troubleshoot Clusters ](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/)
+ [Cloud.gov Kubernetes Runbook ](https://landing.app.cloud.gov/docs/ops/runbook/troubleshooting-kubernetes/)
+ [Troubleshoot Clusters](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/)
+ [Cloud.gov Kubernetes Runbook](https://landing.app.cloud.gov/docs/ops/runbook/troubleshooting-kubernetes/)
+ [Recover a Broken Cluster](https://codefresh.io/Kubernetes-Tutorial/recover-broken-kubernetes-cluster/)

0 comments on commit d9a7f3a

Please sign in to comment.