Merge pull request prometheus-operator#22 from nvtkaszpir/runbook-kub…

…ernetes-3
portefaix · Feb 13, 2023 · ff3bf6b · ff3bf6b
2 parents bb608b4 + 1616bee
commit ff3bf6b
Show file tree

Hide file tree

Showing 10 changed files with 266 additions and 9 deletions.
diff --git a/content/runbooks/kubernetes/KubeJobFailed.md b/content/runbooks/kubernetes/KubeJobFailed.md
@@ -0,0 +1,27 @@
+---
+title: Kube Job Failed
+weight: 20
+---
+
+# KubeJobFailed
+
+## Meaning
+
+Job failed complete.
+
+## Impact
+
+Failure of processing of a scheduled task.
+
+## Diagnosis
+
+- Check job via `kubectl -n $NAMESPACE describe jobs $JOB`.
+- Check pod events via `kubectl -n $NAMESPACE describe pod $POD_FROM_JOB`.
+- Check pod logs via `kubectl -n $NAMESPACE log pod $POD_FROM_JOB`.
+
+## Mitigation
+
+- See [Debugging Pods](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/#debugging-pods)
+- See [Job patterns](https://kubernetes.io/docs/tasks/job/)
+- redesign job so that it is idempotent (can be re-run many times which will
+  always produce the same output even if input differes)
diff --git a/content/runbooks/kubernetes/KubeMemoryOvercommit.md b/content/runbooks/kubernetes/KubeMemoryOvercommit.md
@@ -0,0 +1,48 @@
+---
+title: Kube Memory Overcommit
+weight: 20
+aliases:
+  - /kubememovercommit/
+---
+
+# KubeMemoryOvercommit
+
+## Meaning
+
+Cluster has overcommitted Memory resource requests for Pods
+and cannot tolerate node failure.
+
+<details>
+<summary>Full context</summary>
+
+Total number of Memory requests for pods exceeds cluster capacity.
+In case of node failure some pods will not fit in the remaining nodes.
+
+</details>
+
+## Impact
+
+The cluster cannot tolerate node failure. In the event of a node failure,
+some Pods will be in `Pending` state.
+
+## Diagnosis
+
+- Check if Memory resource requests are adjusted to the app usage
+- Check if some nodes are available and not cordoned
+- Check if cluster-autoscaler has issues with adding new nodes
+
+## Mitigation
+
+- Add more nodes to the cluster - usually it is better to have more smaller
+  nodes, than few bigger.
+
+- Add different node pools with different instance types to avoid problem
+  when using only one instance type in the cloud.
+
+- Use pod priorities to avoid important services from losing performance,
+  see [pod priority and preemption](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/)
+
+- Fine tune settings for special pods used with [cluster-autoscaler](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-cluster-autoscaler-work-with-pod-priority-and-preemption)
+
+- Prepare performance tests for the expected workload, plan cluster capacity
+  accordingly.
diff --git a/content/runbooks/kubernetes/KubeletClientCertificateExpiration.md b/content/runbooks/kubernetes/KubeletClientCertificateExpiration.md
@@ -0,0 +1,27 @@
+---
+title: Kubelet Client Certificate Expiration
+weight: 20
+---
+
+# KubeletClientCertificateExpiration
+
+## Meaning
+
+Client certificate for Kubelet on node expires soon or already expired.
+
+## Impact
+
+Node will not be able to be used within the cluster.
+
+## Diagnosis
+
+Check when certificate was issued and when it expires.
+
+## Mitigation
+
+Update certificates in the cluster control nodes and the worker nodes.
+Refer to the documentation of the tool used to create cluster.
+
+Another option is to delete node if it affects only one,
+
+In extreme situations recreate cluster.
diff --git a/content/runbooks/kubernetes/KubeletClientCertificateRenewalErrors.md b/content/runbooks/kubernetes/KubeletClientCertificateRenewalErrors.md
@@ -0,0 +1,28 @@
+---
+title: Kubelet Client Certificate Renewal Errors
+weight: 20
+---
+
+# KubeletClientCertificateRenewalErrors
+
+## Meaning
+
+Kubelet on node  has failed to renew its client certificate
+(XX errors in the last 15 minutes)
+
+## Impact
+
+Node will not be able to be used within the cluster.
+
+## Diagnosis
+
+Check when certificate was issued and when it expires.
+
+## Mitigation
+
+Update certificates in the cluster control nodes and the worker nodes.
+Refer to the documentation of the tool used to create cluster.
+
+Another option is to delete node if it affects only one,
+
+In extreme situations recreate cluster.
diff --git a/content/runbooks/kubernetes/KubeletDown.md b/content/runbooks/kubernetes/KubeletDown.md
@@ -1,3 +1,8 @@
+---
+title: Kubelet Down
+weight: 20
+---
+
 # KubeletDown
 
 ## Meaning
@@ -18,7 +23,7 @@ debugging tools are likely not functional, e.g. `kubectl exec` and `kubectl logs
 Check the status of nodes and for recent events on `Node` objects, or for recent
 events in general:
 
-```console
+```shell
 $ kubectl get nodes
 $ kubectl describe node $NODE_NAME
 $ kubectl get events --field-selector 'involvedObject.kind=Node'
@@ -27,7 +32,7 @@ $ kubectl get events
 
 If you have SSH access to the nodes, access the logs for the Kubelet directly:
 
-```console
+```shell
 $ journalctl -b -f -u kubelet.service
 ```
 
@@ -36,3 +41,5 @@ $ journalctl -b -f -u kubelet.service
 The mitigation depends on what is causing the Kubelets to become
 unresponsive. Check for wide-spread networking issues, or node level
 configuration issues.
+
+See [Kubernetes Docs - kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)
diff --git a/content/runbooks/kubernetes/KubeletPlegDurationHigh.md b/content/runbooks/kubernetes/KubeletPlegDurationHigh.md
@@ -0,0 +1,23 @@
+---
+title: Kubelet Pod Lifecycle Event Generator Duration High
+weight: 20
+---
+
+# KubeletPlegDurationHigh
+
+## Meaning
+
+The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of
+XX seconds on node.
+
+## Impact
+
+TODO
+
+## Diagnosis
+
+TODO
+
+## Mitigation
+
+TODO
diff --git a/content/runbooks/kubernetes/KubeletPodStartUpLatencyHigh.md b/content/runbooks/kubernetes/KubeletPodStartUpLatencyHigh.md
@@ -0,0 +1,23 @@
+---
+title: Kubelet Pod Start Up Latency High
+weight: 20
+---
+
+# KubeletPodStartUpLatencyHigh
+
+## Meaning
+
+Kubelet Pod startup 99th percentile latency is XX seconds on node.
+
+## Impact
+
+Slow pod starts.
+
+## Diagnosis
+
+Usually exhaused IOPS for node storage.
+
+## Mitigation
+
+[Cordon and drain node](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/) and delete it.
+If issue persists look into the node logs.
diff --git a/content/runbooks/kubernetes/KubeletServerCertificateExpiration.md b/content/runbooks/kubernetes/KubeletServerCertificateExpiration.md
@@ -0,0 +1,27 @@
+---
+title: Kubelet Server Certificate Expiration
+weight: 20
+---
+
+# KubeletServerCertificateExpiration
+
+## Meaning
+
+Server certificate for Kubelet on node expires soon or already expired.
+
+## Impact
+
+**Critical** - Cluster will be in inoperable state.
+
+## Diagnosis
+
+Check when certificate was issued and when it expires.
+
+## Mitigation
+
+Update certificates in the cluster control nodes and the worker nodes.
+Refer to the documentation of the tool used to create cluster.
+
+Another option is to delete node if it affects only one,
+
+In extreme situations recreate cluster.
diff --git a/content/runbooks/kubernetes/KubeletServerCertificateRenewalErrors.md b/content/runbooks/kubernetes/KubeletServerCertificateRenewalErrors.md
@@ -0,0 +1,28 @@
+---
+title: Kubelet Server Certificate Renewal Errors
+weight: 20
+---
+
+# KubeletServerCertificateRenewalErrors
+
+## Meaning
+
+Kubelet on node  has failed to renew its server certificate
+(XX errors in the last 5 minutes)
+
+## Impact
+
+**Critical** - Cluster will be in inoperable state.
+
+## Diagnosis
+
+Check when certificate was issued and when it expires.
+
+## Mitigation
+
+Update certificates in the cluster control nodes and the worker nodes.
+Refer to the documentation of the tool used to create cluster.
+
+Another option is to delete node if it affects only one,
+
+In extreme situations recreate cluster.
diff --git a/content/runbooks/kubernetes/KubeletTooManyPods.md b/content/runbooks/kubernetes/KubeletTooManyPods.md
@@ -1,27 +1,46 @@
-# KubeletTooManyPods
+---
+title: Kubelet Too Many Pods
+weight: 20
+---
 
+# KubeletTooManyPods
 
 ## Meaning
 
-The alert fires when a specific node is running >95% of its capacity of pods (110 by default).
+The alert fires when a specific node is running >95% of its capacity of pods
+(110 by default).
 
 <details>
 <summary>Full context</summary>
 
-Kubelets have a configuration that limits how many Pods they can run. The default value of this is 110 Pods per Kubelet, but it is configurable (and this alert takes that configuration into account with the `kube_node_status_capacity_pods` metric).
+Kubelets have a configuration that limits how many Pods they can run.
+The default value of this is 110 Pods per Kubelet, but it is configurable
+(and this alert takes that configuration into account with the
+`kube_node_status_capacity_pods` metric).
 
 </details>
 
 ## Impact
 
-Running many pods (more than 110) on a single node places a strain on the Container Runtime Interface (CRI), Container Network Interface (CNI), and the operating system itself. Approaching that limit may affect performance and availability of that node.
+Running many pods (more than 110) on a single node places a strain on the
+Container Runtime Interface (CRI), Container Network Interface (CNI),
+and the operating system itself. Approaching that limit may affect performance
+and availability of that node.
 
 ## Diagnosis
 
-Check the number of pods on a given node by running `kubectl get pods --all-namespaces --field-selector spec.nodeName=<node>`
+Check the number of pods on a given node by running:
+
+```shell
+kubectl get pods --all-namespaces --field-selector spec.nodeName=<node>
+```
 
 ## Mitigation
 
-Since Kubernetes only officially supports [110 pods per node](https://kubernetes.io/docs/setup/best-practices/cluster-large/), you should preferably move pods onto other nodes or expand your cluster with more worker nodes.
+Since Kubernetes only officially supports [110 pods per node](https://kubernetes.io/docs/setup/best-practices/cluster-large/),
+you should preferably move pods onto other nodes or expand your cluster with more worker nodes.
 
-If you're certain the node can handle more pods, you can raise the max pods per node limit by changing `maxPods` in your [KubeletConfiguration](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/) (for kubeadm-based clusters) or changing the setting in your cloud provider's dashboard (if supported).
+If you're certain the node can handle more pods, you can raise the max pods
+per node limit by changing `maxPods` in your [KubeletConfiguration](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/)
+(for kubeadm-based clusters) or changing the setting in your cloud provider's
+dashboard (if supported).