Merge pull request prometheus-operator#20 from nvtkaszpir/runbook-kub…

…ernetes-5
portefaix · Feb 13, 2023 · bb608b4 · bb608b4
2 parents 8095161 + 0fac85a
commit bb608b4
Show file tree

Hide file tree

Showing 7 changed files with 265 additions and 1 deletion.
diff --git a/content/runbooks/kubernetes/KubeQuotaExceeded.md b/content/runbooks/kubernetes/KubeQuotaExceeded.md
@@ -0,0 +1,25 @@
+---
+title: Kube Quota Exceeded
+weight: 20
+---
+
+# KubeQuotaExceeded
+
+## Meaning
+
+Cluster reaches to the allowed hard limits for given namespace.
+
+## Impact
+
+Inability to create resources in kubernetes.
+
+## Diagnosis
+
+- Check resource usage for the namespace in given time span
+
+## Mitigation
+
+- Review existing quota for given namespace and adjust it accordingly.
+- Review resources used by the quota and fine tune them.
+- Continue with standard capacity planning procedures.
+- See [Quotas](https://kubernetes.io/docs/concepts/policy/resource-quotas/)
diff --git a/content/runbooks/kubernetes/KubeQuotaFullyUsed.md b/content/runbooks/kubernetes/KubeQuotaFullyUsed.md
@@ -0,0 +1,25 @@
+---
+title: Kube Quota Fully Used
+weight: 20
+---
+
+# KubeQuotaFullyUsed
+
+## Meaning
+
+Cluster reached allowed limits for given namespace.
+
+## Impact
+
+New app installations may not be possible.
+
+## Diagnosis
+
+- Check resource usage for the namespace in given time span
+
+## Mitigation
+
+- Review existing quota for given namespace and adjust it accordingly.
+- Review resources used by the quota and fine tune them.
+- Continue with standard capacity planning procedures.
+- See [Quotas](https://kubernetes.io/docs/concepts/policy/resource-quotas/)
diff --git a/content/runbooks/kubernetes/KubeSchedulerDown.md b/content/runbooks/kubernetes/KubeSchedulerDown.md
@@ -1,3 +1,22 @@
+---
+title: Kube Scheduler Down
+weight: 20
+---
+
 # KubeSchedulerDown
 
-Runbook available at https://coreos.com/tectonic/docs/latest/troubleshooting/controller-recovery.html#recovering-a-scheduler
+## Meaning
+
+Kube Scheduler has disappeared from Prometheus target discovery.
+
+## Impact
+
+This is a critical alert. The cluster may partially or fully non-functional.
+
+## Diagnosis
+
+To be added.
+
+## Mitigation
+
+See old CoreOS docs in [Web Archive](http://web.archive.org/web/20201026205154/https://coreos.com/tectonic/docs/latest/troubleshooting/controller-recovery.html)
diff --git a/content/runbooks/kubernetes/KubeStatefulSetGenerationMismatch.md b/content/runbooks/kubernetes/KubeStatefulSetGenerationMismatch.md
@@ -0,0 +1,54 @@
+---
+title: Kube StatefulSet Generation Mismatch
+weight: 20
+---
+
+# KubeStatefulSetGenerationMismatch
+
+## Meaning
+
+StatefulSet generation mismatch due to possible roll-back.
+
+## Impact
+
+Service degradation or unavailability.
+
+## Diagnosis
+
+See [Kubernetes Docs - Failed Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment)
+which can be also applied to StatefulSets to some extent
+
+- Check out rollout history `kubectl -n $NAMESPACE rollout history statefulset $NAME`
+- Check rollout status if it is not paused
+- Check deployment status via `kubectl -n $NAMESPACE describe statefulset $NAME`.
+- Check how many replicas are there declared.
+- Investigate if new pods are not crashing.
+- Look at the issues with PersistentVolumes attached to StatefulSets.
+- Check the status of the pods which belong to the replica sets under the deployment.
+- Check pod template parameters such as:
+  - pod priority - maybe it was evicted by other more important pods
+  - resources - maybe it tries to use unavailable resource, such as GPU
+    but there is limited number of nodes with GPU
+  - affinity rules - maybe due to affinities and not enough nodes it is
+    not possible to schedule pods
+  - pod termination grace period - if too long then pods may be for too long
+    in terminating state
+- Check if Horizontal Pod Autoscaler (HPA) is not triggered due to untested
+  values (requests values).
+- Check if cluster-autoscaler is able to create new nodes - see its logs or
+  cluster-autoscaler status configmap.
+
+## Mitigation
+
+Statefulsets are quite specific, and usually have special scripts on pod termination.
+See if there are special commands executed such as data migration, which may significantly slow down the progress.
+
+In case of scale out usually adding new nodes solves the issue.
+
+Otherwise probably statefulset definition needs to be fixed.
+
+In rare cases roll back to previous version - see [Kubernetes Docs - Rolling Back](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#rolling-updates)
+
+In extremely rare situations it may be better to delete problematic pods.
+
+See [Debugging Pods](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/#debugging-pods)
diff --git a/content/runbooks/kubernetes/KubeStatefulSetReplicasMismatch.md b/content/runbooks/kubernetes/KubeStatefulSetReplicasMismatch.md
@@ -0,0 +1,56 @@
+---
+title: Kube StatefulSet Replicas Mismatch
+weight: 20
+---
+
+# KubeStatefulSetReplicasMismatch
+
+## Meaning
+
+StatefulSet has not matched the expected number of replicas.
+
+<details>
+<summary>Full context</summary>
+
+Kubernetes StatefulSet resource does not have number of replicas which were
+declared to be in operation.
+For example statefulset is expected to have 3 replicas, but it has less than
+that for a noticeable period of time.
+
+In rare occasions there may be more replicas than it should and system did not
+clean it up.
+</details>
+
+## Impact
+
+Service degradation or unavailability.
+
+## Diagnosis
+
+- Check statefulset via `kubectl -n $NAMESPACE describe statefulset $NAME`.
+- Check how many replicas are there declared.
+- Check the status of the pods which belong to the replica sets under the
+  statefulset.
+- Check pod template parameters such as:
+  - pod priority - maybe it was evicted by other more importand pods
+  - resources - maybe it tries to use unavailabe resource, such as GPU but
+    there is limited number of nodes with GPU
+  - affinity rules - maybe due to affinities and not enough nodes it is
+    not possible to schedule pods
+  - pod termination grace period - if too long then pods may be for too long
+    in terminating state
+- Check if there are issues with attaching disks to statefulset - for example
+  disk was in Zone A, but pod is scheduled in Zone B.
+- Check if Horizontal Pod Autoscaler (HPA) is not triggered due to untested
+  values (requests values).
+- Check if cluster-autoscaler is able to create new nodes - see its logs or
+  cluster-autoscaler status configmap.
+
+## Mitigation
+
+Depending on the conditions usually adding new nodes solves the issue.
+
+Set proper affinity rules to schedule pods in the same zone to avoid issues
+with volumes.
+
+See [Debugging Pods](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/#debugging-pods)
diff --git a/content/runbooks/kubernetes/KubeStatefulSetUpdateNotRolledOut.md b/content/runbooks/kubernetes/KubeStatefulSetUpdateNotRolledOut.md
@@ -0,0 +1,42 @@
+---
+title: Kube StatefulSet Update Not RolledOut
+weight: 20
+---
+
+# KubeStatefulSetUpdateNotRolledOut
+
+## Meaning
+
+StatefulSet update has not been rolled out.
+
+## Impact
+
+Service degradation or unavailability.
+
+## Diagnosis
+
+- Check statefulset via `kubectl -n $NAMESPACE describe statefulset $NAME`.
+- Check if statefuls update was not paused manually (see status)
+- Check how many replicas are there declared.
+- Check the status of the pods which belong to the replica sets under the
+  statefulset.
+- Check pod template parameters such as:
+  - pod priority - maybe it was evicted by other more importand pods
+  - resources - maybe it tries to use unavailabe resource, such as GPU but
+    there is limited number of nodes with GPU
+  - affinity rules - maybe due to affinities and not enough nodes it is
+    not possible to schedule pods
+  - pod termination grace period - if too long then pods may be for too long
+    in terminating state
+- Check if there are issues with attaching disks to statefulset - for example
+  disk was in Zone A, but pod is scheduled in Zone B.
+- Check if Horizontal Pod Autoscaler (HPA) is not triggered due to untested
+  values (requests values).
+- Check if cluster-autoscaler is able to create new nodes - see its logs or
+  cluster-autoscaler status configmap.
+
+## Mitigation
+
+TODO
+
+See [Debugging Pods](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/#debugging-pods)
diff --git a/content/runbooks/kubernetes/KubeVersionMismatch.md b/content/runbooks/kubernetes/KubeVersionMismatch.md
@@ -0,0 +1,43 @@
+---
+title: Kube Version Mismatch
+weight: 20
+---
+
+# KubeVersionMismatch
+
+## Meaning
+
+Different semantic versions of Kubernetes components running.
+Usually happens during kubernetes cluster upgrade process.
+
+<details>
+<summary>Full context</summary>
+
+Kubernetes control plane nodes or worker nodes use different versions.
+This usually happens when kubernetes cluster is upgraded between minor and
+major version.
+
+</details>
+
+## Impact
+
+Incompatible API versions between kubernetes components may have very
+broad range of issues, influencing single containers, through app stability,
+ending at whole cluster stability.
+
+## Diagnosis
+
+- Check existing kubernetes versions via `kubectl get nodes` and see
+  VERSION column
+- Check if there is ongoing kubernetes upgrade - especially in managed services
+  in the cloud
+
+## Mitigation
+
+- Drain affected nodes, then upgrade or replace them with newer ones,
+  see [Safely drain node](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/)
+
+- Ensure to set proper control plane version and node pool versions when
+  creating clusters.
+- Ensure auto cluster updates for control plane and node pools.
+- Set proper maintenance windows for the clusters.