Skip to content

Commit

Permalink
Merge pull request prometheus-operator#20 from nvtkaszpir/runbook-kub…
Browse files Browse the repository at this point in the history
…ernetes-5
  • Loading branch information
paulfantom authored Feb 13, 2023
2 parents 8095161 + 0fac85a commit bb608b4
Show file tree
Hide file tree
Showing 7 changed files with 265 additions and 1 deletion.
25 changes: 25 additions & 0 deletions content/runbooks/kubernetes/KubeQuotaExceeded.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
title: Kube Quota Exceeded
weight: 20
---

# KubeQuotaExceeded

## Meaning

Cluster reaches to the allowed hard limits for given namespace.

## Impact

Inability to create resources in kubernetes.

## Diagnosis

- Check resource usage for the namespace in given time span

## Mitigation

- Review existing quota for given namespace and adjust it accordingly.
- Review resources used by the quota and fine tune them.
- Continue with standard capacity planning procedures.
- See [Quotas](https://kubernetes.io/docs/concepts/policy/resource-quotas/)
25 changes: 25 additions & 0 deletions content/runbooks/kubernetes/KubeQuotaFullyUsed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
title: Kube Quota Fully Used
weight: 20
---

# KubeQuotaFullyUsed

## Meaning

Cluster reached allowed limits for given namespace.

## Impact

New app installations may not be possible.

## Diagnosis

- Check resource usage for the namespace in given time span

## Mitigation

- Review existing quota for given namespace and adjust it accordingly.
- Review resources used by the quota and fine tune them.
- Continue with standard capacity planning procedures.
- See [Quotas](https://kubernetes.io/docs/concepts/policy/resource-quotas/)
21 changes: 20 additions & 1 deletion content/runbooks/kubernetes/KubeSchedulerDown.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,22 @@
---
title: Kube Scheduler Down
weight: 20
---

# KubeSchedulerDown

Runbook available at https://coreos.com/tectonic/docs/latest/troubleshooting/controller-recovery.html#recovering-a-scheduler
## Meaning

Kube Scheduler has disappeared from Prometheus target discovery.

## Impact

This is a critical alert. The cluster may partially or fully non-functional.

## Diagnosis

To be added.

## Mitigation

See old CoreOS docs in [Web Archive](http://web.archive.org/web/20201026205154/https://coreos.com/tectonic/docs/latest/troubleshooting/controller-recovery.html)
54 changes: 54 additions & 0 deletions content/runbooks/kubernetes/KubeStatefulSetGenerationMismatch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
title: Kube StatefulSet Generation Mismatch
weight: 20
---

# KubeStatefulSetGenerationMismatch

## Meaning

StatefulSet generation mismatch due to possible roll-back.

## Impact

Service degradation or unavailability.

## Diagnosis

See [Kubernetes Docs - Failed Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment)
which can be also applied to StatefulSets to some extent

- Check out rollout history `kubectl -n $NAMESPACE rollout history statefulset $NAME`
- Check rollout status if it is not paused
- Check deployment status via `kubectl -n $NAMESPACE describe statefulset $NAME`.
- Check how many replicas are there declared.
- Investigate if new pods are not crashing.
- Look at the issues with PersistentVolumes attached to StatefulSets.
- Check the status of the pods which belong to the replica sets under the deployment.
- Check pod template parameters such as:
- pod priority - maybe it was evicted by other more important pods
- resources - maybe it tries to use unavailable resource, such as GPU
but there is limited number of nodes with GPU
- affinity rules - maybe due to affinities and not enough nodes it is
not possible to schedule pods
- pod termination grace period - if too long then pods may be for too long
in terminating state
- Check if Horizontal Pod Autoscaler (HPA) is not triggered due to untested
values (requests values).
- Check if cluster-autoscaler is able to create new nodes - see its logs or
cluster-autoscaler status configmap.

## Mitigation

Statefulsets are quite specific, and usually have special scripts on pod termination.
See if there are special commands executed such as data migration, which may significantly slow down the progress.

In case of scale out usually adding new nodes solves the issue.

Otherwise probably statefulset definition needs to be fixed.

In rare cases roll back to previous version - see [Kubernetes Docs - Rolling Back](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#rolling-updates)

In extremely rare situations it may be better to delete problematic pods.

See [Debugging Pods](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/#debugging-pods)
56 changes: 56 additions & 0 deletions content/runbooks/kubernetes/KubeStatefulSetReplicasMismatch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
title: Kube StatefulSet Replicas Mismatch
weight: 20
---

# KubeStatefulSetReplicasMismatch

## Meaning

StatefulSet has not matched the expected number of replicas.

<details>
<summary>Full context</summary>

Kubernetes StatefulSet resource does not have number of replicas which were
declared to be in operation.
For example statefulset is expected to have 3 replicas, but it has less than
that for a noticeable period of time.

In rare occasions there may be more replicas than it should and system did not
clean it up.
</details>

## Impact

Service degradation or unavailability.

## Diagnosis

- Check statefulset via `kubectl -n $NAMESPACE describe statefulset $NAME`.
- Check how many replicas are there declared.
- Check the status of the pods which belong to the replica sets under the
statefulset.
- Check pod template parameters such as:
- pod priority - maybe it was evicted by other more importand pods
- resources - maybe it tries to use unavailabe resource, such as GPU but
there is limited number of nodes with GPU
- affinity rules - maybe due to affinities and not enough nodes it is
not possible to schedule pods
- pod termination grace period - if too long then pods may be for too long
in terminating state
- Check if there are issues with attaching disks to statefulset - for example
disk was in Zone A, but pod is scheduled in Zone B.
- Check if Horizontal Pod Autoscaler (HPA) is not triggered due to untested
values (requests values).
- Check if cluster-autoscaler is able to create new nodes - see its logs or
cluster-autoscaler status configmap.

## Mitigation

Depending on the conditions usually adding new nodes solves the issue.

Set proper affinity rules to schedule pods in the same zone to avoid issues
with volumes.

See [Debugging Pods](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/#debugging-pods)
42 changes: 42 additions & 0 deletions content/runbooks/kubernetes/KubeStatefulSetUpdateNotRolledOut.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
title: Kube StatefulSet Update Not RolledOut
weight: 20
---

# KubeStatefulSetUpdateNotRolledOut

## Meaning

StatefulSet update has not been rolled out.

## Impact

Service degradation or unavailability.

## Diagnosis

- Check statefulset via `kubectl -n $NAMESPACE describe statefulset $NAME`.
- Check if statefuls update was not paused manually (see status)
- Check how many replicas are there declared.
- Check the status of the pods which belong to the replica sets under the
statefulset.
- Check pod template parameters such as:
- pod priority - maybe it was evicted by other more importand pods
- resources - maybe it tries to use unavailabe resource, such as GPU but
there is limited number of nodes with GPU
- affinity rules - maybe due to affinities and not enough nodes it is
not possible to schedule pods
- pod termination grace period - if too long then pods may be for too long
in terminating state
- Check if there are issues with attaching disks to statefulset - for example
disk was in Zone A, but pod is scheduled in Zone B.
- Check if Horizontal Pod Autoscaler (HPA) is not triggered due to untested
values (requests values).
- Check if cluster-autoscaler is able to create new nodes - see its logs or
cluster-autoscaler status configmap.

## Mitigation

TODO

See [Debugging Pods](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/#debugging-pods)
43 changes: 43 additions & 0 deletions content/runbooks/kubernetes/KubeVersionMismatch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
title: Kube Version Mismatch
weight: 20
---

# KubeVersionMismatch

## Meaning

Different semantic versions of Kubernetes components running.
Usually happens during kubernetes cluster upgrade process.

<details>
<summary>Full context</summary>

Kubernetes control plane nodes or worker nodes use different versions.
This usually happens when kubernetes cluster is upgraded between minor and
major version.

</details>

## Impact

Incompatible API versions between kubernetes components may have very
broad range of issues, influencing single containers, through app stability,
ending at whole cluster stability.

## Diagnosis

- Check existing kubernetes versions via `kubectl get nodes` and see
VERSION column
- Check if there is ongoing kubernetes upgrade - especially in managed services
in the cloud

## Mitigation

- Drain affected nodes, then upgrade or replace them with newer ones,
see [Safely drain node](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/)

- Ensure to set proper control plane version and node pool versions when
creating clusters.
- Ensure auto cluster updates for control plane and node pools.
- Set proper maintenance windows for the clusters.

0 comments on commit bb608b4

Please sign in to comment.