forked from prometheus-operator/runbooks
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request prometheus-operator#20 from nvtkaszpir/runbook-kub…
…ernetes-5
- Loading branch information
Showing
7 changed files
with
265 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
--- | ||
title: Kube Quota Exceeded | ||
weight: 20 | ||
--- | ||
|
||
# KubeQuotaExceeded | ||
|
||
## Meaning | ||
|
||
Cluster reaches to the allowed hard limits for given namespace. | ||
|
||
## Impact | ||
|
||
Inability to create resources in kubernetes. | ||
|
||
## Diagnosis | ||
|
||
- Check resource usage for the namespace in given time span | ||
|
||
## Mitigation | ||
|
||
- Review existing quota for given namespace and adjust it accordingly. | ||
- Review resources used by the quota and fine tune them. | ||
- Continue with standard capacity planning procedures. | ||
- See [Quotas](https://kubernetes.io/docs/concepts/policy/resource-quotas/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
--- | ||
title: Kube Quota Fully Used | ||
weight: 20 | ||
--- | ||
|
||
# KubeQuotaFullyUsed | ||
|
||
## Meaning | ||
|
||
Cluster reached allowed limits for given namespace. | ||
|
||
## Impact | ||
|
||
New app installations may not be possible. | ||
|
||
## Diagnosis | ||
|
||
- Check resource usage for the namespace in given time span | ||
|
||
## Mitigation | ||
|
||
- Review existing quota for given namespace and adjust it accordingly. | ||
- Review resources used by the quota and fine tune them. | ||
- Continue with standard capacity planning procedures. | ||
- See [Quotas](https://kubernetes.io/docs/concepts/policy/resource-quotas/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,22 @@ | ||
--- | ||
title: Kube Scheduler Down | ||
weight: 20 | ||
--- | ||
|
||
# KubeSchedulerDown | ||
|
||
Runbook available at https://coreos.com/tectonic/docs/latest/troubleshooting/controller-recovery.html#recovering-a-scheduler | ||
## Meaning | ||
|
||
Kube Scheduler has disappeared from Prometheus target discovery. | ||
|
||
## Impact | ||
|
||
This is a critical alert. The cluster may partially or fully non-functional. | ||
|
||
## Diagnosis | ||
|
||
To be added. | ||
|
||
## Mitigation | ||
|
||
See old CoreOS docs in [Web Archive](http://web.archive.org/web/20201026205154/https://coreos.com/tectonic/docs/latest/troubleshooting/controller-recovery.html) |
54 changes: 54 additions & 0 deletions
54
content/runbooks/kubernetes/KubeStatefulSetGenerationMismatch.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
--- | ||
title: Kube StatefulSet Generation Mismatch | ||
weight: 20 | ||
--- | ||
|
||
# KubeStatefulSetGenerationMismatch | ||
|
||
## Meaning | ||
|
||
StatefulSet generation mismatch due to possible roll-back. | ||
|
||
## Impact | ||
|
||
Service degradation or unavailability. | ||
|
||
## Diagnosis | ||
|
||
See [Kubernetes Docs - Failed Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment) | ||
which can be also applied to StatefulSets to some extent | ||
|
||
- Check out rollout history `kubectl -n $NAMESPACE rollout history statefulset $NAME` | ||
- Check rollout status if it is not paused | ||
- Check deployment status via `kubectl -n $NAMESPACE describe statefulset $NAME`. | ||
- Check how many replicas are there declared. | ||
- Investigate if new pods are not crashing. | ||
- Look at the issues with PersistentVolumes attached to StatefulSets. | ||
- Check the status of the pods which belong to the replica sets under the deployment. | ||
- Check pod template parameters such as: | ||
- pod priority - maybe it was evicted by other more important pods | ||
- resources - maybe it tries to use unavailable resource, such as GPU | ||
but there is limited number of nodes with GPU | ||
- affinity rules - maybe due to affinities and not enough nodes it is | ||
not possible to schedule pods | ||
- pod termination grace period - if too long then pods may be for too long | ||
in terminating state | ||
- Check if Horizontal Pod Autoscaler (HPA) is not triggered due to untested | ||
values (requests values). | ||
- Check if cluster-autoscaler is able to create new nodes - see its logs or | ||
cluster-autoscaler status configmap. | ||
|
||
## Mitigation | ||
|
||
Statefulsets are quite specific, and usually have special scripts on pod termination. | ||
See if there are special commands executed such as data migration, which may significantly slow down the progress. | ||
|
||
In case of scale out usually adding new nodes solves the issue. | ||
|
||
Otherwise probably statefulset definition needs to be fixed. | ||
|
||
In rare cases roll back to previous version - see [Kubernetes Docs - Rolling Back](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#rolling-updates) | ||
|
||
In extremely rare situations it may be better to delete problematic pods. | ||
|
||
See [Debugging Pods](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/#debugging-pods) |
56 changes: 56 additions & 0 deletions
56
content/runbooks/kubernetes/KubeStatefulSetReplicasMismatch.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
--- | ||
title: Kube StatefulSet Replicas Mismatch | ||
weight: 20 | ||
--- | ||
|
||
# KubeStatefulSetReplicasMismatch | ||
|
||
## Meaning | ||
|
||
StatefulSet has not matched the expected number of replicas. | ||
|
||
<details> | ||
<summary>Full context</summary> | ||
|
||
Kubernetes StatefulSet resource does not have number of replicas which were | ||
declared to be in operation. | ||
For example statefulset is expected to have 3 replicas, but it has less than | ||
that for a noticeable period of time. | ||
|
||
In rare occasions there may be more replicas than it should and system did not | ||
clean it up. | ||
</details> | ||
|
||
## Impact | ||
|
||
Service degradation or unavailability. | ||
|
||
## Diagnosis | ||
|
||
- Check statefulset via `kubectl -n $NAMESPACE describe statefulset $NAME`. | ||
- Check how many replicas are there declared. | ||
- Check the status of the pods which belong to the replica sets under the | ||
statefulset. | ||
- Check pod template parameters such as: | ||
- pod priority - maybe it was evicted by other more importand pods | ||
- resources - maybe it tries to use unavailabe resource, such as GPU but | ||
there is limited number of nodes with GPU | ||
- affinity rules - maybe due to affinities and not enough nodes it is | ||
not possible to schedule pods | ||
- pod termination grace period - if too long then pods may be for too long | ||
in terminating state | ||
- Check if there are issues with attaching disks to statefulset - for example | ||
disk was in Zone A, but pod is scheduled in Zone B. | ||
- Check if Horizontal Pod Autoscaler (HPA) is not triggered due to untested | ||
values (requests values). | ||
- Check if cluster-autoscaler is able to create new nodes - see its logs or | ||
cluster-autoscaler status configmap. | ||
|
||
## Mitigation | ||
|
||
Depending on the conditions usually adding new nodes solves the issue. | ||
|
||
Set proper affinity rules to schedule pods in the same zone to avoid issues | ||
with volumes. | ||
|
||
See [Debugging Pods](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/#debugging-pods) |
42 changes: 42 additions & 0 deletions
42
content/runbooks/kubernetes/KubeStatefulSetUpdateNotRolledOut.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
--- | ||
title: Kube StatefulSet Update Not RolledOut | ||
weight: 20 | ||
--- | ||
|
||
# KubeStatefulSetUpdateNotRolledOut | ||
|
||
## Meaning | ||
|
||
StatefulSet update has not been rolled out. | ||
|
||
## Impact | ||
|
||
Service degradation or unavailability. | ||
|
||
## Diagnosis | ||
|
||
- Check statefulset via `kubectl -n $NAMESPACE describe statefulset $NAME`. | ||
- Check if statefuls update was not paused manually (see status) | ||
- Check how many replicas are there declared. | ||
- Check the status of the pods which belong to the replica sets under the | ||
statefulset. | ||
- Check pod template parameters such as: | ||
- pod priority - maybe it was evicted by other more importand pods | ||
- resources - maybe it tries to use unavailabe resource, such as GPU but | ||
there is limited number of nodes with GPU | ||
- affinity rules - maybe due to affinities and not enough nodes it is | ||
not possible to schedule pods | ||
- pod termination grace period - if too long then pods may be for too long | ||
in terminating state | ||
- Check if there are issues with attaching disks to statefulset - for example | ||
disk was in Zone A, but pod is scheduled in Zone B. | ||
- Check if Horizontal Pod Autoscaler (HPA) is not triggered due to untested | ||
values (requests values). | ||
- Check if cluster-autoscaler is able to create new nodes - see its logs or | ||
cluster-autoscaler status configmap. | ||
|
||
## Mitigation | ||
|
||
TODO | ||
|
||
See [Debugging Pods](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/#debugging-pods) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
--- | ||
title: Kube Version Mismatch | ||
weight: 20 | ||
--- | ||
|
||
# KubeVersionMismatch | ||
|
||
## Meaning | ||
|
||
Different semantic versions of Kubernetes components running. | ||
Usually happens during kubernetes cluster upgrade process. | ||
|
||
<details> | ||
<summary>Full context</summary> | ||
|
||
Kubernetes control plane nodes or worker nodes use different versions. | ||
This usually happens when kubernetes cluster is upgraded between minor and | ||
major version. | ||
|
||
</details> | ||
|
||
## Impact | ||
|
||
Incompatible API versions between kubernetes components may have very | ||
broad range of issues, influencing single containers, through app stability, | ||
ending at whole cluster stability. | ||
|
||
## Diagnosis | ||
|
||
- Check existing kubernetes versions via `kubectl get nodes` and see | ||
VERSION column | ||
- Check if there is ongoing kubernetes upgrade - especially in managed services | ||
in the cloud | ||
|
||
## Mitigation | ||
|
||
- Drain affected nodes, then upgrade or replace them with newer ones, | ||
see [Safely drain node](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/) | ||
|
||
- Ensure to set proper control plane version and node pool versions when | ||
creating clusters. | ||
- Ensure auto cluster updates for control plane and node pools. | ||
- Set proper maintenance windows for the clusters. |