forked from prometheus-operator/runbooks
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request prometheus-operator#22 from nvtkaszpir/runbook-kub…
…ernetes-3
- Loading branch information
Showing
10 changed files
with
266 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
--- | ||
title: Kube Job Failed | ||
weight: 20 | ||
--- | ||
|
||
# KubeJobFailed | ||
|
||
## Meaning | ||
|
||
Job failed complete. | ||
|
||
## Impact | ||
|
||
Failure of processing of a scheduled task. | ||
|
||
## Diagnosis | ||
|
||
- Check job via `kubectl -n $NAMESPACE describe jobs $JOB`. | ||
- Check pod events via `kubectl -n $NAMESPACE describe pod $POD_FROM_JOB`. | ||
- Check pod logs via `kubectl -n $NAMESPACE log pod $POD_FROM_JOB`. | ||
|
||
## Mitigation | ||
|
||
- See [Debugging Pods](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/#debugging-pods) | ||
- See [Job patterns](https://kubernetes.io/docs/tasks/job/) | ||
- redesign job so that it is idempotent (can be re-run many times which will | ||
always produce the same output even if input differes) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
--- | ||
title: Kube Memory Overcommit | ||
weight: 20 | ||
aliases: | ||
- /kubememovercommit/ | ||
--- | ||
|
||
# KubeMemoryOvercommit | ||
|
||
## Meaning | ||
|
||
Cluster has overcommitted Memory resource requests for Pods | ||
and cannot tolerate node failure. | ||
|
||
<details> | ||
<summary>Full context</summary> | ||
|
||
Total number of Memory requests for pods exceeds cluster capacity. | ||
In case of node failure some pods will not fit in the remaining nodes. | ||
|
||
</details> | ||
|
||
## Impact | ||
|
||
The cluster cannot tolerate node failure. In the event of a node failure, | ||
some Pods will be in `Pending` state. | ||
|
||
## Diagnosis | ||
|
||
- Check if Memory resource requests are adjusted to the app usage | ||
- Check if some nodes are available and not cordoned | ||
- Check if cluster-autoscaler has issues with adding new nodes | ||
|
||
## Mitigation | ||
|
||
- Add more nodes to the cluster - usually it is better to have more smaller | ||
nodes, than few bigger. | ||
|
||
- Add different node pools with different instance types to avoid problem | ||
when using only one instance type in the cloud. | ||
|
||
- Use pod priorities to avoid important services from losing performance, | ||
see [pod priority and preemption](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/) | ||
|
||
- Fine tune settings for special pods used with [cluster-autoscaler](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-cluster-autoscaler-work-with-pod-priority-and-preemption) | ||
|
||
- Prepare performance tests for the expected workload, plan cluster capacity | ||
accordingly. |
27 changes: 27 additions & 0 deletions
27
content/runbooks/kubernetes/KubeletClientCertificateExpiration.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
--- | ||
title: Kubelet Client Certificate Expiration | ||
weight: 20 | ||
--- | ||
|
||
# KubeletClientCertificateExpiration | ||
|
||
## Meaning | ||
|
||
Client certificate for Kubelet on node expires soon or already expired. | ||
|
||
## Impact | ||
|
||
Node will not be able to be used within the cluster. | ||
|
||
## Diagnosis | ||
|
||
Check when certificate was issued and when it expires. | ||
|
||
## Mitigation | ||
|
||
Update certificates in the cluster control nodes and the worker nodes. | ||
Refer to the documentation of the tool used to create cluster. | ||
|
||
Another option is to delete node if it affects only one, | ||
|
||
In extreme situations recreate cluster. |
28 changes: 28 additions & 0 deletions
28
content/runbooks/kubernetes/KubeletClientCertificateRenewalErrors.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
--- | ||
title: Kubelet Client Certificate Renewal Errors | ||
weight: 20 | ||
--- | ||
|
||
# KubeletClientCertificateRenewalErrors | ||
|
||
## Meaning | ||
|
||
Kubelet on node has failed to renew its client certificate | ||
(XX errors in the last 15 minutes) | ||
|
||
## Impact | ||
|
||
Node will not be able to be used within the cluster. | ||
|
||
## Diagnosis | ||
|
||
Check when certificate was issued and when it expires. | ||
|
||
## Mitigation | ||
|
||
Update certificates in the cluster control nodes and the worker nodes. | ||
Refer to the documentation of the tool used to create cluster. | ||
|
||
Another option is to delete node if it affects only one, | ||
|
||
In extreme situations recreate cluster. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
--- | ||
title: Kubelet Pod Lifecycle Event Generator Duration High | ||
weight: 20 | ||
--- | ||
|
||
# KubeletPlegDurationHigh | ||
|
||
## Meaning | ||
|
||
The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of | ||
XX seconds on node. | ||
|
||
## Impact | ||
|
||
TODO | ||
|
||
## Diagnosis | ||
|
||
TODO | ||
|
||
## Mitigation | ||
|
||
TODO |
23 changes: 23 additions & 0 deletions
23
content/runbooks/kubernetes/KubeletPodStartUpLatencyHigh.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
--- | ||
title: Kubelet Pod Start Up Latency High | ||
weight: 20 | ||
--- | ||
|
||
# KubeletPodStartUpLatencyHigh | ||
|
||
## Meaning | ||
|
||
Kubelet Pod startup 99th percentile latency is XX seconds on node. | ||
|
||
## Impact | ||
|
||
Slow pod starts. | ||
|
||
## Diagnosis | ||
|
||
Usually exhaused IOPS for node storage. | ||
|
||
## Mitigation | ||
|
||
[Cordon and drain node](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/) and delete it. | ||
If issue persists look into the node logs. |
27 changes: 27 additions & 0 deletions
27
content/runbooks/kubernetes/KubeletServerCertificateExpiration.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
--- | ||
title: Kubelet Server Certificate Expiration | ||
weight: 20 | ||
--- | ||
|
||
# KubeletServerCertificateExpiration | ||
|
||
## Meaning | ||
|
||
Server certificate for Kubelet on node expires soon or already expired. | ||
|
||
## Impact | ||
|
||
**Critical** - Cluster will be in inoperable state. | ||
|
||
## Diagnosis | ||
|
||
Check when certificate was issued and when it expires. | ||
|
||
## Mitigation | ||
|
||
Update certificates in the cluster control nodes and the worker nodes. | ||
Refer to the documentation of the tool used to create cluster. | ||
|
||
Another option is to delete node if it affects only one, | ||
|
||
In extreme situations recreate cluster. |
28 changes: 28 additions & 0 deletions
28
content/runbooks/kubernetes/KubeletServerCertificateRenewalErrors.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
--- | ||
title: Kubelet Server Certificate Renewal Errors | ||
weight: 20 | ||
--- | ||
|
||
# KubeletServerCertificateRenewalErrors | ||
|
||
## Meaning | ||
|
||
Kubelet on node has failed to renew its server certificate | ||
(XX errors in the last 5 minutes) | ||
|
||
## Impact | ||
|
||
**Critical** - Cluster will be in inoperable state. | ||
|
||
## Diagnosis | ||
|
||
Check when certificate was issued and when it expires. | ||
|
||
## Mitigation | ||
|
||
Update certificates in the cluster control nodes and the worker nodes. | ||
Refer to the documentation of the tool used to create cluster. | ||
|
||
Another option is to delete node if it affects only one, | ||
|
||
In extreme situations recreate cluster. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,27 +1,46 @@ | ||
# KubeletTooManyPods | ||
--- | ||
title: Kubelet Too Many Pods | ||
weight: 20 | ||
--- | ||
|
||
# KubeletTooManyPods | ||
|
||
## Meaning | ||
|
||
The alert fires when a specific node is running >95% of its capacity of pods (110 by default). | ||
The alert fires when a specific node is running >95% of its capacity of pods | ||
(110 by default). | ||
|
||
<details> | ||
<summary>Full context</summary> | ||
|
||
Kubelets have a configuration that limits how many Pods they can run. The default value of this is 110 Pods per Kubelet, but it is configurable (and this alert takes that configuration into account with the `kube_node_status_capacity_pods` metric). | ||
Kubelets have a configuration that limits how many Pods they can run. | ||
The default value of this is 110 Pods per Kubelet, but it is configurable | ||
(and this alert takes that configuration into account with the | ||
`kube_node_status_capacity_pods` metric). | ||
|
||
</details> | ||
|
||
## Impact | ||
|
||
Running many pods (more than 110) on a single node places a strain on the Container Runtime Interface (CRI), Container Network Interface (CNI), and the operating system itself. Approaching that limit may affect performance and availability of that node. | ||
Running many pods (more than 110) on a single node places a strain on the | ||
Container Runtime Interface (CRI), Container Network Interface (CNI), | ||
and the operating system itself. Approaching that limit may affect performance | ||
and availability of that node. | ||
|
||
## Diagnosis | ||
|
||
Check the number of pods on a given node by running `kubectl get pods --all-namespaces --field-selector spec.nodeName=<node>` | ||
Check the number of pods on a given node by running: | ||
|
||
```shell | ||
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node> | ||
``` | ||
|
||
## Mitigation | ||
|
||
Since Kubernetes only officially supports [110 pods per node](https://kubernetes.io/docs/setup/best-practices/cluster-large/), you should preferably move pods onto other nodes or expand your cluster with more worker nodes. | ||
Since Kubernetes only officially supports [110 pods per node](https://kubernetes.io/docs/setup/best-practices/cluster-large/), | ||
you should preferably move pods onto other nodes or expand your cluster with more worker nodes. | ||
|
||
If you're certain the node can handle more pods, you can raise the max pods per node limit by changing `maxPods` in your [KubeletConfiguration](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/) (for kubeadm-based clusters) or changing the setting in your cloud provider's dashboard (if supported). | ||
If you're certain the node can handle more pods, you can raise the max pods | ||
per node limit by changing `maxPods` in your [KubeletConfiguration](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/) | ||
(for kubeadm-based clusters) or changing the setting in your cloud provider's | ||
dashboard (if supported). |