Skip to content

Commit

Permalink
Merge pull request prometheus-operator#22 from nvtkaszpir/runbook-kub…
Browse files Browse the repository at this point in the history
…ernetes-3
  • Loading branch information
paulfantom authored Feb 13, 2023
2 parents bb608b4 + 1616bee commit ff3bf6b
Show file tree
Hide file tree
Showing 10 changed files with 266 additions and 9 deletions.
27 changes: 27 additions & 0 deletions content/runbooks/kubernetes/KubeJobFailed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
title: Kube Job Failed
weight: 20
---

# KubeJobFailed

## Meaning

Job failed complete.

## Impact

Failure of processing of a scheduled task.

## Diagnosis

- Check job via `kubectl -n $NAMESPACE describe jobs $JOB`.
- Check pod events via `kubectl -n $NAMESPACE describe pod $POD_FROM_JOB`.
- Check pod logs via `kubectl -n $NAMESPACE log pod $POD_FROM_JOB`.

## Mitigation

- See [Debugging Pods](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/#debugging-pods)
- See [Job patterns](https://kubernetes.io/docs/tasks/job/)
- redesign job so that it is idempotent (can be re-run many times which will
always produce the same output even if input differes)
48 changes: 48 additions & 0 deletions content/runbooks/kubernetes/KubeMemoryOvercommit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
title: Kube Memory Overcommit
weight: 20
aliases:
- /kubememovercommit/
---

# KubeMemoryOvercommit

## Meaning

Cluster has overcommitted Memory resource requests for Pods
and cannot tolerate node failure.

<details>
<summary>Full context</summary>

Total number of Memory requests for pods exceeds cluster capacity.
In case of node failure some pods will not fit in the remaining nodes.

</details>

## Impact

The cluster cannot tolerate node failure. In the event of a node failure,
some Pods will be in `Pending` state.

## Diagnosis

- Check if Memory resource requests are adjusted to the app usage
- Check if some nodes are available and not cordoned
- Check if cluster-autoscaler has issues with adding new nodes

## Mitigation

- Add more nodes to the cluster - usually it is better to have more smaller
nodes, than few bigger.

- Add different node pools with different instance types to avoid problem
when using only one instance type in the cloud.

- Use pod priorities to avoid important services from losing performance,
see [pod priority and preemption](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/)

- Fine tune settings for special pods used with [cluster-autoscaler](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-cluster-autoscaler-work-with-pod-priority-and-preemption)

- Prepare performance tests for the expected workload, plan cluster capacity
accordingly.
27 changes: 27 additions & 0 deletions content/runbooks/kubernetes/KubeletClientCertificateExpiration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
title: Kubelet Client Certificate Expiration
weight: 20
---

# KubeletClientCertificateExpiration

## Meaning

Client certificate for Kubelet on node expires soon or already expired.

## Impact

Node will not be able to be used within the cluster.

## Diagnosis

Check when certificate was issued and when it expires.

## Mitigation

Update certificates in the cluster control nodes and the worker nodes.
Refer to the documentation of the tool used to create cluster.

Another option is to delete node if it affects only one,

In extreme situations recreate cluster.
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
title: Kubelet Client Certificate Renewal Errors
weight: 20
---

# KubeletClientCertificateRenewalErrors

## Meaning

Kubelet on node has failed to renew its client certificate
(XX errors in the last 15 minutes)

## Impact

Node will not be able to be used within the cluster.

## Diagnosis

Check when certificate was issued and when it expires.

## Mitigation

Update certificates in the cluster control nodes and the worker nodes.
Refer to the documentation of the tool used to create cluster.

Another option is to delete node if it affects only one,

In extreme situations recreate cluster.
11 changes: 9 additions & 2 deletions content/runbooks/kubernetes/KubeletDown.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
title: Kubelet Down
weight: 20
---

# KubeletDown

## Meaning
Expand All @@ -18,7 +23,7 @@ debugging tools are likely not functional, e.g. `kubectl exec` and `kubectl logs
Check the status of nodes and for recent events on `Node` objects, or for recent
events in general:

```console
```shell
$ kubectl get nodes
$ kubectl describe node $NODE_NAME
$ kubectl get events --field-selector 'involvedObject.kind=Node'
Expand All @@ -27,7 +32,7 @@ $ kubectl get events

If you have SSH access to the nodes, access the logs for the Kubelet directly:

```console
```shell
$ journalctl -b -f -u kubelet.service
```

Expand All @@ -36,3 +41,5 @@ $ journalctl -b -f -u kubelet.service
The mitigation depends on what is causing the Kubelets to become
unresponsive. Check for wide-spread networking issues, or node level
configuration issues.

See [Kubernetes Docs - kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)
23 changes: 23 additions & 0 deletions content/runbooks/kubernetes/KubeletPlegDurationHigh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
title: Kubelet Pod Lifecycle Event Generator Duration High
weight: 20
---

# KubeletPlegDurationHigh

## Meaning

The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of
XX seconds on node.

## Impact

TODO

## Diagnosis

TODO

## Mitigation

TODO
23 changes: 23 additions & 0 deletions content/runbooks/kubernetes/KubeletPodStartUpLatencyHigh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
title: Kubelet Pod Start Up Latency High
weight: 20
---

# KubeletPodStartUpLatencyHigh

## Meaning

Kubelet Pod startup 99th percentile latency is XX seconds on node.

## Impact

Slow pod starts.

## Diagnosis

Usually exhaused IOPS for node storage.

## Mitigation

[Cordon and drain node](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/) and delete it.
If issue persists look into the node logs.
27 changes: 27 additions & 0 deletions content/runbooks/kubernetes/KubeletServerCertificateExpiration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
title: Kubelet Server Certificate Expiration
weight: 20
---

# KubeletServerCertificateExpiration

## Meaning

Server certificate for Kubelet on node expires soon or already expired.

## Impact

**Critical** - Cluster will be in inoperable state.

## Diagnosis

Check when certificate was issued and when it expires.

## Mitigation

Update certificates in the cluster control nodes and the worker nodes.
Refer to the documentation of the tool used to create cluster.

Another option is to delete node if it affects only one,

In extreme situations recreate cluster.
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
title: Kubelet Server Certificate Renewal Errors
weight: 20
---

# KubeletServerCertificateRenewalErrors

## Meaning

Kubelet on node has failed to renew its server certificate
(XX errors in the last 5 minutes)

## Impact

**Critical** - Cluster will be in inoperable state.

## Diagnosis

Check when certificate was issued and when it expires.

## Mitigation

Update certificates in the cluster control nodes and the worker nodes.
Refer to the documentation of the tool used to create cluster.

Another option is to delete node if it affects only one,

In extreme situations recreate cluster.
33 changes: 26 additions & 7 deletions content/runbooks/kubernetes/KubeletTooManyPods.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,46 @@
# KubeletTooManyPods
---
title: Kubelet Too Many Pods
weight: 20
---

# KubeletTooManyPods

## Meaning

The alert fires when a specific node is running >95% of its capacity of pods (110 by default).
The alert fires when a specific node is running >95% of its capacity of pods
(110 by default).

<details>
<summary>Full context</summary>

Kubelets have a configuration that limits how many Pods they can run. The default value of this is 110 Pods per Kubelet, but it is configurable (and this alert takes that configuration into account with the `kube_node_status_capacity_pods` metric).
Kubelets have a configuration that limits how many Pods they can run.
The default value of this is 110 Pods per Kubelet, but it is configurable
(and this alert takes that configuration into account with the
`kube_node_status_capacity_pods` metric).

</details>

## Impact

Running many pods (more than 110) on a single node places a strain on the Container Runtime Interface (CRI), Container Network Interface (CNI), and the operating system itself. Approaching that limit may affect performance and availability of that node.
Running many pods (more than 110) on a single node places a strain on the
Container Runtime Interface (CRI), Container Network Interface (CNI),
and the operating system itself. Approaching that limit may affect performance
and availability of that node.

## Diagnosis

Check the number of pods on a given node by running `kubectl get pods --all-namespaces --field-selector spec.nodeName=<node>`
Check the number of pods on a given node by running:

```shell
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node>
```

## Mitigation

Since Kubernetes only officially supports [110 pods per node](https://kubernetes.io/docs/setup/best-practices/cluster-large/), you should preferably move pods onto other nodes or expand your cluster with more worker nodes.
Since Kubernetes only officially supports [110 pods per node](https://kubernetes.io/docs/setup/best-practices/cluster-large/),
you should preferably move pods onto other nodes or expand your cluster with more worker nodes.

If you're certain the node can handle more pods, you can raise the max pods per node limit by changing `maxPods` in your [KubeletConfiguration](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/) (for kubeadm-based clusters) or changing the setting in your cloud provider's dashboard (if supported).
If you're certain the node can handle more pods, you can raise the max pods
per node limit by changing `maxPods` in your [KubeletConfiguration](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/)
(for kubeadm-based clusters) or changing the setting in your cloud provider's
dashboard (if supported).

0 comments on commit ff3bf6b

Please sign in to comment.