Skip to content

Commit

Permalink
Merge pull request prometheus-operator#13 from nvtkaszpir/runbooks-node
Browse files Browse the repository at this point in the history
  • Loading branch information
paulfantom authored Feb 13, 2023
2 parents 73710f0 + 441e935 commit ea632f2
Show file tree
Hide file tree
Showing 13 changed files with 235 additions and 27 deletions.
22 changes: 22 additions & 0 deletions content/runbooks/node/NodeClockNotSynchronising.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
title: Node Clock Not Synchronising
weight: 20
---

# NodeClockNotSynchronising

## Meaning

Clock not synchronising.

## Impact

Time is not automatically synchronizing on the node. This can cause issues with handling TLS as well as problems with other time-sensitive applications.

## Diagnosis

TODO

## Mitigation

See [Node Clok Skew Detected]({{< ref "./NodeClockSkewDetected.md" >}}) for mitigation steps.
33 changes: 33 additions & 0 deletions content/runbooks/node/NodeClockSkewDetected.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
title: Node Clock Skew Detected
weight: 20
---

# NodeClockSkewDetected

## Meaning

Clock skew detected.

## Impact

Time is skewed on the node. This can cause issues with handling TLS as well as problems with other time-sensitive applications.

## Diagnosis

TODO

## Mitigation

Ensure time synchronization service is running.
Set proper time servers.
Esure to sync time on server start, especially when using
low power mode or hibernation.

Some resource consuming process can cause issues on given hardware,
so move it to different servers.

On physical servers check if on-board battery requires replacement.
Check for hardware errors.
Check for firmware updates.
Ensure to use newer hardware (like server mainboard and so on).
7 changes: 6 additions & 1 deletion content/runbooks/node/NodeFileDescriptorLimit.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
title: Node File Descriptor Limit
weight: 20
---

# NodeFileDescriptorLimit

## Meaning
Expand All @@ -17,7 +22,7 @@ node.
You can open a shell on the node and use the standard Linux utilities to
diagnose the issue:

```console
```shell
$ NODE_NAME='<value of instance label from alert>'

$ oc debug "node/$NODE_NAME"
Expand Down
11 changes: 7 additions & 4 deletions content/runbooks/node/NodeFilesystemAlmostOutOfFiles.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
---
title: Node Filesystem Almost Out Of Files
weight: 20
---

# NodeFilesystemAlmostOutOfFiles

## Meaning

This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but rather
This alert is similar to the NodeFilesystemSpaceFillingUp alert, but rather
than being based on a prediction that a filesystem will run out of inodes in a
certain amount of time, it uses simple static thresholds. The alert will fire as
at a `warning` level at 5% of available inodes left, and at a `critical` level
Expand All @@ -18,10 +23,8 @@ of the cluster.

## Diagnosis

Refer to the [NodeFilesystemFilesFillingUp][1] runbook.

## Mitigation

Refer to the [NodeFilesystemFilesFillingUp][1] runbook.
See [Node Filesystem FilesFilling Up]({{< ref "./NodeFilesystemFilesFillingUp.md" >}})

[1]: ./NodeFilesystemFilesFillingUp.md
13 changes: 7 additions & 6 deletions content/runbooks/node/NodeFilesystemAlmostOutOfSpace.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
---
title: Node Filesystem Almost Out Of Space
weight: 20
---

# NodeFilesystemAlmostOutOfSpace

## Meaning

This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but rather
This alert is similar to the NodeFilesystemSpaceFillingUp alert, but rather
than being based on a prediction that a filesystem will become full in a certain
amount of time, it uses simple static thresholds. The alert will fire as at a
`warning` level at 5% space left, and at a `critical` level with 3% space left.
Expand All @@ -17,10 +22,6 @@ of the cluster.

## Diagnosis

Refer to the [NodeFilesystemSpaceFillingUp][1] runbook.

## Mitigation

Refer to the [NodeFilesystemSpaceFillingUp][1] runbook.

[1]: ./NodeFilesystemSpaceFillingUp.md
See [Node Filesystem FilesFilling Up]({{< ref "./NodeFilesystemFilesFillingUp.md" >}})
13 changes: 9 additions & 4 deletions content/runbooks/node/NodeFilesystemFilesFillingUp.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
---
title: Node Filesystem Files Filling Up
weight: 20
---

# NodeFilesystemFilesFillingUp

## Meaning

This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but
This alert is similar to the NodeFilesystemSpaceFillingUp alert, but
predicts the filesystem will run out of inodes rather than bytes of storage
space. The alert fires at a `critical` level when the filesystem is predicted to
run out of available inodes within four hours.
Expand All @@ -21,7 +26,7 @@ Note the `instance` and `mountpoint` labels from the alert. You can graph the
usage history of this filesystem with the following query in the OpenShift web
console:

```text
```promql
node_filesystem_files_free{
instance="<value of instance label from alert>",
mountpoint="<value of mountpoint label from alert>"
Expand All @@ -31,7 +36,7 @@ node_filesystem_files_free{
You can also open a debug session on the node and use the standard Linux
utilities to locate the source of the usage:

```console
```shell
$ MOUNT_POINT='<value of mountpoint label from alert>'
$ NODE_NAME='<value of instance label from alert>'

Expand All @@ -50,4 +55,4 @@ size. You may be able to solve the problem, or buy time, by increasing size of
the storage volume. Otherwise, determine the application that is creating large
numbers of files and adjust its configuration or provide it dedicated storage.

[1]: ./NodeFilesystemSpaceFillingUp.md
See [Node Filesystem FilesFilling Up]({{< ref "./NodeFilesystemFilesFillingUp.md" >}}) for additional mitigation steps.
29 changes: 19 additions & 10 deletions content/runbooks/node/NodeFilesystemSpaceFillingUp.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
title: Node Filesystem Space Filling Up
weight: 20
---

# NodeFilesystemSpaceFillingUp

## Meaning
Expand All @@ -11,8 +16,12 @@ time is less than 4h.
<details>
<summary>Full context</summary>

The filesystem on Kubernetes nodes mainly consists of the operating system, [container ephemeral storage][1], container images, and container logs.
Since Kubelet automatically handles [cleaning up old logs][2] and [deleting unused images][3], container ephemeral storage is a common cause of this alert. Although this alert may be triggered before Kubelet's garbage collection kicks in.
The filesystem on Kubernetes nodes mainly consists of the operating system,
[container ephemeral storage][1], container images, and container logs.
Since Kubelet automatically handles [cleaning up old logs][2] and
[deleting unused images][3], container ephemeral storage is a common cause of
this alert. Although this alert may be triggered before Kubelet's garbage
collection kicks in.

</details>

Expand All @@ -31,7 +40,7 @@ and/or recent offenders. Is this some irregular condition, e.g. a process fails
to clean up behind itself or is this organic growth? If monitoring is enabled,
the following metric can be watched in PromQL.

```console
```promql
node_filesystem_free_bytes
```

Expand All @@ -44,31 +53,31 @@ removing unused images solves that issue:

Debug the node by accessing the node filesystem:

```console
```shell
$ NODE_NAME=<instance label from alert>
$ kubectl -n default debug node/$NODE_NAME
$ chroot /host
```

Remove dangling images:

```console
```shell
# TODO: Command needed
```

Remove unused images:

```console
```shell
# TODO: Command needed
```

Exit debug:

```console
```shell
$ exit
$ exit
```

[1]: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#local-ephemeral-storage
[2]: https://kubernetes.io/docs/concepts/cluster-administration/logging/
[3]: https://kubernetes.io/docs/concepts/architecture/garbage-collection/#containers-images
- [1](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#local-ephemeral-storage)
- [2](https://kubernetes.io/docs/concepts/cluster-administration/logging/)
- [3](https://kubernetes.io/docs/concepts/architecture/garbage-collection/#containers-images)
24 changes: 24 additions & 0 deletions content/runbooks/node/NodeHighNumberConntrackEntriesUsed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
title: Node High Number Conntrack Entries Used
weight: 20
---

# NodeHighNumberConntrackEntriesUsed

## Meaning

Number of conntrack are getting close to the limit.

## Impact

When reached the limit then some connections will be dropped, degrading service quality.

## Diagnosis

Check current conntrack value on the node.
Check which apps are generating a lot of connections.

## Mitigation

Migrate some pods to another nodes.
Bump conntrack limit directly on the node, remembering to make it persistent across node reboots.
32 changes: 32 additions & 0 deletions content/runbooks/node/NodeNetworkReceiveErrs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
title: Node Network Receive Errors
weight: 20
---

# NodeNetworkReceiveErrs

## Meaning

Network interface is reporting many receive errors.

## Impact

Applications on the node may no longer be able to operate with other services.
Network attached storage performance issues or even data loss.

## Diagnosis

Investigate networkng issues on the node and to connected hardware.
Check physical cables, check networking firewall rules and so on.

## Mitigation

In general mitigation landscape is quite vast, some suggestions:

- Ensure some node capacity is left unallocated (cpu/memory) for handling
networking.
- [Increase TX queue length](https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/ovs-dpdk_end_to_end_troubleshooting_guide/high_packet_loss_in_the_tx_queue_of_the_instance_s_tap_interface)
- Spread services to other nodes/pods.
- Replace physical cables, change ports.
- Look into introducting Quality of Service or other
[TCP congestion avoidance algorithms](https://en.wikipedia.org/wiki/TCP_congestion_control)
34 changes: 34 additions & 0 deletions content/runbooks/node/NodeNetworkTransmitErrs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
title: Node Network Transmit Errors
weight: 20
---

# NodeNetworkTransmitErrs

## Meaning

Network interface is reporting many transmit errors.

## Impact

Applications on the node may no longer be able to operate with other services.
Network attached storage performance issues or even data loss.

## Diagnosis

Investigate networkng issues on the node and to connected hardware.
Check network interface saturation.
Check CPU usage saturation.
Check physical cables, check networking firewall rules and so on.

## Mitigation

In general mitigation landscape is quite vast, some suggestions:

- Ensure some node capacity is left unallocated (cpu/memory) for handling
networking.
- [Increase TX queue length](https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/ovs-dpdk_end_to_end_troubleshooting_guide/high_packet_loss_in_the_tx_queue_of_the_instance_s_tap_interface)
- Spread services to other nodes/pods.
- Replace physical cables, change ports.
- Look into introducting Quality of Service or other
[TCP congestion avoidance algorithms](https://en.wikipedia.org/wiki/TCP_congestion_control)
13 changes: 11 additions & 2 deletions content/runbooks/node/NodeRAIDDegraded.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,14 @@
---
title: Node RAID Degraded
weight: 20
---

# NodeRAIDDegraded

## Meaning

RAID Array is degraded.

This alert is triggered when a node has a storage configuration with RAID array,
and the array is reporting as being in a degraded state due to one or more disk
failures.
Expand All @@ -17,7 +24,7 @@ You can open a shell on the node and use the standard Linux utilities to
diagnose the issue, but you may need to install additional software in the debug
container:

```console
```shell
$ NODE_NAME='<value of instance label from alert>'

$ oc debug "node/$NODE_NAME"
Expand All @@ -26,6 +33,8 @@ $ cat /proc/mdstat

## Mitigation

Cordon and drain node if possible, proceed to RAID recovery.

See the Red Hat Enterprise Linux [documentation][1] for potential steps.

[1]: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_storage_devices/managing-raid_managing-storage-devices
- [1](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_storage_devices/managing-raid_managing-storage-devices)
8 changes: 8 additions & 0 deletions content/runbooks/node/NodeRAIDDiskFailure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
title: Node RAID Disk Failure
weight: 20
---

# NodeRAIDDiskFailure

See [Node RAID Degraded]({{< ref "./NodeRAIDDegraded.md" >}})
23 changes: 23 additions & 0 deletions content/runbooks/node/NodeTextFileCollectorScrapeError.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
title: Node Text File Collector Scrape Error
weight: 20
---

# NodeTextFileCollectorScrapeError

## Meaning

Node Exporter text file collector failed to scrape.

## Impact

Missing metrics from additional scripts.

## Diagnosis

- Check node_exporter logs
- Check script supervisor (like systemd or cron) for more information about failed script execution

## Mitigation

Check if provided configuration is valid, if files were not renamed during upgrades.

0 comments on commit ea632f2

Please sign in to comment.