Merge pull request prometheus-operator#13 from nvtkaszpir/runbooks-node

portefaix · Feb 13, 2023 · ea632f2 · ea632f2
2 parents 73710f0 + 441e935
commit ea632f2
Show file tree

Hide file tree

Showing 13 changed files with 235 additions and 27 deletions.
diff --git a/content/runbooks/node/NodeClockNotSynchronising.md b/content/runbooks/node/NodeClockNotSynchronising.md
@@ -0,0 +1,22 @@
+---
+title: Node Clock Not Synchronising
+weight: 20
+---
+
+# NodeClockNotSynchronising
+
+## Meaning
+
+Clock not synchronising.
+
+## Impact
+
+Time is not automatically synchronizing on the node. This can cause issues with handling TLS as well as problems with other time-sensitive applications.
+
+## Diagnosis
+
+TODO
+
+## Mitigation
+
+See [Node Clok Skew Detected]({{< ref "./NodeClockSkewDetected.md" >}}) for mitigation steps.
diff --git a/content/runbooks/node/NodeClockSkewDetected.md b/content/runbooks/node/NodeClockSkewDetected.md
@@ -0,0 +1,33 @@
+---
+title: Node Clock Skew Detected
+weight: 20
+---
+
+# NodeClockSkewDetected
+
+## Meaning
+
+Clock skew detected.
+
+## Impact
+
+Time is skewed on the node. This can cause issues with handling TLS as well as problems with other time-sensitive applications.
+
+## Diagnosis
+
+TODO
+
+## Mitigation
+
+Ensure time synchronization service is running.
+Set proper time servers.
+Esure to sync time on server start, especially when using
+low power mode or hibernation.
+
+Some resource consuming process can cause issues on given hardware,
+so move it to different servers.
+
+On physical servers check if on-board battery requires replacement.
+Check for hardware errors.
+Check for firmware updates.
+Ensure to use newer hardware (like server mainboard and so on).
diff --git a/content/runbooks/node/NodeFileDescriptorLimit.md b/content/runbooks/node/NodeFileDescriptorLimit.md
@@ -1,3 +1,8 @@
+---
+title: Node File Descriptor Limit
+weight: 20
+---
+
 # NodeFileDescriptorLimit
 
 ## Meaning
@@ -17,7 +22,7 @@ node.
 You can open a shell on the node and use the standard Linux utilities to
 diagnose the issue:
 
-```console
+```shell
 $ NODE_NAME='<value of instance label from alert>'
 
 $ oc debug "node/$NODE_NAME"

diff --git a/content/runbooks/node/NodeFilesystemAlmostOutOfFiles.md b/content/runbooks/node/NodeFilesystemAlmostOutOfFiles.md
@@ -1,8 +1,13 @@
+---
+title: Node Filesystem Almost Out Of Files
+weight: 20
+---
+
 # NodeFilesystemAlmostOutOfFiles
 
 ## Meaning
 
-This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but rather
+This alert is similar to the NodeFilesystemSpaceFillingUp alert, but rather
 than being based on a prediction that a filesystem will run out of inodes in a
 certain amount of time, it uses simple static thresholds. The alert will fire as
 at a `warning` level at 5% of available inodes left, and at a `critical` level
@@ -18,10 +23,8 @@ of the cluster.
 
 ## Diagnosis
 
-Refer to the [NodeFilesystemFilesFillingUp][1] runbook.
 
 ## Mitigation
 
-Refer to the [NodeFilesystemFilesFillingUp][1] runbook.
+See [Node Filesystem FilesFilling Up]({{< ref "./NodeFilesystemFilesFillingUp.md" >}})
 
-[1]: ./NodeFilesystemFilesFillingUp.md
diff --git a/content/runbooks/node/NodeFilesystemAlmostOutOfSpace.md b/content/runbooks/node/NodeFilesystemAlmostOutOfSpace.md
@@ -1,8 +1,13 @@
+---
+title: Node Filesystem Almost Out Of Space
+weight: 20
+---
+
 # NodeFilesystemAlmostOutOfSpace
 
 ## Meaning
 
-This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but rather
+This alert is similar to the NodeFilesystemSpaceFillingUp alert, but rather
 than being based on a prediction that a filesystem will become full in a certain
 amount of time, it uses simple static thresholds. The alert will fire as at a
 `warning` level at 5% space left, and at a `critical` level with 3% space left.
@@ -17,10 +22,6 @@ of the cluster.
 
 ## Diagnosis
 
-Refer to the [NodeFilesystemSpaceFillingUp][1] runbook.
-
 ## Mitigation
 
-Refer to the [NodeFilesystemSpaceFillingUp][1] runbook.
-
-[1]: ./NodeFilesystemSpaceFillingUp.md
+See [Node Filesystem FilesFilling Up]({{< ref "./NodeFilesystemFilesFillingUp.md" >}})
diff --git a/content/runbooks/node/NodeFilesystemFilesFillingUp.md b/content/runbooks/node/NodeFilesystemFilesFillingUp.md
@@ -1,8 +1,13 @@
+---
+title: Node Filesystem Files Filling Up
+weight: 20
+---
+
 # NodeFilesystemFilesFillingUp
 
 ## Meaning
 
-This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but
+This alert is similar to the NodeFilesystemSpaceFillingUp alert, but
 predicts the filesystem will run out of inodes rather than bytes of storage
 space. The alert fires at a `critical` level when the filesystem is predicted to
 run out of available inodes within four hours.
@@ -21,7 +26,7 @@ Note the `instance` and `mountpoint` labels from the alert. You can graph the
 usage history of this filesystem with the following query in the OpenShift web
 console:
 
-```text
+```promql
 node_filesystem_files_free{
   instance="<value of instance label from alert>",
   mountpoint="<value of mountpoint label from alert>"
@@ -31,7 +36,7 @@ node_filesystem_files_free{
 You can also open a debug session on the node and use the standard Linux
 utilities to locate the source of the usage:
 
-```console
+```shell
 $ MOUNT_POINT='<value of mountpoint label from alert>'
 $ NODE_NAME='<value of instance label from alert>'
 
@@ -50,4 +55,4 @@ size. You may be able to solve the problem, or buy time, by increasing size of
 the storage volume. Otherwise, determine the application that is creating large
 numbers of files and adjust its configuration or provide it dedicated storage.
 
-[1]: ./NodeFilesystemSpaceFillingUp.md
+See [Node Filesystem FilesFilling Up]({{< ref "./NodeFilesystemFilesFillingUp.md" >}}) for additional mitigation steps.
diff --git a/content/runbooks/node/NodeFilesystemSpaceFillingUp.md b/content/runbooks/node/NodeFilesystemSpaceFillingUp.md
@@ -1,3 +1,8 @@
+---
+title: Node Filesystem Space Filling Up
+weight: 20
+---
+
 # NodeFilesystemSpaceFillingUp
 
 ## Meaning
@@ -11,8 +16,12 @@ time is less than 4h.
 <details>
 <summary>Full context</summary>
 
-The filesystem on Kubernetes nodes mainly consists of the operating system, [container ephemeral storage][1], container images, and container logs.
-Since Kubelet automatically handles [cleaning up old logs][2] and [deleting unused images][3], container ephemeral storage is a common cause of this alert. Although this alert may be triggered before Kubelet's garbage collection kicks in.
+The filesystem on Kubernetes nodes mainly consists of the operating system,
+[container ephemeral storage][1], container images, and container logs.
+Since Kubelet automatically handles [cleaning up old logs][2] and
+[deleting unused images][3], container ephemeral storage is a common cause of
+this alert. Although this alert may be triggered before Kubelet's garbage
+collection kicks in.
 
 </details>
 
@@ -31,7 +40,7 @@ and/or recent offenders. Is this some irregular condition, e.g. a process fails
 to clean up behind itself or is this organic growth? If monitoring is enabled,
 the following metric can be watched in PromQL.
 
-```console
+```promql
 node_filesystem_free_bytes
 ```
 
@@ -44,31 +53,31 @@ removing unused images solves that issue:
 
 Debug the node by accessing the node filesystem:
 
-```console
+```shell
 $ NODE_NAME=<instance label from alert>
 $ kubectl -n default debug node/$NODE_NAME
 $ chroot /host
 ```
 
 Remove dangling images:
 
-```console
+```shell
 # TODO: Command needed
 ```
 
 Remove unused images:
 
-```console
+```shell
 # TODO: Command needed
 ```
 
 Exit debug:
 
-```console
+```shell
 $ exit
 $ exit
 ```
 
-[1]: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#local-ephemeral-storage
-[2]: https://kubernetes.io/docs/concepts/cluster-administration/logging/
-[3]: https://kubernetes.io/docs/concepts/architecture/garbage-collection/#containers-images
+- [1](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#local-ephemeral-storage)
+- [2](https://kubernetes.io/docs/concepts/cluster-administration/logging/)
+- [3](https://kubernetes.io/docs/concepts/architecture/garbage-collection/#containers-images)
diff --git a/content/runbooks/node/NodeHighNumberConntrackEntriesUsed.md b/content/runbooks/node/NodeHighNumberConntrackEntriesUsed.md
@@ -0,0 +1,24 @@
+---
+title: Node High Number Conntrack Entries Used
+weight: 20
+---
+
+# NodeHighNumberConntrackEntriesUsed
+
+## Meaning
+
+Number of conntrack are getting close to the limit.
+
+## Impact
+
+When reached the limit then some connections will be dropped, degrading service quality.
+
+## Diagnosis
+
+Check current conntrack value on the node.
+Check which apps are generating a lot of connections.
+
+## Mitigation
+
+Migrate some pods to another nodes.
+Bump conntrack limit directly on the node, remembering to make it persistent across node reboots.
diff --git a/content/runbooks/node/NodeNetworkReceiveErrs.md b/content/runbooks/node/NodeNetworkReceiveErrs.md
@@ -0,0 +1,32 @@
+---
+title: Node Network Receive Errors
+weight: 20
+---
+
+# NodeNetworkReceiveErrs
+
+## Meaning
+
+Network interface is reporting many receive errors.
+
+## Impact
+
+Applications on the node may no longer be able to operate with other services.
+Network attached storage performance issues or even data loss.
+
+## Diagnosis
+
+Investigate networkng issues on the node and to connected hardware.
+Check physical cables, check networking firewall rules and so on.
+
+## Mitigation
+
+In general mitigation landscape is quite vast, some suggestions:
+
+- Ensure some node capacity is left unallocated (cpu/memory) for handling
+networking.
+- [Increase TX queue length](https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/ovs-dpdk_end_to_end_troubleshooting_guide/high_packet_loss_in_the_tx_queue_of_the_instance_s_tap_interface)
+- Spread services to other nodes/pods.
+- Replace physical cables, change ports.
+- Look into introducting Quality of Service or other
+[TCP congestion avoidance algorithms](https://en.wikipedia.org/wiki/TCP_congestion_control)
diff --git a/content/runbooks/node/NodeNetworkTransmitErrs.md b/content/runbooks/node/NodeNetworkTransmitErrs.md
@@ -0,0 +1,34 @@
+---
+title: Node Network Transmit Errors
+weight: 20
+---
+
+# NodeNetworkTransmitErrs
+
+## Meaning
+
+Network interface is reporting many transmit errors.
+
+## Impact
+
+Applications on the node may no longer be able to operate with other services.
+Network attached storage performance issues or even data loss.
+
+## Diagnosis
+
+Investigate networkng issues on the node and to connected hardware.
+Check network interface saturation.
+Check CPU usage saturation.
+Check physical cables, check networking firewall rules and so on.
+
+## Mitigation
+
+In general mitigation landscape is quite vast, some suggestions:
+
+- Ensure some node capacity is left unallocated (cpu/memory) for handling
+networking.
+- [Increase TX queue length](https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/ovs-dpdk_end_to_end_troubleshooting_guide/high_packet_loss_in_the_tx_queue_of_the_instance_s_tap_interface)
+- Spread services to other nodes/pods.
+- Replace physical cables, change ports.
+- Look into introducting Quality of Service or other
+[TCP congestion avoidance algorithms](https://en.wikipedia.org/wiki/TCP_congestion_control)
diff --git a/content/runbooks/node/NodeRAIDDegraded.md b/content/runbooks/node/NodeRAIDDegraded.md
@@ -1,7 +1,14 @@
+---
+title: Node RAID Degraded
+weight: 20
+---
+
 # NodeRAIDDegraded
 
 ## Meaning
 
+RAID Array is degraded.
+
 This alert is triggered when a node has a storage configuration with RAID array,
 and the array is reporting as being in a degraded state due to one or more disk
 failures.
@@ -17,7 +24,7 @@ You can open a shell on the node and use the standard Linux utilities to
 diagnose the issue, but you may need to install additional software in the debug
 container:
 
-```console
+```shell
 $ NODE_NAME='<value of instance label from alert>'
 
 $ oc debug "node/$NODE_NAME"
@@ -26,6 +33,8 @@ $ cat /proc/mdstat
 
 ## Mitigation
 
+Cordon and drain node if possible, proceed to RAID recovery.
+
 See the Red Hat Enterprise Linux [documentation][1] for potential steps.
 
-[1]: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_storage_devices/managing-raid_managing-storage-devices
+- [1](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_storage_devices/managing-raid_managing-storage-devices)
diff --git a/content/runbooks/node/NodeRAIDDiskFailure.md b/content/runbooks/node/NodeRAIDDiskFailure.md
@@ -0,0 +1,8 @@
+---
+title: Node RAID Disk Failure
+weight: 20
+---
+
+# NodeRAIDDiskFailure
+
+See [Node RAID Degraded]({{< ref "./NodeRAIDDegraded.md" >}})
diff --git a/content/runbooks/node/NodeTextFileCollectorScrapeError.md b/content/runbooks/node/NodeTextFileCollectorScrapeError.md
@@ -0,0 +1,23 @@
+---
+title: Node Text File Collector Scrape Error
+weight: 20
+---
+
+# NodeTextFileCollectorScrapeError
+
+## Meaning
+
+Node Exporter text file collector failed to scrape.
+
+## Impact
+
+Missing metrics from additional scripts.
+
+## Diagnosis
+
+- Check node_exporter logs
+- Check script supervisor (like systemd or cron) for more information about failed script execution
+
+## Mitigation
+
+Check if provided configuration is valid, if files were not renamed during upgrades.