forked from prometheus-operator/runbooks
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request prometheus-operator#13 from nvtkaszpir/runbooks-node
- Loading branch information
Showing
13 changed files
with
235 additions
and
27 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
--- | ||
title: Node Clock Not Synchronising | ||
weight: 20 | ||
--- | ||
|
||
# NodeClockNotSynchronising | ||
|
||
## Meaning | ||
|
||
Clock not synchronising. | ||
|
||
## Impact | ||
|
||
Time is not automatically synchronizing on the node. This can cause issues with handling TLS as well as problems with other time-sensitive applications. | ||
|
||
## Diagnosis | ||
|
||
TODO | ||
|
||
## Mitigation | ||
|
||
See [Node Clok Skew Detected]({{< ref "./NodeClockSkewDetected.md" >}}) for mitigation steps. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
--- | ||
title: Node Clock Skew Detected | ||
weight: 20 | ||
--- | ||
|
||
# NodeClockSkewDetected | ||
|
||
## Meaning | ||
|
||
Clock skew detected. | ||
|
||
## Impact | ||
|
||
Time is skewed on the node. This can cause issues with handling TLS as well as problems with other time-sensitive applications. | ||
|
||
## Diagnosis | ||
|
||
TODO | ||
|
||
## Mitigation | ||
|
||
Ensure time synchronization service is running. | ||
Set proper time servers. | ||
Esure to sync time on server start, especially when using | ||
low power mode or hibernation. | ||
|
||
Some resource consuming process can cause issues on given hardware, | ||
so move it to different servers. | ||
|
||
On physical servers check if on-board battery requires replacement. | ||
Check for hardware errors. | ||
Check for firmware updates. | ||
Ensure to use newer hardware (like server mainboard and so on). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
24 changes: 24 additions & 0 deletions
24
content/runbooks/node/NodeHighNumberConntrackEntriesUsed.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
--- | ||
title: Node High Number Conntrack Entries Used | ||
weight: 20 | ||
--- | ||
|
||
# NodeHighNumberConntrackEntriesUsed | ||
|
||
## Meaning | ||
|
||
Number of conntrack are getting close to the limit. | ||
|
||
## Impact | ||
|
||
When reached the limit then some connections will be dropped, degrading service quality. | ||
|
||
## Diagnosis | ||
|
||
Check current conntrack value on the node. | ||
Check which apps are generating a lot of connections. | ||
|
||
## Mitigation | ||
|
||
Migrate some pods to another nodes. | ||
Bump conntrack limit directly on the node, remembering to make it persistent across node reboots. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
--- | ||
title: Node Network Receive Errors | ||
weight: 20 | ||
--- | ||
|
||
# NodeNetworkReceiveErrs | ||
|
||
## Meaning | ||
|
||
Network interface is reporting many receive errors. | ||
|
||
## Impact | ||
|
||
Applications on the node may no longer be able to operate with other services. | ||
Network attached storage performance issues or even data loss. | ||
|
||
## Diagnosis | ||
|
||
Investigate networkng issues on the node and to connected hardware. | ||
Check physical cables, check networking firewall rules and so on. | ||
|
||
## Mitigation | ||
|
||
In general mitigation landscape is quite vast, some suggestions: | ||
|
||
- Ensure some node capacity is left unallocated (cpu/memory) for handling | ||
networking. | ||
- [Increase TX queue length](https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/ovs-dpdk_end_to_end_troubleshooting_guide/high_packet_loss_in_the_tx_queue_of_the_instance_s_tap_interface) | ||
- Spread services to other nodes/pods. | ||
- Replace physical cables, change ports. | ||
- Look into introducting Quality of Service or other | ||
[TCP congestion avoidance algorithms](https://en.wikipedia.org/wiki/TCP_congestion_control) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
--- | ||
title: Node Network Transmit Errors | ||
weight: 20 | ||
--- | ||
|
||
# NodeNetworkTransmitErrs | ||
|
||
## Meaning | ||
|
||
Network interface is reporting many transmit errors. | ||
|
||
## Impact | ||
|
||
Applications on the node may no longer be able to operate with other services. | ||
Network attached storage performance issues or even data loss. | ||
|
||
## Diagnosis | ||
|
||
Investigate networkng issues on the node and to connected hardware. | ||
Check network interface saturation. | ||
Check CPU usage saturation. | ||
Check physical cables, check networking firewall rules and so on. | ||
|
||
## Mitigation | ||
|
||
In general mitigation landscape is quite vast, some suggestions: | ||
|
||
- Ensure some node capacity is left unallocated (cpu/memory) for handling | ||
networking. | ||
- [Increase TX queue length](https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/ovs-dpdk_end_to_end_troubleshooting_guide/high_packet_loss_in_the_tx_queue_of_the_instance_s_tap_interface) | ||
- Spread services to other nodes/pods. | ||
- Replace physical cables, change ports. | ||
- Look into introducting Quality of Service or other | ||
[TCP congestion avoidance algorithms](https://en.wikipedia.org/wiki/TCP_congestion_control) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
--- | ||
title: Node RAID Disk Failure | ||
weight: 20 | ||
--- | ||
|
||
# NodeRAIDDiskFailure | ||
|
||
See [Node RAID Degraded]({{< ref "./NodeRAIDDegraded.md" >}}) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
--- | ||
title: Node Text File Collector Scrape Error | ||
weight: 20 | ||
--- | ||
|
||
# NodeTextFileCollectorScrapeError | ||
|
||
## Meaning | ||
|
||
Node Exporter text file collector failed to scrape. | ||
|
||
## Impact | ||
|
||
Missing metrics from additional scripts. | ||
|
||
## Diagnosis | ||
|
||
- Check node_exporter logs | ||
- Check script supervisor (like systemd or cron) for more information about failed script execution | ||
|
||
## Mitigation | ||
|
||
Check if provided configuration is valid, if files were not renamed during upgrades. |