forked from prometheus-operator/runbooks
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request prometheus-operator#5 from thaum-xyz/port-from-ope…
…nshift
- Loading branch information
Showing
19 changed files
with
852 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# AlertmanagerFailedReload | ||
|
||
## Meaning | ||
|
||
The alert `AlertmanagerFailedReload` is triggered when the Alertmanager instance | ||
for the cluster monitoring stack has consistently failed to reload its | ||
configuration for a certain period. | ||
|
||
## Impact | ||
|
||
Alerts for cluster components may not be delivered as expected. | ||
|
||
## Diagnosis | ||
|
||
Check the logs for the `alertmanager-main` pods in the `monitoring` namespace: | ||
|
||
```console | ||
$ kubectl -n monitoring logs -l 'alertmanager=main' | ||
``` | ||
|
||
## Mitigation | ||
|
||
The resolution depends on the particular issue reported in the logs. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
--- | ||
title: etcd | ||
bookCollapseSection: true | ||
bookFlatSection: true | ||
weight: 10 | ||
--- | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
# etcdBackendQuotaLowSpace | ||
|
||
## Meaning | ||
|
||
This alert fires when the total existing DB size exceeds 95% of the maximum | ||
DB quota. The consumed space is in Prometheus represented by the metric | ||
`etcd_mvcc_db_total_size_in_bytes`, and the DB quota size is defined by | ||
`etcd_server_quota_backend_bytes`. | ||
|
||
## Impact | ||
|
||
In case the DB size exceeds the DB quota, no writes can be performed anymore on | ||
the etcd cluster. This further prevents any updates in the cluster, such as the | ||
creation of pods. | ||
|
||
## Diagnosis | ||
|
||
The following two approaches can be used for the diagnosis. | ||
|
||
### CLI Checks | ||
|
||
To run `etcdctl` commands, we need to `rsh` into the `etcdctl` container of any | ||
etcd pod. | ||
|
||
```console | ||
$ NAMESPACE="kube-etcd" | ||
$ kubectl rsh -c etcdctl -n $NAMESPACE $(kubectl get po -l app=etcd -oname -n $NAMESPACE | awk -F"/" 'NR==1{ print $2 }') | ||
``` | ||
|
||
Validate that the `etcdctl` command is available: | ||
|
||
```console | ||
$ etcdctl version | ||
``` | ||
|
||
`etcdctl` can be used to fetch the DB size of the etcd endpoints. | ||
|
||
```console | ||
$ etcdctl endpoint status -w table | ||
``` | ||
|
||
### PromQL queries | ||
|
||
Check the percentage consumption of etcd DB with the following query in the | ||
metrics console: | ||
|
||
```console | ||
(etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes) * 100 | ||
``` | ||
|
||
Check the DB size in MB that can be reduced after defragmentation: | ||
|
||
```console | ||
(etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_in_bytes)/1024/1024 | ||
``` | ||
|
||
## Mitigation | ||
|
||
### Capacity planning | ||
|
||
If the `etcd_mvcc_db_total_size_in_bytes` shows that you are growing close to | ||
the `etcd_server_quota_backend_bytes`, etcd almost reached max capacity and it's | ||
start planning for new cluster. | ||
|
||
In the meantime before migration happens, you can use defrag to gain some time. | ||
|
||
### Defrag | ||
|
||
When the etcd DB size increases, we can defragment existing etcd DB to optimize | ||
DB consumption as described in [here][etcdDefragmentation]. Run the following | ||
command in all etcd pods. | ||
|
||
```console | ||
$ etcdctl defrag | ||
``` | ||
|
||
As validation, check the endpoint status of etcd members to know the reduced | ||
size of etcd DB. Use for this purpose the same diagnostic approaches as listed | ||
above. More space should be available now. | ||
|
||
[etcdDefragmentation]: https://etcd.io/dkubectls/v3.4.0/op-guide/maintenance/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
# etcdGRPCRequestsSlow | ||
|
||
## Meaning | ||
|
||
This alert fires when the 99th percentile of etcd gRPC requests are too slow. | ||
|
||
## Impact | ||
|
||
When requests are too slow, they can lead to various scenarios like leader | ||
election failure, slow reads and writes. | ||
|
||
## Diagnosis | ||
|
||
This could be result of slow disk (due to fragmented state) or CPU contention. | ||
|
||
### Slow disk | ||
|
||
One of the most common reasons for slow gRPC requests is disk. Checking disk | ||
related metrics and dashboards should provide a more clear picture. | ||
|
||
#### PromQL queries used to troubleshoot | ||
|
||
Verify the value of how slow the etcd gRPC requests are by using the following | ||
query in the metrics console: | ||
|
||
```console | ||
histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~".*etcd.*", grpc_type="unary"}[5m])) without(grpc_type)) | ||
``` | ||
That result should give a rough timeline of when the issue started. | ||
|
||
`etcd_disk_wal_fsync_duration_seconds_bucket` reports the etcd disk fsync | ||
duration, `etcd_server_leader_changes_seen_total` reports the leader changes. To | ||
rule out a slow disk and confirm that the disk is reasonably fast, 99th | ||
percentile of the `etcd_disk_wal_fsync_duration_seconds_bucket` should be less | ||
than 10ms. Query in metrics UI: | ||
|
||
```console | ||
histogram_quantile(0.99, sum by (instance, le) (irate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m]))) | ||
``` | ||
#### Console dashboards | ||
|
||
In the OpenShift dashboard console under Observe section, select the etcd | ||
dashboard. There are both RPC rate as well as Disk Sync Duration dashboards | ||
which will assist with further issues. | ||
|
||
### Resource exhaustion | ||
|
||
It can happen that etcd responds slower due to CPU resource exhaustion. | ||
This was seen in some cases when one application was requesting too much CPU | ||
which led to this alert firing for multiple methods. | ||
|
||
Often if this is the case, we also see | ||
`etcd_disk_wal_fsync_duration_seconds_bucket` slower as well. | ||
|
||
To confirm this is the cause of the slow requests either: | ||
|
||
1. In OpenShift console on primary page under "Cluster utilization" view the | ||
requested CPU vs available. | ||
|
||
2. PromQL query is the following to see top consumers of CPU: | ||
|
||
```console | ||
topk(25, sort_desc( | ||
sum by (namespace) ( | ||
( | ||
sum(avg_over_time(pod:container_cpu_usage:sum{container="",pod!=""}[5m])) BY (namespace, pod) | ||
* | ||
on(pod,namespace) group_left(node) (node_namespace_pod:kube_pod_info:) | ||
) | ||
* | ||
on(node) group_left(role) (max by (node) (kube_node_role{role=~".+"})) | ||
) | ||
)) | ||
``` | ||
|
||
## Mitigation | ||
|
||
### Fragmented state | ||
|
||
In the case of slow fisk or when the etcd DB size increases, we can defragment | ||
existing etcd DB to optimize DB consumption as described in | ||
[here][etcdDefragmentation]. Run the following command in all etcd pods. | ||
|
||
```console | ||
$ etcdctl defrag | ||
``` | ||
|
||
As validation, check the endpoint status of etcd members to know the reduced | ||
size of etcd DB. Use for this purpose the same diagnostic approaches as listed | ||
above. More space should be available now. | ||
|
||
Further info on etcd best practices can be found in the [OpenShift docs | ||
here][etcdPractices]. | ||
|
||
[etcdDefragmentation]: https://etcd.io/docs/v3.4.0/op-guide/maintenance/ | ||
[etcdPractices]: https://docs.openshift.com/container-platform/4.7/scalability_and_performance/recommended-host-practices.html#recommended-etcd-practices_ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# etcdHighFsyncDurations | ||
|
||
## Meaning | ||
|
||
This alert fires when the 99th percentile of etcd disk fsync duration is too | ||
high for 10 minutes. | ||
|
||
## Impact | ||
|
||
When this happens it can lead to various scenarios like leader election failure, | ||
frequent leader elections, slow reads and writes. | ||
|
||
## Diagnosis | ||
|
||
This could be result of slow disk possibly due to fragmented state in etcd or | ||
simply due to slow disk. | ||
|
||
### Slow disk | ||
|
||
Checking disk related metrics and dashboards should provide a more clear | ||
picture. | ||
|
||
#### PromQL queries used to troubleshoot | ||
|
||
`etcd_disk_wal_fsync_duration_seconds_bucket` reports the etcd disk fsync | ||
duration, `etcd_server_leader_changes_seen_total` reports the leader changes. To | ||
rule out a slow disk and confirm that the disk is reasonably fast, 99th | ||
percentile of the `etcd_disk_wal_fsync_duration_seconds_bucket` should be less | ||
than 10ms. Query in metrics UI: | ||
|
||
```console | ||
histogram_quantile(0.99, sum by (instance, le) (irate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m]))) | ||
``` | ||
|
||
## Mitigation | ||
|
||
### Fragmented state | ||
|
||
In the case of slow fisk or when the etcd DB size increases, we can defragment | ||
existing etcd DB to optimize DB consumption as described in | ||
[here][etcdDefragmentation]. Run the following command in all etcd pods. | ||
|
||
```console | ||
$ etcdctl defrag | ||
``` | ||
|
||
As validation, check the endpoint status of etcd members to know the reduced | ||
size of etcd DB. Use for this purpose the same diagnostic approaches as listed | ||
above. More space should be available now. | ||
|
||
Further info on etcd best practices can be found in the [OpenShift docs | ||
here][etcdPractices]. | ||
|
||
[etcdDefragmentation]: https://etcd.io/docs/v3.4.0/op-guide/maintenance/ | ||
[etcdPractices]: https://docs.openshift.com/container-platform/4.7/scalability_and_performance/recommended-host-practices.html#recommended-etcd-practices_ |
41 changes: 41 additions & 0 deletions
41
content/runbooks/etcd/etcdHighNumberOfFailedGRPCRequests.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# etcdHighNumberOfFailedGRPCRequests | ||
|
||
## Meaning | ||
|
||
This alert fires when at least 50% of etcd gRPC requests failed in the past 10 | ||
minutes. | ||
|
||
## Impact | ||
|
||
First establish which gRPC method is failing, this will be visible in the alert. | ||
If it's not part of the alert, the following query will display method and etcd | ||
instance that has failing requests: | ||
|
||
```sh | ||
100 * sum without(grpc_type, grpc_code) | ||
(rate(grpc_server_handled_total{grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded",job="etcd"}[5m])) | ||
/ sum without(grpc_type, grpc_code) | ||
(rate(grpc_server_handled_total{job="etcd"}[5m])) > 5 and on() | ||
(sum(cluster_infrastructure_provider{type!~"ipi|BareMetal"} == bool 1)) | ||
``` | ||
|
||
## Diagnosis | ||
|
||
All the gRPC errors should also be logged in each respective etcd instance logs. | ||
You can get the instance name from the alert that is firing or by running the | ||
query detailed above. Those etcd instance logs should serve as further insight | ||
into what is wrong. | ||
|
||
To get logs of etcd containers either check the instance from the alert and | ||
check logs directly or run the following: | ||
|
||
```sh | ||
NAMESPACE="kube-etcd" | ||
kubectl logs -n $NAMESPACE -lapp=etcd etcd | ||
``` | ||
|
||
## Mitigation | ||
|
||
Depending on the above diagnosis, the issue will most likely be described in the | ||
error log line of either etcd or openshift-etcd-operator. Most likely causes | ||
tend to be networking issues. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
# etcdInsufficientMembers | ||
|
||
## Meaning | ||
|
||
This alert fires when there are fewer instances available than are needed by | ||
etcd to be healthy. | ||
|
||
## Impact | ||
|
||
When etcd does not have a majority of instances available the Kubernetes and | ||
OpenShift APIs will reject read and write requests and operations that preserve | ||
the health of workloads cannot be performed. | ||
|
||
## Diagnosis | ||
|
||
This can kubectlcur multiple control plane nodes are powered off or are unable to | ||
connect each other via the network. Check that all control plane nodes are | ||
powered and that network connections between each machine are functional. | ||
|
||
Check any other critical, warning or info alerts firing that can assist with the | ||
diagnosis. | ||
|
||
Login to the cluster. Check health of master nodes if any of them is in | ||
`NotReady` state or not. | ||
|
||
```console | ||
$ kubectl get nodes -l node-role.kubernetes.io/master= | ||
``` | ||
|
||
### General etcd health | ||
|
||
To run `etcdctl` commands, we need to `exec` into the `etcdctl` container of any | ||
etcd pod. | ||
|
||
```console | ||
$ kubectl exec -c etcdctl -n openshift-etcd $(kubectl get po -l app=etcd -oname -n openshift-etcd | awk -F"/" 'NR==1{ print $2 }') | ||
``` | ||
|
||
Validate that the `etcdctl` command is available: | ||
|
||
```console | ||
$ etcdctl version | ||
``` | ||
|
||
Run the following command to get the health of etcd: | ||
|
||
```console | ||
$ etcdctl endpoint health -w table | ||
``` | ||
## Mitigation | ||
|
||
### Disaster and recovery | ||
|
||
If an upgrade is in progress, the alert may automatically resolve in some time | ||
when the master node comes up again. If MCO is not working on the master node, | ||
check the cloud provider to verify if the master node instances are running or not. | ||
|
||
In the case when you are running on AWS, the AWS instance retirement might need | ||
a manual reboot of the master node. | ||
|
||
As a last resort if none of the above fix the issue and the alert is still | ||
firing, for etcd specific issues follow the steps described in the [disaster and | ||
recovery dkubectls](dkubectls). | ||
|
||
[dkubectls]:(https://dkubectls.openshift.com/container-platform/4.7/backup_and_restore/disaster_recovery/about-disaster-recovery.html). |
Oops, something went wrong.