Merge pull request prometheus-operator#5 from thaum-xyz/port-from-ope…

…nshift
portefaix · Nov 25, 2021 · 62590db · 62590db
2 parents 6dd0987 + 89112c8
commit 62590db
Show file tree

Hide file tree

Showing 19 changed files with 852 additions and 0 deletions.
diff --git a/content/runbooks/alertmanager/AlertmanagerFailedReload.md b/content/runbooks/alertmanager/AlertmanagerFailedReload.md
@@ -0,0 +1,23 @@
+# AlertmanagerFailedReload
+
+## Meaning
+
+The alert `AlertmanagerFailedReload` is triggered when the Alertmanager instance
+for the cluster monitoring stack has consistently failed to reload its
+configuration for a certain period.
+
+## Impact
+
+Alerts for cluster components may not be delivered as expected.
+
+## Diagnosis
+
+Check the logs for the `alertmanager-main` pods in the `monitoring` namespace:
+
+```console
+$ kubectl -n monitoring logs -l 'alertmanager=main'
+```
+
+## Mitigation
+
+The resolution depends on the particular issue reported in the logs.
diff --git a/content/runbooks/etcd/_index.md b/content/runbooks/etcd/_index.md
@@ -0,0 +1,7 @@
+---
+title: etcd
+bookCollapseSection: true
+bookFlatSection: true
+weight: 10
+---
+
diff --git a/content/runbooks/etcd/etcdBackendQuotaLowSpace.md b/content/runbooks/etcd/etcdBackendQuotaLowSpace.md
@@ -0,0 +1,81 @@
+# etcdBackendQuotaLowSpace
+
+## Meaning
+
+This alert fires when the total existing DB size exceeds 95% of the maximum
+DB quota. The consumed space is in Prometheus represented by the metric
+`etcd_mvcc_db_total_size_in_bytes`, and the DB quota size is defined by
+`etcd_server_quota_backend_bytes`.
+
+## Impact
+
+In case the DB size exceeds the DB quota, no writes can be performed anymore on
+the etcd cluster. This further prevents any updates in the cluster, such as the
+creation of pods.
+
+## Diagnosis
+
+The following two approaches can be used for the diagnosis.
+
+### CLI Checks
+
+To run `etcdctl` commands, we need to `rsh` into the `etcdctl` container of any
+etcd pod.
+
+```console
+$ NAMESPACE="kube-etcd"
+$ kubectl rsh -c etcdctl -n $NAMESPACE $(kubectl get po -l app=etcd -oname -n $NAMESPACE | awk -F"/" 'NR==1{ print $2 }')
+```
+
+Validate that the `etcdctl` command is available:
+
+```console
+$ etcdctl version
+```
+
+`etcdctl` can be used to fetch the DB size of the etcd endpoints.
+
+```console
+$ etcdctl endpoint status -w table
+```
+
+### PromQL queries
+
+Check the percentage consumption of etcd DB with the following query in the
+metrics console:
+
+```console
+(etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes) * 100
+```
+
+Check the DB size in MB that can be reduced after defragmentation:
+
+```console
+(etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_in_bytes)/1024/1024
+```
+
+## Mitigation
+
+### Capacity planning
+
+If the `etcd_mvcc_db_total_size_in_bytes` shows that you are growing close to
+the `etcd_server_quota_backend_bytes`, etcd almost reached max capacity and it's
+start planning for new cluster.
+
+In the meantime before migration happens, you can use defrag to gain some time.
+
+### Defrag
+
+When the etcd DB size increases, we can defragment existing etcd DB to optimize
+DB consumption as described in [here][etcdDefragmentation]. Run the following
+command in all etcd pods.
+
+```console
+$ etcdctl defrag
+```
+
+As validation, check the endpoint status of etcd members to know the reduced
+size of etcd DB. Use for this purpose the same diagnostic approaches as listed
+above. More space should be available now.
+
+[etcdDefragmentation]: https://etcd.io/dkubectls/v3.4.0/op-guide/maintenance/
diff --git a/content/runbooks/etcd/etcdGRPCRequestsSlow.md b/content/runbooks/etcd/etcdGRPCRequestsSlow.md
@@ -0,0 +1,96 @@
+# etcdGRPCRequestsSlow
+
+## Meaning
+
+This alert fires when the 99th percentile of etcd gRPC requests are too slow.
+
+## Impact
+
+When requests are too slow, they can lead to various scenarios like leader
+election failure, slow reads and writes.
+
+## Diagnosis
+
+This could be result of slow disk (due to fragmented state) or CPU contention.
+
+### Slow disk
+
+One of the most common reasons for slow gRPC requests is disk. Checking disk
+related metrics and dashboards should provide a more clear picture.
+
+#### PromQL queries used to troubleshoot
+
+Verify the value of how slow the etcd gRPC requests are by using the following
+query in the metrics console:
+
+```console
+histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~".*etcd.*", grpc_type="unary"}[5m])) without(grpc_type))
+```
+That result should give a rough timeline of when the issue started.
+
+`etcd_disk_wal_fsync_duration_seconds_bucket` reports the etcd disk fsync
+duration, `etcd_server_leader_changes_seen_total` reports the leader changes. To
+rule out a slow disk and confirm that the disk is reasonably fast, 99th
+percentile of the `etcd_disk_wal_fsync_duration_seconds_bucket` should be less
+than 10ms. Query in metrics UI:
+
+```console
+histogram_quantile(0.99, sum by (instance, le) (irate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m])))
+```
+#### Console dashboards
+
+In the OpenShift dashboard console under Observe section, select the etcd
+dashboard. There are both RPC rate as well as Disk Sync Duration dashboards
+which will assist with further issues.
+
+### Resource exhaustion
+
+It can happen that etcd responds slower due to CPU resource exhaustion.
+This was seen in some cases when one application was requesting too much CPU
+which led to this alert firing for multiple methods.
+
+Often if this is the case, we also see
+`etcd_disk_wal_fsync_duration_seconds_bucket` slower as well.
+
+To confirm this is the cause of the slow requests either:
+
+1. In OpenShift console on primary page under "Cluster utilization" view the
+   requested CPU vs available.
+
+2. PromQL query is the following to see top consumers of CPU:
+
+```console
+      topk(25, sort_desc(
+        sum by (namespace) (
+          (
+            sum(avg_over_time(pod:container_cpu_usage:sum{container="",pod!=""}[5m])) BY (namespace, pod)
+            *
+            on(pod,namespace) group_left(node) (node_namespace_pod:kube_pod_info:)
+          )
+          *
+          on(node) group_left(role) (max by (node) (kube_node_role{role=~".+"}))
+        )
+      ))
+```
+
+## Mitigation
+
+### Fragmented state
+
+In the case of slow fisk or when the etcd DB size increases, we can defragment
+existing etcd DB to optimize DB consumption as described in
+[here][etcdDefragmentation]. Run the following command in all etcd pods.
+
+```console
+$ etcdctl defrag
+```
+
+As validation, check the endpoint status of etcd members to know the reduced
+size of etcd DB. Use for this purpose the same diagnostic approaches as listed
+above. More space should be available now.
+
+Further info on etcd best practices can be found in the [OpenShift docs
+here][etcdPractices].
+
+[etcdDefragmentation]: https://etcd.io/docs/v3.4.0/op-guide/maintenance/
+[etcdPractices]: https://docs.openshift.com/container-platform/4.7/scalability_and_performance/recommended-host-practices.html#recommended-etcd-practices_
diff --git a/content/runbooks/etcd/etcdHighFsyncDurations.md b/content/runbooks/etcd/etcdHighFsyncDurations.md
@@ -0,0 +1,55 @@
+# etcdHighFsyncDurations
+
+## Meaning
+
+This alert fires when the 99th percentile of etcd disk fsync duration is too
+high for 10 minutes.
+
+## Impact
+
+When this happens it can lead to various scenarios like leader election failure,
+frequent leader elections, slow reads and writes.
+
+## Diagnosis
+
+This could be result of slow disk possibly due to fragmented state in etcd or
+simply due to slow disk.
+
+### Slow disk
+
+Checking disk related metrics and dashboards should provide a more clear
+picture.
+
+#### PromQL queries used to troubleshoot
+
+`etcd_disk_wal_fsync_duration_seconds_bucket` reports the etcd disk fsync
+duration, `etcd_server_leader_changes_seen_total` reports the leader changes. To
+rule out a slow disk and confirm that the disk is reasonably fast, 99th
+percentile of the `etcd_disk_wal_fsync_duration_seconds_bucket` should be less
+than 10ms. Query in metrics UI:
+
+```console
+histogram_quantile(0.99, sum by (instance, le) (irate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m])))
+```
+
+## Mitigation
+
+### Fragmented state
+
+In the case of slow fisk or when the etcd DB size increases, we can defragment
+existing etcd DB to optimize DB consumption as described in
+[here][etcdDefragmentation]. Run the following command in all etcd pods.
+
+```console
+$ etcdctl defrag
+```
+
+As validation, check the endpoint status of etcd members to know the reduced
+size of etcd DB. Use for this purpose the same diagnostic approaches as listed
+above. More space should be available now.
+
+Further info on etcd best practices can be found in the [OpenShift docs
+here][etcdPractices].
+
+[etcdDefragmentation]: https://etcd.io/docs/v3.4.0/op-guide/maintenance/
+[etcdPractices]: https://docs.openshift.com/container-platform/4.7/scalability_and_performance/recommended-host-practices.html#recommended-etcd-practices_
diff --git a/content/runbooks/etcd/etcdHighNumberOfFailedGRPCRequests.md b/content/runbooks/etcd/etcdHighNumberOfFailedGRPCRequests.md
@@ -0,0 +1,41 @@
+# etcdHighNumberOfFailedGRPCRequests
+
+## Meaning
+
+This alert fires when at least 50% of etcd gRPC requests failed in the past 10
+minutes.
+
+## Impact
+
+First establish which gRPC method is failing, this will be visible in the alert.
+If it's not part of the alert, the following query will display method and etcd
+instance that has failing requests:
+
+```sh
+100 * sum without(grpc_type, grpc_code)
+(rate(grpc_server_handled_total{grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded",job="etcd"}[5m]))
+/ sum without(grpc_type, grpc_code)
+(rate(grpc_server_handled_total{job="etcd"}[5m])) > 5 and on()
+(sum(cluster_infrastructure_provider{type!~"ipi|BareMetal"} == bool 1))
+```
+
+## Diagnosis
+
+All the gRPC errors should also be logged in each respective etcd instance logs.
+You can get the instance name from the alert that is firing or by running the
+query detailed above. Those etcd instance logs should serve as further insight
+into what is wrong.
+
+To get logs of etcd containers either check the instance from the alert and
+check logs directly or run the following:
+
+```sh
+NAMESPACE="kube-etcd"
+kubectl logs -n $NAMESPACE -lapp=etcd etcd
+```
+
+## Mitigation
+
+Depending on the above diagnosis, the issue will most likely be described in the
+error log line of either etcd or openshift-etcd-operator. Most likely causes
+tend to be networking issues.
diff --git a/content/runbooks/etcd/etcdInsufficientMembers.md b/content/runbooks/etcd/etcdInsufficientMembers.md
@@ -0,0 +1,65 @@
+# etcdInsufficientMembers
+
+## Meaning
+
+This alert fires when there are fewer instances available than are needed by
+etcd to be healthy.
+
+## Impact
+
+When etcd does not have a majority of instances available the Kubernetes and
+OpenShift APIs will reject read and write requests and operations that preserve
+the health of workloads cannot be performed.
+
+## Diagnosis
+
+This can kubectlcur multiple control plane nodes are powered off or are unable to
+connect each other via the network. Check that all control plane nodes are
+powered and that network connections between each machine are functional.
+
+Check any other critical, warning or info alerts firing that can assist with the
+diagnosis.
+
+Login to the cluster. Check health of master nodes if any of them is in
+`NotReady` state or not.
+
+```console
+$ kubectl get nodes -l node-role.kubernetes.io/master=
+```
+
+### General etcd health
+
+To run `etcdctl` commands, we need to `exec` into the `etcdctl` container of any
+etcd pod.
+
+```console
+$ kubectl exec -c etcdctl -n openshift-etcd $(kubectl get po -l app=etcd -oname -n openshift-etcd | awk -F"/" 'NR==1{ print $2 }')
+```
+
+Validate that the `etcdctl` command is available:
+
+```console
+$ etcdctl version
+```
+
+Run the following command to get the health of etcd:
+
+```console
+$ etcdctl endpoint health -w table
+```
+## Mitigation
+
+### Disaster and recovery
+
+If an upgrade is in progress, the alert may automatically resolve in some time
+when the master node comes up again. If MCO is not working on the master node,
+check the cloud provider to verify if the master node instances are running or not.
+
+In the case when you are running on AWS, the AWS instance retirement might need
+a manual reboot of the master node.
+
+As a last resort if none of the above fix the issue and the alert is still
+firing, for etcd specific issues follow the steps described in the [disaster and
+recovery dkubectls](dkubectls).
+
+[dkubectls]:(https://dkubectls.openshift.com/container-platform/4.7/backup_and_restore/disaster_recovery/about-disaster-recovery.html).