Skip to content

Commit

Permalink
Merge pull request prometheus-operator#5 from thaum-xyz/port-from-ope…
Browse files Browse the repository at this point in the history
…nshift
  • Loading branch information
paulfantom authored Nov 25, 2021
2 parents 6dd0987 + 89112c8 commit 62590db
Show file tree
Hide file tree
Showing 19 changed files with 852 additions and 0 deletions.
23 changes: 23 additions & 0 deletions content/runbooks/alertmanager/AlertmanagerFailedReload.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# AlertmanagerFailedReload

## Meaning

The alert `AlertmanagerFailedReload` is triggered when the Alertmanager instance
for the cluster monitoring stack has consistently failed to reload its
configuration for a certain period.

## Impact

Alerts for cluster components may not be delivered as expected.

## Diagnosis

Check the logs for the `alertmanager-main` pods in the `monitoring` namespace:

```console
$ kubectl -n monitoring logs -l 'alertmanager=main'
```

## Mitigation

The resolution depends on the particular issue reported in the logs.
7 changes: 7 additions & 0 deletions content/runbooks/etcd/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
title: etcd
bookCollapseSection: true
bookFlatSection: true
weight: 10
---

81 changes: 81 additions & 0 deletions content/runbooks/etcd/etcdBackendQuotaLowSpace.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# etcdBackendQuotaLowSpace

## Meaning

This alert fires when the total existing DB size exceeds 95% of the maximum
DB quota. The consumed space is in Prometheus represented by the metric
`etcd_mvcc_db_total_size_in_bytes`, and the DB quota size is defined by
`etcd_server_quota_backend_bytes`.

## Impact

In case the DB size exceeds the DB quota, no writes can be performed anymore on
the etcd cluster. This further prevents any updates in the cluster, such as the
creation of pods.

## Diagnosis

The following two approaches can be used for the diagnosis.

### CLI Checks

To run `etcdctl` commands, we need to `rsh` into the `etcdctl` container of any
etcd pod.

```console
$ NAMESPACE="kube-etcd"
$ kubectl rsh -c etcdctl -n $NAMESPACE $(kubectl get po -l app=etcd -oname -n $NAMESPACE | awk -F"/" 'NR==1{ print $2 }')
```

Validate that the `etcdctl` command is available:

```console
$ etcdctl version
```

`etcdctl` can be used to fetch the DB size of the etcd endpoints.

```console
$ etcdctl endpoint status -w table
```

### PromQL queries

Check the percentage consumption of etcd DB with the following query in the
metrics console:

```console
(etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes) * 100
```

Check the DB size in MB that can be reduced after defragmentation:

```console
(etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_in_bytes)/1024/1024
```

## Mitigation

### Capacity planning

If the `etcd_mvcc_db_total_size_in_bytes` shows that you are growing close to
the `etcd_server_quota_backend_bytes`, etcd almost reached max capacity and it's
start planning for new cluster.

In the meantime before migration happens, you can use defrag to gain some time.

### Defrag

When the etcd DB size increases, we can defragment existing etcd DB to optimize
DB consumption as described in [here][etcdDefragmentation]. Run the following
command in all etcd pods.

```console
$ etcdctl defrag
```

As validation, check the endpoint status of etcd members to know the reduced
size of etcd DB. Use for this purpose the same diagnostic approaches as listed
above. More space should be available now.

[etcdDefragmentation]: https://etcd.io/dkubectls/v3.4.0/op-guide/maintenance/
96 changes: 96 additions & 0 deletions content/runbooks/etcd/etcdGRPCRequestsSlow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# etcdGRPCRequestsSlow

## Meaning

This alert fires when the 99th percentile of etcd gRPC requests are too slow.

## Impact

When requests are too slow, they can lead to various scenarios like leader
election failure, slow reads and writes.

## Diagnosis

This could be result of slow disk (due to fragmented state) or CPU contention.

### Slow disk

One of the most common reasons for slow gRPC requests is disk. Checking disk
related metrics and dashboards should provide a more clear picture.

#### PromQL queries used to troubleshoot

Verify the value of how slow the etcd gRPC requests are by using the following
query in the metrics console:

```console
histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~".*etcd.*", grpc_type="unary"}[5m])) without(grpc_type))
```
That result should give a rough timeline of when the issue started.

`etcd_disk_wal_fsync_duration_seconds_bucket` reports the etcd disk fsync
duration, `etcd_server_leader_changes_seen_total` reports the leader changes. To
rule out a slow disk and confirm that the disk is reasonably fast, 99th
percentile of the `etcd_disk_wal_fsync_duration_seconds_bucket` should be less
than 10ms. Query in metrics UI:

```console
histogram_quantile(0.99, sum by (instance, le) (irate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m])))
```
#### Console dashboards

In the OpenShift dashboard console under Observe section, select the etcd
dashboard. There are both RPC rate as well as Disk Sync Duration dashboards
which will assist with further issues.

### Resource exhaustion

It can happen that etcd responds slower due to CPU resource exhaustion.
This was seen in some cases when one application was requesting too much CPU
which led to this alert firing for multiple methods.

Often if this is the case, we also see
`etcd_disk_wal_fsync_duration_seconds_bucket` slower as well.

To confirm this is the cause of the slow requests either:

1. In OpenShift console on primary page under "Cluster utilization" view the
requested CPU vs available.

2. PromQL query is the following to see top consumers of CPU:

```console
topk(25, sort_desc(
sum by (namespace) (
(
sum(avg_over_time(pod:container_cpu_usage:sum{container="",pod!=""}[5m])) BY (namespace, pod)
*
on(pod,namespace) group_left(node) (node_namespace_pod:kube_pod_info:)
)
*
on(node) group_left(role) (max by (node) (kube_node_role{role=~".+"}))
)
))
```

## Mitigation

### Fragmented state

In the case of slow fisk or when the etcd DB size increases, we can defragment
existing etcd DB to optimize DB consumption as described in
[here][etcdDefragmentation]. Run the following command in all etcd pods.

```console
$ etcdctl defrag
```

As validation, check the endpoint status of etcd members to know the reduced
size of etcd DB. Use for this purpose the same diagnostic approaches as listed
above. More space should be available now.

Further info on etcd best practices can be found in the [OpenShift docs
here][etcdPractices].

[etcdDefragmentation]: https://etcd.io/docs/v3.4.0/op-guide/maintenance/
[etcdPractices]: https://docs.openshift.com/container-platform/4.7/scalability_and_performance/recommended-host-practices.html#recommended-etcd-practices_
55 changes: 55 additions & 0 deletions content/runbooks/etcd/etcdHighFsyncDurations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# etcdHighFsyncDurations

## Meaning

This alert fires when the 99th percentile of etcd disk fsync duration is too
high for 10 minutes.

## Impact

When this happens it can lead to various scenarios like leader election failure,
frequent leader elections, slow reads and writes.

## Diagnosis

This could be result of slow disk possibly due to fragmented state in etcd or
simply due to slow disk.

### Slow disk

Checking disk related metrics and dashboards should provide a more clear
picture.

#### PromQL queries used to troubleshoot

`etcd_disk_wal_fsync_duration_seconds_bucket` reports the etcd disk fsync
duration, `etcd_server_leader_changes_seen_total` reports the leader changes. To
rule out a slow disk and confirm that the disk is reasonably fast, 99th
percentile of the `etcd_disk_wal_fsync_duration_seconds_bucket` should be less
than 10ms. Query in metrics UI:

```console
histogram_quantile(0.99, sum by (instance, le) (irate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m])))
```

## Mitigation

### Fragmented state

In the case of slow fisk or when the etcd DB size increases, we can defragment
existing etcd DB to optimize DB consumption as described in
[here][etcdDefragmentation]. Run the following command in all etcd pods.

```console
$ etcdctl defrag
```

As validation, check the endpoint status of etcd members to know the reduced
size of etcd DB. Use for this purpose the same diagnostic approaches as listed
above. More space should be available now.

Further info on etcd best practices can be found in the [OpenShift docs
here][etcdPractices].

[etcdDefragmentation]: https://etcd.io/docs/v3.4.0/op-guide/maintenance/
[etcdPractices]: https://docs.openshift.com/container-platform/4.7/scalability_and_performance/recommended-host-practices.html#recommended-etcd-practices_
41 changes: 41 additions & 0 deletions content/runbooks/etcd/etcdHighNumberOfFailedGRPCRequests.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# etcdHighNumberOfFailedGRPCRequests

## Meaning

This alert fires when at least 50% of etcd gRPC requests failed in the past 10
minutes.

## Impact

First establish which gRPC method is failing, this will be visible in the alert.
If it's not part of the alert, the following query will display method and etcd
instance that has failing requests:

```sh
100 * sum without(grpc_type, grpc_code)
(rate(grpc_server_handled_total{grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded",job="etcd"}[5m]))
/ sum without(grpc_type, grpc_code)
(rate(grpc_server_handled_total{job="etcd"}[5m])) > 5 and on()
(sum(cluster_infrastructure_provider{type!~"ipi|BareMetal"} == bool 1))
```

## Diagnosis

All the gRPC errors should also be logged in each respective etcd instance logs.
You can get the instance name from the alert that is firing or by running the
query detailed above. Those etcd instance logs should serve as further insight
into what is wrong.

To get logs of etcd containers either check the instance from the alert and
check logs directly or run the following:

```sh
NAMESPACE="kube-etcd"
kubectl logs -n $NAMESPACE -lapp=etcd etcd
```

## Mitigation

Depending on the above diagnosis, the issue will most likely be described in the
error log line of either etcd or openshift-etcd-operator. Most likely causes
tend to be networking issues.
65 changes: 65 additions & 0 deletions content/runbooks/etcd/etcdInsufficientMembers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# etcdInsufficientMembers

## Meaning

This alert fires when there are fewer instances available than are needed by
etcd to be healthy.

## Impact

When etcd does not have a majority of instances available the Kubernetes and
OpenShift APIs will reject read and write requests and operations that preserve
the health of workloads cannot be performed.

## Diagnosis

This can kubectlcur multiple control plane nodes are powered off or are unable to
connect each other via the network. Check that all control plane nodes are
powered and that network connections between each machine are functional.

Check any other critical, warning or info alerts firing that can assist with the
diagnosis.

Login to the cluster. Check health of master nodes if any of them is in
`NotReady` state or not.

```console
$ kubectl get nodes -l node-role.kubernetes.io/master=
```

### General etcd health

To run `etcdctl` commands, we need to `exec` into the `etcdctl` container of any
etcd pod.

```console
$ kubectl exec -c etcdctl -n openshift-etcd $(kubectl get po -l app=etcd -oname -n openshift-etcd | awk -F"/" 'NR==1{ print $2 }')
```

Validate that the `etcdctl` command is available:

```console
$ etcdctl version
```

Run the following command to get the health of etcd:

```console
$ etcdctl endpoint health -w table
```
## Mitigation

### Disaster and recovery

If an upgrade is in progress, the alert may automatically resolve in some time
when the master node comes up again. If MCO is not working on the master node,
check the cloud provider to verify if the master node instances are running or not.

In the case when you are running on AWS, the AWS instance retirement might need
a manual reboot of the master node.

As a last resort if none of the above fix the issue and the alert is still
firing, for etcd specific issues follow the steps described in the [disaster and
recovery dkubectls](dkubectls).

[dkubectls]:(https://dkubectls.openshift.com/container-platform/4.7/backup_and_restore/disaster_recovery/about-disaster-recovery.html).
Loading

0 comments on commit 62590db

Please sign in to comment.