Skip to content

Commit

Permalink
Merge pull request prometheus-operator#16 from nvtkaszpir/runbooks-al…
Browse files Browse the repository at this point in the history
…ertmanager
  • Loading branch information
paulfantom authored Feb 13, 2023
2 parents ce0a5fb + 30c548d commit 5d07fa8
Show file tree
Hide file tree
Showing 5 changed files with 25 additions and 14 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Alerts could be notified multiple time unless pods are crashing to fast and no a

## Diagnosis

```bash
```shell
kubectl get pod -l app=alertmanager

NAMESPACE NAME READY STATUS RESTARTS AGE
Expand All @@ -26,7 +26,7 @@ default alertmanager-main-2 2/2 Running 0 43d

Find the root cause by looking to events for a given pod/deployement

```
```shell
kubectl get events --field-selector involvedObject.name=alertmanager-main-0
```

Expand Down
2 changes: 1 addition & 1 deletion content/runbooks/alertmanager/AlertmanagerClusterDown.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ You have an unstable cluster, if everything goes wrong you will lose the whole c
Verify why pods are not running.
You can get a big picture with `events`.

```bash
```shell
$ kubectl get events --field-selector involvedObject.kind=Pod | grep alertmanager
```

Expand Down
13 changes: 8 additions & 5 deletions content/runbooks/alertmanager/AlertmanagerFailedReload.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: AlertmanagerFailedReload
title: Alertmanager Failed Reload
weight: 20
---

Expand All @@ -14,14 +14,15 @@ configuration for a certain period.
## Impact

The impact depends on the type of the error you will find in the logs.
Most of the time, previous configuration is still working, thanks to multiple instances, so avoid deleting existing pods.
Most of the time, previous configuration is still working, thanks to multiple
instances, so avoid deleting existing pods.

## Diagnosis

Verify if there is an error in `config-reloader` container logs.
Here an example with network issues.

```bash
```shell
$ kubectl logs sts/alertmanager-main -c config-reloader

level=error ts=2021-09-24T11:24:52.69629226Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9093/alertmanager/-/reload\": dial tcp [::1]:9093: connect: connection refused"
Expand All @@ -31,5 +32,7 @@ You can also verify directly `alertmanager.yaml` file (default: `/etc/alertmanag

## Mitigation

Running [amtool check-config alertmanager.yaml](https://github.com/prometheus/alertmanager#amtool) on your configuration file will help you detect problem related to syntax.
You could also rollback `alertmanager.yaml` to the previous version in order to get back to a stable version.
Running [amtool check-config alertmanager.yaml](https://github.com/prometheus/alertmanager#amtool)
on your configuration file will help you detect problem related to syntax.
You could also rollback `alertmanager.yaml` to the previous version in order
to get back to a stable version.
18 changes: 13 additions & 5 deletions content/runbooks/alertmanager/AlertmanagerFailedToSendAlerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,20 @@ At least one instance is unable to routed alert to the corresponding integration

## Impact

No impact since another instance should be able to send the notification, unless `AlertmanagerClusterFailedToSendAlerts` is also triggerd for the same integration.
No impact since another instance should be able to send the notification,
unless `AlertmanagerClusterFailedToSendAlerts` is also triggerd for the same integration.

## Diagnosis

Verify the amount of failed notification per alert-manager-[instance] for a specific integration.
Verify the amount of failed notification per alert-manager-[instance] for
a specific integration.

You can look metrics exposed in prometheus console using promQL. For exemple the following query will display the number of failed notifications per instance for pager duty integration. We have 3 instances involved in the example bellow.
You can look metrics exposed in prometheus console using promQL.
For exemple the following query will display the number of failed
notifications per instance for pager duty integration.
We have 3 instances involved in the example bellow.

```
```promql
rate(alertmanager_notifications_total{integration="pagerduty"}[5m])
```

Expand All @@ -30,6 +35,9 @@ rate(alertmanager_notifications_total{integration="pagerduty"}[5m])

Depending on the integration, you can have a look to alert-manager logs and act (network, authorization token, firewall...)

```
Depending on the integration, you can have a look to alert-manager logs
and act (network, authorization token, firewall...)

```shell
kubectl -n monitoring logs -l 'alertmanager=main' -c alertmanager
```
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ At least one of alertmanager cluster members cannot be found.

Check if IP addresses discovered by alertmanager cluster are the same ones as in alertmanager Service. Following example show possible inconsistency in Endpoint IP addresses:

```bash
```shell
$ kubectl describe svc alertmanager-main

Name: alertmanager-main
Expand Down

0 comments on commit 5d07fa8

Please sign in to comment.