Merge pull request prometheus-operator#16 from nvtkaszpir/runbooks-al…

…ertmanager
portefaix · Feb 13, 2023 · 5d07fa8 · 5d07fa8
2 parents ce0a5fb + 30c548d
commit 5d07fa8
Show file tree

Hide file tree

Showing 5 changed files with 25 additions and 14 deletions.
diff --git a/content/runbooks/alertmanager/AlertmanagerClusterCrashlooping.md b/content/runbooks/alertmanager/AlertmanagerClusterCrashlooping.md
@@ -15,7 +15,7 @@ Alerts could be notified multiple time unless pods are crashing to fast and no a
 
 ## Diagnosis
 
-```bash
+```shell
 kubectl get pod -l app=alertmanager
 
 NAMESPACE   NAME                    READY   STATUS              RESTARTS    AGE
@@ -26,7 +26,7 @@ default     alertmanager-main-2     2/2     Running             0 43d
 
 Find the root cause by looking to events for a given pod/deployement
 
-```
+```shell
 kubectl get events --field-selector involvedObject.name=alertmanager-main-0
 ```
 

diff --git a/content/runbooks/alertmanager/AlertmanagerClusterDown.md b/content/runbooks/alertmanager/AlertmanagerClusterDown.md
@@ -18,7 +18,7 @@ You have an unstable cluster, if everything goes wrong you will lose the whole c
 Verify why pods are not running.
 You can get a big picture with `events`.
 
-```bash
+```shell
 $ kubectl get events --field-selector involvedObject.kind=Pod | grep alertmanager
 ```
 

diff --git a/content/runbooks/alertmanager/AlertmanagerFailedReload.md b/content/runbooks/alertmanager/AlertmanagerFailedReload.md
@@ -1,5 +1,5 @@
 ---
-title: AlertmanagerFailedReload
+title: Alertmanager Failed Reload
 weight: 20
 ---
 
@@ -14,14 +14,15 @@ configuration for a certain period.
 ## Impact
 
 The impact depends on the type of the error you will find in the logs.
-Most of the time, previous configuration is still working, thanks to multiple instances, so avoid deleting existing pods.
+Most of the time, previous configuration is still working, thanks to multiple
+instances, so avoid deleting existing pods.
 
 ## Diagnosis
 
 Verify if there is an error in `config-reloader` container logs.
 Here an example with network issues.
 
-```bash
+```shell
 $ kubectl logs sts/alertmanager-main -c config-reloader
 
 level=error ts=2021-09-24T11:24:52.69629226Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9093/alertmanager/-/reload\": dial tcp [::1]:9093: connect: connection refused"
@@ -31,5 +32,7 @@ You can also verify directly `alertmanager.yaml` file (default: `/etc/alertmanag
 
 ## Mitigation
 
-Running [amtool check-config alertmanager.yaml](https://github.com/prometheus/alertmanager#amtool) on your configuration file will help you detect problem related to syntax.
-You could also rollback `alertmanager.yaml` to the previous version in order to get back to a stable version. 
+Running [amtool check-config alertmanager.yaml](https://github.com/prometheus/alertmanager#amtool)
+on your configuration file will help you detect problem related to syntax.
+You could also rollback `alertmanager.yaml` to the previous version in order
+to get back to a stable version.
diff --git a/content/runbooks/alertmanager/AlertmanagerFailedToSendAlerts.md b/content/runbooks/alertmanager/AlertmanagerFailedToSendAlerts.md
@@ -11,15 +11,20 @@ At least one instance is unable to routed alert to the corresponding integration
 
 ## Impact
 
-No impact since another instance should be able to send the notification, unless `AlertmanagerClusterFailedToSendAlerts` is also triggerd for the same integration.
+No impact since another instance should be able to send the notification,
+unless `AlertmanagerClusterFailedToSendAlerts` is also triggerd for the same integration.
 
 ## Diagnosis
 
-Verify the amount of failed notification per alert-manager-[instance] for a specific integration.
+Verify the amount of failed notification per alert-manager-[instance] for
+a specific integration.
 
-You can look metrics exposed in prometheus console using promQL. For exemple the following query will display the number of failed notifications per instance for pager duty integration. We have 3 instances involved in the example bellow.
+You can look metrics exposed in prometheus console using promQL.
+For exemple the following query will display the number of failed
+notifications per instance for pager duty integration.
+We have 3 instances involved in the example bellow.
 
-```
+```promql
 rate(alertmanager_notifications_total{integration="pagerduty"}[5m])
 ```
 
@@ -30,6 +35,9 @@ rate(alertmanager_notifications_total{integration="pagerduty"}[5m])
 
 Depending on the integration, you can have a look to alert-manager logs and act (network, authorization token, firewall...)
 
-```
+Depending on the integration, you can have a look to alert-manager logs
+and act (network, authorization token, firewall...)
+
+```shell
 kubectl -n monitoring logs -l 'alertmanager=main' -c alertmanager
 ```
diff --git a/content/runbooks/alertmanager/AlertmanagerMembersInconsistent.md b/content/runbooks/alertmanager/AlertmanagerMembersInconsistent.md
@@ -15,7 +15,7 @@ At least one of alertmanager cluster members cannot be found.
 
 Check if IP addresses discovered by alertmanager cluster are the same ones as in alertmanager Service. Following example show possible inconsistency in Endpoint IP addresses:
 
-```bash
+```shell
 $ kubectl describe svc alertmanager-main
 
 Name:              alertmanager-main