Merge pull request prometheus-operator#15 from nvtkaszpir/runbooks-pr…

…ometheus
portefaix · Feb 13, 2023 · ce0a5fb · ce0a5fb
2 parents a6bc8ea + b786728
commit ce0a5fb
Show file tree

Hide file tree

Showing 18 changed files with 393 additions and 13 deletions.
diff --git a/content/runbooks/prometheus/PrometheusBadConfig.md b/content/runbooks/prometheus/PrometheusBadConfig.md
@@ -1,17 +1,30 @@
+---
+title: Prometheus Bad Config
+weight: 20
+---
+
 # PrometheusBadConfig
 
 ## Meaning
 
-Alert fires when Prometheus cannot successfully reload the configuration file due to the file having incorrect content.
+Alert fires when Prometheus cannot successfully reload the configuration file
+due to the file having incorrect content.
 
 ## Impact
 
-Configuration cannot be reloaded and prometheus operates with last known good configuration. Configuration changes in any of Prometheus, Probe, PodMonitor, or ServiceMonitor objects may not be picked up by prometheus server.
+Configuration cannot be reloaded and prometheus operates with last known good
+configuration.
+Configuration changes in any of Prometheus, Probe, PodMonitor,
+or ServiceMonitor objects may not be picked up by prometheus server.
 
 ## Diagnosis
 
-Check prometheus container logs for an explanation of which part of the configuration is problematic. Usually this can occur when ServiceMonitors or PodMonitors share the same job label.
+Check prometheus container logs for an explanation of which part of the
+configuration is problematic.
+
+Usually this can occur when ServiceMonitors or
+PodMonitors share the same job label.
 
 ## Mitigation
 
-Remove conflicting configuration option.
+Remove conflicting configuration option.
diff --git a/content/runbooks/prometheus/PrometheusDuplicateTimestamps.md b/content/runbooks/prometheus/PrometheusDuplicateTimestamps.md
@@ -1,23 +1,34 @@
+---
+title: Prometheus Duplicate Timestamps
+weight: 20
+---
+
 # PrometheusDuplicateTimestamps
 
 Find the Prometheus Pod that concerns this.
 
-```bash
+```shell
 $ kubectl -n <namespace> get pod
 prometheus-k8s-0                       2/2     Running   1          122m
 prometheus-k8s-1                       2/2     Running   1          122m
 ```
 
 Look at the logs of each of them, there should be a log line such as:
 
-```bash
+```shell
 $ kubectl -n <namespace> logs prometheus-k8s-0
 level=warn ts=2021-01-04T15:08:55.613Z caller=scrape.go:1372 component="scrape manager" scrape_pool=default/main-ingress-nginx-controller/0 target=http://10.0.7.3:10254/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=16
 ```
 
 Now there is a judgement call to make, this could be the result of:
 
-* Faulty configuration, which could be resolved by removing the offending `ServiceMonitor` or `PodMonitor` object, which can be identified through the `scrape_pool` label in the log line, which is in the format of `<namespace>/<service-monitor-name>/<endpoint-id>`.
-* The target is reporting faulty data, sometimes this can be resolved by restarting the target, or it might need to be fixed in code of the offending application.
+* Faulty configuration, which could be resolved by removing the offending
+  `ServiceMonitor` or `PodMonitor` object, which can be identified through
+  the `scrape_pool` label in the log line, which is in the format of
+  `<namespace>/<service-monitor-name>/<endpoint-id>`.
+
+* The target is reporting faulty data, sometimes this can be resolved by
+  restarting the target, or it might need to be fixed in code of the offending
+  application.
 
-Further reading: https://www.robustperception.io/debugging-out-of-order-samples
+Further reading [blog](https://www.robustperception.io/debugging-out-of-order-samples)
diff --git a/content/runbooks/prometheus/PrometheusErrorSendingAlertsToAnyAlertmanager.md b/content/runbooks/prometheus/PrometheusErrorSendingAlertsToAnyAlertmanager.md
@@ -0,0 +1,24 @@
+---
+title: Prometheus Error Sending Alerts To Any Alertmanager
+weight: 20
+---
+
+# PrometheusErrorSendingAlertsToAnyAlertmanager
+
+## Meaning
+
+Prometheus has encountered errors sending alerts to a any Alertmanager.
+
+## Impact
+
+All alerts may be lost.
+
+## Diagnosis
+
+Check connectivity issues between Prometheus and AlertManager cluster.
+Check NetworkPolicies, network saturation.
+Check if AlertManager is not overloaded or has not enough resources.
+
+## Mitigation
+
+Set multiple AlertManager instances, spread them across nodes.
diff --git a/content/runbooks/prometheus/PrometheusErrorSendingAlertsToSomeAlertmanagers.md b/content/runbooks/prometheus/PrometheusErrorSendingAlertsToSomeAlertmanagers.md
@@ -0,0 +1,24 @@
+---
+title: Prometheus Error Sending Alerts To Some Alertmanagers
+weight: 20
+---
+
+# PrometheusErrorSendingAlertsToSomeAlertmanagers
+
+## Meaning
+
+Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager.
+
+## Impact
+
+Some alerts may be lost.
+
+## Diagnosis
+
+Check connectivity issues between Prometheus and AlertManager.
+Check NetworkPolicies, network saturation.
+Check if AlertManager is not overloaded or has not enough resources.
+
+## Mitigation
+
+Set multiple AlertManager instances, spread them across nodes.
diff --git a/content/runbooks/prometheus/PrometheusLabelLimitHit.md b/content/runbooks/prometheus/PrometheusLabelLimitHit.md
@@ -0,0 +1,22 @@
+---
+title: Prometheus Label LimitHit
+weight: 20
+---
+
+# PrometheusLabelLimitHit
+
+## Meaning
+
+Prometheus has dropped targets because some scrape configs have exceeded the labels limit.
+
+## Impact
+
+Metrics and alerts may be missing or inaccurate.
+
+## Diagnosis
+
+
+## Mitigation
+
+Start thinking about sharding prometheus.
+Increase scrape times to perform it less frequently.
diff --git a/content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md b/content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md
@@ -0,0 +1,22 @@
+---
+title: Prometheus Missing Rule Evaluations
+weight: 20
+---
+
+# PrometheusMissingRuleEvaluations
+
+## Meaning
+
+Prometheus is missing rule evaluations due to slow rule group evaluation.
+
+## Impact
+
+Metrics and alerts may be missing or inaccurate.
+
+## Diagnosis
+
+Check which rules fail, try to calcuate them differently.
+
+## Mitigation
+
+Sometimes giving more CPU is the only way to fix it.
diff --git a/content/runbooks/prometheus/PrometheusNotConnectedToAlertmanagers.md b/content/runbooks/prometheus/PrometheusNotConnectedToAlertmanagers.md
@@ -0,0 +1,24 @@
+---
+title: Prometheus Not Connected To Alertmanagers
+weight: 20
+---
+
+# PrometheusNotConnectedToAlertmanagers
+
+## Meaning
+
+Prometheus is not connected to any Alertmanagers.
+
+## Impact
+
+Sending alerts is not possible.
+
+## Diagnosis
+
+Check connectivity issues between Prometheus and AlertManager.
+Check NetworkPolicies, network saturation.
+Check if AlertManager is not overloaded or has not enough resources.
+
+## Mitigation
+
+Set multiple AlertManager instances, spread them across nodes.
diff --git a/content/runbooks/prometheus/PrometheusNotIngestingSamples.md b/content/runbooks/prometheus/PrometheusNotIngestingSamples.md
@@ -0,0 +1,22 @@
+---
+title: Prometheus Not Ingesting Samples
+weight: 20
+---
+
+# PrometheusNotIngestingSamples
+
+## Meaning
+
+Prometheus is not ingesting samples.
+
+## Impact
+
+Missing metrics.
+
+## Diagnosis
+
+TODO
+
+## Mitigation
+
+TODO
diff --git a/content/runbooks/prometheus/PrometheusNotificationQueueRunningFull.md b/content/runbooks/prometheus/PrometheusNotificationQueueRunningFull.md
@@ -0,0 +1,26 @@
+---
+title: Prometheus Notification Queue Running Full
+weight: 20
+---
+
+# PrometheusNotificationQueueRunningFull
+
+## Meaning
+
+Prometheus alert notification queue predicted to run full in less than 30m.
+
+## Impact
+
+Fail to send alerts.
+
+## Diagnosis
+
+Check prometheus container logs for an explanation of which part of the
+configuration is problematic.
+
+## Mitigation
+
+Remove conflicting configuration option.
+
+Check if there is an option to decrease number of alerts firing,
+for example by sharding prometheus.
diff --git a/content/runbooks/prometheus/PrometheusOutOfOrderTimestamps.md b/content/runbooks/prometheus/PrometheusOutOfOrderTimestamps.md
@@ -1,3 +1,8 @@
+---
+title: Prometheus Out Of Order Timestamps
+weight: 20
+---
+
 # PrometheusOutOfOrderTimestamps
 
-More information in https://www.robustperception.io/debugging-out-of-order-samples
+More information in [blog](https://www.robustperception.io/debugging-out-of-order-samples)
diff --git a/content/runbooks/prometheus/PrometheusRemoteStorageFailures.md b/content/runbooks/prometheus/PrometheusRemoteStorageFailures.md
@@ -0,0 +1,24 @@
+---
+title: Prometheus Remote Storage Failures
+weight: 20
+---
+
+# PrometheusRemoteStorageFailures
+
+## Meaning
+
+Prometheus fails to send samples to remote storage.
+
+## Impact
+
+Metrics and alerts may be missing or inaccurate.
+
+## Diagnosis
+
+Check prometheus logs and remote storage logs.
+Investigate network issues.
+Check configs and credentials.
+
+## Mitigation
+
+TODO
diff --git a/content/runbooks/prometheus/PrometheusRemoteWriteBehind.md b/content/runbooks/prometheus/PrometheusRemoteWriteBehind.md
@@ -0,0 +1,29 @@
+---
+title: Prometheus Remote Write Behind
+weight: 20
+---
+
+# PrometheusRemoteStorageFailures
+
+## Meaning
+
+Prometheus remote write is behind.
+
+## Impact
+
+Metrics and alerts may be missing or inaccurate.
+Increased data lag between locations.
+
+## Diagnosis
+
+Check prometheus logs and remote storage logs.
+Investigate network issues.
+Check configs and credentials.
+
+## Mitigation
+
+Probbaly amout of data sent to remote system is too high
+for given network connectivity speed.
+You may need to limit which metrics to send to minimize transfers.
+
+See [Prometheus Remote Storage Failures]({{< ref "./PrometheusRemoteStorageFailures.md" >}})
diff --git a/content/runbooks/prometheus/PrometheusRemoteWriteDesiredShards.md b/content/runbooks/prometheus/PrometheusRemoteWriteDesiredShards.md
@@ -0,0 +1,33 @@
+---
+title: PrometheusRemoteWriteDesiredShards
+weight: 20
+---
+
+# PrometheusRemoteWriteDesiredShards
+
+## Meaning
+
+Prometheus remote write desired shards calculation wants to run
+more than configured max shards.
+
+
+## Impact
+
+Metrics and alerts may be missing or inaccurate.
+
+
+## Diagnosis
+
+Check metrics cardinality.
+
+Check prometheus logs and remote storage logs.
+Investigate network issues.
+Check configs and credentials.
+
+## Mitigation
+
+Probbaly amout of data sent to remote system is too high
+for given network connectivity speed.
+You may need to limit which metrics to send to minimize transfers.
+
+See [Prometheus Remote Storage Failures]({{< ref "./PrometheusRemoteStorageFailures.md" >}})
diff --git a/content/runbooks/prometheus/PrometheusRuleFailures.md b/content/runbooks/prometheus/PrometheusRuleFailures.md
@@ -1,5 +1,27 @@
+---
+title: Prometheus Rule Failures
+weight: 20
+---
+
 # PrometheusRuleFailures
 
-Your best starting point is the rules page of the Prometheus UI (:9090/rules). It will show the error. 
+## Meaning
+
+Prometheus is failing rule evaluations.
+Prometheus rules are incorrect or failed to calculate.
+
+## Impact
+
+Metrics and alerts may be missing or inaccurate.
+
+## Diagnosis
+
+Your best starting point is the rules page of the Prometheus UI (:9090/rules).
+It will show the error.
+
+You can also evaluate the rule expression yourself, using the UI, or maybe
+using PromLens to help debug expression issues.
+
+## Mitigation
 
-You can also evaluate the rule expression yourself, using the UI, or maybe using PromLens to help debug expression issues.
+Fix rules.