Skip to content

Commit

Permalink
Merge pull request prometheus-operator#15 from nvtkaszpir/runbooks-pr…
Browse files Browse the repository at this point in the history
…ometheus
  • Loading branch information
paulfantom authored Feb 13, 2023
2 parents a6bc8ea + b786728 commit ce0a5fb
Show file tree
Hide file tree
Showing 18 changed files with 393 additions and 13 deletions.
21 changes: 17 additions & 4 deletions content/runbooks/prometheus/PrometheusBadConfig.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,30 @@
---
title: Prometheus Bad Config
weight: 20
---

# PrometheusBadConfig

## Meaning

Alert fires when Prometheus cannot successfully reload the configuration file due to the file having incorrect content.
Alert fires when Prometheus cannot successfully reload the configuration file
due to the file having incorrect content.

## Impact

Configuration cannot be reloaded and prometheus operates with last known good configuration. Configuration changes in any of Prometheus, Probe, PodMonitor, or ServiceMonitor objects may not be picked up by prometheus server.
Configuration cannot be reloaded and prometheus operates with last known good
configuration.
Configuration changes in any of Prometheus, Probe, PodMonitor,
or ServiceMonitor objects may not be picked up by prometheus server.

## Diagnosis

Check prometheus container logs for an explanation of which part of the configuration is problematic. Usually this can occur when ServiceMonitors or PodMonitors share the same job label.
Check prometheus container logs for an explanation of which part of the
configuration is problematic.

Usually this can occur when ServiceMonitors or
PodMonitors share the same job label.

## Mitigation

Remove conflicting configuration option.
Remove conflicting configuration option.
21 changes: 16 additions & 5 deletions content/runbooks/prometheus/PrometheusDuplicateTimestamps.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,34 @@
---
title: Prometheus Duplicate Timestamps
weight: 20
---

# PrometheusDuplicateTimestamps

Find the Prometheus Pod that concerns this.

```bash
```shell
$ kubectl -n <namespace> get pod
prometheus-k8s-0 2/2 Running 1 122m
prometheus-k8s-1 2/2 Running 1 122m
```

Look at the logs of each of them, there should be a log line such as:

```bash
```shell
$ kubectl -n <namespace> logs prometheus-k8s-0
level=warn ts=2021-01-04T15:08:55.613Z caller=scrape.go:1372 component="scrape manager" scrape_pool=default/main-ingress-nginx-controller/0 target=http://10.0.7.3:10254/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=16
```

Now there is a judgement call to make, this could be the result of:

* Faulty configuration, which could be resolved by removing the offending `ServiceMonitor` or `PodMonitor` object, which can be identified through the `scrape_pool` label in the log line, which is in the format of `<namespace>/<service-monitor-name>/<endpoint-id>`.
* The target is reporting faulty data, sometimes this can be resolved by restarting the target, or it might need to be fixed in code of the offending application.
* Faulty configuration, which could be resolved by removing the offending
`ServiceMonitor` or `PodMonitor` object, which can be identified through
the `scrape_pool` label in the log line, which is in the format of
`<namespace>/<service-monitor-name>/<endpoint-id>`.

* The target is reporting faulty data, sometimes this can be resolved by
restarting the target, or it might need to be fixed in code of the offending
application.

Further reading: https://www.robustperception.io/debugging-out-of-order-samples
Further reading [blog](https://www.robustperception.io/debugging-out-of-order-samples)
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
title: Prometheus Error Sending Alerts To Any Alertmanager
weight: 20
---

# PrometheusErrorSendingAlertsToAnyAlertmanager

## Meaning

Prometheus has encountered errors sending alerts to a any Alertmanager.

## Impact

All alerts may be lost.

## Diagnosis

Check connectivity issues between Prometheus and AlertManager cluster.
Check NetworkPolicies, network saturation.
Check if AlertManager is not overloaded or has not enough resources.

## Mitigation

Set multiple AlertManager instances, spread them across nodes.
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
title: Prometheus Error Sending Alerts To Some Alertmanagers
weight: 20
---

# PrometheusErrorSendingAlertsToSomeAlertmanagers

## Meaning

Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager.

## Impact

Some alerts may be lost.

## Diagnosis

Check connectivity issues between Prometheus and AlertManager.
Check NetworkPolicies, network saturation.
Check if AlertManager is not overloaded or has not enough resources.

## Mitigation

Set multiple AlertManager instances, spread them across nodes.
22 changes: 22 additions & 0 deletions content/runbooks/prometheus/PrometheusLabelLimitHit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
title: Prometheus Label LimitHit
weight: 20
---

# PrometheusLabelLimitHit

## Meaning

Prometheus has dropped targets because some scrape configs have exceeded the labels limit.

## Impact

Metrics and alerts may be missing or inaccurate.

## Diagnosis


## Mitigation

Start thinking about sharding prometheus.
Increase scrape times to perform it less frequently.
22 changes: 22 additions & 0 deletions content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
title: Prometheus Missing Rule Evaluations
weight: 20
---

# PrometheusMissingRuleEvaluations

## Meaning

Prometheus is missing rule evaluations due to slow rule group evaluation.

## Impact

Metrics and alerts may be missing or inaccurate.

## Diagnosis

Check which rules fail, try to calcuate them differently.

## Mitigation

Sometimes giving more CPU is the only way to fix it.
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
title: Prometheus Not Connected To Alertmanagers
weight: 20
---

# PrometheusNotConnectedToAlertmanagers

## Meaning

Prometheus is not connected to any Alertmanagers.

## Impact

Sending alerts is not possible.

## Diagnosis

Check connectivity issues between Prometheus and AlertManager.
Check NetworkPolicies, network saturation.
Check if AlertManager is not overloaded or has not enough resources.

## Mitigation

Set multiple AlertManager instances, spread them across nodes.
22 changes: 22 additions & 0 deletions content/runbooks/prometheus/PrometheusNotIngestingSamples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
title: Prometheus Not Ingesting Samples
weight: 20
---

# PrometheusNotIngestingSamples

## Meaning

Prometheus is not ingesting samples.

## Impact

Missing metrics.

## Diagnosis

TODO

## Mitigation

TODO
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
title: Prometheus Notification Queue Running Full
weight: 20
---

# PrometheusNotificationQueueRunningFull

## Meaning

Prometheus alert notification queue predicted to run full in less than 30m.

## Impact

Fail to send alerts.

## Diagnosis

Check prometheus container logs for an explanation of which part of the
configuration is problematic.

## Mitigation

Remove conflicting configuration option.

Check if there is an option to decrease number of alerts firing,
for example by sharding prometheus.
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
title: Prometheus Out Of Order Timestamps
weight: 20
---

# PrometheusOutOfOrderTimestamps

More information in https://www.robustperception.io/debugging-out-of-order-samples
More information in [blog](https://www.robustperception.io/debugging-out-of-order-samples)
24 changes: 24 additions & 0 deletions content/runbooks/prometheus/PrometheusRemoteStorageFailures.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
title: Prometheus Remote Storage Failures
weight: 20
---

# PrometheusRemoteStorageFailures

## Meaning

Prometheus fails to send samples to remote storage.

## Impact

Metrics and alerts may be missing or inaccurate.

## Diagnosis

Check prometheus logs and remote storage logs.
Investigate network issues.
Check configs and credentials.

## Mitigation

TODO
29 changes: 29 additions & 0 deletions content/runbooks/prometheus/PrometheusRemoteWriteBehind.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
title: Prometheus Remote Write Behind
weight: 20
---

# PrometheusRemoteStorageFailures

## Meaning

Prometheus remote write is behind.

## Impact

Metrics and alerts may be missing or inaccurate.
Increased data lag between locations.

## Diagnosis

Check prometheus logs and remote storage logs.
Investigate network issues.
Check configs and credentials.

## Mitigation

Probbaly amout of data sent to remote system is too high
for given network connectivity speed.
You may need to limit which metrics to send to minimize transfers.

See [Prometheus Remote Storage Failures]({{< ref "./PrometheusRemoteStorageFailures.md" >}})
33 changes: 33 additions & 0 deletions content/runbooks/prometheus/PrometheusRemoteWriteDesiredShards.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
title: PrometheusRemoteWriteDesiredShards
weight: 20
---

# PrometheusRemoteWriteDesiredShards

## Meaning

Prometheus remote write desired shards calculation wants to run
more than configured max shards.


## Impact

Metrics and alerts may be missing or inaccurate.


## Diagnosis

Check metrics cardinality.

Check prometheus logs and remote storage logs.
Investigate network issues.
Check configs and credentials.

## Mitigation

Probbaly amout of data sent to remote system is too high
for given network connectivity speed.
You may need to limit which metrics to send to minimize transfers.

See [Prometheus Remote Storage Failures]({{< ref "./PrometheusRemoteStorageFailures.md" >}})
26 changes: 24 additions & 2 deletions content/runbooks/prometheus/PrometheusRuleFailures.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,27 @@
---
title: Prometheus Rule Failures
weight: 20
---

# PrometheusRuleFailures

Your best starting point is the rules page of the Prometheus UI (:9090/rules). It will show the error.
## Meaning

Prometheus is failing rule evaluations.
Prometheus rules are incorrect or failed to calculate.

## Impact

Metrics and alerts may be missing or inaccurate.

## Diagnosis

Your best starting point is the rules page of the Prometheus UI (:9090/rules).
It will show the error.

You can also evaluate the rule expression yourself, using the UI, or maybe
using PromLens to help debug expression issues.

## Mitigation

You can also evaluate the rule expression yourself, using the UI, or maybe using PromLens to help debug expression issues.
Fix rules.
Loading

0 comments on commit ce0a5fb

Please sign in to comment.