forked from prometheus-operator/runbooks
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request prometheus-operator#15 from nvtkaszpir/runbooks-pr…
…ometheus
- Loading branch information
Showing
18 changed files
with
393 additions
and
13 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,17 +1,30 @@ | ||
--- | ||
title: Prometheus Bad Config | ||
weight: 20 | ||
--- | ||
|
||
# PrometheusBadConfig | ||
|
||
## Meaning | ||
|
||
Alert fires when Prometheus cannot successfully reload the configuration file due to the file having incorrect content. | ||
Alert fires when Prometheus cannot successfully reload the configuration file | ||
due to the file having incorrect content. | ||
|
||
## Impact | ||
|
||
Configuration cannot be reloaded and prometheus operates with last known good configuration. Configuration changes in any of Prometheus, Probe, PodMonitor, or ServiceMonitor objects may not be picked up by prometheus server. | ||
Configuration cannot be reloaded and prometheus operates with last known good | ||
configuration. | ||
Configuration changes in any of Prometheus, Probe, PodMonitor, | ||
or ServiceMonitor objects may not be picked up by prometheus server. | ||
|
||
## Diagnosis | ||
|
||
Check prometheus container logs for an explanation of which part of the configuration is problematic. Usually this can occur when ServiceMonitors or PodMonitors share the same job label. | ||
Check prometheus container logs for an explanation of which part of the | ||
configuration is problematic. | ||
|
||
Usually this can occur when ServiceMonitors or | ||
PodMonitors share the same job label. | ||
|
||
## Mitigation | ||
|
||
Remove conflicting configuration option. | ||
Remove conflicting configuration option. |
21 changes: 16 additions & 5 deletions
21
content/runbooks/prometheus/PrometheusDuplicateTimestamps.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,23 +1,34 @@ | ||
--- | ||
title: Prometheus Duplicate Timestamps | ||
weight: 20 | ||
--- | ||
|
||
# PrometheusDuplicateTimestamps | ||
|
||
Find the Prometheus Pod that concerns this. | ||
|
||
```bash | ||
```shell | ||
$ kubectl -n <namespace> get pod | ||
prometheus-k8s-0 2/2 Running 1 122m | ||
prometheus-k8s-1 2/2 Running 1 122m | ||
``` | ||
|
||
Look at the logs of each of them, there should be a log line such as: | ||
|
||
```bash | ||
```shell | ||
$ kubectl -n <namespace> logs prometheus-k8s-0 | ||
level=warn ts=2021-01-04T15:08:55.613Z caller=scrape.go:1372 component="scrape manager" scrape_pool=default/main-ingress-nginx-controller/0 target=http://10.0.7.3:10254/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=16 | ||
``` | ||
|
||
Now there is a judgement call to make, this could be the result of: | ||
|
||
* Faulty configuration, which could be resolved by removing the offending `ServiceMonitor` or `PodMonitor` object, which can be identified through the `scrape_pool` label in the log line, which is in the format of `<namespace>/<service-monitor-name>/<endpoint-id>`. | ||
* The target is reporting faulty data, sometimes this can be resolved by restarting the target, or it might need to be fixed in code of the offending application. | ||
* Faulty configuration, which could be resolved by removing the offending | ||
`ServiceMonitor` or `PodMonitor` object, which can be identified through | ||
the `scrape_pool` label in the log line, which is in the format of | ||
`<namespace>/<service-monitor-name>/<endpoint-id>`. | ||
|
||
* The target is reporting faulty data, sometimes this can be resolved by | ||
restarting the target, or it might need to be fixed in code of the offending | ||
application. | ||
|
||
Further reading: https://www.robustperception.io/debugging-out-of-order-samples | ||
Further reading [blog](https://www.robustperception.io/debugging-out-of-order-samples) |
24 changes: 24 additions & 0 deletions
24
content/runbooks/prometheus/PrometheusErrorSendingAlertsToAnyAlertmanager.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
--- | ||
title: Prometheus Error Sending Alerts To Any Alertmanager | ||
weight: 20 | ||
--- | ||
|
||
# PrometheusErrorSendingAlertsToAnyAlertmanager | ||
|
||
## Meaning | ||
|
||
Prometheus has encountered errors sending alerts to a any Alertmanager. | ||
|
||
## Impact | ||
|
||
All alerts may be lost. | ||
|
||
## Diagnosis | ||
|
||
Check connectivity issues between Prometheus and AlertManager cluster. | ||
Check NetworkPolicies, network saturation. | ||
Check if AlertManager is not overloaded or has not enough resources. | ||
|
||
## Mitigation | ||
|
||
Set multiple AlertManager instances, spread them across nodes. |
24 changes: 24 additions & 0 deletions
24
content/runbooks/prometheus/PrometheusErrorSendingAlertsToSomeAlertmanagers.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
--- | ||
title: Prometheus Error Sending Alerts To Some Alertmanagers | ||
weight: 20 | ||
--- | ||
|
||
# PrometheusErrorSendingAlertsToSomeAlertmanagers | ||
|
||
## Meaning | ||
|
||
Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager. | ||
|
||
## Impact | ||
|
||
Some alerts may be lost. | ||
|
||
## Diagnosis | ||
|
||
Check connectivity issues between Prometheus and AlertManager. | ||
Check NetworkPolicies, network saturation. | ||
Check if AlertManager is not overloaded or has not enough resources. | ||
|
||
## Mitigation | ||
|
||
Set multiple AlertManager instances, spread them across nodes. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
--- | ||
title: Prometheus Label LimitHit | ||
weight: 20 | ||
--- | ||
|
||
# PrometheusLabelLimitHit | ||
|
||
## Meaning | ||
|
||
Prometheus has dropped targets because some scrape configs have exceeded the labels limit. | ||
|
||
## Impact | ||
|
||
Metrics and alerts may be missing or inaccurate. | ||
|
||
## Diagnosis | ||
|
||
|
||
## Mitigation | ||
|
||
Start thinking about sharding prometheus. | ||
Increase scrape times to perform it less frequently. |
22 changes: 22 additions & 0 deletions
22
content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
--- | ||
title: Prometheus Missing Rule Evaluations | ||
weight: 20 | ||
--- | ||
|
||
# PrometheusMissingRuleEvaluations | ||
|
||
## Meaning | ||
|
||
Prometheus is missing rule evaluations due to slow rule group evaluation. | ||
|
||
## Impact | ||
|
||
Metrics and alerts may be missing or inaccurate. | ||
|
||
## Diagnosis | ||
|
||
Check which rules fail, try to calcuate them differently. | ||
|
||
## Mitigation | ||
|
||
Sometimes giving more CPU is the only way to fix it. |
24 changes: 24 additions & 0 deletions
24
content/runbooks/prometheus/PrometheusNotConnectedToAlertmanagers.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
--- | ||
title: Prometheus Not Connected To Alertmanagers | ||
weight: 20 | ||
--- | ||
|
||
# PrometheusNotConnectedToAlertmanagers | ||
|
||
## Meaning | ||
|
||
Prometheus is not connected to any Alertmanagers. | ||
|
||
## Impact | ||
|
||
Sending alerts is not possible. | ||
|
||
## Diagnosis | ||
|
||
Check connectivity issues between Prometheus and AlertManager. | ||
Check NetworkPolicies, network saturation. | ||
Check if AlertManager is not overloaded or has not enough resources. | ||
|
||
## Mitigation | ||
|
||
Set multiple AlertManager instances, spread them across nodes. |
22 changes: 22 additions & 0 deletions
22
content/runbooks/prometheus/PrometheusNotIngestingSamples.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
--- | ||
title: Prometheus Not Ingesting Samples | ||
weight: 20 | ||
--- | ||
|
||
# PrometheusNotIngestingSamples | ||
|
||
## Meaning | ||
|
||
Prometheus is not ingesting samples. | ||
|
||
## Impact | ||
|
||
Missing metrics. | ||
|
||
## Diagnosis | ||
|
||
TODO | ||
|
||
## Mitigation | ||
|
||
TODO |
26 changes: 26 additions & 0 deletions
26
content/runbooks/prometheus/PrometheusNotificationQueueRunningFull.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
--- | ||
title: Prometheus Notification Queue Running Full | ||
weight: 20 | ||
--- | ||
|
||
# PrometheusNotificationQueueRunningFull | ||
|
||
## Meaning | ||
|
||
Prometheus alert notification queue predicted to run full in less than 30m. | ||
|
||
## Impact | ||
|
||
Fail to send alerts. | ||
|
||
## Diagnosis | ||
|
||
Check prometheus container logs for an explanation of which part of the | ||
configuration is problematic. | ||
|
||
## Mitigation | ||
|
||
Remove conflicting configuration option. | ||
|
||
Check if there is an option to decrease number of alerts firing, | ||
for example by sharding prometheus. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,8 @@ | ||
--- | ||
title: Prometheus Out Of Order Timestamps | ||
weight: 20 | ||
--- | ||
|
||
# PrometheusOutOfOrderTimestamps | ||
|
||
More information in https://www.robustperception.io/debugging-out-of-order-samples | ||
More information in [blog](https://www.robustperception.io/debugging-out-of-order-samples) |
24 changes: 24 additions & 0 deletions
24
content/runbooks/prometheus/PrometheusRemoteStorageFailures.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
--- | ||
title: Prometheus Remote Storage Failures | ||
weight: 20 | ||
--- | ||
|
||
# PrometheusRemoteStorageFailures | ||
|
||
## Meaning | ||
|
||
Prometheus fails to send samples to remote storage. | ||
|
||
## Impact | ||
|
||
Metrics and alerts may be missing or inaccurate. | ||
|
||
## Diagnosis | ||
|
||
Check prometheus logs and remote storage logs. | ||
Investigate network issues. | ||
Check configs and credentials. | ||
|
||
## Mitigation | ||
|
||
TODO |
29 changes: 29 additions & 0 deletions
29
content/runbooks/prometheus/PrometheusRemoteWriteBehind.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
--- | ||
title: Prometheus Remote Write Behind | ||
weight: 20 | ||
--- | ||
|
||
# PrometheusRemoteStorageFailures | ||
|
||
## Meaning | ||
|
||
Prometheus remote write is behind. | ||
|
||
## Impact | ||
|
||
Metrics and alerts may be missing or inaccurate. | ||
Increased data lag between locations. | ||
|
||
## Diagnosis | ||
|
||
Check prometheus logs and remote storage logs. | ||
Investigate network issues. | ||
Check configs and credentials. | ||
|
||
## Mitigation | ||
|
||
Probbaly amout of data sent to remote system is too high | ||
for given network connectivity speed. | ||
You may need to limit which metrics to send to minimize transfers. | ||
|
||
See [Prometheus Remote Storage Failures]({{< ref "./PrometheusRemoteStorageFailures.md" >}}) |
33 changes: 33 additions & 0 deletions
33
content/runbooks/prometheus/PrometheusRemoteWriteDesiredShards.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
--- | ||
title: PrometheusRemoteWriteDesiredShards | ||
weight: 20 | ||
--- | ||
|
||
# PrometheusRemoteWriteDesiredShards | ||
|
||
## Meaning | ||
|
||
Prometheus remote write desired shards calculation wants to run | ||
more than configured max shards. | ||
|
||
|
||
## Impact | ||
|
||
Metrics and alerts may be missing or inaccurate. | ||
|
||
|
||
## Diagnosis | ||
|
||
Check metrics cardinality. | ||
|
||
Check prometheus logs and remote storage logs. | ||
Investigate network issues. | ||
Check configs and credentials. | ||
|
||
## Mitigation | ||
|
||
Probbaly amout of data sent to remote system is too high | ||
for given network connectivity speed. | ||
You may need to limit which metrics to send to minimize transfers. | ||
|
||
See [Prometheus Remote Storage Failures]({{< ref "./PrometheusRemoteStorageFailures.md" >}}) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,27 @@ | ||
--- | ||
title: Prometheus Rule Failures | ||
weight: 20 | ||
--- | ||
|
||
# PrometheusRuleFailures | ||
|
||
Your best starting point is the rules page of the Prometheus UI (:9090/rules). It will show the error. | ||
## Meaning | ||
|
||
Prometheus is failing rule evaluations. | ||
Prometheus rules are incorrect or failed to calculate. | ||
|
||
## Impact | ||
|
||
Metrics and alerts may be missing or inaccurate. | ||
|
||
## Diagnosis | ||
|
||
Your best starting point is the rules page of the Prometheus UI (:9090/rules). | ||
It will show the error. | ||
|
||
You can also evaluate the rule expression yourself, using the UI, or maybe | ||
using PromLens to help debug expression issues. | ||
|
||
## Mitigation | ||
|
||
You can also evaluate the rule expression yourself, using the UI, or maybe using PromLens to help debug expression issues. | ||
Fix rules. |
Oops, something went wrong.