Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run ListMetrics calls with goroutines. #53

Draft
wants to merge 13 commits into
base: live
Choose a base branch
from

Conversation

cristiangreco
Copy link

Execute ListMetrics calls in separate goroutines (one for each metric),
in a similar way to how GetMetricData requests are handled.

This change removes semaphore usage (it's not used in GetMetricData as
well) for the sake of making it easier to reason about the code. It can
easily be added back in case we hit any issue around e.g. rate limits.

In local tests, the speedup seems to be particularly effective when
requesting 4 or more metrics in parallel (e.g. especially with EC2/EBS).

kgeckhart and others added 13 commits October 7, 2021 08:24
Eliminate global labelMap and build a labelSet for metrics in ensureLabelConsistencyForMetrics
* Pass logger to the different structs

* Use a builder for the AWS services

* Use job specific loggers to propagate properties
* Merge latest release 0.34.0-alpha

* Add missing metric
* Move AWS metrics back to prometheus.go and expose them as array in update.go

* The go workflow should run on PR's for live

* Remove unused field
Execute ListMetrics calls in separate goroutines (one for each metric),
in a similar way to how GetMetricData requests are handled.

This change removes semaphore usage (it's not used in GetMetricData as
well) for the sake of making it easier to reason about the code. It can
easily be added back in case we hit any issue around e.g. rate limits.

In local tests, the speedup seems to be particularly effective when
requesting 4 or more metrics in parallel (e.g. especially with EC2/EBS).
Copy link

@kgeckhart kgeckhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change removes semaphore usage (it's not used in GetMetricData as well) for the sake of making it easier to reason about the code. It can easily be added back in case we hit any issue around e.g. rate limits.

Slight concern, errors aren't propagated out of the parallel sections of the code ATM so we might want to make sure we are monitoring errors a little closer after this.

Copy link

@IfSentient IfSentient left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to worry about ratelimiting from AWS in the case of large numbers of metrics?

metricsList, err := getFullMetricsList(ctx, svc.Namespace, m, clientCloudwatch)

if err != nil {
level.Error(logger).Log("msg", "Failed to get full metric list", "err", err, "metric_name", m.Name, "namespace", svc.Namespace)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add a prom metrics to track both this errors and the scenario below with zero resources?

Copy link

@ferruvich ferruvich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we roll this out to a single cluster first to observe any impact?

LGTM as we're going to do so

@CLAassistant
Copy link

CLAassistant commented Jun 15, 2022

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants