Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Start GetMetricData calls before all ListMetrics calls finish #1094

Open
1 task done
kgeckhart opened this issue Aug 6, 2023 · 0 comments
Open
1 task done
Labels
enhancement New feature or request

Comments

@kgeckhart
Copy link
Contributor

kgeckhart commented Aug 6, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Feature description

After the most recent batch of performance fixes, discovery jobs which pull metrics from large AWS environments have very reasonable resource utilization 🎉 . The currently limitation which can still cause longer scrape times is that all ListMetrics calls must complete before any GetMetricData calls are made. Since the APIs have independent rate limits (ListMetrics is 25 TPS and GetMetricData is ~50 TPS) we can safely start calling GetMetricData before ListMetrics completes.

I think the most idiomatic go way of going about this is via introducing channels to runDiscoveryJob. The current challenge with doing this is the current code is really complex and relatively untestable. I think before doing this it needs to be refactored it to dramatically reduce the risk of such an impactful change. I would like to start by decomposing the main steps of runDiscoveryJob in to smaller composable/testable "dataflows" listed below

  1. GetResources
  2. ListMetrics
  3. AssociateMetricsToResources
  4. GetMetricData

At this point we should have the ability to have solid test coverage across the complex logic used by each flow and that runDiscoveryJob is going to flow the data appropriately. After this introducing channels should hopefully be as simple as introducing a new strategy for how runDiscoveryJob composes the flow of data which can be gated behind a feature flag. This level of decoupling will make it much easier to to test the complex test cases channels require like shutdown, and error propagation.

If this pattern works out well it I think it should be adaptable to reduce the amount of code copy CustomNamespace require. A CustomNamespace job should be a composition of the ListMetrics and GetMetricData dataflows.

What might the configuration look like?

Ideally, no configuration changes are required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant