Adds provisioner limits based on cpu and mem #817

suket22 · 2021-11-19T17:12:26Z

1. Issue, if available:
N/A

2. Description of changes:
Adds limits to the Karpenter provisioner based on CPU and Memory. With the current implementation these are soft limits and aren't strictly enforced.

The way the limiting works is that we keep launching worker nodes until we have any amount of capacity yet. Only once we've actually hit / crossed our limits do we stop the provisioning. It's a pretty naive implementation, and can be problematic sometimes.

Scenario 1 -
You've got 2 CPU left, and a 10 new pods come in each asking for 1 CPU. In an ideal world, we should only provision a tiny instance with 2CPU that can host 2 pods and the other 8 remain stuck in pending. However, what this implementation will do is provision whatever instance type Karpenter would do anyway based on our bin packing logic and fit all 10 pods. The next batch of pending pods however, will be entirely stuck since we'd detect we've crossed our limits.

Scenario 2 -
If you set the CPU or Memory to 0 in the provisioner, Karpenter won't be able to launch any worker nodes at all. This is by design and can help when you're looking to block instance creation entirely. Will be more useful once I add GPUs in.

What is to come next?

Adding GPU support. That should come quick - I just need to play around with it a little more.
Adding documentation on how this limiting works.
Adding metrics for the resource consumption in the cluster per provisioner.

Testing done so far

It's backwards compatible so if you haven't used limits before, your provisioning will continue to remain unlimited.
Specifying just cpu or just memory and a combination etc.
Done small scale ups to verify that we allow provisioning until we detect that we've hit/crossed our limits and then it blocks all provisioning.

3. Does this change impact docs?

[] Yes, PR includes docs updates
Yes, issue opened: link to issue
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2021-11-19T17:12:32Z

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: 2591f0a

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/619ebf3aefc2cb0008c10508

pkg/controllers/provisioning/resourcecounter.go

pkg/apis/provisioning/v1alpha5/limits.go

pkg/apis/provisioning/v1alpha5/provisioner.go

pkg/apis/provisioning/v1alpha5/resources.go

pkg/controllers/provisioning/resourcecounter.go

pkg/controllers/provisioning/scheduling/suite_test.go

pkg/controllers/provisioning/launcher.go

pkg/controllers/provisioning/provisioner.go

pkg/controllers/provisioning/launcher.go

Co-authored-by: Ellis Tarn <[email protected]>

pkg/controllers/provisioning/launcher.go

Etarn fixes

cmd/controller/main.go

ellistarn

Nice! I really like this approach. Some quick comments.

pkg/apis/provisioning/v1alpha5/limits.go

ellistarn · 2021-11-23T06:43:46Z

pkg/apis/provisioning/v1alpha5/provisioner_status.go

@@ -29,6 +30,9 @@ type ProvisionerStatus struct {
 	// its target, and indicates whether or not those conditions are met.
 	// +optional
 	Conditions apis.Conditions `json:"conditions,omitempty"`
+
+	// Resources is the list of resources that have been provisioned.
+	Resources v1.ResourceList `json:"resources,omitempty"`


It would be nice if we had a mechanism to express the amount of resources reserved by pods as well as the total amount of capacity in the current nodes. Not that you need to implement them both, but we should make sure it makes sense on the API.

Provisioner.Status.Capacity
Provisioner.Status.Reserved

Maybe we want to be cautious of the Reserved bit, since it will mean updating on every pod event.

Thoughts?

Nice call. I think there's definite value in having that in ProvisionerSpec. Might even help show how much better the instance utilization is with Karpenter. I think in the API though, it should could maybe be Provisioner.Status.Requests?

Agreed though, that it might mean we update the provisioner on each pod event so I'm a little concerned from a scaling front.

Controller-runtime will batch the reconciles and has a serial execution guarantee. I think this will roughly translate into a linear stream of writes to the API Server as it detects new resources. AFAICT, it shouldn't be enough to overwhelm anything.

charts/karpenter/templates/controller/rbac.yaml

pkg/controllers/counter/controller.go

pkg/controllers/provisioning/controller.go

pkg/apis/provisioning/v1alpha5/limits.go

suket22 · 2021-11-23T19:49:32Z

pkg/controllers/provisioning/launcher.go

+	provisioner, err := l.updateState(ctx, provisioner)
+	if err != nil {
+		return fmt.Errorf("unable to determine status of provisioner")
+	}


I'm not sure why I didn't need this in my earlier revisions. Maybe I was doing something else that was forcing the provisioner to get updated or I wasn't testing my scale ups fast enough, but without this the state in this provisioner object is really off. Since it's hitting a cache, I don't mind moving this into the for loop below either so we reduce the chance of staleness

pkg/apis/provisioning/v1alpha5/limits.go

pkg/controllers/provisioning/launcher.go

ellistarn

Nice! We just need a unit test now.

pkg/apis/provisioning/v1alpha5/provisioner_validation.go

pkg/controllers/provisioning/launcher.go

pkg/controllers/provisioning/suite_test.go

pkg/controllers/provisioning/launcher.go

pkg/apis/provisioning/v1alpha5/provisioner.go

pkg/controllers/counter/controller.go

Co-authored-by: Ellis Tarn <[email protected]>

pkg/controllers/provisioning/suite_test.go

pkg/controllers/provisioning/launcher.go

pkg/controllers/provisioning/controller.go

Co-authored-by: Ellis Tarn <[email protected]>

Limit provisioner by cpu and memory

632e737

suket22 commented Nov 19, 2021

View reviewed changes

pkg/controllers/provisioning/resourcecounter.go Outdated Show resolved Hide resolved

ellistarn reviewed Nov 19, 2021

View reviewed changes

pkg/apis/provisioning/v1alpha5/limits.go Outdated Show resolved Hide resolved

ellistarn reviewed Nov 19, 2021

View reviewed changes

pkg/apis/provisioning/v1alpha5/provisioner.go Outdated Show resolved Hide resolved

ellistarn reviewed Nov 19, 2021

View reviewed changes

pkg/apis/provisioning/v1alpha5/resources.go Outdated Show resolved Hide resolved

ellistarn reviewed Nov 19, 2021

View reviewed changes

pkg/controllers/provisioning/resourcecounter.go Outdated Show resolved Hide resolved

ellistarn reviewed Nov 19, 2021

View reviewed changes

pkg/controllers/provisioning/scheduling/suite_test.go Outdated Show resolved Hide resolved

ellistarn reviewed Nov 19, 2021

View reviewed changes

pkg/controllers/provisioning/launcher.go Outdated Show resolved Hide resolved

ellistarn reviewed Nov 19, 2021

View reviewed changes

pkg/controllers/provisioning/provisioner.go Outdated Show resolved Hide resolved

ellistarn reviewed Nov 19, 2021

View reviewed changes

pkg/controllers/provisioning/launcher.go Outdated Show resolved Hide resolved

suket22 and others added 3 commits November 19, 2021 12:41

Update pkg/apis/provisioning/v1alpha5/provisioner.go

db8dbb1

Co-authored-by: Ellis Tarn <[email protected]>

Update pkg/controllers/provisioning/provisioner.go

5a7759a

Co-authored-by: Ellis Tarn <[email protected]>

Update pkg/controllers/provisioning/resourcecounter.go

5c5b3cf

Co-authored-by: Ellis Tarn <[email protected]>

JacobGabrielson reviewed Nov 19, 2021

View reviewed changes

pkg/controllers/provisioning/launcher.go Outdated Show resolved Hide resolved

suket22 and others added 7 commits November 22, 2021 14:52

Separate controller for counting resources

e00778a

Separate controller for counting resources

4a6cd5e

Etarn fixes

7633fc9

Merge pull request #1 from ellistarn/limitsImpl

9310ccd

Etarn fixes

Some more fixes - don't default the status

217942f

Remove extra logging statement

2f291aa

Fix defaults, fix binding errors

9d67667

suket22 changed the title ~~[WIP] Adds provisioner limits based on cpu and mem~~ Adds provisioner limits based on cpu and mem Nov 23, 2021

ellistarn reviewed Nov 23, 2021

View reviewed changes

cmd/controller/main.go Outdated Show resolved Hide resolved

ellistarn reviewed Nov 23, 2021

View reviewed changes

suket22 added 2 commits November 23, 2021 11:37

More minor fixes

3285786

Remove extra patch from rbac

4531d68

suket22 mentioned this pull request Nov 23, 2021

Emit Resource Limits in Metrics Controller #840

Closed

suket22 commented Nov 23, 2021

View reviewed changes

ellistarn reviewed Nov 23, 2021

View reviewed changes

pkg/apis/provisioning/v1alpha5/limits.go Outdated Show resolved Hide resolved

Don't reassign the provisioner in the launcher

555d18d

suket22 commented Nov 24, 2021

View reviewed changes

pkg/controllers/provisioning/launcher.go Outdated Show resolved Hide resolved

ellistarn reviewed Nov 24, 2021

View reviewed changes

pkg/apis/provisioning/v1alpha5/provisioner_validation.go Outdated Show resolved Hide resolved

pkg/controllers/provisioning/launcher.go Outdated Show resolved Hide resolved

pkg/controllers/provisioning/suite_test.go Show resolved Hide resolved

Addressing more comments on the PR

f866907

ellistarn reviewed Nov 24, 2021

View reviewed changes

pkg/controllers/provisioning/launcher.go Outdated Show resolved Hide resolved

ellistarn reviewed Nov 24, 2021

View reviewed changes

pkg/apis/provisioning/v1alpha5/provisioner.go Outdated Show resolved Hide resolved

ellistarn reviewed Nov 24, 2021

View reviewed changes

pkg/controllers/counter/controller.go Outdated Show resolved Hide resolved

ellistarn reviewed Nov 24, 2021

View reviewed changes

pkg/controllers/counter/controller.go Show resolved Hide resolved

ellistarn reviewed Nov 24, 2021

View reviewed changes

pkg/controllers/counter/controller.go Outdated Show resolved Hide resolved

suket22 and others added 6 commits November 24, 2021 13:51

Adds basic unit test

74d6bd9

Update pkg/controllers/counter/controller.go

afa1796

Co-authored-by: Ellis Tarn <[email protected]>

Update pkg/apis/provisioning/v1alpha5/provisioner.go

9974847

Co-authored-by: Ellis Tarn <[email protected]>

Update pkg/controllers/counter/controller.go

dfbc043

Co-authored-by: Ellis Tarn <[email protected]>

More refactoring

3036eea

Merge branch 'main' into limitsImpl

ca36e42

ellistarn reviewed Nov 24, 2021

View reviewed changes

pkg/controllers/provisioning/suite_test.go Outdated Show resolved Hide resolved

ellistarn reviewed Nov 24, 2021

View reviewed changes

pkg/controllers/provisioning/launcher.go Outdated Show resolved Hide resolved

ellistarn reviewed Nov 24, 2021

View reviewed changes

pkg/controllers/provisioning/launcher.go Outdated Show resolved Hide resolved

ellistarn reviewed Nov 24, 2021

View reviewed changes

pkg/controllers/provisioning/launcher.go Outdated Show resolved Hide resolved

ellistarn reviewed Nov 24, 2021

View reviewed changes

pkg/controllers/provisioning/controller.go Outdated Show resolved Hide resolved

ellistarn previously approved these changes Nov 24, 2021

View reviewed changes

Update pkg/controllers/provisioning/suite_test.go

26c1831

Co-authored-by: Ellis Tarn <[email protected]>

suket22 dismissed ellistarn’s stale review via 26c1831 November 24, 2021 22:38

Apply suggestions from code review

2591f0a

Co-authored-by: Ellis Tarn <[email protected]>

ellistarn approved these changes Nov 24, 2021

View reviewed changes

suket22 closed this Nov 24, 2021

suket22 reopened this Nov 24, 2021

suket22 merged commit 4bef8e9 into aws:main Nov 24, 2021

njtran mentioned this pull request Nov 24, 2021

added default provisioner limits into docs #851

Merged

3 tasks

suket22 deleted the limitsImpl branch January 5, 2022 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds provisioner limits based on cpu and mem #817

Adds provisioner limits based on cpu and mem #817

suket22 commented Nov 19, 2021 •

edited

Loading

netlify bot commented Nov 19, 2021 •

edited

Loading

ellistarn left a comment

ellistarn Nov 23, 2021

suket22 Nov 23, 2021

ellistarn Nov 23, 2021

suket22 Nov 23, 2021

ellistarn left a comment

Adds provisioner limits based on cpu and mem #817

Adds provisioner limits based on cpu and mem #817

Conversation

suket22 commented Nov 19, 2021 • edited Loading

netlify bot commented Nov 19, 2021 • edited Loading

ellistarn left a comment

Choose a reason for hiding this comment

ellistarn Nov 23, 2021

Choose a reason for hiding this comment

suket22 Nov 23, 2021

Choose a reason for hiding this comment

ellistarn Nov 23, 2021

Choose a reason for hiding this comment

suket22 Nov 23, 2021

Choose a reason for hiding this comment

ellistarn left a comment

Choose a reason for hiding this comment

suket22 commented Nov 19, 2021 •

edited

Loading

netlify bot commented Nov 19, 2021 •

edited

Loading