Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-eks: spot interrupt handler for ASG capacity is incompatible with latest/support EKS versions #33108

Open
1 task
shapirov103 opened this issue Jan 23, 2025 · 3 comments
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. effort/medium Medium work item – several days of effort p1

Comments

@shapirov103
Copy link

Describe the bug

Cluster provisioning fails when spot interrupt handler is set to true and ASG capacity is used with the latest Kubernetes versions.

Created a simple cluster with EKS Blueprints for CDK and used ASG capacity provider.
CDK code is using cluster.addAutoScalingGroupCapacity with spotInterruptHandler set to true (default setting).

Getting the following exception:

Received response status [FAILED] from custom resource. Message returned: Error: b'Release "asgtestchartspotinterrupthandler88cd0a56" does not exist. Installing it now.\nError: unable to build kubernetes objects from release manifest: resource mapping not found for name: "asgtestchartspotinterrupthandler88cd0a56-aws-node-termination-h" namespace: "" from "": no matches for kind "PodSecurityPolicy" in version "policy/v1beta1"\nensure CRDs are installed first\n'
Logs: /aws/lambda/asg-test-awscdkawseksKubectlProvid-Handler886CB40B-VjmYzuKObYxM
at invokeUserFunction (/var/task/framework.js:2:6)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async onEvent (/var/task/framework.js:1:369)
at async Runtime.handler (/var/task/cfn-response.js:1:1837) (RequestId: d7f295fe-9046-45b5-bbe3-fcc72f9cc84d)

Similar results when setting Kubernetes version to 1.29 and 1.30.

Narrowed down to this code: https://github.com/aws/aws-cdk/blob/main/packages/aws-cdk-lib/aws-eks/lib/cluster.ts#L1163-L1176

Why is helm chart version hardcoded?

Regression Issue

  • Select this option if this issue appears to be a regression.

Last Known Working CDK Version

No response

Expected Behavior

Cluster provisioned with ASG capacity and spot interrupt handler installed.

Current Behavior

CFN provisioning failed with the exception described in the body of the issue.

Reproduction Steps

Created a simple cluster with EKS Blueprints for CDK and used ASG capacity provider.
CDK code is using cluster.addAutoScalingGroupCapacity with spotInterruptHandler set to true (default setting).

Possible Solution

maintain a map of chart versions for node termination handler that are supported by the latest Kubernetes/EKS versions or allow customers to pass the version (less preferred).

Additional Information/Context

Potential workaround is to disable spot interrupt handler and install node termination helm chart with the correct helm chart version, e.g.

version: "0.25.1",
repository: 'oci://public.ecr.aws/aws-ec2/helm/aws-node-termination-handler',

CDK CLI Version

2.173.4

Framework Version

No response

Node.js Version

20.10

OS

MacOS

Language

TypeScript

Language Version

No response

Other information

No response

@shapirov103 shapirov103 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 23, 2025
@github-actions github-actions bot added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Jan 23, 2025
@pahud pahud self-assigned this Jan 24, 2025
@pahud
Copy link
Contributor

pahud commented Jan 24, 2025

Yes, spot-interrupt-handler seems deprecated in eks-charts

https://github.com/aws/eks-charts?tab=readme-ov-file#aws-node-termination-handler

private addSpotInterruptHandler() {
if (!this._spotInterruptHandler) {
this._spotInterruptHandler = this.addHelmChart('spot-interrupt-handler', {
chart: 'aws-node-termination-handler',
version: '0.18.0',
repository: 'https://aws.github.io/eks-charts',
namespace: 'kube-system',
values: {
nodeSelector: {
lifecycle: LifecycleLabel.SPOT,
},
},
});
}

I suggest we deprecate spotInterruptHandler prop in aws-eks as well. If user needs to opt in this feature, they should install oci://public.ecr.aws/aws-ec2/helm/aws-node-termination-handler explicitly in their CDK code and specify their desired version.

@pahud pahud added p1 effort/medium Medium work item – several days of effort and removed needs-triage This issue or PR still needs to be triaged. labels Jan 24, 2025
@pahud pahud removed their assignment Jan 24, 2025
@shapirov103
Copy link
Author

deprecate works, as long as it is clear that this flag is only applicable to the EKS versions that are long out of the regular support. However, an approach that tells the user that node termination handler should be installed (or at least it is a good idea to install it) should be highlighted in the docs, including API docs.

@pahud
Copy link
Contributor

pahud commented Jan 29, 2025

@shapirov103 Agree. I'll bring it up to the team for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. effort/medium Medium work item – several days of effort p1
Projects
None yet
Development

No branches or pull requests

2 participants