Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will not disrupt bad node: Cannot disrupt NodeClaim: state node isn't initialized #6803

Open
diranged opened this issue Aug 19, 2024 · 9 comments
Assignees
Labels
bug Something isn't working triage/needs-investigation Issues that need to be investigated before triaging

Comments

@diranged
Copy link

Description

Observed Behavior:

A few days ago we rolled out the 0.37.0 -> 1.0.0 upgrade, and it happened to roll out right as some nodes were booting up. This lead to having two clusters (out of ~15) that had a NodeClaim in an Unready state. Here is an example of one currently in this state:

% k get nodeclaim kube-system-hsqcm -o wide
NAME                TYPE          CAPACITY    ZONE         NODE                                          READY   AGE     ID                                      NODEPOOL      NODECLASS
kube-system-hsqcm   c6gd.xlarge   on-demand   eu-west-1c   ip-100-64-188-29.eu-west-1.compute.internal   False   2d15h   aws:///eu-west-1c/i-...   kube-system   default-nvme

Looking at the status more, we can see that it reports the nodeclaim as being not "initialized":

% k get nodeclaim kube-system-hsqcm -o jsonpath='{.status.conditions[2]}'     
{"lastTransitionTime":"2024-08-16T23:53:43Z","message":"Node status is NotReady","reason":"NodeNotReady","status":"False","type":"Initialized"}

When we check the node itself, we see that it is missing the karpenter.sh/initialized label - though it does have the karpenter.sh/registered: "true" label:

% k get node ip-100-64-188-29.eu-west-1.compute.internal -o yaml | yq .metadata.labels | grep 'karpenter.sh'
karpenter.sh/capacity-type: on-demand
karpenter.sh/nodepool: ...
karpenter.sh/registered: "true"

I dug through the Karpenter logs as best we could - and we find zero references to either the NodeClaim name, or the Node name or the node's Instance ID.

Expected Behavior:

When it comes to disrupting the NodeClaim - I don't understand why the initialized bit has to be set to true.. if the NodeClaim is in an "unready" state for some amount of time, we want Karpenter to reap the node and make sure it goes away... not leave it around. This, for example, is bad:

 % k describe nodeclaim kube-system-hsqcm
Name:         kube-system-hsqcm
...
Events:
  Type    Reason             Age                     From       Message
  ----    ------             ----                    ----       -------
  Normal  DisruptionBlocked  22s (x1835 over 2d16h)  karpenter  Cannot disrupt NodeClaim: state node isn't initialized

I will folllow up here with a little digging into our logs to show the errors we do see from the controller when this happened..

@diranged diranged added bug Something isn't working needs-triage Issues that need to be triaged labels Aug 19, 2024
@diranged
Copy link
Author

Logs

While we don't seem to have logs that specifically talk about these nodelcaims/nodes - I can see exactly when the event happened. First, look at the creationTimestamp for the node:

% k get node ip-100-64-188-29.eu-west-1.compute.internal -o jsonpath='{.metadata.creationTimestamp}'
2024-08-16T23:53:43Z

Given that these logs are in CloudWatch (because we run Karpenter on Fargate), we used Datadog to do a bit of digging on the logs. I can really only give screenshots here... so forgive me.

As the upgrade begins... we get some webhook errors.

As part of the upgrade, we delete the existing Validating/Mutating webhook configurations. We can see the existing Karpenter pods suddenly start throwing errors about that:

image

During this, two nodeclaims are initialized

Here we can see two nodeclaims happen to get created right as this is going on... oddly though, neither of these are the offending nodeclaim that is broken.

image

1.0.0 starts coming up

Now the 1.0.0 pods start coming up .. but of course the 0.37.1 pods still own the lease:

image

The nodeclaim does show up

After digging further, I did find the offending nodeclaim and we can see it's created right in the middle of the cutover. You can see that it gets "registered"..

image

The old karpenter pod goes down right after the registration happens

Here we can see that right affter the registration event, the old karpenter pod is shut down and stops doing work.

image

@diranged
Copy link
Author

So it seems to me that there are 2 issues here...

  1. Karpenter doesn't necessarily handle the transition from one pod to another managing resources well here. It seems that some state between the registration and initialized step can be lost and is never recovered when the new karpenter pod comes up.
  2. An old, unready, and uninitialized NodeClaim should be eligible for disruption so that this situation is automatically repaired.

@diranged
Copy link
Author

(I can leave this bad node around for a little while... if there is any further troubleshooting or testing we want to do on it?)

@jmdeal
Copy link
Contributor

jmdeal commented Aug 20, 2024

Looking at the status more, we can see that it reports the nodeclaim as being not "initialized"

Taking a look at the status condition, we see that the NodeClaim is not ready because of the status condition of the node, not any internal state in Karpenter. Are you able to confirm that the matching Node for the NodeClaim is not ready?

$ kg node $NODE_NAME -o jsonpath='{.status.conditions[?(@.type=="Ready")]}

When it comes to disrupting the NodeClaim - I don't understand why the initialized bit has to be set to true

Karpenter can't disrupt uninitialized nodes because one of the things initialization tracks is extended resource registration. Let's say you have an instance that supports an extended resource, but that resource takes 5 minutes to register after the node joined the cluster. If the node was a valid disruption candidate, Karpenter could consolidate it as an empty node before any resources register and the pending workload pods could schedule. This would result in an instance launch loop where Karpenter launches a node, the node registers, and Karpenter consolidates it before the resources have a chance to register.

There's definitely an argument to be made for initialization having a timeout, like registration. Typically though initialization failures have to do with misconfiguration of things like device drivers, where relaunching the node won't fix the issue.

@jmdeal jmdeal self-assigned this Aug 20, 2024
@diranged
Copy link
Author

Thanks for the response - and sorry for my delayed response! I understand the situation - and in fact, we had an instance with a bunch of Unknown state NodeClaims due to an invalid startup taint configuration where we ended up with nodes just sitting around that weren't in use. I think that as an operator we should have control of a timeout where any node that is not initialized/registered within a set time can be reaped.

@awoimbee
Copy link

awoimbee commented Oct 21, 2024

I have the same issue ("Cannot disrupt Node: state node isn't initialized"), it lead to errors where karpenter wanted to schedule pods to the broken nodes but the cluster scheduler would refuse to, so the pods would not be scheduled.
The message is printed here. We can see that StateNode.Initialized() depends only on in.Node != nil && in.Node.Labels[v1.NodeInitializedLabelKey] == "true" here. And indeed my broken nodes have karpenter.sh/registered=true but not karpenter.sh/initialized=true.

This issues is unrelated to the status condition as my broken nodes were marked as "Ready".
I have no idea (yet?) why the kube scheduler did not want to schedule pods on these nodes.

@lyanhuigit
Copy link

We have seen similar issue in our environment, for various reason, some of the nodeclaim doe snot have karpenter.sh/initialized=true, this prevent consolidate/shutdown the node, a node should be stuck in Uninitlaized status for long period of time, if it's a rac condition, can we get the node consolidated if it has not been in Initialized status for a period of time( like 1 hour)?

@makzzz1986
Copy link

We have noticed that some nodes are kept "uninitialized" for 10 minutes without any explanation in logs even when they become ready in a cluster.

@jmdeal jmdeal added triage/needs-investigation Issues that need to be investigated before triaging and removed needs-triage Issues that need to be triaged labels Jan 27, 2025
@seanmorton
Copy link

seanmorton commented Jan 30, 2025

We also commonly see this issue in our cluster, with the same situation as @awoimbee is describing and the nodeclaims remain Unknown after 10m+ before we manually delete them. This typically happens when we scale up nodes quickly.

Some reproduction steps: I just spun up 50 nodes instantaneously on EKS and 2 of them hit this condition. Happy to share more details too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage/needs-investigation Issues that need to be investigated before triaging
Projects
None yet
Development

No branches or pull requests

6 participants