-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Will not disrupt bad node: Cannot disrupt NodeClaim: state node isn't initialized #6803
Comments
LogsWhile we don't seem to have logs that specifically talk about these nodelcaims/nodes - I can see exactly when the event happened. First, look at the % k get node ip-100-64-188-29.eu-west-1.compute.internal -o jsonpath='{.metadata.creationTimestamp}'
2024-08-16T23:53:43Z Given that these logs are in CloudWatch (because we run Karpenter on Fargate), we used Datadog to do a bit of digging on the logs. I can really only give screenshots here... so forgive me. As the upgrade begins... we get some webhook errors.As part of the upgrade, we delete the existing Validating/Mutating webhook configurations. We can see the existing Karpenter pods suddenly start throwing errors about that: During this, two nodeclaims are initializedHere we can see two nodeclaims happen to get created right as this is going on... oddly though, neither of these are the offending nodeclaim that is broken. 1.0.0 starts coming upNow the The nodeclaim does show upAfter digging further, I did find the offending nodeclaim and we can see it's created right in the middle of the cutover. You can see that it gets "registered".. The old karpenter pod goes down right after the registration happensHere we can see that right affter the registration event, the old karpenter pod is shut down and stops doing work. |
So it seems to me that there are 2 issues here...
|
(I can leave this bad node around for a little while... if there is any further troubleshooting or testing we want to do on it?) |
Taking a look at the status condition, we see that the NodeClaim is not ready because of the status condition of the node, not any internal state in Karpenter. Are you able to confirm that the matching Node for the NodeClaim is not ready? $ kg node $NODE_NAME -o jsonpath='{.status.conditions[?(@.type=="Ready")]}
Karpenter can't disrupt uninitialized nodes because one of the things initialization tracks is extended resource registration. Let's say you have an instance that supports an extended resource, but that resource takes 5 minutes to register after the node joined the cluster. If the node was a valid disruption candidate, Karpenter could consolidate it as an empty node before any resources register and the pending workload pods could schedule. This would result in an instance launch loop where Karpenter launches a node, the node registers, and Karpenter consolidates it before the resources have a chance to register. There's definitely an argument to be made for initialization having a timeout, like registration. Typically though initialization failures have to do with misconfiguration of things like device drivers, where relaunching the node won't fix the issue. |
Thanks for the response - and sorry for my delayed response! I understand the situation - and in fact, we had an instance with a bunch of |
I have the same issue (" This issues is unrelated to the status condition as my broken nodes were marked as "Ready". |
We have seen similar issue in our environment, for various reason, some of the nodeclaim doe snot have karpenter.sh/initialized=true, this prevent consolidate/shutdown the node, a node should be stuck in Uninitlaized status for long period of time, if it's a rac condition, can we get the node consolidated if it has not been in Initialized status for a period of time( like 1 hour)? |
We have noticed that some nodes are kept "uninitialized" for 10 minutes without any explanation in logs even when they become ready in a cluster. |
We also commonly see this issue in our cluster, with the same situation as @awoimbee is describing and the nodeclaims remain Some reproduction steps: I just spun up 50 nodes instantaneously on EKS and 2 of them hit this condition. Happy to share more details too. |
Description
Observed Behavior:
A few days ago we rolled out the 0.37.0 -> 1.0.0 upgrade, and it happened to roll out right as some nodes were booting up. This lead to having two clusters (out of ~15) that had a
NodeClaim
in anUnready
state. Here is an example of one currently in this state:Looking at the status more, we can see that it reports the nodeclaim as being not "initialized":
When we check the node itself, we see that it is missing the
karpenter.sh/initialized
label - though it does have thekarpenter.sh/registered: "true"
label:I dug through the Karpenter logs as best we could - and we find zero references to either the
NodeClaim
name, or theNode
name or the node's Instance ID.Expected Behavior:
When it comes to disrupting the NodeClaim - I don't understand why the
initialized
bit has to be set to true.. if the NodeClaim is in an "unready" state for some amount of time, we want Karpenter to reap the node and make sure it goes away... not leave it around. This, for example, is bad:% k describe nodeclaim kube-system-hsqcm Name: kube-system-hsqcm ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal DisruptionBlocked 22s (x1835 over 2d16h) karpenter Cannot disrupt NodeClaim: state node isn't initialized
I will folllow up here with a little digging into our logs to show the errors we do see from the controller when this happened..
The text was updated successfully, but these errors were encountered: