Will not disrupt bad node: Cannot disrupt NodeClaim: state node isn't initialized #6803

diranged · 2024-08-19T16:02:25Z

Description

Observed Behavior:

A few days ago we rolled out the 0.37.0 -> 1.0.0 upgrade, and it happened to roll out right as some nodes were booting up. This lead to having two clusters (out of ~15) that had a NodeClaim in an Unready state. Here is an example of one currently in this state:

% k get nodeclaim kube-system-hsqcm -o wide
NAME                TYPE          CAPACITY    ZONE         NODE                                          READY   AGE     ID                                      NODEPOOL      NODECLASS
kube-system-hsqcm   c6gd.xlarge   on-demand   eu-west-1c   ip-100-64-188-29.eu-west-1.compute.internal   False   2d15h   aws:///eu-west-1c/i-...   kube-system   default-nvme

Looking at the status more, we can see that it reports the nodeclaim as being not "initialized":

% k get nodeclaim kube-system-hsqcm -o jsonpath='{.status.conditions[2]}'     
{"lastTransitionTime":"2024-08-16T23:53:43Z","message":"Node status is NotReady","reason":"NodeNotReady","status":"False","type":"Initialized"}

When we check the node itself, we see that it is missing the karpenter.sh/initialized label - though it does have the karpenter.sh/registered: "true" label:

% k get node ip-100-64-188-29.eu-west-1.compute.internal -o yaml | yq .metadata.labels | grep 'karpenter.sh'
karpenter.sh/capacity-type: on-demand
karpenter.sh/nodepool: ...
karpenter.sh/registered: "true"

I dug through the Karpenter logs as best we could - and we find zero references to either the NodeClaim name, or the Node name or the node's Instance ID.

Expected Behavior:

When it comes to disrupting the NodeClaim - I don't understand why the initialized bit has to be set to true.. if the NodeClaim is in an "unready" state for some amount of time, we want Karpenter to reap the node and make sure it goes away... not leave it around. This, for example, is bad:

 % k describe nodeclaim kube-system-hsqcm
Name:         kube-system-hsqcm
...
Events:
  Type    Reason             Age                     From       Message
  ----    ------             ----                    ----       -------
  Normal  DisruptionBlocked  22s (x1835 over 2d16h)  karpenter  Cannot disrupt NodeClaim: state node isn't initialized

I will folllow up here with a little digging into our logs to show the errors we do see from the controller when this happened..

The text was updated successfully, but these errors were encountered:

diranged · 2024-08-19T16:12:10Z

Logs

While we don't seem to have logs that specifically talk about these nodelcaims/nodes - I can see exactly when the event happened. First, look at the creationTimestamp for the node:

% k get node ip-100-64-188-29.eu-west-1.compute.internal -o jsonpath='{.metadata.creationTimestamp}'
2024-08-16T23:53:43Z

Given that these logs are in CloudWatch (because we run Karpenter on Fargate), we used Datadog to do a bit of digging on the logs. I can really only give screenshots here... so forgive me.

As the upgrade begins... we get some webhook errors.

As part of the upgrade, we delete the existing Validating/Mutating webhook configurations. We can see the existing Karpenter pods suddenly start throwing errors about that:

During this, two nodeclaims are initialized

Here we can see two nodeclaims happen to get created right as this is going on... oddly though, neither of these are the offending nodeclaim that is broken.

1.0.0 starts coming up

Now the 1.0.0 pods start coming up .. but of course the 0.37.1 pods still own the lease:

The nodeclaim does show up

After digging further, I did find the offending nodeclaim and we can see it's created right in the middle of the cutover. You can see that it gets "registered"..

The old karpenter pod goes down right after the registration happens

Here we can see that right affter the registration event, the old karpenter pod is shut down and stops doing work.

diranged · 2024-08-19T16:13:37Z

So it seems to me that there are 2 issues here...

Karpenter doesn't necessarily handle the transition from one pod to another managing resources well here. It seems that some state between the registration and initialized step can be lost and is never recovered when the new karpenter pod comes up.
An old, unready, and uninitialized NodeClaim should be eligible for disruption so that this situation is automatically repaired.

diranged · 2024-08-19T16:13:55Z

(I can leave this bad node around for a little while... if there is any further troubleshooting or testing we want to do on it?)

jmdeal · 2024-08-20T16:37:21Z

Looking at the status more, we can see that it reports the nodeclaim as being not "initialized"

Taking a look at the status condition, we see that the NodeClaim is not ready because of the status condition of the node, not any internal state in Karpenter. Are you able to confirm that the matching Node for the NodeClaim is not ready?

$ kg node $NODE_NAME -o jsonpath='{.status.conditions[?(@.type=="Ready")]}

When it comes to disrupting the NodeClaim - I don't understand why the initialized bit has to be set to true

Karpenter can't disrupt uninitialized nodes because one of the things initialization tracks is extended resource registration. Let's say you have an instance that supports an extended resource, but that resource takes 5 minutes to register after the node joined the cluster. If the node was a valid disruption candidate, Karpenter could consolidate it as an empty node before any resources register and the pending workload pods could schedule. This would result in an instance launch loop where Karpenter launches a node, the node registers, and Karpenter consolidates it before the resources have a chance to register.

There's definitely an argument to be made for initialization having a timeout, like registration. Typically though initialization failures have to do with misconfiguration of things like device drivers, where relaunching the node won't fix the issue.

diranged · 2024-08-24T16:04:12Z

Thanks for the response - and sorry for my delayed response! I understand the situation - and in fact, we had an instance with a bunch of Unknown state NodeClaims due to an invalid startup taint configuration where we ended up with nodes just sitting around that weren't in use. I think that as an operator we should have control of a timeout where any node that is not initialized/registered within a set time can be reaped.

awoimbee · 2024-10-21T09:34:31Z

I have the same issue ("Cannot disrupt Node: state node isn't initialized"), it lead to errors where karpenter wanted to schedule pods to the broken nodes but the cluster scheduler would refuse to, so the pods would not be scheduled.
The message is printed here. We can see that StateNode.Initialized() depends only on in.Node != nil && in.Node.Labels[v1.NodeInitializedLabelKey] == "true" here. And indeed my broken nodes have karpenter.sh/registered=true but not karpenter.sh/initialized=true.

This issues is unrelated to the status condition as my broken nodes were marked as "Ready".
I have no idea (yet?) why the kube scheduler did not want to schedule pods on these nodes.

lyanhuigit · 2024-10-24T16:35:34Z

We have seen similar issue in our environment, for various reason, some of the nodeclaim doe snot have karpenter.sh/initialized=true, this prevent consolidate/shutdown the node, a node should be stuck in Uninitlaized status for long period of time, if it's a rac condition, can we get the node consolidated if it has not been in Initialized status for a period of time( like 1 hour)?

makzzz1986 · 2024-11-20T13:10:41Z

We have noticed that some nodes are kept "uninitialized" for 10 minutes without any explanation in logs even when they become ready in a cluster.

seanmorton · 2025-01-30T03:23:00Z

We also commonly see this issue in our cluster, with the same situation as @awoimbee is describing and the nodeclaims remain Unknown after 10m+ before we manually delete them. This typically happens when we scale up nodes quickly.

Some reproduction steps: I just spun up 50 nodes instantaneously on EKS and 2 of them hit this condition. Happy to share more details too.

diranged added bug Something isn't working needs-triage Issues that need to be triaged labels Aug 19, 2024

jmdeal self-assigned this Aug 20, 2024

wesbragagt mentioned this issue Oct 15, 2024

[question]: Cannot disrupt Node: state node doesn't contain both a node and a nodeclaim Panfactum/stack#163

Closed

2 tasks

This was referenced Oct 22, 2024

Node Repair kubernetes-sigs/karpenter#750

Closed

Cannot disrupt NodeClaim: state node doesn't contain both a node and a nodeclaim #7046

Closed

jmdeal added triage/needs-investigation Issues that need to be investigated before triaging and removed needs-triage Issues that need to be triaged labels Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will not disrupt bad node: Cannot disrupt NodeClaim: state node isn't initialized #6803

Will not disrupt bad node: Cannot disrupt NodeClaim: state node isn't initialized #6803

diranged commented Aug 19, 2024

diranged commented Aug 19, 2024

diranged commented Aug 19, 2024

diranged commented Aug 19, 2024

jmdeal commented Aug 20, 2024

diranged commented Aug 24, 2024

awoimbee commented Oct 21, 2024 •

edited

Loading

lyanhuigit commented Oct 24, 2024

makzzz1986 commented Nov 20, 2024

seanmorton commented Jan 30, 2025 •

edited

Loading

Will not disrupt bad node: Cannot disrupt NodeClaim: state node isn't initialized #6803

Will not disrupt bad node: Cannot disrupt NodeClaim: state node isn't initialized #6803

Comments

diranged commented Aug 19, 2024

Description

diranged commented Aug 19, 2024

Logs

As the upgrade begins... we get some webhook errors.

During this, two nodeclaims are initialized

1.0.0 starts coming up

The nodeclaim does show up

The old karpenter pod goes down right after the registration happens

diranged commented Aug 19, 2024

diranged commented Aug 19, 2024

jmdeal commented Aug 20, 2024

diranged commented Aug 24, 2024

awoimbee commented Oct 21, 2024 • edited Loading

lyanhuigit commented Oct 24, 2024

makzzz1986 commented Nov 20, 2024

seanmorton commented Jan 30, 2025 • edited Loading

awoimbee commented Oct 21, 2024 •

edited

Loading

seanmorton commented Jan 30, 2025 •

edited

Loading