-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix VMWare VM freezing bug by reverting #51066 #67825
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/ok-to-test |
/assign @divyenpatel @BaluDontu |
/cc @dougm PTAL |
/sig storage |
@nikopen Can you verify the workflow and behavior when Node on which Pod is present is powered off? |
@divyenpatel @BaluDontu and I discussed this issue. vSphere allows multi attach to make sure that on node power off, the pods on failed node can be quickly scheduled on another worker node for reducing the pod downtime. Since the node is powered off, the attach disk to new worker node will succeed instantly and the pods can come up sooner. But in non powered off scenario, looks like this is causing an issue. Note that vSphere volume does not support ReadWriteMany. We have the following questions:
Its ok to revert this change, but its better to have clarity on the above questions. |
@SandeepPissay I noticed another variant of this bug today. The problem is - vsphere cloud provider is not detaching volumes from old nodes at all. Say, I have a pod with volume on node X and node X is powered off, then in vsphere cloudprovider node api object gets removed from etcd. This actually causes But because multiattach is enabled for vsphere, the disk in question gets attached to Y. But when I tried to restart node X, then it refused to boot with:
So, it appears that using multiattach to get around some of the detach issues is problematic and needs more testing. To answer your questions: a. When a worker node is powered off, a pod on it will eventually be evicted and started on a healthy node. When pod gets evicted then and only then detach should be performed. if underlying volume type does not support multiattach then it should not allow attaching volume to another node. GCE,AWS etc don't allow a volume to be attached to a different node. b. Currently all cloudproviders are moving towards a model where node api object will not be deleted when node is shutdown. There is a lengthy discussion of it here - #45986 so vsphere will be only cloudprovider left which will delete the node object when a node is shutdown. The upshot of that decision is - on AWS/GCE, when a node is shutdown and pods get evicted from stopped node, after this event + 6 minutes - volumes get "forced detached" from the powered off node. There is a ongoing feature to try and taint the node with "Shutdown" taint, so as attach/detach controller can know that the node has been shutdown and it can detach volumes sooner. Also - another side effect of the decision is, pods running on shutdown node, don't necessarily get deleted on AWS/GCE/Openstack. They remain in "unknown" state. One win from this approach for vsphere I see is, the bug I mentioned above. Because vsphere deletes the node api object when a node is turned off, it removes the node object from internal node map cache. This mean, detach is not possible from that node. Keeping the node object around will fix that(though there may be other ways to workaround this) |
@divyenpatel @BaluDontu @SandeepPissay as @gnufied said, #51066 was a custom vsphere patch to have a different behaviour than all other cloudproviders, but that caused a much bigger problem: freezing a whole VM for a minute or more, i.e. freezing all pods running on that VM. This PR reverts that change and returns to the behaviour all other cloud providers have. The solution for the 6-minute shutdown wait time will be implemented with #45986. Could you please merge this and also approve cherry-picks for all previous major versions affected? That would be 1.8, 1.9, 1.10, 1.11. Thanks |
/lgtm |
With this change, Pod is never able to come up on another Node. Disks remains attached to the powered off Node, and pod creation on new Node is failing.
Log
|
@divyenpatel is that because of #67900 issue I filed? Anyhow - using multiattach to mask a detach problem seems incorrect. |
/retest Review the full test history for this PR. Silence the bot with an |
/test pull-kubernetes-e2e-gce |
/test pull-kubernetes-e2e-kops-aws |
/test pull-kubernetes-e2e-gce |
/test pull-kubernetes-e2e-kops-aws |
1 similar comment
/test pull-kubernetes-e2e-kops-aws |
/test pull-kubernetes-e2e-gce |
/test pull-kubernetes-e2e-kops-aws |
/retest Review the full test history for this PR. Silence the bot with an |
Automatic merge from submit-queue (batch tested with PRs 67745, 67432, 67569, 67825, 67943). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md. |
@nikopen Can you please confirm if your 1.9 clusters are affected by #67900 ? Because as per your comment #67825 (comment) detach from shutdown nodes in 1.9 works. But my testing says otherwise. I think @divyenpatel has also confirmed this. In 1.9 - when a node is shutdown it gets removed from api-server and from nodeinfo map and hence |
@gnufied The repro comment stands true, tested in both non-patched and patched 1.9.10 clusters. I think the shutdown was issued via ssh && sudo shutdown, does this remove it from the Vsphere inventory? It probably shouldn't, but I didn't verify this last detail. I can attempt it tomorrow again and let you know. |
@gnufied @divyenpatel Just tried a VM hard shutdown on a patched cluster, again couldn't repro #67900 . Volume got re-attached in 6 minutes as expected. |
What this PR does / why we need it: kube-controller-manager, VSphere specific: When the controller tries to attach a Volume to Node A that is already attached to Node B, Node A freezes until the volume is attached. Kubernetes continues to try to attach the volume as it thinks that it's 'multi-attachable' when it's not. #51066 is the culprit.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes vmware-archive#500 / vmware-archive#502 (same issue)
Special notes for your reviewer:
Vsphere installation, any k8s version from 1.8 and above, pod with attached PV/PVC/VMDK:
kubectl delete po/[pod] --force --grace-period=0
kubectl describe [pod]
and attempt to Ping it or SSH into it.kubectl get node
shows it as 'NotReady'. New node is frozen until the volume is attached - usually 1 minute freeze for 1 volume in a low-load cluster, and many minutes more with higher loads and more volumes involved.Tested a custom patched 1.9.10 kube-controller-manager with #51066 reverted and the above bug is resolved - can't repro it anymore. New node doesn't freeze at all, and attaching happens quite quickly, in a few seconds.
Release note: