Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix VMWare VM freezing bug by reverting #51066 #67825

Merged
merged 1 commit into from
Aug 29, 2018
Merged

Fix VMWare VM freezing bug by reverting #51066 #67825

merged 1 commit into from
Aug 29, 2018

Conversation

nikopen
Copy link
Contributor

@nikopen nikopen commented Aug 24, 2018

What this PR does / why we need it: kube-controller-manager, VSphere specific: When the controller tries to attach a Volume to Node A that is already attached to Node B, Node A freezes until the volume is attached. Kubernetes continues to try to attach the volume as it thinks that it's 'multi-attachable' when it's not. #51066 is the culprit.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes vmware-archive#500 / vmware-archive#502 (same issue)

Special notes for your reviewer:

  • Repro:

Vsphere installation, any k8s version from 1.8 and above, pod with attached PV/PVC/VMDK:

  1. cordon the node which the pod is in
  2. kubectl delete po/[pod] --force --grace-period=0
  3. the pod is immediately rescheduled to a new node. Grab the new node from a kubectl describe [pod] and attempt to Ping it or SSH into it.
  4. you can see that pings/ssh fail to reach the new node. kubectl get node shows it as 'NotReady'. New node is frozen until the volume is attached - usually 1 minute freeze for 1 volume in a low-load cluster, and many minutes more with higher loads and more volumes involved.
  • Patch verification:

Tested a custom patched 1.9.10 kube-controller-manager with #51066 reverted and the above bug is resolved - can't repro it anymore. New node doesn't freeze at all, and attaching happens quite quickly, in a few seconds.

Release note:

Fix VSphere VM Freezing bug by reverting #51066 

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 24, 2018
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Aug 24, 2018
@gnufied
Copy link
Member

gnufied commented Aug 24, 2018

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 24, 2018
@gnufied
Copy link
Member

gnufied commented Aug 24, 2018

/assign @divyenpatel @BaluDontu

@frapposelli
Copy link
Member

/cc @dougm

PTAL

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 24, 2018
@dims
Copy link
Member

dims commented Aug 24, 2018

/sig storage

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Aug 24, 2018
@divyenpatel
Copy link
Member

@nikopen Can you verify the workflow and behavior when Node on which Pod is present is powered off?

@SandeepPissay
Copy link
Contributor

@divyenpatel @BaluDontu and I discussed this issue. vSphere allows multi attach to make sure that on node power off, the pods on failed node can be quickly scheduled on another worker node for reducing the pod downtime. Since the node is powered off, the attach disk to new worker node will succeed instantly and the pods can come up sooner. But in non powered off scenario, looks like this is causing an issue. Note that vSphere volume does not support ReadWriteMany.

We have the following questions:

  1. What is the Kubernetes recommendation for pod downtime when worker node is powered off?
  2. How are other cloud providers like AWS, Azure, GCE etc handling pod downtime in case of node power off?

Its ok to revert this change, but its better to have clarity on the above questions.

@gnufied
Copy link
Member

gnufied commented Aug 25, 2018

@SandeepPissay I noticed another variant of this bug today. The problem is - vsphere cloud provider is not detaching volumes from old nodes at all. Say, I have a pod with volume on node X and node X is powered off, then in vsphere cloudprovider node api object gets removed from etcd. This actually causes DetachDisk to fail and detach is never even attempted from node X.

But because multiattach is enabled for vsphere, the disk in question gets attached to Y. But when I tried to restart node X, then it refused to boot with:

"Cannot open the disk '/vmfs/volumes/5b0f026b-31c5ecfd-1537-4ccc6ab9973d/kubevols/kubernetes-dynamic-pvc-0e972db5-a7b1-11e8-a954-00505694a8ab.vmdk' or one of the snapshot disks it depends on.

So, it appears that using multiattach to get around some of the detach issues is problematic and needs more testing.

To answer your questions:

a. When a worker node is powered off, a pod on it will eventually be evicted and started on a healthy node. When pod gets evicted then and only then detach should be performed. if underlying volume type does not support multiattach then it should not allow attaching volume to another node. GCE,AWS etc don't allow a volume to be attached to a different node.

b. Currently all cloudproviders are moving towards a model where node api object will not be deleted when node is shutdown. There is a lengthy discussion of it here - #45986 so vsphere will be only cloudprovider left which will delete the node object when a node is shutdown.

The upshot of that decision is - on AWS/GCE, when a node is shutdown and pods get evicted from stopped node, after this event + 6 minutes - volumes get "forced detached" from the powered off node. There is a ongoing feature to try and taint the node with "Shutdown" taint, so as attach/detach controller can know that the node has been shutdown and it can detach volumes sooner.

Also - another side effect of the decision is, pods running on shutdown node, don't necessarily get deleted on AWS/GCE/Openstack. They remain in "unknown" state. One win from this approach for vsphere I see is, the bug I mentioned above. Because vsphere deletes the node api object when a node is turned off, it removes the node object from internal node map cache. This mean, detach is not possible from that node. Keeping the node object around will fix that(though there may be other ways to workaround this)

cc @saad-ali @sjenning @childsb

@nikopen
Copy link
Contributor Author

nikopen commented Aug 27, 2018

@divyenpatel @BaluDontu @SandeepPissay

as @gnufied said, #51066 was a custom vsphere patch to have a different behaviour than all other cloudproviders, but that caused a much bigger problem: freezing a whole VM for a minute or more, i.e. freezing all pods running on that VM.

This PR reverts that change and returns to the behaviour all other cloud providers have. The solution for the 6-minute shutdown wait time will be implemented with #45986.

Could you please merge this and also approve cherry-picks for all previous major versions affected? That would be 1.8, 1.9, 1.10, 1.11.

Thanks

@gnufied
Copy link
Member

gnufied commented Aug 27, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 27, 2018
@divyenpatel
Copy link
Member

divyenpatel commented Aug 27, 2018

@nikopen

With this change, Pod is never able to come up on another Node. Disks remains attached to the powered off Node, and pod creation on new Node is failing.

Events:
  Type     Reason              Age               From                       Message
  ----     ------              ----              ----                       -------
  Normal   Scheduled           9m                default-scheduler          Successfully assigned default/wordpress-mysql-779b4ddc75-xm5vb to kubernetes-node4
  Warning  FailedAttachVolume  9m                attachdetach-controller    Multi-Attach error for volume "pvc-ffcf2f53-aa2a-11e8-8703-0050569d768c" Volume is already used by pod(s) wordpress-mysql-779b4ddc75-wwdmr
  Warning  FailedMount         21s (x4 over 7m)  kubelet, kubernetes-node4  Unable to mount volumes for pod "wordpress-mysql-779b4ddc75-xm5vb_default(1cdad0c4-aa2c-11e8-8703-0050569d768c)": timeout expired waiting for volumes to attach or mount for pod "default"/"wordpress-mysql-779b4ddc75-xm5vb". list of unmounted volumes=[mysql-persistent-storage]. list of unattached volumes=[mysql-persistent-storage default-token-4kd4n]
# kubectl get pods
NAME                               READY     STATUS              RESTARTS   AGE
wordpress-mysql-779b4ddc75-wwdmr   1/1       Unknown             0          17m
wordpress-mysql-779b4ddc75-xm5vb   0/1       ContainerCreating   0          9m48s
# kubectl get nodes
NAME                STATUS                     ROLES     AGE       VERSION
kubernetes-master   Ready,SchedulingDisabled   <none>    35m       v1.13.0-alpha.0.626+ed3c32c3f9af0e-dirty
kubernetes-node1    Ready                      <none>    35m       v1.13.0-alpha.0.626+ed3c32c3f9af0e-dirty
kubernetes-node2    NotReady                   <none>    35m       v1.13.0-alpha.0.626+ed3c32c3f9af0e-dirty
kubernetes-node3    Ready                      <none>    35m       v1.13.0-alpha.0.626+ed3c32c3f9af0e-dirty
kubernetes-node4    Ready                      <none>    35m       v1.13.0-alpha.0.626+ed3c32c3f9af0e-dirty

Log

{"log":"W0827 19:05:07.450164       1 reconciler.go:370] Multi-Attach error for volume \"pvc-ffcf2f53-aa2a-11e8-8703-0050569d768c\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] c848845b-381e-d536-d784-02002c5be5de/kubernetes-dynamic-pvc-ffcf2f53-aa2a-11e8-8703-0050569d768c.vmdk\") from node \"kubernetes-node4\" Volume is already used by pods default/wordpress-mysql-779b4ddc75-wwdmr on node kubernetes-node2\n","stream":"stderr","time":"2018-08-27T19:05:07.450201326Z"}
{"log":"I0827 19:05:07.450290       1 event.go:221] Event(v1.ObjectReference{Kind:\"Pod\", Namespace:\"default\", Name:\"wordpress-mysql-779b4ddc75-xm5vb\", UID:\"1cdad0c4-aa2c-11e8-8703-0050569d768c\", APIVersion:\"v1\", ResourceVersion:\"2113\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume \"pvc-ffcf2f53-aa2a-11e8-8703-0050569d768c\" Volume is already used by pod(s) wordpress-mysql-779b4ddc75-wwdmr\n","stream":"stderr","time":"2018-08-27T19:05:07.450420381Z"}

@gnufied
Copy link
Member

gnufied commented Aug 27, 2018

@divyenpatel is that because of #67900 issue I filed? Anyhow - using multiattach to mask a detach problem seems incorrect.

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 29, 2018
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@nikopen
Copy link
Contributor Author

nikopen commented Aug 29, 2018

/test pull-kubernetes-e2e-gce

@frapposelli
Copy link
Member

/test pull-kubernetes-e2e-kops-aws

@nikopen
Copy link
Contributor Author

nikopen commented Aug 29, 2018

/test pull-kubernetes-e2e-gce

@nikopen
Copy link
Contributor Author

nikopen commented Aug 29, 2018

/test pull-kubernetes-e2e-kops-aws

1 similar comment
@nikopen
Copy link
Contributor Author

nikopen commented Aug 29, 2018

/test pull-kubernetes-e2e-kops-aws

@frapposelli
Copy link
Member

/test pull-kubernetes-e2e-gce

@frapposelli
Copy link
Member

/test pull-kubernetes-e2e-kops-aws

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 67745, 67432, 67569, 67825, 67943). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

@gnufied
Copy link
Member

gnufied commented Aug 30, 2018

@nikopen Can you please confirm if your 1.9 clusters are affected by #67900 ? Because as per your comment #67825 (comment) detach from shutdown nodes in 1.9 works.

But my testing says otherwise. I think @divyenpatel has also confirmed this. In 1.9 - when a node is shutdown it gets removed from api-server and from nodeinfo map and hence DetachDisk call fails. There is another PR for this - #63413 so I must not be alone in this.

@nikopen
Copy link
Contributor Author

nikopen commented Aug 30, 2018

@gnufied The repro comment stands true, tested in both non-patched and patched 1.9.10 clusters. I think the shutdown was issued via ssh && sudo shutdown, does this remove it from the Vsphere inventory? It probably shouldn't, but I didn't verify this last detail.

I can attempt it tomorrow again and let you know.

@nikopen
Copy link
Contributor Author

nikopen commented Aug 31, 2018

@gnufied @divyenpatel Just tried a VM hard shutdown on a patched cluster, again couldn't repro #67900 . Volume got re-attached in 6 minutes as expected.

k8s-github-robot pushed a commit that referenced this pull request Sep 7, 2018
…5-upstream-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066

Cherry pick of #67825 on release-1.10.

#67825: Fix VMWare VM freezing bug by reverting #51066
k8s-github-robot pushed a commit that referenced this pull request Sep 7, 2018
…5-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066

Cherry pick of #67825 on release-1.9.

#67825: Fix VMWare VM freezing bug by reverting #51066
k8s-ci-robot added a commit that referenced this pull request Sep 10, 2018
…5-upstream-release-1.11

Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066
@gnufied gnufied mentioned this pull request Oct 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Attaching and deleting volumes seems fuzzy