Fix vsphere detach bug #69691

gnufied · 2018-10-11T18:29:02Z

Fixes Kubernetes not detaching volumes from shutdown nodes even after pods that are using the volume has been deleted. Fixes #67900

This is a variant of #63413

/sig storage

Fix volume detaches from powered-off vSphere VMs

k8s-ci-robot · 2018-10-11T18:29:04Z

@gnufied: This PR is not for the master branch but does not have the cherry-pick-approved label. Adding the do-not-merge/cherry-pick-not-approved label.

To approve the cherry-pick, please assign the patch release manager for the release branch by writing /assign @username in a comment when ready.

The list of patch release managers for each release can be found here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gnufied · 2018-10-11T18:55:39Z

/assign @SandeepPissay

gnufied · 2018-10-11T18:55:49Z

/sig vmware

gnufied · 2018-10-11T19:03:53Z

/test pull-kubernetes-verify

gnufied · 2018-10-11T21:25:04Z

/test pull-kubernetes-integration

abrarshivani

Deletion of entries from configmap is missing. Do you plan to handle it in this PR?
Rest looks good to me.

abrarshivani · 2018-10-11T22:21:33Z

@gnufied Can you please specify how you have tested this change?

gnufied · 2018-10-12T14:23:42Z

@abrarshivani I am still testing it with ESXi 6.7. I will update this PR with confirmation.

gnufied · 2018-10-16T17:58:51Z

/test pull-kubernetes-integration

gnufied · 2018-10-16T18:07:00Z

@abrarshivani alright, so I have tested this with a multinode cluster deployed using plain k8s on ESXi 6.7 and it does indeed appear to work and volumes are being detached without problem.

I am not handling removing of entries from configmap though. I think, that is something that should be done when node is terminated and removed from vcenter inventory too.

nikopen · 2018-10-16T20:34:18Z

@gnufied is this fix only needed in 1.11? any issue linked with this PR (apart from #67825 )?

gnufied · 2018-10-16T21:31:21Z

@nikopen this fix is needed for 1.9,1.10 and 1.11 (or whichever k8s versions are still considered supported except master). The PR you linked(#67825) just disables multiattach issue with vSphere, the underlying problem still is - we are NOT detaching volumes from shutdown nodes at all in these versions even after pods that were running on shutdown nodes has been deleted.

nikopen · 2018-10-16T21:55:12Z

Indeed 67825 simply circumvented around the issue.

Did something else fix this in 1.12 onwards?

/lgtm

abrarshivani · 2018-10-17T04:02:24Z

@gnufied Thanks for testing.
/lgtm
/approve

abrarshivani · 2018-10-17T22:16:29Z

/approve

saad-ali · 2018-10-17T22:25:06Z

/approve

I'm ok with this. RBAC rules changing in a minor release might be problematic. But I'm not the expert in that area.

/assign @liggitt

liggitt

If you want to specify the role/binding in a yaml file for vsphere deployers to install, that seems more reasonable than setting this up on all clusters, even non-vsphere ones

liggitt · 2018-10-17T22:24:57Z

plugin/pkg/auth/authorizer/rbac/bootstrappolicy/policy.go

@@ -489,6 +489,13 @@ func ClusterRoles() []rbacv1.ClusterRole {
 				eventsRule(),
 			},
 		},
+		{
+			ObjectMeta: metav1.ObjectMeta{Name: "system:vsphere-cloud-provider"},


This is a really broad permission (global read/write of all configmaps), which doesn't seem necessary. Also, we are actively working to remove provider-specific roles from the default policy set up on all clusters (aws was the only one added before this was well-considered, and is being removed in
#66635)

reduced the scope to just the namespace. It is no longer crated as a clusterole.

gnufied · 2018-10-17T22:31:14Z

@liggitt just so I understand this correctly - your recommendation is provide an YAML file with necessary roles and bindings and after that it is the job of installation tool to actually apply that YAML. Is there any other example of this?

liggitt · 2018-10-17T22:36:21Z

Similar to the yaml manifests for default storage class when running in a particular cloud provider env:

https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/storage-class/aws/default.yaml

liggitt · 2018-10-18T01:42:03Z

@nikopen we don't need this in 1.12 because we are not deleting node api objects when a node is shutdown and hence we don't need to save node's UUID in a configmap etc.

backporting that change to 1.11 seems like a more reasonable approach than this

gnufied · 2018-10-18T02:17:54Z

@nikopen we don't need this in 1.12 because we are not deleting node api objects when a node is shutdown and hence we don't need to save node's UUID in a configmap etc.

backporting that change to 1.11 seems like a more reasonable approach than this

I am not comfortable with changing lifecycle of node api objects in a minor release that has to be backported all the way back to 1.9. We may have customer scripts/utilities etc that depends on it. Also - we don't know what breaks in kubernetes itself if we do that. Having said that, I am not super familiar with vsphere code and I will have to rely on @SandeepPissay @abrarshivani to verify if that is something worth considering.

VCP fails to detach disk from powered-off node which is still in vcenter inventory. That's caused because VCP tries to detach disks after nodemanager has unregistered the node information. Use configmap as a fallback when node is not available

k8s-ci-robot · 2018-10-18T15:56:26Z

New changes are detected. LGTM label has been removed.

k8s-ci-robot · 2018-10-18T15:56:35Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: abrarshivani, gnufied, saad-ali
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: eparis

If they are not already assigned, you can assign the PR to them by writing /assign @eparis in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

cluster/OWNERS
hack/OWNERS
~~pkg/cloudprovider/providers/vsphere/OWNERS~~ [abrarshivani]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gnufied · 2018-10-18T15:58:12Z

@liggitt moved the rbac policy installation to a cluster add-on. It is currently only installed via ./hack/local-cluster-up.sh since ./kube-up.sh does not support vsphere. PTAL.

Install cloudprovider specific rbac policies

gnufied · 2018-10-18T16:42:09Z

/test pull-kubernetes-e2e-gce

gnufied · 2018-10-18T17:11:22Z

/test pull-kubernetes-e2e-kops-aws

SandeepPissay · 2018-10-18T21:25:00Z

@nikopen we don't need this in 1.12 because we are not deleting node api objects when a node is shutdown and hence we don't need to save node's UUID in a configmap etc.

backporting that change to 1.11 seems like a more reasonable approach than this

I am not comfortable with changing lifecycle of node api objects in a minor release that has to be backported all the way back to 1.9. We may have customer scripts/utilities etc that depends on it. Also - we don't know what breaks in kubernetes itself if we do that. Having said that, I am not super familiar with vsphere code and I will have to rely on @SandeepPissay @abrarshivani to verify if that is something worth considering.

@gnufied If the node object is not deleted on node shutdown, then there is no need for this change. Are you sure that the node object is not getting deleted? Is this the behavior in earlier releases of k8s? My understanding is that it is not the case. So, how do you plan to fix this issue for earlier k8s releases?

gnufied · 2018-10-18T22:31:31Z

@SandeepPissay yes in releases prior to 1.12, node object is being deleted when a node is shutdown and that is why I proposed this PR only for 1.9,1.10 and 1.11 releases.

What @liggitt is saying is - instead of this PR (which uses configmap cache for storing node name/uuid mapping) - can we backport the PR that changed the node deletion behavior on shutdown?

nikopen · 2018-10-19T08:14:04Z

FYI 1.9 is not active anymore - it's just 1.10 and 1.11 as per 3 active versions

nikopen · 2018-10-19T08:16:39Z

IMO at this point since it's about bugfixing for previous releases, if backporting the 1.12 PR has the same effect (fixes the bug) then it's a viable option. Do you expect any different behaviour with that, or extra PRs needed as well?

SandeepPissay · 2018-10-19T19:49:28Z

Backporting the 1.12 PR is a better solution. Do you know which PR fixes the node removal bug?

gnufied · 2018-10-19T23:04:59Z

The PR that changed lifecycle of node object for vsphere nodes is - #66619 but that PR alone isn't enough to fix the bug in question because it relies on a new cloud interface function InstanceShutdownByProviderID which was introduced by a different PR - #59323 (which exists in 1.11 but not in 1.10).

It is possible to rework a patch that just changes shutdown behaviour for VMs and isn't a backport of 1.12 PR. Another thing to keep in mind is, changing lifecycle of node objects(i.e keeping them around in case of shutdown) won't fix the detach issue. Volumes still will not be detached. But then user will have option of force deleting pods for forcing detach of volumes from powered-down nodes. So basically, vSphere behaviour will become same as GCE.

nikopen · 2018-10-24T08:00:39Z

@gnufied so if #66619 is backported to 1.10 + 1.11 and #59323 to 1.10, it should be good to go?

gnufied · 2018-10-26T16:20:10Z

@nikopen I have opened a new smaller PR - #70291 which just changes behaviour of shutdown nodes. if we backport it, then at least if user force deletes his pods volumes will be detached.

gnufied · 2018-11-10T01:37:04Z

Since we went with different solution. Closing
/close

k8s-ci-robot added the do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. label Oct 11, 2018

k8s-ci-robot requested review from deads2k and roberthbailey October 11, 2018 18:29

gnufied force-pushed the fix-vsphere-detach-bug branch from 3ad84c6 to a44180e Compare October 11, 2018 18:43

k8s-ci-robot assigned SandeepPissay Oct 11, 2018

k8s-ci-robot added the area/provider/vmware Issues or PRs related to vmware provider label Oct 11, 2018

abrarshivani reviewed Oct 11, 2018

View reviewed changes

roberthbailey removed their request for review October 12, 2018 08:34

k8s-ci-robot assigned nikopen Oct 16, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 16, 2018

k8s-ci-robot unassigned nikopen Oct 16, 2018

k8s-ci-robot assigned abrarshivani Oct 17, 2018

k8s-ci-robot assigned liggitt Oct 17, 2018

liggitt reviewed Oct 17, 2018

View reviewed changes

Fix bug in detaching disks in vsphere cp

0b881c7

VCP fails to detach disk from powered-off node which is still in vcenter inventory. That's caused because VCP tries to detach disks after nodemanager has unregistered the node information. Use configmap as a fallback when node is not available

gnufied force-pushed the fix-vsphere-detach-bug branch from 4c861dc to 70fa029 Compare October 18, 2018 15:56

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 18, 2018

Provider cluster addon for vsphere rbac policies

ae4aafa

Install cloudprovider specific rbac policies

gnufied force-pushed the fix-vsphere-detach-bug branch from 70fa029 to ae4aafa Compare October 18, 2018 16:11

gnufied mentioned this pull request Oct 26, 2018

Add code for including shutdown nodes #70291

Merged

gnufied closed this Nov 10, 2018

gnufied mentioned this pull request Dec 7, 2018

vSphere cloud provider unable to detachVolume if node is deleted but not vSphere vm #71829

Closed

Fix vsphere detach bug #69691

Fix vsphere detach bug #69691

Conversation

gnufied commented Oct 11, 2018 • edited Loading

k8s-ci-robot commented Oct 11, 2018

gnufied commented Oct 11, 2018

gnufied commented Oct 11, 2018

gnufied commented Oct 11, 2018

gnufied commented Oct 11, 2018

abrarshivani left a comment • edited Loading

Choose a reason for hiding this comment

abrarshivani commented Oct 11, 2018

gnufied commented Oct 12, 2018

gnufied commented Oct 16, 2018

gnufied commented Oct 16, 2018

nikopen commented Oct 16, 2018

gnufied commented Oct 16, 2018

nikopen commented Oct 16, 2018

abrarshivani commented Oct 17, 2018

abrarshivani commented Oct 17, 2018

saad-ali commented Oct 17, 2018

liggitt left a comment

Choose a reason for hiding this comment

liggitt Oct 17, 2018

Choose a reason for hiding this comment

gnufied Oct 18, 2018

Choose a reason for hiding this comment

gnufied commented Oct 17, 2018

liggitt commented Oct 17, 2018

liggitt commented Oct 18, 2018

gnufied commented Oct 18, 2018

k8s-ci-robot commented Oct 18, 2018

k8s-ci-robot commented Oct 18, 2018

gnufied commented Oct 18, 2018

gnufied commented Oct 18, 2018

gnufied commented Oct 18, 2018

SandeepPissay commented Oct 18, 2018

gnufied commented Oct 18, 2018

nikopen commented Oct 19, 2018

nikopen commented Oct 19, 2018

SandeepPissay commented Oct 19, 2018

gnufied commented Oct 19, 2018

nikopen commented Oct 24, 2018

gnufied commented Oct 26, 2018

gnufied commented Nov 10, 2018

gnufied commented Oct 11, 2018 •

edited

Loading

abrarshivani left a comment •

edited

Loading