-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix vsphere detach bug #69691
Fix vsphere detach bug #69691
Conversation
@gnufied: This PR is not for the master branch but does not have the To approve the cherry-pick, please assign the patch release manager for the release branch by writing The list of patch release managers for each release can be found here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
3ad84c6
to
a44180e
Compare
/assign @SandeepPissay |
/sig vmware |
/test pull-kubernetes-verify |
/test pull-kubernetes-integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deletion of entries from configmap is missing. Do you plan to handle it in this PR?
Rest looks good to me.
@gnufied Can you please specify how you have tested this change? |
@abrarshivani I am still testing it with ESXi 6.7. I will update this PR with confirmation. |
/test pull-kubernetes-integration |
@abrarshivani alright, so I have tested this with a multinode cluster deployed using plain k8s on ESXi 6.7 and it does indeed appear to work and volumes are being detached without problem. I am not handling removing of entries from configmap though. I think, that is something that should be done when node is terminated and removed from vcenter inventory too. |
@nikopen this fix is needed for 1.9,1.10 and 1.11 (or whichever k8s versions are still considered supported except master). The PR you linked(#67825) just disables multiattach issue with vSphere, the underlying problem still is - we are NOT detaching volumes from shutdown nodes at all in these versions even after pods that were running on shutdown nodes has been deleted. |
Indeed 67825 simply circumvented around the issue. Did something else fix this in 1.12 onwards? /lgtm |
@gnufied Thanks for testing. |
/approve |
/approve I'm ok with this. RBAC rules changing in a minor release might be problematic. But I'm not the expert in that area. /assign @liggitt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to specify the role/binding in a yaml file for vsphere deployers to install, that seems more reasonable than setting this up on all clusters, even non-vsphere ones
@@ -489,6 +489,13 @@ func ClusterRoles() []rbacv1.ClusterRole { | |||
eventsRule(), | |||
}, | |||
}, | |||
{ | |||
ObjectMeta: metav1.ObjectMeta{Name: "system:vsphere-cloud-provider"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really broad permission (global read/write of all configmaps), which doesn't seem necessary. Also, we are actively working to remove provider-specific roles from the default policy set up on all clusters (aws was the only one added before this was well-considered, and is being removed in
#66635)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reduced the scope to just the namespace. It is no longer crated as a clusterole.
@liggitt just so I understand this correctly - your recommendation is provide an YAML file with necessary roles and bindings and after that it is the job of installation tool to actually apply that YAML. Is there any other example of this? |
Similar to the yaml manifests for default storage class when running in a particular cloud provider env: |
backporting that change to 1.11 seems like a more reasonable approach than this |
I am not comfortable with changing lifecycle of node api objects in a minor release that has to be backported all the way back to 1.9. We may have customer scripts/utilities etc that depends on it. Also - we don't know what breaks in kubernetes itself if we do that. Having said that, I am not super familiar with vsphere code and I will have to rely on @SandeepPissay @abrarshivani to verify if that is something worth considering. |
VCP fails to detach disk from powered-off node which is still in vcenter inventory. That's caused because VCP tries to detach disks after nodemanager has unregistered the node information. Use configmap as a fallback when node is not available
4c861dc
to
70fa029
Compare
New changes are detected. LGTM label has been removed. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: abrarshivani, gnufied, saad-ali If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@liggitt moved the rbac policy installation to a cluster add-on. It is currently only installed via |
Install cloudprovider specific rbac policies
70fa029
to
ae4aafa
Compare
/test pull-kubernetes-e2e-gce |
/test pull-kubernetes-e2e-kops-aws |
@gnufied If the node object is not deleted on node shutdown, then there is no need for this change. Are you sure that the node object is not getting deleted? Is this the behavior in earlier releases of k8s? My understanding is that it is not the case. So, how do you plan to fix this issue for earlier k8s releases? |
@SandeepPissay yes in releases prior to 1.12, node object is being deleted when a node is shutdown and that is why I proposed this PR only for 1.9,1.10 and 1.11 releases. What @liggitt is saying is - instead of this PR (which uses configmap cache for storing node name/uuid mapping) - can we backport the PR that changed the node deletion behavior on shutdown? |
FYI 1.9 is not active anymore - it's just 1.10 and 1.11 as per 3 active versions |
IMO at this point since it's about bugfixing for previous releases, if backporting the 1.12 PR has the same effect (fixes the bug) then it's a viable option. Do you expect any different behaviour with that, or extra PRs needed as well? |
Backporting the 1.12 PR is a better solution. Do you know which PR fixes the node removal bug? |
The PR that changed lifecycle of node object for vsphere nodes is - #66619 but that PR alone isn't enough to fix the bug in question because it relies on a new cloud interface function It is possible to rework a patch that just changes shutdown behaviour for VMs and isn't a backport of 1.12 PR. Another thing to keep in mind is, changing lifecycle of node objects(i.e keeping them around in case of shutdown) won't fix the detach issue. Volumes still will not be detached. But then user will have option of force deleting pods for forcing detach of volumes from powered-down nodes. So basically, vSphere behaviour will become same as GCE. |
Since we went with different solution. Closing |
Fixes Kubernetes not detaching volumes from shutdown nodes even after pods that are using the volume has been deleted. Fixes #67900
This is a variant of #63413
/sig storage