Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issue that pull image failed from a cross-subscription Azure Container Registry by SP access #66429

Merged
merged 1 commit into from
Jul 23, 2018

Conversation

andyzhangx
Copy link
Member

@andyzhangx andyzhangx commented Jul 20, 2018

What this PR does / why we need it:
after granting sp access to azure ACR , pull image from ACR would fail, and after wait about 15-30min(or restart kubelet directly), pull image would succeed. Root cause is that servicePrincipalToken needs to be refreshed when doing registryClient.List, otherwise it will always return empty registry list. Pull image error would be like following:

Events:
  Type     Reason                 Age              From                               Message
  ----     ------                 ----             ----                               -------
  Warning  FailedScheduling       8m (x3 over 8m)  default-scheduler                  0/1 nodes are available: 1 Insufficient cpu.
  Normal   Scheduled              8m               default-scheduler                  Successfully assigned nginx-server-776564f79c-zhtjk to aks-nodepool1-20881069-0
  Normal   SuccessfulMountVolume  8m               kubelet, aks-nodepool1-20881069-0  MountVolume.SetUp succeeded for volume "default-token-4t7tk"
  Normal   SuccessfulMountVolume  8m               kubelet, aks-nodepool1-20881069-0  MountVolume.SetUp succeeded for volume "pvc-5c1f0521-739f-11e8-9b69-0a58ac1f09c2"
  Warning  Failed                 8m (x5 over 8m)  kubelet, aks-nodepool1-20881069-0  Error: ImagePullBackOff
  Normal   BackOff                8m (x5 over 8m)  kubelet, aks-nodepool1-20881069-0  Back-off pulling image "andyacr.azurecr.io/nginx-server:1.0.0"
  Warning  Failed                 8m (x2 over 8m)  kubelet, aks-nodepool1-20881069-0  Error: ErrImagePull
  Warning  Failed                 8m (x2 over 8m)  kubelet, aks-nodepool1-20881069-0  Failed to pull image "andyacr.azurecr.io/nginx-server:1.0.0": rpc error: code = Unknown desc = Error response from daemon: Get https://andyacr.azurecr.io/v2/nginx-server/manifests/1.0.0: unauthorized: authentication required

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #65225

Special notes for your reviewer:
After discuss with dong, registryClient.List won't be necessary, instead we return {"*.azurecr.io", "*.azurecr.cn", "*.azurecr.de", "*.azurecr.us"} like aws, gce code logic, it will do the url matching.
I will cherry pick this PR to all supported version, every version has this issue.

Neet to mention that this PR would also make image pull from a cross-subscription ACR work after we grant SP access to an ACR in a different subs. Original code won’t work because it’s using the SubscriptionID of current AKS cluster, since ACR is in a different subs, registryClient.List will never get cross subs ACR repos.

Release note:

fix issue that pull image failed from a cross-subscription Azure Container Registry

/sig azure
/assign @feiskyer @khenidak @brendandburns @karataliu

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Jul 20, 2018
@k8s-ci-robot k8s-ci-robot added sig/azure size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 20, 2018
@andyzhangx
Copy link
Member Author

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 20, 2018
@@ -177,6 +170,16 @@ func (a *acrProvider) Provide() credentialprovider.DockerConfig {
defer cancel()

glog.V(4).Infof("listing registries")

var err error
a.servicePrincipalToken, err = auth.GetServicePrincipalToken(a.config, a.environment)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will call AAD API every time to get token every time 'Provide' is called.
Can it cache 'servicePrincipalToken', and 'registryClient', and only do a re-fetch if later 'registryClient.List' fails?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to do this only when not getting the expected result.

Since the list is actually successful but without the expected result, you can add a check of the list, e.g. len(result) == 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karataliu @feiskyer The Provide func call happens when there is pull image requirement, and if pull image fails, it will try to rerun Provide by lookup func every 5 min, so not every kubelet will call this func only when there is pull image failure, related invocation code is here:

creds, withCredentials := keyring.Lookup(repoToPull)

@karataliu
Copy link
Contributor

Seems the config key can be a glob pattern, so we might just remove the list acr part.

func urlsMatchStr(glob string, target string) (bool, error) {

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 20, 2018
@andyzhangx andyzhangx changed the title fix acr could not be listed in sp issue [WIP]fix acr could not be listed in sp issue Jul 20, 2018
@andyzhangx
Copy link
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 20, 2018
@andyzhangx andyzhangx closed this Jul 20, 2018
@k8s-ci-robot k8s-ci-robot added retest-not-required-docs-only size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 20, 2018
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 20, 2018
@andyzhangx
Copy link
Member Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 20, 2018
@andyzhangx
Copy link
Member Author

@karataliu code changed, and test pass, PTAL

@andyzhangx andyzhangx changed the title [WIP]fix acr could not be listed in sp issue fix acr could not be listed in sp issue Jul 20, 2018
for ix := range result {
loginServer := getLoginServer(result[ix])
var cred *credentialprovider.DockerConfigEntry
if a.config.UseManagedIdentityExtension {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, we keep the original logic for UseManagedIdentityExtension, while for sp, we always return loginServer as *.azurecr.io, *.azurecr.cn and that would save ARM calls

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the new method

@andyzhangx
Copy link
Member Author

/test pull-kubernetes-integration
/test pull-kubernetes-e2e-kops-aws

@feiskyer
Copy link
Member

/lgtm
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 23, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 49670be into kubernetes:master Jul 23, 2018
k8s-github-robot pushed a commit that referenced this pull request Jul 30, 2018
…6429-upstream-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #66429: fix acr sp access issue

Cherry pick of #66429 on release-1.10.

#66429: fix acr sp access issue
k8s-github-robot pushed a commit that referenced this pull request Aug 2, 2018
…6429-upstream-release-1.11

Automatic merge from submit-queue.

Automated cherry pick of #66429: fix acr sp access issue

Cherry pick of #66429 on release-1.11.

#66429: fix acr sp access issue
@andyzhangx andyzhangx changed the title fix acr could not be listed in sp issue fix issue that pull image failed from a cross-subscription Azure Container Registry Aug 13, 2018
@keyoke
Copy link

keyoke commented Aug 27, 2018

I'm sorry but maybe I am missing something but I'm not sure how the code solves the cross-sub acr access issue for MSI?
We are still using registryClient.List(ctx) will list this list registries across subscriptions? -

if a.config.UseManagedIdentityExtension {

@andyzhangx
Copy link
Member Author

@keyoke, this PR fix cross-sub acr access issue for sp, not MSI

@andyzhangx andyzhangx changed the title fix issue that pull image failed from a cross-subscription Azure Container Registry fix issue that pull image failed from a cross-subscription Azure Container Registry by SP access Aug 27, 2018
@keyoke
Copy link

keyoke commented Aug 27, 2018

@andyzhangx - Ok thanks any plans for us to achieve the same with MSI?

@andyzhangx
Copy link
Member Author

I filed another issue here: #67892, I will dig into the code and find a way. You could also comment if you have any better idea, thanks.

k8s-github-robot pushed a commit that referenced this pull request Sep 2, 2018
…6429-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #66429: fix acr sp access issue

Cherry pick of #66429 on release-1.9.

#66429: fix acr sp access issue

**Release note**:

```
fix issue that pull image failed from a cross-subscription Azure Container Registry
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kubelet needs restart after granting sp access to azure acr
8 participants