Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DefaultRole being returned for new pods under load #100

Open
jrnt30 opened this issue Sep 22, 2017 · 2 comments
Open

DefaultRole being returned for new pods under load #100

jrnt30 opened this issue Sep 22, 2017 · 2 comments

Comments

@jrnt30
Copy link
Collaborator

jrnt30 commented Sep 22, 2017

On a comment in #92 @dadux mentioned a few issues they were seeing with the DefaultRole being returned on pods when under load. Wanted to get this out there to track.

We've tried running of this branch in our dev environments, but still get a lot of "default role" under load (our build-engineering team spinning up 100s of concurrent jobs)

I want to dig into the lifecycle of the callback handlers a bit closer, the only way I think this could be occurring would be if:

  • Indexer gets an Add/Update event for a new pod that contains the IP but not the annotation (this would add the IP -> Pod store then)
  • Request comes in for credentials, GetRoleMapping is hit and returns the partial pod representation (missing explicit annotation) and falls back to the default role
  • Some time later, Update event is received with the fully represented Pod that contains the appropriate annotation information
@jtblin
Copy link
Owner

jtblin commented Nov 27, 2017

As per #92 (comment), I'm thinking to move the fallback to the default role down the chain so that we return an error from extractRoleARN. That will trigger the exponential backoff operation retry and should be able to catch the updated annotation. There will be some latency impact but hopefully it's acceptable.

@jagregory
Copy link

Hey folks, I think I'm hitting this issue. I have a cronjob which kicks off a pod once a minute, and it fairly frequently fails the first time a pod comes up in that minute, and then the second or third retry usually works.

I've just put in a retry on startup so it'll should backoff and retry for a while. Will see if that makes any difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants