Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-dns-anti-affinity: kube-dns never-co-located-in-the-same-node #52193

Merged

Conversation

StevenACoffman
Copy link
Contributor

@StevenACoffman StevenACoffman commented Sep 8, 2017

What this PR does / why we need it:

This is upstreaming the kubernetes/kops#2705 pull request by @jamesbucher that was originally against kops.
Please see kubernetes/kops#2705 for more details, including a lengthy discussion.

Briefly, given the constraints of how the system works today:

  • if you need multiple DNS pods primarily for availability, then requiredDuringSchedulingIgnoredDuringExecution makes sense because putting more than one DNS pod on the same node isn't useful
  • if you need multiple DNS pods primarily for performance, then
    preferredDuringScheduling IgnoredDuringExecution makes sense because it will allow the DNS pods to schedule even if they can't be spread across nodes

Which issue this PR fixes

fixes kubernetes/kops#2693

Release note:

Improve resilience by annotating kube-dns addon with podAntiAffinity to prefer scheduling on different nodes.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 8, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @StevenACoffman. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 8, 2017
@k8s-github-robot
Copy link

Adding do-not-merge/release-note-label-needed because the release note process has not been followed.
One of the following labels is required "release-note", "release-note-action-required", or "release-note-none".
Please see: https://github.com/kubernetes/community/blob/master/contributors/devel/pull-requests.md#write-release-notes-if-needed.

@k8s-github-robot k8s-github-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Sep 8, 2017
@chrislovecnm
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 8, 2017
@StevenACoffman
Copy link
Contributor Author

/retest

@StevenACoffman
Copy link
Contributor Author

Not sure why this would have caused the test failure... perhaps in the test there's a single node?

@StevenACoffman
Copy link
Contributor Author

@justinsb @chrislovecnm I'm not familiar enough with the test suites to understand why this change would cause these errors:

W0909 20:42:43.501] error: context "e2e-f8n-agent-pr-93-0" does not exist
W0909 21:04:47.369] Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver, verb=get, resource=nodes, subresource=proxy)
W0909 21:04:47.635] error: context "e2e-f8n-agent-pr-93-0" does not exist
W0909 21:06:54.694] 2017/09/09 21:06:54 main.go:233: Something went wrong: error starting federation: error during ./federation/cluster/federation-up.sh: exit status 124
I0909 21:06:55.796] Starting pull-kubernetes-federation-e2e-gce-26305...
E0909 21:06:55.796] Command failed
I0909 21:06:55.797] process 20344 exited with code 1 after 39.8m
E0909 21:06:55.797] FAIL: pull-kubernetes-federation-e2e-gce

@bowei
Copy link
Member

bowei commented Sep 11, 2017

@csbell @kubernetes/sig-federation-misc for federation e2e failures

@@ -44,7 +44,38 @@ spec:
k8s-app: kube-dns
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
# For 1.6, we keep the old tolerations in case of a downgrade to 1.5
scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly", "operator":"Exists"}]'
# For 1.6, we keep the old affinity annotation in case of a downgrade to 1.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might not need these alpha annotations anymore. This PR will likely go in for 1.9, and 1.5 is already 4 major versions away.

@MrHohn
Copy link
Member

MrHohn commented Sep 15, 2017

/retest

Ref #40063, #41125

cc @thockin

@chrislovecnm
Copy link
Contributor

The main question that was raised on the PR into kops was if we use required or perferred. @thockin et al any thoughts?

@ghost ghost added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed sig/federation labels Oct 2, 2017
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 4, 2017
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 4, 2017
@johanneswuerbach
Copy link
Contributor

Any updates on this? We also just lost cluster dns service as all dns pods were located on the same node.

Anything I could help?

@bowei
Copy link
Member

bowei commented Oct 11, 2017

There is a discussion on the kops repo about using "preferred" vs "required". kubernetes/kops#2705

It seems to me we should go with "preferred" for now as it is strictly better than what we have before and does not result in kube-dns instances not being scheduled due to constraints.

If the PR is updated to preferred, I can lgtm.

@StevenACoffman
Copy link
Contributor Author

@bowei Done!

@bowei
Copy link
Member

bowei commented Oct 11, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 11, 2017
@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 11, 2017
@StevenACoffman
Copy link
Contributor Author

Looks like the tests passed, and the weight is corrected. How's this look to you now @MrHohn and @bowei ?

@MrHohn
Copy link
Member

MrHohn commented Oct 16, 2017

Thanks!
/lgtm

Could you update release note as well? Ref PULL_REQUEST_TEMPLATE.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 16, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bowei, MrHohn, StevenACoffman

Associated issue: 2705

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 16, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 53106, 52193, 51250, 52449, 53861). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit ef87482 into kubernetes:master Oct 16, 2017
@evanj
Copy link

evanj commented Oct 16, 2017

YAY! Thanks! I just noticed that one of our clusters was running both its instances of kube-dns on a single machine today, then found this issue while searching around for bugs or mentions. I was going to attempt this, but this is even better!

@StevenACoffman StevenACoffman deleted the kube-dns-anti-affinity branch October 17, 2017 01:46
k8s-github-robot pushed a commit that referenced this pull request Oct 18, 2017
…affinity

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Revert "kube-dns-anti-affinity: kube-dns never-co-located-in-the-same-node"

Reverts #52193

As this has slowed down scheduling of kube-dns pods significantly (fixes #54164)

/cc @bowei @MrHohn @StevenACoffman
@luxas
Copy link
Member

luxas commented Oct 19, 2017

This got reverted now, but please do keep kubeadm in sync at all times. The code lives in cmd/kubeadm as @johanneswuerbach pointed out (thanks!)

@jonastl
Copy link

jonastl commented Nov 27, 2017

  1. For what reason was this reverted?
  2. Was Kube-dns pods coming up super slowly in large clusters #54164 referring to an actual production cluster, or a "toy test" setup?
  3. What is the current plan for ensuring the intended behavior of this PR?

I just opened a ticket on this very issue (in the wrong project unfortunately), as result of the described behavior causing problems in one of our production clusters.

We have three small control nodes which service critical infrastructure pods, and a slew of pre-emptive nodes (on GKE). Without this patch we end up with poor scheduling decisions, where the scheduler stacks kube-dns pods on top of each other on the same nodes, causing our own workload pods to fail scheduling due to kube-dns having stolen the Allocatable resources.

We had hoped to be able to use the merged patch with node anti-affinity, to keep kube-dns away from the preemptive worker nodes, to avoid the problem faced and described in #41125 (spotty DNS service).

Without this patch we'll end up with dozens of kube-dns pods (created by kube-dns-autoscaler), all stacked onto 3 x 2 core machines, stealing all allocatable CPU and preventing our own critical backplane pods from being scheduled.

Right now it seems our only option is to buy more machines for the sole purpose of hosting kube-dns. The pod(s) must run somewhere in the cluster, but we don't want them on pre-emptive nodes, and we don't want more than at most one on each control node. Since the reverted merge prevents us from telling the scheduler "not more than one per node!", it seems to me we'll need a new set of machines dedicated to kube-dns hosting. This will increase our cost and there will be more moving parts to manage (additional node group).

@bowei
Copy link
Member

bowei commented Nov 27, 2017

Unfortunately the scheduling feature (anti-affinity) does not currently scale very well, it cannot be used for a system service such as this. When the scheduling team improves the performance of the feature, we can re-examine the scheduler constraint...

@jonastl
Copy link

jonastl commented Nov 27, 2017

Bowei, do you have a proposal / suggestion for how to cope with kubedns until such a fix is out?
Is a dedicated kube-dns "server farm" the only option?

Edit: If we had self-hosted k8s, we could of course have disabled the kube-dns autoscaler and set the deployment replica count manually, and revised the deployment to enable anti-affinity ourselves. Unfortunately we're using google and they have some reconciliation process that tend to overwrite / revert customer made changes in kube-system.
This sort of issue straddles k8s and Google and I'm not sure whether to carry this discussion in k8s context or through our google support channel. They tend to point us here anyway as soon as it even has a potential of being a k8s concern.

@bowei
Copy link
Member

bowei commented Nov 27, 2017

It possible to run a customized version of kube-dns on the cluster:

NOTE: after following these steps, you will be responsible for maintaining the configuration of the DNS system. Changes can always be reverted to return to the original configuration.

  • Download your current cluster kube-dns yaml:
    • kubectl get -n kube-system -o yaml deployment/kube-dns > my-kube-dns.yaml
    • Modify the YAML with your custom settings.
    • Change the name of the deployment in the metadata section to my-kube-dns.
    • Remove the label kubernetes.io/cluster-service: "true" from my-kube-dns.yaml. (The label is used by the Kubernetes addon manager to distinguish user resources from those managed by the Kubernetes system itself)
    • Make sure the cluster domain is set properly in my-kube-dns.yaml. (Search for cluster.local)
  • Scale the dns autoscaler replicas to 0. You can use kubectl scale --namespace kube-system deploy/kube-dns-autoscaler --replicas 0.
  • The next steps will disrupt DNS temporarily in the cluster:
  • Create the deployment my-kube-dns.yaml as constructed above. ClusterFirst requests will be sent to my-kube-dns as well as kube-dns because they have the same application label.
  • Scale the existing kube-system:kube-dns replicas to 0 to remove traffic from kube-dns. You can do this by setting the DNS auto scaler to have 0 replicas.
  • Scale up my-kube-dns by setting replicas to the appropriate value.

@jonastl
Copy link

jonastl commented Nov 27, 2017

Excellent. Just one question.
What about the label addonmanager.kubernetes.io/mode: Reconcile?
I take it that should be removed as well.

@bowei
Copy link
Member

bowei commented Nov 27, 2017

Yes, anything related to the addon manager will need to be removed.

@jonastl
Copy link

jonastl commented Nov 27, 2017

Perfect, thanks!

@StevenACoffman
Copy link
Contributor Author

StevenACoffman commented Nov 28, 2017

Unfortunately the scheduling feature (anti-affinity) does not currently scale very well, it cannot be used for a system service such as this. When the scheduling team improves the performance of the feature, we can re-examine the scheduler constraint...

@bowei Is there an issue tracking the anti-affinity performance/scaling problem we can link to this PR to remind us to revisit it?

@MrHohn
Copy link
Member

MrHohn commented Nov 28, 2017

bowei Is there an issue tracking the anti-affinity performance/scaling problem we can link to this PR to remind us to revisit it?

I believe that issue is #54189.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/network Categorizes an issue or PR as relevant to SIG Network. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kube-dns pods shouldn't all be on one node
10 participants