kube-dns-anti-affinity: kube-dns never-co-located-in-the-same-node #52193

StevenACoffman · 2017-09-08T19:04:14Z

What this PR does / why we need it:

This is upstreaming the kubernetes/kops#2705 pull request by @jamesbucher that was originally against kops.
Please see kubernetes/kops#2705 for more details, including a lengthy discussion.

Briefly, given the constraints of how the system works today:

if you need multiple DNS pods primarily for availability, then requiredDuringSchedulingIgnoredDuringExecution makes sense because putting more than one DNS pod on the same node isn't useful
if you need multiple DNS pods primarily for performance, then
preferredDuringScheduling IgnoredDuringExecution makes sense because it will allow the DNS pods to schedule even if they can't be spread across nodes

Which issue this PR fixes

fixes kubernetes/kops#2693

Release note:

Improve resilience by annotating kube-dns addon with podAntiAffinity to prefer scheduling on different nodes.

k8s-ci-robot · 2017-09-08T19:04:22Z

Hi @StevenACoffman. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-github-robot · 2017-09-08T19:05:12Z

Adding do-not-merge/release-note-label-needed because the release note process has not been followed.
One of the following labels is required "release-note", "release-note-action-required", or "release-note-none".
Please see: https://github.com/kubernetes/community/blob/master/contributors/devel/pull-requests.md#write-release-notes-if-needed.

chrislovecnm · 2017-09-08T20:52:17Z

/ok-to-test

StevenACoffman · 2017-09-09T20:23:03Z

/retest

StevenACoffman · 2017-09-09T20:23:41Z

Not sure why this would have caused the test failure... perhaps in the test there's a single node?

StevenACoffman · 2017-09-10T11:55:04Z

@justinsb @chrislovecnm I'm not familiar enough with the test suites to understand why this change would cause these errors:

W0909 20:42:43.501] error: context "e2e-f8n-agent-pr-93-0" does not exist
W0909 21:04:47.369] Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver, verb=get, resource=nodes, subresource=proxy)
W0909 21:04:47.635] error: context "e2e-f8n-agent-pr-93-0" does not exist
W0909 21:06:54.694] 2017/09/09 21:06:54 main.go:233: Something went wrong: error starting federation: error during ./federation/cluster/federation-up.sh: exit status 124
I0909 21:06:55.796] Starting pull-kubernetes-federation-e2e-gce-26305...
E0909 21:06:55.796] Command failed
I0909 21:06:55.797] process 20344 exited with code 1 after 39.8m
E0909 21:06:55.797] FAIL: pull-kubernetes-federation-e2e-gce

bowei · 2017-09-11T00:36:39Z

@csbell @kubernetes/sig-federation-misc for federation e2e failures

MrHohn · 2017-09-14T22:52:51Z

cluster/addons/dns/kubedns-controller.yaml.base

@@ -44,7 +44,38 @@ spec:
        k8s-app: kube-dns
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
+        # For 1.6, we keep the old tolerations in case of a downgrade to 1.5
+        scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly", "operator":"Exists"}]'
+        # For 1.6, we keep the old affinity annotation in case of a downgrade to 1.5


We might not need these alpha annotations anymore. This PR will likely go in for 1.9, and 1.5 is already 4 major versions away.

MrHohn · 2017-09-15T00:20:24Z

/retest

Ref #40063, #41125

cc @thockin

chrislovecnm · 2017-09-15T01:46:35Z

The main question that was raised on the PR into kops was if we use required or perferred. @thockin et al any thoughts?

johanneswuerbach · 2017-10-11T17:44:53Z

Any updates on this? We also just lost cluster dns service as all dns pods were located on the same node.

Anything I could help?

bowei · 2017-10-11T20:39:40Z

There is a discussion on the kops repo about using "preferred" vs "required". kubernetes/kops#2705

It seems to me we should go with "preferred" for now as it is strictly better than what we have before and does not result in kube-dns instances not being scheduled due to constraints.

If the PR is updated to preferred, I can lgtm.

StevenACoffman · 2017-10-11T21:22:01Z

@bowei Done!

bowei · 2017-10-11T21:26:54Z

/lgtm

StevenACoffman · 2017-10-16T11:24:14Z

Looks like the tests passed, and the weight is corrected. How's this look to you now @MrHohn and @bowei ?

MrHohn · 2017-10-16T17:50:49Z

Thanks!
/lgtm

Could you update release note as well? Ref PULL_REQUEST_TEMPLATE.

k8s-github-robot · 2017-10-16T17:50:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bowei, MrHohn, StevenACoffman

Associated issue: 2705

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~cluster/addons/dns/OWNERS~~ [MrHohn,bowei]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-github-robot · 2017-10-16T21:47:18Z

Automatic merge from submit-queue (batch tested with PRs 53106, 52193, 51250, 52449, 53861). If you want to cherry-pick this change to another branch, please follow the instructions here.

evanj · 2017-10-16T23:10:15Z

YAY! Thanks! I just noticed that one of our clusters was running both its instances of kube-dns on a single machine today, then found this issue while searching around for bugs or mentions. I was going to attempt this, but this is even better!

@bowei

…affinity Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Revert "kube-dns-anti-affinity: kube-dns never-co-located-in-the-same-node" Reverts #52193 As this has slowed down scheduling of kube-dns pods significantly (fixes #54164) /cc @bowei @MrHohn @StevenACoffman

luxas · 2017-10-19T18:34:49Z

This got reverted now, but please do keep kubeadm in sync at all times. The code lives in cmd/kubeadm as @johanneswuerbach pointed out (thanks!)

jonastl · 2017-11-27T20:29:07Z

For what reason was this reverted?
Was Kube-dns pods coming up super slowly in large clusters #54164 referring to an actual production cluster, or a "toy test" setup?
What is the current plan for ensuring the intended behavior of this PR?

I just opened a ticket on this very issue (in the wrong project unfortunately), as result of the described behavior causing problems in one of our production clusters.

We have three small control nodes which service critical infrastructure pods, and a slew of pre-emptive nodes (on GKE). Without this patch we end up with poor scheduling decisions, where the scheduler stacks kube-dns pods on top of each other on the same nodes, causing our own workload pods to fail scheduling due to kube-dns having stolen the Allocatable resources.

We had hoped to be able to use the merged patch with node anti-affinity, to keep kube-dns away from the preemptive worker nodes, to avoid the problem faced and described in #41125 (spotty DNS service).

Without this patch we'll end up with dozens of kube-dns pods (created by kube-dns-autoscaler), all stacked onto 3 x 2 core machines, stealing all allocatable CPU and preventing our own critical backplane pods from being scheduled.

Right now it seems our only option is to buy more machines for the sole purpose of hosting kube-dns. The pod(s) must run somewhere in the cluster, but we don't want them on pre-emptive nodes, and we don't want more than at most one on each control node. Since the reverted merge prevents us from telling the scheduler "not more than one per node!", it seems to me we'll need a new set of machines dedicated to kube-dns hosting. This will increase our cost and there will be more moving parts to manage (additional node group).

bowei · 2017-11-27T20:34:12Z

Unfortunately the scheduling feature (anti-affinity) does not currently scale very well, it cannot be used for a system service such as this. When the scheduling team improves the performance of the feature, we can re-examine the scheduler constraint...

jonastl · 2017-11-27T20:36:44Z

Bowei, do you have a proposal / suggestion for how to cope with kubedns until such a fix is out?
Is a dedicated kube-dns "server farm" the only option?

Edit: If we had self-hosted k8s, we could of course have disabled the kube-dns autoscaler and set the deployment replica count manually, and revised the deployment to enable anti-affinity ourselves. Unfortunately we're using google and they have some reconciliation process that tend to overwrite / revert customer made changes in kube-system.
This sort of issue straddles k8s and Google and I'm not sure whether to carry this discussion in k8s context or through our google support channel. They tend to point us here anyway as soon as it even has a potential of being a k8s concern.

bowei · 2017-11-27T21:14:49Z

It possible to run a customized version of kube-dns on the cluster:

NOTE: after following these steps, you will be responsible for maintaining the configuration of the DNS system. Changes can always be reverted to return to the original configuration.

Download your current cluster kube-dns yaml:
- kubectl get -n kube-system -o yaml deployment/kube-dns > my-kube-dns.yaml
- Modify the YAML with your custom settings.
- Change the name of the deployment in the metadata section to my-kube-dns.
- Remove the label kubernetes.io/cluster-service: "true" from my-kube-dns.yaml. (The label is used by the Kubernetes addon manager to distinguish user resources from those managed by the Kubernetes system itself)
- Make sure the cluster domain is set properly in my-kube-dns.yaml. (Search for cluster.local)
Scale the dns autoscaler replicas to 0. You can use kubectl scale --namespace kube-system deploy/kube-dns-autoscaler --replicas 0.
The next steps will disrupt DNS temporarily in the cluster:
Create the deployment my-kube-dns.yaml as constructed above. ClusterFirst requests will be sent to my-kube-dns as well as kube-dns because they have the same application label.
Scale the existing kube-system:kube-dns replicas to 0 to remove traffic from kube-dns. You can do this by setting the DNS auto scaler to have 0 replicas.
Scale up my-kube-dns by setting replicas to the appropriate value.

jonastl · 2017-11-27T22:11:07Z

Excellent. Just one question.
What about the label addonmanager.kubernetes.io/mode: Reconcile?
I take it that should be removed as well.

bowei · 2017-11-27T22:15:38Z

Yes, anything related to the addon manager will need to be removed.

jonastl · 2017-11-27T22:21:16Z

Perfect, thanks!

StevenACoffman · 2017-11-28T04:00:41Z

Unfortunately the scheduling feature (anti-affinity) does not currently scale very well, it cannot be used for a system service such as this. When the scheduling team improves the performance of the feature, we can re-examine the scheduler constraint...

@bowei Is there an issue tracking the anti-affinity performance/scaling problem we can link to this PR to remind us to revisit it?

MrHohn · 2017-11-28T04:04:41Z

bowei Is there an issue tracking the anti-affinity performance/scaling problem we can link to this PR to remind us to revisit it?

I believe that issue is #54189.

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 8, 2017

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 8, 2017

k8s-github-robot assigned bowei and MrHohn Sep 8, 2017

k8s-github-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Sep 8, 2017

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 8, 2017

k8s-ci-robot added the sig/federation label Sep 11, 2017

MrHohn reviewed Sep 15, 2017

View reviewed changes

ghost added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed sig/federation labels Oct 2, 2017

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 4, 2017

StevenACoffman force-pushed the kube-dns-anti-affinity branch from a88c158 to adee6fa Compare October 4, 2017 17:31

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 4, 2017

StevenACoffman force-pushed the kube-dns-anti-affinity branch from adee6fa to ad586e1 Compare October 11, 2017 21:20

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 11, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 11, 2017

kube-dns-anti-affinity: Adjust to match different scheme

e6540d4

StevenACoffman force-pushed the kube-dns-anti-affinity branch from 29b7a4c to e6540d4 Compare October 14, 2017 02:51

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 16, 2017

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 16, 2017

k8s-github-robot merged commit ef87482 into kubernetes:master Oct 16, 2017

StevenACoffman deleted the kube-dns-anti-affinity branch October 17, 2017 01:46

This was referenced Oct 18, 2017

Kube-dns pods coming up super slowly in large clusters #54164

Closed

Revert "kube-dns-anti-affinity: kube-dns never-co-located-in-the-same-node" #54166

Merged

StevenACoffman mentioned this pull request Nov 1, 2017

kube-dns: tolerate being scheduled on master node #54945

Closed

jojje mentioned this pull request Nov 27, 2017

Multiple kube-dns pods scheduled on same node kubernetes/website#6456

Closed

2 tasks

MrHohn mentioned this pull request Jan 8, 2018

Add self anti-affinity to kube-dns pods #57683

Merged

chrisohaver mentioned this pull request Jan 17, 2018

Adding pod anti-affinity to the deployment manifest. coredns/deployment#52

Merged

roobert mentioned this pull request Aug 18, 2020

Permit descheduling of critical pods kubernetes-sigs/descheduler#378

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-dns-anti-affinity: kube-dns never-co-located-in-the-same-node #52193

kube-dns-anti-affinity: kube-dns never-co-located-in-the-same-node #52193

StevenACoffman commented Sep 8, 2017 •

edited

Loading

k8s-ci-robot commented Sep 8, 2017

k8s-github-robot commented Sep 8, 2017

chrislovecnm commented Sep 8, 2017

StevenACoffman commented Sep 9, 2017

StevenACoffman commented Sep 9, 2017

StevenACoffman commented Sep 10, 2017

bowei commented Sep 11, 2017

MrHohn Sep 14, 2017

MrHohn commented Sep 15, 2017

chrislovecnm commented Sep 15, 2017

johanneswuerbach commented Oct 11, 2017

bowei commented Oct 11, 2017

StevenACoffman commented Oct 11, 2017

bowei commented Oct 11, 2017

StevenACoffman commented Oct 16, 2017

MrHohn commented Oct 16, 2017

k8s-github-robot commented Oct 16, 2017

k8s-github-robot commented Oct 16, 2017

evanj commented Oct 16, 2017

luxas commented Oct 19, 2017

jonastl commented Nov 27, 2017

bowei commented Nov 27, 2017

jonastl commented Nov 27, 2017 •

edited

Loading

bowei commented Nov 27, 2017

jonastl commented Nov 27, 2017

bowei commented Nov 27, 2017

jonastl commented Nov 27, 2017 •

edited

Loading

StevenACoffman commented Nov 28, 2017 •

edited

Loading

MrHohn commented Nov 28, 2017

kube-dns-anti-affinity: kube-dns never-co-located-in-the-same-node #52193

kube-dns-anti-affinity: kube-dns never-co-located-in-the-same-node #52193

Conversation

StevenACoffman commented Sep 8, 2017 • edited Loading

k8s-ci-robot commented Sep 8, 2017

k8s-github-robot commented Sep 8, 2017

chrislovecnm commented Sep 8, 2017

StevenACoffman commented Sep 9, 2017

StevenACoffman commented Sep 9, 2017

StevenACoffman commented Sep 10, 2017

bowei commented Sep 11, 2017

MrHohn Sep 14, 2017

Choose a reason for hiding this comment

MrHohn commented Sep 15, 2017

chrislovecnm commented Sep 15, 2017

johanneswuerbach commented Oct 11, 2017

bowei commented Oct 11, 2017

StevenACoffman commented Oct 11, 2017

bowei commented Oct 11, 2017

StevenACoffman commented Oct 16, 2017

MrHohn commented Oct 16, 2017

k8s-github-robot commented Oct 16, 2017

k8s-github-robot commented Oct 16, 2017

evanj commented Oct 16, 2017

luxas commented Oct 19, 2017

jonastl commented Nov 27, 2017

bowei commented Nov 27, 2017

jonastl commented Nov 27, 2017 • edited Loading

bowei commented Nov 27, 2017

jonastl commented Nov 27, 2017

bowei commented Nov 27, 2017

jonastl commented Nov 27, 2017 • edited Loading

StevenACoffman commented Nov 28, 2017 • edited Loading

MrHohn commented Nov 28, 2017

StevenACoffman commented Sep 8, 2017 •

edited

Loading

jonastl commented Nov 27, 2017 •

edited

Loading

jonastl commented Nov 27, 2017 •

edited

Loading

StevenACoffman commented Nov 28, 2017 •

edited

Loading