Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make healthchecks skippable, and check masters only #56130

Merged
merged 2 commits into from
Nov 24, 2017

Conversation

anguslees
Copy link
Member

@anguslees anguslees commented Nov 21, 2017

What this PR does / why we need it:

Previously kubeadm would abort if any node was not Ready. This is obviously infeasible in a non-trivial (esp. baremetal) cluster.

This PR makes two changes:

  • Allows kubeadm healthchecks to be selectively skipped (made non-fatal) with --ignore-checks-errors.
  • Check only that the master nodes are Ready.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes kubernetes/kubeadm#539

Special notes for your reviewer:

Builds on #56072

Release note:

kubeadm health checks can also be skipped with `--ignore-preflight-errors`

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 21, 2017
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 21, 2017
@anguslees anguslees changed the title Kubeadm nodehealth Make healthchecks skippable, and check masters only Nov 21, 2017
@anguslees
Copy link
Member Author

/assign @luxas @kad

@@ -106,16 +107,25 @@ func apiServerHealthy(client clientset.Interface) error {
return nil
}

// nodesHealthy checks whether all Nodes in the cluster are in the Running state
func nodesHealthy(client clientset.Interface) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually keep this check and produce warning in case some nodes are not in healthy state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree if possible to plumb through a warning -- users may want a heads up if their cluster isn't at full strength.

Copy link
Member Author

@anguslees anguslees Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, it is easy to add a warning of course, and I'm happy to do so if you insist.

I strongly disagree in principle however: kubeadm is not a monitoring service - if the admin wants to be informed that they have a node down, they shouldn't run kubeadm to discover that. Furthermore, I think the general posture of "we should protect the kubeadm user from unspecified problems that don't actually prevent an upgrade" has a real danger of making kubeadm fragile and not useful for real work (eg kubernetes/kubeadm#539 made kubeadm unusable for me). Imo kubeadm must be a do-what-I-say tool.

@@ -106,16 +107,25 @@ func apiServerHealthy(client clientset.Interface) error {
return nil
}

// nodesHealthy checks whether all Nodes in the cluster are in the Running state
func nodesHealthy(client clientset.Interface) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree if possible to plumb through a warning -- users may want a heads up if their cluster isn't at full strength.

@jbeda
Copy link
Contributor

jbeda commented Nov 22, 2017

/approve

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 22, 2017
@jbeda jbeda added this to the v1.9 milestone Nov 22, 2017
@anguslees
Copy link
Member Author

/kind cleanup
/priority important-soon
/sig cluster-lifecycle
/area kubeadm

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. area/kubeadm labels Nov 22, 2017
@k8s-github-robot k8s-github-robot added milestone/needs-approval needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed milestone/incomplete-labels labels Nov 22, 2017
@luxas
Copy link
Member

luxas commented Nov 22, 2017

LGTM overall after a rebase

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 23, 2017
@luxas luxas added status/approved-for-milestone priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed milestone/needs-approval priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 23, 2017
@k8s-github-robot k8s-github-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 23, 2017
@k8s-github-robot k8s-github-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 23, 2017
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 23, 2017
Copy link
Member

@luxas luxas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@cblecker
Copy link
Member

/lgtm cancel
Removing LGTM to stop retesting loop. This is not building:

W1123 17:36:02.397] # k8s.io/kubernetes/cmd/kubeadm/app/cmd/upgrade
W1123 17:36:02.398] cmd/kubeadm/app/cmd/upgrade/apply.go:122:178: flags.parent.ignoreChecksErrorsSet undefined (type *cmdUpgradeFlags has no field or method ignoreChecksErrorsSet)
W1123 17:36:02.398] cmd/kubeadm/app/cmd/upgrade/plan.go:58:166: parentFlags.ignoreChecksErrorsSet undefined (type *cmdUpgradeFlags has no field or method ignoreChecksErrorsSet)

@anguslees This needs attention ASAP.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 23, 2017
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request Current

@anguslees @cblecker @justinsb @kad @luxas @mikedanese

Note: This pull request is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Pull Request Labels
  • sig/cluster-lifecycle: Pull Request will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/cleanup: Adding tests, refactoring, fixing old bugs.
Help

@cblecker
Copy link
Member

/retest

1 similar comment
@anguslees
Copy link
Member Author

/retest

Copy link
Member

@luxas luxas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 24, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: anguslees, jbeda, luxas

Associated issue: 539

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 58fca39 into kubernetes:master Nov 24, 2017
@xiangpengzhao
Copy link
Contributor

@anguslees since we changed the flag name to --ignore-preflight-errors in #56208, can you please change the release note in this PR?

I see the release note kubeadm health checks can also be skipped with --ignore-checks-errors in CHANGELOG-1.9.md. This will mislead users :)

@luxas
Copy link
Member

luxas commented Dec 11, 2017

@xiangpengzhao I fixed it here and in release notes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kubeadm refuses to upgrade if a node is unhealthy