Kubernetes has a rich end-to-end (e2e) testing infrastructure, which allows detailed testing of clusters and their assets. Most settings and tools for that can be found in the test-infra GitHub repository.
Kubernetes uses applications such as the web-based testgrid for monitoring the status of e2e tests. test-infra also hosts the configuration on individual test jobs and Docker images that contain tools to invoke the jobs.
The following folder contains all the SIG Cluster Lifecycle (the SIG that maintains kubeadm) originated test jobs: sig-cluster-lifecycle
Please note that this document will only cover details on the kubeadm*.yaml
files and only on some parameters
these files contain.
For example, let's have a look at this file: kubeadm-kinder.yaml
It contains a list of jobs such as:
- name: ci-kubernetes-e2e-kubeadm-foo
interval: 2h
...
In this case, ci-kubernetes-e2e-kubeadm-foo
is a test job that runs every 2 hours and it also
has a set of other parameters defined for it. Such test jobs use an image that will run as a container inside
a Pod of the Kubernetes Prow cluster.
Prow jobs execute a tool called kinder
.
kubeadm uses a deployment tool called kinder. kinder is based on kind and it's used for a variety of different tests, such as upgrades and skew tests.
Kinder uses test workflow files that run sequences of tasks, such as "upgrade", "run e2e conformance tests", "run e2e kubeadm tests".
More information on kinder workflows can be found here. The same document must be updated every time kinder workflows are added/deleted.
Testgrid contains elements like dashboard groups, dashboards and tabs.
As an overview:
- SIG Cluster Lifecycle dashboards reside in a dashboard group that can be found here.
- Inside this dashboard group there is a dashboard for kubeadm.
- Inside this dashboard there are individual tabs such as
kubeadm-foo
(which is the tab name for the jobci-kubernetes-e2e-kubeadm-foo
).
In the YAML structure of jobs such as ci-kubernetes-e2e-kubeadm-foo
the following list
of annotations can be seen:
annotations:
testgrid-dashboards: sig-cluster-lifecycle-kubeadm,sig-release-master-informing
testgrid-tab-name: kubeadm-foo
testgrid-alert-email: [email protected]
description: "OWNER: sig-cluster-lifecycle: some description here"
testgrid-num-columns-recent: "20"
testgrid-num-failures-to-alert: "4"
testgrid-alert-stale-results-hours: "8"
These annotations configure the testgrid entries for this job from the job configuration itself. They contain information such as dashboards where this job will appear, tab name, where to send email alerts, description and other.
For more information about configuring testgrid see this page.
The script kinder/hack/update-workflows.sh
can be used to update
the workflows in the current clone of the kubeadm repository and also
update the test-infra jobs in a local kubernetes/test-infra
clone.
It requires the following environment variable to be set:
TEST_INFRA_KUBEADM_PATH
: passes--path-test-infra
to thekinder/ci/tools/update-workflows
tool.
The update-workflows
tool uses templates to generate YAML files.
The templates are located in kinder/ci/tools/update-workflows/templates
.
Let's define the size of the Kubernetes support skew as N
.
For N=3
, 3 releases will be in support.
When a new release is about to happen, the kubeadm test workflows and jobs have to be updated two times:
- When the new
release-x.yy
branch ofkubernetes/kubernetes
is created and therelease-x.yy-*
test-grid dashboards are created.
- Edit
kinder/hack/update-workflows.sh
:- The MINOR value of
KUBERNETES_VERSION
must be incremented SKEW_SIZE
must be set toN+1
to also include the new version
- The MINOR value of
- Run
./hack/update-workflows.sh
from the kinder directory
- When the oldest release (i.e.
x.yy-N
) goes out support:
- Edit
kinder/hack/update-workflows.sh
:SKEW_SIZE
must be set back toN
, to exclude the oldest release
- Run
./hack/update-workflows.sh
from the kinder directory
Different kubeadm test jobs run on a different frequency of time. Jobs against the latest development branch run more often.
Test job failures can trigger email alerts. This can be configured using annotations:
testgrid-alert-email: [email protected]
testgrid-num-failures-to-alert: "4"
testgrid-alert-stale-results-hours: "8"
testgrid-alert-email
should be set to[email protected]
. in the case of SIG Cluster Lifecycle. This is a mailing list (Google Group) that will receive the alert.testgrid-alert-stale-results-hours
means an alert will be sent in case the job is in a stale state and is not reporting new status after N hours. Usually a restart by a test-infra "on-call" operator is required in such cases.testgrid-num-failures-to-alert
sets the N failed runs ("red runs") after which an alert will be sent.
Jobs that runs against the development kubernetes/kubernetes branch should send email alerts more often (e.g. 8 hours), while jobs for the older branches should report less often (e.g. 24 hours).
Certain test jobs maintained by SIG Cluster Lifecycle can be present in release blocking or informing dashboards, such as: https://testgrid.k8s.io/sig-release-master-informing
These test jobs are of higher importance as they can block a Kubernetes release in case they are failing. Such dashboards are managed by SIG Release, but SIG Cluster Lifecycle can propose changes by adding or removing test jobs, preferably near the beginning of a release cycle.
SIG Cluster Lifecycle must be responsive in case the release team tries to contact them about failing jobs.
Please note that if a job is consistently failing or flaky it will be removed from release dashboards, by SIG Release or SIG Testing.