Concepts Kubernetes
Concepts Kubernetes
The Concepts section helps you learn about the parts of the Kubernetes
system and the abstractions Kubernetes uses to represent your cluster, and
helps you obtain a deeper understanding of how Kubernetes works.
Overview
Cluster Architecture
Containers
Workloads
Storage
Configuration
Security
Policies
Cluster Administration
Extending Kubernetes
Overview
Get a high-level outline of Kubernetes and the components it is built from.
What is Kubernetes?
Kubernetes Components
The Kubernetes API lets you query and manipulate the state of objects in
Kubernetes. The core of Kubernetes' control plane is the API server and the
HTTP API that it exposes. Users, the different parts of your cluster, and
external components all communicate with one another through the API
server.
Each VM is a full machine running all the components, including its own
operating system, on top of the virtualized hardware.
Container deployment era: Containers are similar to VMs, but they have
relaxed isolation properties to share the Operating System (OS) among the
applications. Therefore, containers are considered lightweight. Similar to a
VM, a container has its own filesystem, share of CPU, memory, process
space, and more. As they are decoupled from the underlying infrastructure,
they are portable across clouds and OS distributions.
Containers have become popular because they provide extra benefits, such
as:
That's how Kubernetes comes to the rescue! Kubernetes provides you with a
framework to run distributed systems resiliently. It takes care of scaling and
failover for your application, provides deployment patterns, and more. For
example, Kubernetes can easily manage a canary deployment for your
system.
What's next
• Take a look at the Kubernetes Components
• Ready to Get Started?
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified October 22, 2020 at 6:45 AM PST: Containers don't get their
own CPU (#24659) (fcdcfba35)
Edit this page Create child page Create an issue
Kubernetes Components
A Kubernetes cluster consists of the components that represent the control
plane and a set of machines called nodes.
The worker node(s) host the Pods that are the components of the application
workload. The control plane manages the worker nodes and the Pods in the
cluster. In production environments, the control plane usually runs across
multiple computers and a cluster usually runs multiple nodes, providing
fault-tolerance and high availability.
This document outlines the various components you need to have a complete
and working Kubernetes cluster.
Here's the diagram of a Kubernetes cluster with all the components tied
together.
Kubernetes cluster
API server
api
Cloud controller
c-m
c-m c-c-m manager
c-c-m
c-m c-c-m (optional) c-c-m
Controller
manage c-mr
etcd
api
Node Node (persistence store) etcd
api Node
api
kubelet
kubelet
sched
sched
sched
Scheduler
sched
Node
kube-apiserver
The API server is a component of the Kubernetes control plane that exposes
the Kubernetes API. The API server is the front end for the Kubernetes
control plane.
etcd
Consistent and highly-available key value store used as Kubernetes' backing
store for all cluster data.
If your Kubernetes cluster uses etcd as its backing store, make sure you
have a back up plan for those data.
You can find in-depth information about etcd in the official documentation.
kube-scheduler
Control plane component that watches for newly created Pods with no
assigned node, and selects a node for them to run on.
Factors taken into account for scheduling decisions include: individual and
collective resource requirements, hardware/software/policy constraints,
affinity and anti-affinity specifications, data locality, inter-workload
interference, and deadlines.
kube-controller-manager
Control Plane component that runs controller processes.
cloud-controller-manager
A Kubernetes control plane component that embeds cloud-specific control
logic. The cloud controller manager lets you link your cluster into your cloud
provider's API, and separates out the components that interact with that
cloud platform from components that just interact with your cluster.
Node Components
kubelet
An agent that runs on each node in the cluster. It makes sure that containers
are running in a Pod.
The kubelet takes a set of PodSpecs that are provided through various
mechanisms and ensures that the containers described in those PodSpecs
are running and healthy. The kubelet doesn't manage containers which were
not created by Kubernetes.
kube-proxy
kube-proxy uses the operating system packet filtering layer if there is one
and it's available. Otherwise, kube-proxy forwards the traffic itself.
Container runtime
Addons
DNS
While the other addons are not strictly required, all Kubernetes clusters
should have cluster DNS, as many examples rely on it.
Cluster DNS is a DNS server, in addition to the other DNS server(s) in your
environment, which serves DNS records for Kubernetes services.
Web UI (Dashboard)
Cluster-level Logging
What's next
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
The core of Kubernetes' control plane is the API server. The API server
exposes an HTTP API that lets end users, different parts of your cluster, and
external components communicate with one another.
The Kubernetes API lets you query and manipulate the state of API objects
in Kubernetes (for example: Pods, Namespaces, ConfigMaps, and Events).
Consider using one of the client libraries if you are writing an application
using the Kubernetes API.
OpenAPI specification
Complete API details are documented using OpenAPI.
The Kubernetes API server serves an OpenAPI spec via the /openapi/v2
endpoint. You can request the response format using request headers as
follows:
Header Possible values Notes
Accept- not supplying this header
gzip
Encoding is also acceptable
application/com.github.proto- mainly for intra-cluster
openapi.spec.v2@v1.0+protobuf use
Accept
application/json default
* serves application/json
Persistence
Kubernetes stores the serialized state of objects by writing them into etcd.
Versioning is done at the API level rather than at the resource or field level
to ensure that the API presents a clear, consistent view of system resources
and behavior, and to enable controlling access to end-of-life and/or
experimental APIs.
For example, suppose there are two API versions, v1 and v1beta1, for the
same resource. If you originally created an object using the v1beta1 version
of its API, you can later read, update, or delete that object using either the v
1beta1 or the v1 API version.
API changes
Any system that is successful needs to grow and change as new use cases
emerge or existing ones change. Therefore, Kubernetes has designed the
Kubernetes API to continuously change and grow. The Kubernetes project
aims to not break compatibility with existing clients, and to maintain that
compatibility for a length of time so that other projects have an opportunity
to adapt.
In general, new API resources and new resource fields can be added often
and frequently. Elimination of resources or fields requires following the API
deprecation policy.
Refer to API versions reference for more details on the API version level
definitions.
API Extension
The Kubernetes API can be extended in one of two ways:
1. Custom resources let you declaratively define how the API server
should provide your chosen resource API.
2. You can also extend the Kubernetes API by implementing an
aggregation layer.
What's next
• Learn how to extend the Kubernetes API by adding your own
CustomResourceDefinition.
• Controlling Access To The Kubernetes API describes how the cluster
manages authentication and authorization for API access.
• Learn about API endpoints, resource types and samples by reading API
Reference.
• Learn about what constitutes a compatible change, and how to change
the API, from API changes.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified December 08, 2020 at 1:07 PM PST: Make wording in
overview more in line with the description (9c3891144)
Edit this page Create child page Create an issue
• OpenAPI specification
• Persistence
• API groups and versioning
◦ API changes
• API Extension
• What's next
Namespaces
Annotations
Field Selectors
Recommended Labels
Understanding Kubernetes
Objects
This page explains how Kubernetes objects are represented in the
Kubernetes API, and how you can express them in .yaml format.
The status describes the current state of the object, supplied and updated
by the Kubernetes system and its components. The Kubernetes control plane
continually and actively manages every object's actual state to match the
desired state you supplied.
For more information on the object spec, status, and metadata, see the
Kubernetes API Conventions.
Here's an example .yaml file that shows the required fields and object spec
for a Kubernetes Deployment:
application/deployment.yaml
One way to create a Deployment using a .yaml file like the one above is to
use the kubectl apply command in the kubectl command-line interface,
passing the .yaml file as an argument. Here's an example:
deployment.apps/nginx-deployment created
Required Fields
In the .yaml file for the Kubernetes object you want to create, you'll need to
set values for the following fields:
The precise format of the object spec is different for every Kubernetes
object, and contains nested fields specific to that object. The Kubernetes API
Reference can help you find the spec format for all of the objects you can
create using Kubernetes. For example, the spec format for a Pod can be
found in PodSpec v1 core, and the spec format for a Deployment can be
found in DeploymentSpec v1 apps.
What's next
• Learn about the most important basic Kubernetes objects, such as Pod.
• Learn about controllers in Kubernetes.
• Using the Kubernetes API explains some more API concepts.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified October 13, 2020 at 12:41 AM PST: Move API overview to be a
Docsy section overview (3edb97057)
Edit this page Create child page Create an issue
Management techniques
Warning: A Kubernetes object should be managed using only one
technique. Mixing and matching techniques for the same object
results in undefined behavior.
Imperative commands
When using imperative commands, a user operates directly on live objects in
a cluster. The user provides operations to the kubectl command as
arguments or flags.
This is the simplest way to get started or to run a one-off task in a cluster.
Because this technique operates directly on live objects, it provides no
history of previous configurations.
Examples
Run an instance of the nginx container by creating a Deployment object:
Trade-offs
Advantages compared to object configuration:
Trade-offs
Advantages compared to imperative commands:
Examples
Process all object configuration files in the configs directory, and create or
patch the live objects. You can first diff to see what changes are going to be
made, and then apply:
Trade-offs
Advantages compared to imperative object configuration:
• Changes made directly to live objects are retained, even if they are not
merged back into the configuration files.
• Declarative object configuration has better support for operating on
directories and automatically detecting operation types (create, patch,
delete) per-object.
What's next
• Managing Kubernetes Objects Using Imperative Commands
• Managing Kubernetes Objects Using Object Configuration (Imperative)
• Managing Kubernetes Objects Using Object Configuration (Declarative)
• Managing Kubernetes Objects Using Kustomize (Declarative)
• Kubectl Command Reference
• Kubectl Book
• Kubernetes API Reference
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified July 17, 2020 at 4:11 PM PST: Fix links in concepts section (2)
(c8f470487)
Edit this page Create child page Create an issue
• Management techniques
• Imperative commands
◦ Examples
◦ Trade-offs
• Imperative object configuration
◦ Examples
◦ Trade-offs
• Declarative object configuration
◦ Examples
◦ Trade-offs
• What's next
For example, you can only have one Pod named myapp-1234 within the same
namespace, but you can have one Pod and one Deployment that are each
named myapp-1234.
Names
A client-provided string that refers to an object in a resource URL, such as /
api/v1/pods/some-name.
Only one object of a given kind can have a given name at a time. However, if
you delete the object, you can make a new object with the same name.
Below are three types of commonly used name constraints for resources.
apiVersion: v1
kind: Pod
metadata:
name: nginx-demo
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
UIDs
A Kubernetes systems-generated string to uniquely identify objects.
Every object created over the whole lifetime of a Kubernetes cluster has a
distinct UID. It is intended to distinguish between historical occurrences of
similar entities.
What's next
• Read about labels in Kubernetes.
• See the Identifiers and Names in Kubernetes design document.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified August 05, 2020 at 3:17 AM PST: Replace special quote
characters with normal ones. (c6a96128c)
Edit this page Create child page Create an issue
• Names
◦ DNS Subdomain Names
◦ DNS Label Names
◦ Path Segment Names
• UIDs
• What's next
Namespaces
Kubernetes supports multiple virtual clusters backed by the same physical
cluster. These virtual clusters are called namespaces.
Viewing namespaces
You can list the current namespaces in a cluster using:
For example:
# In a namespace
kubectl api-resources --namespaced=true
# Not in a namespace
kubectl api-resources --namespaced=false
What's next
• Learn more about creating a new namespace.
• Learn more about deleting a namespace.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified October 10, 2020 at 7:15 PM PST: Remove statement about
future of namespaces (dd88014f3)
Edit this page Create child page Create an issue
"metadata": {
"labels": {
"key1" : "value1",
"key2" : "value2"
}
}
Labels allow for efficient queries and watches and are ideal for use in UIs
and CLIs. Non-identifying information should be recorded using annotations.
Motivation
Labels enable users to map their own organizational structures onto system
objects in a loosely coupled fashion, without requiring clients to store these
mappings.
Example labels:
These are just examples of commonly used labels; you are free to develop
your own conventions. Keep in mind that label Key must be unique for a
given object.
Syntax and character set
Labels are key/value pairs. Valid label keys have two segments: an optional
prefix and name, separated by a slash (/). The name segment is required
and must be 63 characters or less, beginning and ending with an
alphanumeric character ([a-z0-9A-Z]) with dashes (-), underscores (_),
dots (.), and alphanumerics between. The prefix is optional. If specified, the
prefix must be a DNS subdomain: a series of DNS labels separated by dots
(.), not longer than 253 characters in total, followed by a slash (/).
If the prefix is omitted, the label Key is presumed to be private to the user.
Automated system components (e.g. kube-scheduler, kube-controller-
manager, kube-apiserver, kubectl, or other third-party automation) which
add labels to end-user objects must specify a prefix.
The kubernetes.io/ and k8s.io/ prefixes are reserved for Kubernetes core
components.
Valid label values must be 63 characters or less and must be empty or begin
and end with an alphanumeric character ([a-z0-9A-Z]) with dashes (-),
underscores (_), dots (.), and alphanumerics between.
For example, here's the configuration file for a Pod that has two labels envir
onment: production and app: nginx :
apiVersion: v1
kind: Pod
metadata:
name: label-demo
labels:
environment: production
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Label selectors
Unlike names and UIDs, labels do not provide uniqueness. In general, we
expect many objects to carry the same label(s).
Via a label selector, the client/user can identify a set of objects. The label
selector is the core grouping primitive in Kubernetes.
The API currently supports two types of selectors: equality-based and set-
based. A label selector can be made of multiple requirements which are
comma-separated. In the case of multiple requirements, all must be satisfied
so the comma separator acts as a logical AND (&&) operator.
Note: For some API types, such as ReplicaSets, the label selectors
of two instances must not overlap within a namespace, or the
controller can see that as conflicting instructions and fail to
determine how many replicas should be present.
Equality-based requirement
Equality- or inequality-based requirements allow filtering by label keys and
values. Matching objects must satisfy all of the specified label constraints,
though they may have additional labels as well. Three kinds of operators are
admitted =,==,!=. The first two represent equality (and are simply
synonyms), while the latter represents inequality. For example:
environment = production
tier != frontend
The former selects all resources with key equal to environment and value
equal to production. The latter selects all resources with key equal to tier
and value distinct from frontend, and all resources with no labels with the t
ier key. One could filter for resources in production excluding frontend
using the comma operator: environment=production,tier!=frontend
apiVersion: v1
kind: Pod
metadata:
name: cuda-test
spec:
containers:
- name: cuda-test
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-p100
Set-based requirement
Set-based label requirements allow filtering keys according to a set of
values. Three kinds of operators are supported: in,notin and exists (only
the key identifier). For example:
• The first example selects all resources with key equal to environment
and value equal to production or qa.
• The second example selects all resources with key equal to tier and
values other than frontend and backend, and all resources with no
labels with the tier key.
• The third example selects all resources including a label with key parti
tion; no values are checked.
• The fourth example selects all resources without a label with key parti
tion; no values are checked.
API
LIST and WATCH filtering
LIST and WATCH operations may specify label selectors to filter the sets of
objects returned using a query parameter. Both requirements are permitted
(presented here as they would appear in a URL query string):
• equality-based requirements: ?
labelSelector=environment%3Dproduction,tier%3Dfrontend
• set-based requirements: ?labelSelector=environment+in+
%28production%2Cqa%29%2Ctier+in+%28frontend%29
Both label selector styles can be used to list or watch resources via a REST
client. For example, targeting apiserver with kubectl and using equality-
based one may write:
The set of pods that a service targets is defined with a label selector.
Similarly, the population of pods that a replicationcontroller should
manage is also defined with a label selector.
Labels selectors for both objects are defined in json or yaml files using
maps, and only equality-based requirement selectors are supported:
"selector": {
"component" : "redis",
}
or
selector:
component: redis
selector:
matchLabels:
component: redis
matchExpressions:
- {key: tier, operator: In, values: [cache]}
- {key: environment, operator: NotIn, values: [dev]}
One use case for selecting over labels is to constrain the set of nodes onto
which a pod can schedule. See the documentation on node selection for
more information.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
• Motivation
• Syntax and character set
• Label selectors
◦ Equality-based requirement
◦ Set-based requirement
• API
◦ LIST and WATCH filtering
◦ Set references in API objects
Annotations
You can use Kubernetes annotations to attach arbitrary non-identifying
metadata to objects. Clients such as tools and libraries can retrieve this
metadata.
"metadata": {
"annotations": {
"key1" : "value1",
"key2" : "value2"
}
}
The kubernetes.io/ and k8s.io/ prefixes are reserved for Kubernetes core
components.
For example, here's the configuration file for a Pod that has the annotation i
mageregistry: https://hub.docker.com/ :
apiVersion: v1
kind: Pod
metadata:
name: annotations-demo
annotations:
imageregistry: "https://hub.docker.com/"
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
What's next
Learn more about Labels and Selectors.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified August 05, 2020 at 3:17 AM PST: Replace special quote
characters with normal ones. (c6a96128c)
Edit this page Create child page Create an issue
• metadata.name=my-service
• metadata.namespace!=default
• status.phase=Pending
This kubectl command selects all Pods for which the value of the status.ph
ase field is Running:
Supported fields
Supported field selectors vary by Kubernetes resource type. All resource
types support the metadata.name and metadata.namespace fields. Using
unsupported field selectors produces an error. For example:
Supported operators
You can use the =, ==, and != operators with field selectors (= and == mean
the same thing). This kubectl command, for example, selects all Kubernetes
Services that aren't in the default namespace:
Chained selectors
As with label and other selectors, field selectors can be chained together as
a comma-separated list. This kubectl command selects all Pods for which
the status.phase does not equal Running and the spec.restartPolicy
field equals Always:
kubectl get pods --field-selector=status.phase!=Running,spec.rest
artPolicy=Always
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified June 08, 2020 at 7:10 PM PST: workaround for new lines in
field selector example (2f6093bfd)
Edit this page Create child page Create an issue
• Supported fields
• Supported operators
• Chained selectors
• Multiple resource types
Recommended Labels
You can visualize and manage Kubernetes objects with more tools than
kubectl and the dashboard. A common set of labels allows tools to work
interoperably, describing objects in a common manner that all tools can
understand.
Labels
In order to take full advantage of using these labels, they should be applied
on every resource object.
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxzy
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
app.kubernetes.io/managed-by: helm
The name of an application and the instance name are recorded separately.
For example, WordPress has a app.kubernetes.io/name of wordpress while
it has an instance name, represented as app.kubernetes.io/instance with
a value of wordpress-abcxzy. This enables the application and instance of
the application to be identifiable. Every instance of an application must have
a unique name.
Examples
To illustrate different ways to use these labels the following examples have
varying complexity.
The Deployment is used to oversee the pods running the application itself.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: myservice
app.kubernetes.io/instance: myservice-abcxzy
...
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: myservice
app.kubernetes.io/instance: myservice-abcxzy
...
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: wordpress
app.kubernetes.io/instance: wordpress-abcxzy
app.kubernetes.io/version: "4.9.4"
app.kubernetes.io/managed-by: helm
app.kubernetes.io/component: server
app.kubernetes.io/part-of: wordpress
...
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxzy
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/managed-by: helm
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
...
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxzy
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/managed-by: helm
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
...
With the MySQL StatefulSet and Service you'll notice information about
both MySQL and Wordpress, the broader application, are included.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified July 10, 2020 at 12:21 PM PST: Fix the example values in the
first examples (f4d296d1c)
Edit this page Create child page Create an issue
• Labels
• Applications And Instances Of Applications
• Examples
◦ A Simple Stateless Service
◦ Web Application With A Database
Cluster Architecture
The architectural concepts behind Kubernetes.
Nodes
Controllers
Nodes
Kubernetes runs your workload by placing containers into Pods to run on
Nodes. A node may be a virtual or physical machine, depending on the
cluster. Each node contains the services necessary to run Pods, managed by
the control plane.
The components on a node include the kubelet, a container runtime, and the
kube-proxy.
Management
There are two main ways to have Nodes added to the API server:
After you create a Node object, or the kubelet on a node self-registers, the
control plane checks whether the new Node object is valid. For example, if
you try to create a Node from the following JSON manifest:
{
"kind": "Node",
"apiVersion": "v1",
"metadata": {
"name": "10.240.79.157",
"labels": {
"name": "my-first-k8s-node"
}
}
}
Note:
Kubernetes keeps the object for the invalid Node and continues
checking to see whether it becomes healthy.
Self-registration of Nodes
When the kubelet flag --register-node is true (the default), the kubelet
will attempt to register itself with the API server. This is the preferred
pattern, used by most distros.
When you want to create Node objects manually, set the kubelet flag --
register-node=false.
You can use labels on Nodes in conjunction with node selectors on Pods to
control scheduling. For example, you can constrain a Pod to only be eligible
to run on a subset of the available nodes.
Node status
A Node's status contains the following information:
• Addresses
• Conditions
• Capacity and Allocatable
• Info
You can use kubectl to view a Node's status and other details:
Conditions
The conditions field describes the status of all Running nodes. Examples of
conditions include:
"conditions": [
{
"type": "Ready",
"status": "True",
"reason": "KubeletReady",
"message": "kubelet is posting ready status",
"lastHeartbeatTime": "2019-06-05T18:38:35Z",
"lastTransitionTime": "2019-06-05T11:41:27Z"
}
]
If the Status of the Ready condition remains Unknown or False for longer
than the pod-eviction-timeout (an argument passed to the kube-
controller-manager), all the Pods on the node are scheduled for deletion by
the node controller. The default eviction timeout duration is five minutes.
In some cases when the node is unreachable, the API server is unable to
communicate with the kubelet on the node. The decision to delete the pods
cannot be communicated to the kubelet until communication with the API
server is re-established. In the meantime, the pods that are scheduled for
deletion may continue to run on the partitioned node.
The node controller does not force delete pods until it is confirmed that they
have stopped running in the cluster. You can see the pods that might be
running on an unreachable node as being in the Terminating or Unknown
state. In cases where Kubernetes cannot deduce from the underlying
infrastructure if a node has permanently left a cluster, the cluster
administrator may need to delete the node object by hand. Deleting the node
object from Kubernetes causes all the Pod objects running on the node to be
deleted from the API server, and frees up their names.
The fields in the capacity block indicate the total amount of resources that a
Node has. The allocatable block indicates the amount of resources on a
Node that is available to be consumed by normal Pods.
You may read more about capacity and allocatable resources while learning
how to reserve compute resources on a Node.
Info
Describes general information about the node, such as kernel version,
Kubernetes version (kubelet and kube-proxy version), Docker version (if
used), and OS name. This information is gathered by Kubelet from the node.
Node controller
The node controller is a Kubernetes control plane component that manages
various aspects of nodes.
The node controller has multiple roles in a node's life. The first is assigning
a CIDR block to the node when it is registered (if CIDR assignment is turned
on).
The second is keeping the node controller's internal list of nodes up to date
with the cloud provider's list of available machines. When running in a cloud
environment, whenever a node is unhealthy, the node controller asks the
cloud provider if the VM for that node is still available. If not, the node
controller deletes the node from its list of nodes.
The third is monitoring the nodes' health. The node controller is responsible
for updating the NodeReady condition of NodeStatus to ConditionUnknown
when a node becomes unreachable (i.e. the node controller stops receiving
heartbeats for some reason, for example due to the node being down), and
then later evicting all the pods from the node (using graceful termination) if
the node continues to be unreachable. (The default timeouts are 40s to start
reporting ConditionUnknown and 5m after that to start evicting pods.) The
node controller checks the state of each node every --node-monitor-period
seconds.
Heartbeats
There are two forms of heartbeats: updates of NodeStatus and the Lease
object. Each Node has an associated Lease object in the kube-node-lease
namespace. Lease is a lightweight resource, which improves the
performance of the node heartbeats as the cluster scales.
The kubelet is responsible for creating and updating the NodeStatus and a
Lease object.
Reliability
In most cases, the node controller limits the eviction rate to --node-
eviction-rate (default 0.1) per second, meaning it won't evict pods from
more than 1 node per 10 seconds.
The node eviction behavior changes when a node in a given availability zone
becomes unhealthy. The node controller checks what percentage of nodes in
the zone are unhealthy (NodeReady condition is ConditionUnknown or
ConditionFalse) at the same time. If the fraction of unhealthy nodes is at
least --unhealthy-zone-threshold (default 0.55) then the eviction rate is
reduced: if the cluster is small (i.e. has less than or equal to --large-
cluster-size-threshold nodes - default 50) then evictions are stopped,
otherwise the eviction rate is reduced to --secondary-node-eviction-rate
(default 0.01) per second. The reason these policies are implemented per
availability zone is because one availability zone might become partitioned
from the master while the others remain connected. If your cluster does not
span multiple cloud provider availability zones, then there is only one
availability zone (the whole cluster).
A key reason for spreading your nodes across availability zones is so that the
workload can be shifted to healthy zones when one entire zone goes down.
Therefore, if all nodes in a zone are unhealthy then the node controller
evicts at the normal rate of --node-eviction-rate. The corner case is when
all zones are completely unhealthy (i.e. there are no healthy nodes in the
cluster). In such a case, the node controller assumes that there's some
problem with master connectivity and stops all evictions until some
connectivity is restored.
The node controller is also responsible for evicting pods running on nodes
with NoExecute taints, unless those pods tolerate that taint. The node
controller also adds taints corresponding to node problems like node
unreachable or not ready. This means that the scheduler won't place Pods
onto unhealthy nodes.
Node capacity
Node objects track information about the Node's resource capacity (for
example: the amount of memory available, and the number of CPUs). Nodes
that self register report their capacity during registration. If you manually
add a Node, then you need to set the node's capacity information when you
add it.
The Kubernetes scheduler ensures that there are enough resources for all
the Pods on a Node. The scheduler checks that the sum of the requests of
containers on the node is no greater than the node's capacity. That sum of
requests includes all containers managed by the kubelet, but excludes any
containers started directly by the container runtime, and also excludes any
processes running outside of the kubelet's control.
Node topology
FEATURE STATE: Kubernetes v1.16 [alpha]
If you have enabled the TopologyManager feature gate, then the kubelet can
use topology hints when making resource assignment decisions. See Control
Topology Management Policies on a Node for more information.
• ShutdownGracePeriod:
◦ Specifies the total duration that the node should delay the
shutdown by. This is the total grace period for pod termination for
both regular and critical pods.
• ShutdownGracePeriodCriticalPods:
◦ Specifies the duration used to terminate critical pods during a
node shutdown. This should be less than ShutdownGracePeriod.
What's next
• Learn about the components that make up a node.
• Read the API definition for Node.
• Read the Node section of the architecture design document.
• Read about taints and tolerations.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified October 07, 2020 at 7:16 PM PST: Revise cluster management
task (59dcd57cc)
Edit this page Create child page Create an issue
• Management
◦ Self-registration of Nodes
◦ Manual Node administration
• Node status
◦ Addresses
◦ Conditions
◦ Capacity and Allocatable
◦ Info
◦ Node controller
◦ Node capacity
• Node topology
• Graceful Node Shutdown
• What's next
Control Plane-Node
Communication
This document catalogs the communication paths between the control plane
(really the apiserver) and the Kubernetes cluster. The intent is to allow users
to customize their installation to harden the network configuration such that
the cluster can be run on an untrusted network (or on fully public IPs on a
cloud provider).
Nodes should be provisioned with the public root certificate for the cluster
such that they can connect securely to the apiserver along with valid client
credentials. A good approach is that the client credentials provided to the
kubelet are in the form of a client certificate. See kubelet TLS bootstrapping
for automated provisioning of kubelet client certificates.
The control plane components also communicate with the cluster apiserver
over the secure port.
As a result, the default operating mode for connections from the nodes and
pods running on the nodes to the control plane is secured by default and can
run over untrusted and/or public networks.
apiserver to kubelet
The connections from the apiserver to the kubelet are used for:
If that is not possible, use SSH tunneling between the apiserver and kubelet
if required to avoid connecting over an untrusted or public network.
SSH tunnels
Kubernetes supports SSH tunnels to protect the control plane to nodes
communication paths. In this configuration, the apiserver initiates an SSH
tunnel to each node in the cluster (connecting to the ssh server listening on
port 22) and passes all traffic destined for a kubelet, node, pod, or service
through the tunnel. This tunnel ensures that the traffic is not exposed
outside of the network in which the nodes are running.
SSH tunnels are currently deprecated so you shouldn't opt to use them
unless you know what you are doing. The Konnectivity service is a
replacement for this communication channel.
Konnectivity service
FEATURE STATE: Kubernetes v1.18 [beta]
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified October 22, 2020 at 2:24 PM PST: Fix links in concepts
section (070023b24)
Edit this page Create child page Create an issue
When you set the temperature, that's telling the thermostat about your
desired state. The actual room temperature is the current state. The
thermostat acts to bring the current state closer to the desired state, by
turning equipment on or off.
In Kubernetes, controllers are control loops that watch the state of your
cluster, then make or request changes where needed. Each controller tries
to move the current cluster state closer to the desired state.
Controller pattern
The controller might carry the action out itself; more commonly, in
Kubernetes, a controller will send messages to the API server that have
useful side effects. You'll see examples of this below.
(Once scheduled, Pod objects become part of the desired state for a
kubelet).
When the Job controller sees a new task it makes sure that, somewhere in
your cluster, the kubelets on a set of Nodes are running the right number of
Pods to get the work done. The Job controller does not run any Pods or
containers itself. Instead, the Job controller tells the API server to create or
remove Pods. Other components in the control plane act on the new
information (there are new Pods to schedule and run), and eventually the
work is done.
After you create a new Job, the desired state is for that Job to be completed.
The Job controller makes the current state for that Job be nearer to your
desired state: creating Pods that do the work you wanted for that Job, so
that the Job is closer to completion.
Controllers also update the objects that configure them. For example: once
the work is done for a Job, the Job controller updates that Job object to mark
it Finished.
(This is a bit like how some thermostats turn a light off to indicate that your
room is now at the temperature you set).
Direct control
For example, if you use a control loop to make sure there are enough Nodes
in your cluster, then that controller needs something outside the current
cluster to set up new Nodes when needed.
Controllers that interact with external state find their desired state from the
API server, then communicate directly with an external system to bring the
current state closer in line.
The important point here is that the controller makes some change to bring
about your desired state, and then reports current state back to your
cluster's API server. Other control loops can observe that reported data and
take their own actions.
Your cluster could be changing at any point as work happens and control
loops automatically fix failures. This means that, potentially, your cluster
never reaches a stable state.
As long as the controllers for your cluster are running and able to make
useful changes, it doesn't matter if the overall state is or is not stable.
Design
It's useful to have simple controllers rather than one, monolithic set of
control loops that are interlinked. Controllers can fail, so Kubernetes is
designed to allow for that.
Note:
For example, you can have Deployments and Jobs; these both
create Pods. The Job controller does not delete the Pods that your
Deployment created, because there is information (labels) the
controllers can use to tell those Pods apart.
Kubernetes comes with a set of built-in controllers that run inside the kube-
controller-manager. These built-in controllers provide important core
behaviors.
The Deployment controller and Job controller are examples of controllers
that come as part of Kubernetes itself ("built-in" controllers). Kubernetes
lets you run a resilient control plane, so that if any of the built-in controllers
were to fail, another part of the control plane will take over the work.
You can find controllers that run outside the control plane, to extend
Kubernetes. Or, if you want, you can write a new controller yourself. You can
run your own controller as a set of Pods, or externally to Kubernetes. What
fits best will depend on what that particular controller does.
What's next
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified October 07, 2020 at 7:16 PM PST: Revise cluster management
task (59dcd57cc)
Edit this page Create child page Create an issue
• Controller pattern
◦ Control via API server
◦ Direct control
• Desired versus current state
• Design
• Ways of running controllers
• What's next
Design
Kubernetes cluster
API server
api
Cloud controller
c-m
c-m c-c-m manager
c-c-m
c-m c-c-m (optional) c-c-m
Controller
manage c-mr
etcd
api
Node Node (persistence store) etcd
api Node
api
kubelet
kubelet
sched
sched
sched
Scheduler
sched
Node
The cloud controller manager runs in the control plane as a replicated set of
processes (usually, these are containers in Pods). Each cloud-controller-
manager implements multiple controllers in a single process.
1. Initialize a Node object for each server that the controller discovers
through the cloud provider API.
2. Annotating and labelling the Node object with cloud-specific
information, such as the region the node is deployed into and the
resources (CPU, memory, etc) that it has available.
3. Obtain the node's hostname and network addresses.
4. Verifying the node's health. In case a node becomes unresponsive, this
controller checks with your cloud provider's API to see if the server has
been deactivated / deleted / terminated. If the node has been deleted
from the cloud, the controller deletes the Node object from your
Kubernetes cluster.
Some cloud provider implementations split this into a node controller and a
separate node lifecycle controller.
Route controller
The route controller is responsible for configuring routes in the cloud
appropriately so that containers on different nodes in your Kubernetes
cluster can communicate with each other.
Depending on the cloud provider, the route controller might also allocate
blocks of IP addresses for the Pod network.
Service controller
Services integrate with cloud infrastructure components such as managed
load balancers, IP addresses, network packet filtering, and target health
checking. The service controller interacts with your cloud provider's APIs to
set up load balancers and other infrastructure components when you
declare a Service resource that requires them.
Authorization
This section breaks down the access that the cloud controller managers
requires on various API objects, in order to perform its operations.
Node controller
The Node controller only works with Node objects. It requires full access to
read and modify Node objects.
v1/Node:
• Get
• List
• Create
• Update
• Patch
• Watch
• Delete
Route controller
The route controller listens to Node object creation and configures routes
appropriately. It requires Get access to Node objects.
v1/Node:
• Get
Service controller
The service controller listens to Service object Create, Update and Delete
events and then configures Endpoints for those Services appropriately.
v1/Service:
• List
• Get
• Watch
• Patch
• Update
Others
The implementation of the core of the cloud controller manager requires
access to create Event objects, and to ensure secure operation, it requires
access to create ServiceAccounts.
v1/Event:
• Create
• Patch
• Update
v1/ServiceAccount:
• Create
The RBAC ClusterRole for the cloud controller manager looks like:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cloud-controller-manager
rules:
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- update
- apiGroups:
- ""
resources:
- nodes
verbs:
- '*'
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
- apiGroups:
- ""
resources:
- services
verbs:
- list
- patch
- update
- watch
- apiGroups:
- ""
resources:
- serviceaccounts
verbs:
- create
- apiGroups:
- ""
resources:
- persistentvolumes
verbs:
- get
- list
- update
- watch
- apiGroups:
- ""
resources:
- endpoints
verbs:
- create
- get
- list
- watch
- update
What's next
Cloud Controller Manager Administration has instructions on running and
managing the cloud controller manager.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
• Design
• Cloud controller manager functions
◦ Node controller
◦ Route controller
◦ Service controller
• Authorization
◦ Node controller
◦ Route controller
◦ Service controller
◦ Others
• What's next
Containers
Technology for packaging an application along with its runtime
dependencies.
Each container that you run is repeatable; the standardization from having
dependencies included means that you get the same behavior wherever you
run it.
Container images
A container image is a ready-to-run software package, containing everything
needed to run an application: the code and any runtime it requires,
application and system libraries, and default values for any essential
settings.
Container runtimes
The container runtime is the software that is responsible for running
containers.
What's next
• Read about container images
• Read about Pods
Images
A container image represents binary data that encapsulates an application
and all its software dependencies. Container images are executable software
bundles that can run standalone and that make very well defined
assumptions about their runtime environment.
Image names
Container images are usually given a name such as pause, example/
mycontainer, or kube-apiserver. Images can also include a registry
hostname; for example: fictional.registry.example/imagename, and
possible a port number as well; for example: fictional.registry.example:
10443/imagename.
If you don't specify a registry hostname, Kubernetes assumes that you mean
the Docker public registry.
After the image name part you can add a tag (as also using with commands
such as docker and podman). Tags let you identify different versions of the
same series of images.
Caution:
You should avoid using the latest tag when deploying containers
in production, as it is harder to track which version of the image is
running and more difficult to roll back to a working version.
Updating images
The default pull policy is IfNotPresent which causes the kubelet to skip
pulling an image if it already exists. If you would like to always force a pull,
you can do one of the following:
• {--root-dir:-/var/lib/kubelet}/config.json
• {cwd of kubelet}/config.json
• ${HOME}/.docker/config.json
• /.docker/config.json
• {--root-dir:-/var/lib/kubelet}/.dockercfg
• {cwd of kubelet}/.dockercfg
• ${HOME}/.dockercfg
• /.dockercfg
Here are the recommended steps to configuring your nodes to use a private
registry. In this example, run these on your desktop/laptop:
1. Run docker login [server] for each set of credentials you want to
use. This updates $HOME/.docker/config.json on your PC.
2. View $HOME/.docker/config.json in an editor to ensure it contains
just the credentials you want to use.
3. Get a list of your nodes; for example:
◦ if you want the names: nodes=$( kubectl get nodes -o
jsonpath='{range.items[*].metadata}{.name} {end}' )
◦ if you want to get the IP addresses: nodes=$( kubectl get
nodes -o jsonpath='{range .items[*].status.addresses[?
(@.type=="ExternalIP")]}{.address} {end}' )
4. Copy your local .docker/config.json to one of the search paths list
above.
◦ for example, to test this out: for n in $nodes; do scp
~/.docker/config.json root@"$n":/var/lib/kubelet/
config.json; done
SUCCESS
You must ensure all nodes in the cluster have the same .docker/
config.json. Otherwise, pods will run on some nodes and fail to run on
others. For example, if you use node autoscaling, then each instance
template needs to include the .docker/config.json or mount a drive that
contains it.
All pods will have read access to images in any private registry once private
registry keys are added to the .docker/config.json.
Pre-pulled images
Note: This approach is suitable if you can control node
configuration. It will not work reliably if your cloud provider
manages nodes and replaces them automatically.
By default, the kubelet tries to pull each image from the specified registry.
However, if the imagePullPolicy property of the container is set to IfNotPr
esent or Never, then a local image is used (preferentially or exclusively,
respectively).
If you already have a Docker credentials file then, rather than using the
above command, you can import the credentials file as a Kubernetes
Secrets.
Create a Secret based on existing Docker credentials explains how to set
this up.
Note: Pods can only reference image pull secrets in their own
namespace, so this process needs to be done one time per
namespace.
Now, you can create pods which reference that secret by adding an imagePu
llSecrets section to a Pod definition.
For example:
This needs to be done for each pod that is using a private registry.
Use cases
There are a number of solutions for configuring private registries. Here are
some common use cases and suggested solutions.
If you need access to multiple registries, you can create one secret for each
registry. Kubelet will merge any imagePullSecrets into a single virtual .doc
ker/config.json
What's next
• Read the OCI Image Manifest Specification
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
• Image names
• Updating images
• Multi-architecture images with image indexes
• Using a private registry
◦ Configuring nodes to authenticate to a private registry
◦ Pre-pulled images
◦ Specifying imagePullSecrets on a Pod
• Use cases
• What's next
Container Environment
This page describes the resources available to Containers in the Container
environment.
Container environment
The Kubernetes Container environment provides several important
resources to Containers:
Container information
The hostname of a Container is the name of the Pod in which the Container
is running. It is available through the hostname command or the gethostnam
e function call in libc.
User defined environment variables from the Pod definition are also
available to the Container, as are any environment variables specified
statically in the Docker image.
Cluster information
A list of all services that were running when a Container was created is
available to that Container as environment variables. Those environment
variables match the syntax of Docker links.
For a service named foo that maps to a Container named bar, the following
variables are defined:
Services have dedicated IP addresses and are available to the Container via
DNS, if DNS addon is enabled.Â
What's next
• Learn more about Container lifecycle hooks.
• Get hands-on experience attaching handlers to Container lifecycle
events.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified July 17, 2020 at 4:11 PM PST: Fix links in concepts section (2)
(c8f470487)
Edit this page Create child page Create an issue
• Container environment
◦ Container information
◦ Cluster information
• What's next
Runtime Class
FEATURE STATE: Kubernetes v1.20 [stable]
Motivation
You can set a different RuntimeClass between different Pods to provide a
balance of performance versus security. For example, if part of your
workload deserves a high level of information security assurance, you might
choose to schedule those Pods so that they run in a container runtime that
uses hardware virtualization. You'd then benefit from the extra isolation of
the alternative runtime, at the expense of some additional overhead.
You can also use RuntimeClass to run different Pods with the same container
runtime but with different settings.
Setup
Ensure the RuntimeClass feature gate is enabled (it is by default). See
Feature Gates for an explanation of enabling feature gates. The RuntimeCla
ss feature gate must be enabled on apiservers and kubelets.
Usage
Once RuntimeClasses are configured for the cluster, using them is very
simple. Specify a runtimeClassName in the Pod spec. For example:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
runtimeClassName: myclass
# ...
This will instruct the Kubelet to use the named RuntimeClass to run this
pod. If the named RuntimeClass does not exist, or the CRI cannot run the
corresponding handler, the pod will enter the Failed terminal phase. Look
for a corresponding event for an error message.
dockershim
containerd
[plugins.cri.containerd.runtimes.${HANDLER_NAME}]
CRI-O
[crio.runtime.runtimes.${HANDLER_NAME}]
runtime_path = "${PATH_TO_BINARY}"
Scheduling
FEATURE STATE: Kubernetes v1.16 [beta]
If the supported nodes are tainted to prevent other RuntimeClass pods from
running on the node, you can add tolerations to the RuntimeClass. As with
the nodeSelector, the tolerations are merged with the pod's tolerations in
admission, effectively taking the union of the set of nodes tolerated by each.
To learn more about configuring the node selector and tolerations, see
Assigning Pods to Nodes.
Pod Overhead
FEATURE STATE: Kubernetes v1.18 [beta]
You can specify overhead resources that are associated with running a Pod.
Declaring overhead allows the cluster (including the scheduler) to account
for it when making decisions about Pods and resources.
To use Pod overhead, you must have the PodOverhead feature gate enabled
(it is on by default).
What's next
• RuntimeClass Design
• RuntimeClass Scheduling Design
• Read about the Pod Overhead concept
• PodOverhead Feature Design
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
• Motivation
• Setup
◦ 1. Configure the CRI implementation on nodes
◦ 2. Create the corresponding RuntimeClass resources
• Usage
◦ CRI Configuration
• Scheduling
◦ Pod Overhead
• What's next
Container Lifecycle Hooks
This page describes how kubelet managed Containers can use the Container
lifecycle hook framework to run code triggered by events during their
management lifecycle.
Overview
Analogous to many programming language frameworks that have
component lifecycle hooks, such as Angular, Kubernetes provides Containers
with lifecycle hooks. The hooks enable Containers to be aware of events in
their management lifecycle and run code implemented in a handler when
the corresponding lifecycle hook is executed.
Container hooks
There are two hooks that are exposed to Containers:
PostStart
PreStop
Hook handler calls are synchronous within the context of the Pod containing
the Container. This means that for a PostStart hook, the Container
ENTRYPOINT and hook fire asynchronously. However, if the hook takes too
long to run or hangs, the Container cannot reach a running state.
PreStop hooks are not executed asynchronously from the signal to stop the
Container; the hook must complete its execution before the signal can be
sent. If a PreStop hook hangs during execution, the Pod's phase will be Term
inating and remain there until the Pod is killed after its terminationGrace
PeriodSeconds expires. This grace period applies to the total time it takes
for both the PreStop hook to execute and for the Container to stop normally.
If, for example, terminationGracePeriodSeconds is 60, and the hook takes
55 seconds to complete, and the Container takes 10 seconds to stop
normally after receiving the signal, then the Container will be killed before it
can stop normally, since terminationGracePeriodSeconds is less than the
total time (55+10) it takes for these two things to happen.
Users should make their hook handlers as lightweight as possible. There are
cases, however, when long running commands make sense, such as when
saving state prior to stopping a Container.
Generally, only single deliveries are made. If, for example, an HTTP hook
receiver is down and is unable to take traffic, there is no attempt to resend.
In some rare cases, however, double delivery may occur. For instance, if a
kubelet restarts in the middle of sending a hook, the hook might be resent
after the kubelet comes back up.
Events:
FirstSeen LastSeen Count
From
SubObjectPath Type Reason Message
--------- -------- -----
----
------------- -------- ------ -------
1m 1m 1 {default-
scheduler }
Normal Scheduled Successfully assigned
test-1730497541-cq1d2 to gke-test-cluster-default-pool-a07e5d30-
siqd
1m 1m 1 {kubelet gke-test-cluster-default-
pool-a07e5d30-siqd} spec.containers{main} Normal
Pulling pulling image "test:1.0"
1m 1m 1 {kubelet gke-test-cluster-default-
pool-a07e5d30-siqd} spec.containers{main} Normal
Created Created container with docker id
5c6a256a2567; Security:[seccomp=unconfined]
1m 1m 1 {kubelet gke-test-cluster-default-
pool-a07e5d30-siqd} spec.containers{main} Normal
Pulled Successfully pulled image "test:1.0"
1m 1m 1 {kubelet gke-test-cluster-default-
pool-a07e5d30-siqd} spec.containers{main} Normal
Started Started container with docker id
5c6a256a2567
38s 38s 1 {kubelet gke-test-cluster-default-
pool-a07e5d30-siqd} spec.containers{main} Normal
Killing Killing container with docker id
5c6a256a2567: PostStart handler: Error executing in Docker
Container: 1
37s 37s 1 {kubelet gke-test-cluster-default-
pool-a07e5d30-siqd} spec.containers{main} Normal
Killing Killing container with docker id
8df9fdfd7054: PostStart handler: Error executing in Docker
Container: 1
38s 37s 2 {kubelet gke-test-cluster-default-
pool-a07e5d30-siqd} Warning
FailedSync Error syncing pod, skipping: failed to
"StartContainer" for "main" with RunContainerError: "PostStart
handler: Error executing in Docker Container: 1"
1m 22s 2 {kubelet gke-test-cluster-default-
pool-a07e5d30-siqd} spec.containers{main} Warning
FailedPostStartHook
What's next
• Learn more about the Container environment.
• Get hands-on experience attaching handlers to Container lifecycle
events.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified September 05, 2020 at 7:58 PM PST: clarify the execution
contexts of hook handlers with different action types are different
(10ebb11f2)
Edit this page Create child page Create an issue
• Overview
• Container hooks
◦ Hook handler implementations
◦ Hook handler execution
◦ Hook delivery guarantees
◦ Debugging Hook handlers
• What's next
Workloads
Understand Pods, the smallest deployable compute object in Kubernetes,
and the higher-level abstractions that help you to run them.
However, to make life considerably easier, you don't need to manage each Po
d directly. Instead, you can use workload resources that manage a set of
pods on your behalf. These resources configure controllers that make sure
the right number of the right kind of pod are running, to match the state you
specified.
What's next
As well as reading about each resource, you can learn about specific tasks
that relate to them:
There are two supporting concepts that provide backgrounds about how
Kubernetes manages pods for applications:
Once your application is running, you might want to make it available on the
internet as a Service or, for web application only, using an Ingress.
Pods
Pods are the smallest deployable units of computing that you can create and
manage in Kubernetes.
As well as application containers, a Pod can contain init containers that run
during Pod startup. You can also inject ephemeral containers for debugging
if your cluster offers this.
What is a Pod?
Note: While Kubernetes supports more container runtimes than
just Docker, Docker is the most commonly known runtime, and it
helps to describe Pods using some terminology from Docker.
Using Pods
Usually you don't need to create Pods directly, even singleton Pods. Instead,
create them using workload resources such as Deployment or Job. If your
Pods need to track state, consider the StatefulSet resource.
Each Pod is meant to run a single instance of a given application. If you want
to scale your application horizontally (to provide more overall resources by
running more instances), you should use multiple Pods, one for each
instance. In Kubernetes, this is typically referred to as replication.
Replicated Pods are usually created and managed as a group by a workload
resource and its controller.
See Pods and controllers for more information on how Kubernetes uses
workload resources, and their controllers, to implement application scaling
and auto-healing.
For example, you might have a container that acts as a web server for files
in a shared volume, and a separate "sidecar" container that updates those
files from a remote source, as in the following diagram:
Some Pods have init containers as well as app containers. Init containers run
and complete before the app containers are started.
Pods natively provide two kinds of shared resources for their constituent
containers: networking and storage.
Here are some examples of workload resources that manage one or more
Pods:
• Deployment
• StatefulSet
• DaemonSet
Pod templates
Controllers for workload resources create Pods from a pod template and
manage those Pods on your behalf.
Each controller for a workload resource uses the PodTemplate inside the
workload object to make actual Pods. The PodTemplate is part of the desired
state of whatever workload resource you used to run your app.
The sample below is a manifest for a simple Job with a template that starts
one container. The container in that Pod prints a message then pauses.
apiVersion: batch/v1
kind: Job
metadata:
name: hello
spec:
template:
# This is the pod template
spec:
containers:
- name: hello
image: busybox
command: ['sh', '-c', 'echo "Hello, Kubernetes!" &&
sleep 3600']
restartPolicy: OnFailure
# The pod template ends here
For example, the StatefulSet controller ensures that the running Pods match
the current pod template for each StatefulSet object. If you edit the
StatefulSet to change its pod template, the StatefulSet starts to create new
Pods based on the updated template. Eventually, all of the old Pods are
replaced with new Pods, and the update is complete.
Each workload resource implements its own rules for handling changes to
the Pod template. If you want to read more about StatefulSet specifically,
read Update strategy in the StatefulSet Basics tutorial.
On Nodes, the kubelet does not directly observe or manage any of the
details around pod templates and updates; those details are abstracted
away. That abstraction and separation of concerns simplifies system
semantics, and makes it feasible to extend the cluster's behavior without
changing existing code.
Storage in Pods
A Pod can specify a set of shared storage volumes. All containers in the Pod
can access the shared volumes, allowing those containers to share data.
Volumes also allow persistent data in a Pod to survive in case one of the
containers within needs to be restarted. See Storage for more information
on how Kubernetes implements shared storage and makes it available to
Pods.
Pod networking
Each Pod is assigned a unique IP address for each address family. Every
container in a Pod shares the network namespace, including the IP address
and network ports. Inside a Pod (and only then), the containers that belong
to the Pod can communicate with one another using localhost. When
containers in a Pod communicate with entities outside the Pod, they must
coordinate how they use the shared network resources (such as ports).
Within a Pod, containers share an IP address and port space, and can find
each other via localhost. The containers in a Pod can also communicate
with each other using standard inter-process communications like SystemV
semaphores or POSIX shared memory. Containers in different Pods have
distinct IP addresses and can not communicate by IPC without special
configuration. Containers that want to interact with a container running in a
different Pod can use IP networking to communicate.
Containers within the Pod see the system hostname as being the same as the
configured name for the Pod. There's more about this in the networking
section.
Static Pods
Static Pods are managed directly by the kubelet daemon on a specific node,
without the API server observing them. Whereas most Pods are managed by
the control plane (for example, a Deployment), for static Pods, the kubelet
directly supervises each static Pod (and restarts it if it fails).
Static Pods are always bound to one Kubelet on a specific node. The main
use for static Pods is to run a self-hosted control plane: in other words, using
the kubelet to supervise the individual control plane components.
The kubelet automatically tries to create a mirror Pod on the Kubernetes API
server for each static Pod. This means that the Pods running on a node are
visible on the API server, but cannot be controlled from there.
What's next
• Learn about the lifecycle of a Pod.
• Learn about RuntimeClass and how you can use it to configure different
Pods with different container runtime configurations.
• Read about Pod topology spread constraints.
• Read about PodDisruptionBudget and how you can use it to manage
application availability during disruptions.
• Pod is a top-level resource in the Kubernetes REST API. The Pod object
definition describes the object in detail.
• The Distributed System Toolkit: Patterns for Composite Containers
explains common layouts for Pods with more than one container.
To understand the context for why Kubernetes wraps a common Pod API in
other resources (such as StatefulSets or Deployments, you can read about
the prior art, including:
• Aurora
• Borg
• Marathon
• Omega
• Tupperware.
Pod Lifecycle
This page describes the lifecycle of a Pod. Pods follow a defined lifecycle,
starting in the Pending phase, moving through Running if at least one of its
primary containers starts OK, and then through either the Succeeded or Fai
led phases depending on whether any container in the Pod terminated in
failure.
In the Kubernetes API, Pods have both a specification and an actual status.
The status for a Pod object consists of a set of Pod conditions. You can also
inject custom readiness information into the condition data for a Pod, if that
is useful to your application.
Pods are only scheduled once in their lifetime. Once a Pod is scheduled
(assigned) to a Node, the Pod runs on that Node until it stops or is
terminated.
Pod lifetime
Like individual application containers, Pods are considered to be relatively
ephemeral (rather than durable) entities. Pods are created, assigned a
unique ID (UID), and scheduled to nodes where they remain until
termination (according to restart policy) or deletion. If a Node dies, the Pods
scheduled to that node are scheduled for deletion after a timeout period.
A multi-container Pod that contains a file puller and a web server that uses a
persistent volume for shared storage between the containers.
Pod phase
A Pod's status field is a PodStatus object, which has a phase field.
The phase of a Pod is a simple, high-level summary of where the Pod is in its
lifecycle. The phase is not intended to be a comprehensive rollup of
observations of container or Pod state, nor is it intended to be a
comprehensive state machine.
The number and meanings of Pod phase values are tightly guarded. Other
than what is documented here, nothing should be assumed about Pods that
have a given phase value.
Here are the possible values for phase:
Value Description
The Pod has been accepted by the Kubernetes cluster, but one or
more of the containers has not been set up and made ready to
Pending run. This includes time a Pod spends waiting to be scheduled as
well as the time spent downloading container images over the
network.
The Pod has been bound to a node, and all of the containers have
Running been created. At least one container is still running, or is in the
process of starting or restarting.
All containers in the Pod have terminated in success, and will not
Succeeded
be restarted.
All containers in the Pod have terminated, and at least one
Failed container has terminated in failure. That is, the container either
exited with non-zero status or was terminated by the system.
For some reason the state of the Pod could not be obtained. This
Unknown phase typically occurs due to an error in communicating with the
node where the Pod should be running.
Container states
As well as the phase of the Pod overall, Kubernetes tracks the state of each
container inside a Pod. You can use container lifecycle hooks to trigger
events to run at certain points in a container's lifecycle.
Once the scheduler assigns a Pod to a Node, the kubelet starts creating
containers for that Pod using a container runtime. There are three possible
container states: Waiting, Running, and Terminated.
To check the state of a Pod's containers, you can use kubectl describe
pod <name-of-pod>. The output shows the state for each container within
that Pod.
Waiting
If a container is not in either the Running or Terminated state, it is Waiting.
A container in the Waiting state is still running the operations it requires in
order to complete start up: for example, pulling the container image from a
container image registry, or applying Secret data. When you use kubectl to
query a Pod with a container that is Waiting, you also see a Reason field to
summarize why the container is in that state.
Running
The Running status indicates that a container is executing without issues. If
there was a postStart hook configured, it has already executed and
finished. When you use kubectl to query a Pod with a container that is Runn
ing, you also see information about when the container entered the Running
state.
Terminated
A container in the Terminated state began execution and then either ran to
completion or failed for some reason. When you use kubectl to query a Pod
with a container that is Terminated, you see a reason, an exit code, and the
start and finish time for that container's period of execution.
If a container has a preStop hook configured, that runs before the container
enters the Terminated state.
Pod conditions
A Pod has a PodStatus, which has an array of PodConditions through which
the Pod has or has not passed:
Pod readiness
FEATURE STATE: Kubernetes v1.14 [stable]
Your application can inject extra feedback or signals into PodStatus: Pod
readiness. To use this, set readinessGates in the Pod's spec to specify a list
of additional conditions that the kubelet evaluates for Pod readiness.
Here is an example:
kind: Pod
...
spec:
readinessGates:
- conditionType: "www.example.com/feature-1"
status:
conditions:
- type: Ready # a built in
PodCondition
status: "False"
lastProbeTime: null
lastTransitionTime: 2018-01-01T00:00:00Z
- type: "www.example.com/feature-1" # an extra
PodCondition
status: "False"
lastProbeTime: null
lastTransitionTime: 2018-01-01T00:00:00Z
containerStatuses:
- containerID: docker://abcd...
ready: true
...
The Pod conditions you add must have names that meet the Kubernetes label
key format.
When a Pod's containers are Ready but at least one custom condition is
missing or False, the kubelet sets the Pod's condition to ContainersReady.
Container probes
A Probe is a diagnostic performed periodically by the kubelet on a Container.
To perform a diagnostic, the kubelet calls a Handler implemented by the
container. There are three types of handlers:
The kubelet can optionally perform and react to three kinds of probes on
running containers:
If you'd like your container to be killed and restarted if a probe fails, then
specify a liveness probe, and specify a restartPolicy of Always or
OnFailure.
If you'd like to start sending traffic to a Pod only when a probe succeeds,
specify a readiness probe. In this case, the readiness probe might be the
same as the liveness probe, but the existence of the readiness probe in the
spec means that the Pod will start without receiving any traffic and only
start receiving traffic after the probe starts succeeding. If your container
needs to work on loading large data, configuration files, or migrations
during startup, specify a readiness probe.
If you want your container to be able to take itself down for maintenance,
you can specify a readiness probe that checks an endpoint specific to
readiness that is different from the liveness probe.
Note: If you just want to be able to drain requests when the Pod is
deleted, you do not necessarily need a readiness probe; on
deletion, the Pod automatically puts itself into an unready state
regardless of whether the readiness probe exists. The Pod remains
in the unready state while it waits for the containers in the Pod to
stop.
Startup probes are useful for Pods that have containers that take a long time
to come into service. Rather than set a long liveness interval, you can
configure a separate configuration for probing the container as it starts up,
allowing a time longer than the liveness interval would allow.
If your container usually starts in more than initialDelaySeconds +
failureThreshold × periodSeconds, you should specify a startup probe
that checks the same endpoint as the liveness probe. The default for period
Seconds is 30s. You should then set its failureThreshold high enough to
allow the container to start, without changing the default values of the
liveness probe. This helps to protect against deadlocks.
Termination of Pods
Because Pods represent processes running on nodes in the cluster, it is
important to allow those processes to gracefully terminate when they are no
longer needed (rather than being abruptly stopped with a KILL signal and
having no chance to clean up).
The design aim is for you to be able to request deletion and know when
processes terminate, but also be able to ensure that deletes eventually
complete. When you request deletion of a Pod, the cluster records and
tracks the intended grace period before the Pod is allowed to be forcefully
killed. With that forceful shutdown tracking in place, the kubelet attempts
graceful shutdown.
Typically, the container runtime sends a TERM signal to the main process in
each container. Many container runtimes respect the STOPSIGNAL value
defined in the container image and send this instead of TERM. Once the
grace period has expired, the KILL signal is sent to any remaining
processes, and the Pod is then deleted from the API Server. If the kubelet or
the container runtime's management service is restarted while waiting for
processes to terminate, the cluster retries from the start including the full
original grace period.
An example flow:
1. You use the kubectl tool to manually delete a specific Pod, with the
default grace period (30 seconds).
2. The Pod in the API server is updated with the time beyond which the
Pod is considered "dead" along with the grace period. If you use kubect
l describe to check on the Pod you're deleting, that Pod shows up as
"Terminating". On the node where the Pod is running: as soon as the
kubelet sees that a Pod has been marked as terminating (a graceful
shutdown duration has been set), the kubelet begins the local Pod
shutdown process.
1. If one of the Pod's containers has defined a preStop hook, the
kubelet runs that hook inside of the container. If the preStop hook
is still running after the grace period expires, the kubelet requests
a small, one-off grace period extension of 2 seconds.
By default, all deletes are graceful within 30 seconds. The kubectl delete
command supports the --grace-period=<seconds> option which allows you
to override the default and specify your own value.
Setting the grace period to 0 forcibly and immediately deletes the Pod from
the API server. If the pod was still running on a node, that forcible deletion
triggers the kubelet to begin immediate cleanup.
When a force deletion is performed, the API server does not wait for
confirmation from the kubelet that the Pod has been terminated on the node
it was running on. It removes the Pod in the API immediately so a new Pod
can be created with the same name. On the node, Pods that are set to
terminate immediately will still be given a small grace period before being
force killed.
If you need to force-delete Pods that are part of a StatefulSet, refer to the
task documentation for deleting Pods from a StatefulSet.
What's next
• Get hands-on experience attaching handlers to Container lifecycle
events.
• For detailed information about Pod / Container status in the API, see
PodStatus and ContainerStatus.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 17, 2020 at 8:11 PM PST: Fix typo in container
restart policy #25041 (983df6134)
Edit this page Create child page Create an issue
• Pod lifetime
• Pod phase
• Container states
◦ Waiting
◦ Running
◦ Terminated
• Container restart policy
• Pod conditions
◦ Pod readiness
◦ Status for Pod readiness
• Container probes
◦ When should you use a liveness probe?
◦ When should you use a readiness probe?
◦ When should you use a startup probe?
• Termination of Pods
◦ Forced Pod termination
◦ Garbage collection of failed Pods
• What's next
Init Containers
This page provides an overview of init containers: specialized containers
that run before app containers in a Pod. Init containers can contain utilities
or setup scripts not present in an app image.
You can specify init containers in the Pod specification alongside the contai
ners array (which describes app containers).
If a Pod's init container fails, the kubelet repeatedly restarts that init
container until it succeeds. However, if the Pod has a restartPolicy of
Never, and an init container fails during startup of that Pod, Kubernetes
treats the overall Pod as failed.
To specify an init container for a Pod, add the initContainers field into the
Pod specification, as an array of objects of type Container, alongside the app
containers array. The status of the init containers is returned in .status.i
nitContainerStatuses field as an array of the container statuses (similar to
the .status.containerStatuses field).
If you specify multiple init containers for a Pod, kubelet runs each init
container sequentially. Each init container must succeed before the next can
run. When all of the init containers have run to completion, kubelet
initializes the application containers for the Pod and runs them as usual.
Using init containers
Because init containers have separate images from app containers, they
have some advantages for start-up related code:
• Init containers can contain utilities or custom code for setup that are
not present in an app image. For example, there is no need to make an
image FROM another image just to use a tool like sed, awk, python, or di
g during setup.
• The application image builder and deployer roles can work
independently without the need to jointly build a single app image.
• Init containers can run with a different view of the filesystem than app
containers in the same Pod. Consequently, they can be given access to
Secrets that app containers cannot access.
• Because init containers run to completion before any app containers
start, init containers offer a mechanism to block or delay app container
startup until a set of preconditions are met. Once preconditions are
met, all of the app containers in a Pod can start in parallel.
• Init containers can securely run utilities or custom code that would
otherwise make an app container image less secure. By keeping
unnecessary tools separate you can limit the attack surface of your app
container image.
Examples
Here are some ideas for how to use init containers:
• Register this Pod with a remote server from the downward API with a
command like:
• Wait for some time before starting the app container with a command
like
sleep 60
This example defines a simple Pod that has two init containers. The first
waits for myservice, and the second waits for mydb. Once both init
containers complete, the Pod runs the app container from its spec section.
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
labels:
app: myapp
spec:
containers:
- name: myapp-container
image: busybox:1.28
command: ['sh', '-c', 'echo The app is running! && sleep
3600']
initContainers:
- name: init-myservice
image: busybox:1.28
command: ['sh', '-c', "until nslookup myservice.$(cat /var/
run/secrets/kubernetes.io/serviceaccount/
namespace).svc.cluster.local; do echo waiting for myservice;
sleep 2; done"]
- name: init-mydb
image: busybox:1.28
command: ['sh', '-c', "until nslookup mydb.$(cat /var/run/
secrets/kubernetes.io/serviceaccount/
namespace).svc.cluster.local; do echo waiting for mydb; sleep 2;
done"]
pod/myapp-pod created
Name: myapp-pod
Namespace: default
[...]
Labels: app=myapp
Status: Pending
[...]
Init Containers:
init-myservice:
[...]
State: Running
[...]
init-mydb:
[...]
State: Waiting
Reason: PodInitializing
Ready: False
[...]
Containers:
myapp-container:
[...]
State: Waiting
Reason: PodInitializing
Ready: False
[...]
Events:
FirstSeen LastSeen Count From
SubObjectPath Type
Reason Message
--------- -------- ----- ----
------------- --------
------ -------
16s 16s 1 {default-
scheduler }
Normal Scheduled Successfully assigned myapp-pod to
172.17.4.201
16s 16s 1 {kubelet 172.17.4.201}
spec.initContainers{init-myservice} Normal
Pulling pulling image "busybox"
13s 13s 1 {kubelet 172.17.4.201}
spec.initContainers{init-myservice} Normal
Pulled Successfully pulled image "busybox"
13s 13s 1 {kubelet 172.17.4.201}
spec.initContainers{init-myservice} Normal
Created Created container with docker id 5ced34a04634;
Security:[seccomp=unconfined]
13s 13s 1 {kubelet 172.17.4.201}
spec.initContainers{init-myservice} Normal
Started Started container with docker id 5ced34a04634
---
apiVersion: v1
kind: Service
metadata:
name: myservice
spec:
ports:
- protocol: TCP
port: 80
targetPort: 9376
---
apiVersion: v1
kind: Service
metadata:
name: mydb
spec:
ports:
- protocol: TCP
port: 80
targetPort: 9377
service/myservice created
service/mydb created
You'll then see that those init containers complete, and that the myapp-pod
Pod moves into the Running state:
This simple example should provide some inspiration for you to create your
own init containers. What's next contains a link to a more detailed example.
Detailed behavior
During Pod startup, the kubelet delays running init containers until the
networking and storage are ready. Then the kubelet runs the Pod's init
containers in the order they appear in the Pod's spec.
Each init container must exit successfully before the next container starts. If
a container fails to start due to the runtime or exits with failure, it is retried
according to the Pod restartPolicy. However, if the Pod restartPolicy is
set to Always, the init containers use restartPolicy OnFailure.
A Pod cannot be Ready until all init containers have succeeded. The ports on
an init container are not aggregated under a Service. A Pod that is
initializing is in the Pending state but should have a condition Initialized
set to true.
If the Pod restarts, or is restarted, all init containers must execute again.
Changes to the init container spec are limited to the container image field.
Altering an init container image field is equivalent to restarting the Pod.
The name of each app and init container in a Pod must be unique; a
validation error is thrown for any container sharing a name with another.
Resources
Given the ordering and execution for init containers, the following rules for
resource usage apply:
Quota and limits are applied based on the effective Pod request and limit.
Pod level control groups (cgroups) are based on the effective Pod request
and limit, the same as the scheduler.
Pod restart reasons
A Pod can restart, causing re-execution of init containers, for the following
reasons:
• A user updates the Pod specification, causing the init container image
to change. Any changes to the init container image restarts the Pod.
App container image changes only restart the app container.
• The Pod infrastructure container is restarted. This is uncommon and
would have to be done by someone with root access to nodes.
• All containers in a Pod are terminated while restartPolicy is set to
Always, forcing a restart, and the init container completion record has
been lost due to garbage collection.
What's next
• Read about creating a Pod that has an init container
• Learn how to debug init containers
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified December 19, 2020 at 2:13 PM PST: Fix some typo in init-
containers.md (21c575008)
Edit this page Create child page Create an issue
You can use topology spread constraints to control how Pods are spread
across your cluster among failure-domains such as regions, zones, nodes,
and other user-defined topology domains. This can help to achieve high
availability as well as efficient resource utilization.
Note: In versions of Kubernetes before v1.19, you must enable the
EvenPodsSpread feature gate on the API server and the scheduler
in order to use Pod topology spread constraints.
Prerequisites
Node Labels
Topology spread constraints rely on node labels to identify the topology
domain(s) that each Node is in. For example, a Node might have labels: node
=node1,zone=us-east-1a,region=us-east-1
Instead of manually applying labels, you can also reuse the well-known
labels that are created and populated automatically on most clusters.
API
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
topologySpreadConstraints:
- maxSkew: <integer>
topologyKey: <string>
whenUnsatisfiable: <string>
labelSelector: <object>
You can define one or multiple topologySpreadConstraint to instruct the
kube-scheduler how to place each incoming Pod in relation to the existing
Pods across your cluster. The fields are:
You can read more about this field by running kubectl explain
Pod.spec.topologySpreadConstraints.
Suppose you have a 4-node cluster where 3 Pods labeled foo:bar are
located in node1, node2 and node3 respectively:
pods/topology-spread-constraints/one-constraint.yaml
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: pause
image: k8s.gcr.io/pause:3.1
topologyKey: zone implies the even distribution will only be applied to the
nodes which have label pair "zone:<any value>" present. whenUnsatisfiabl
e: DoNotSchedule tells the scheduler to let it stay pending if the incoming
Pod can't satisfy the constraint.
If the scheduler placed this incoming Pod into "zoneA", the Pods distribution
would become [3, 1], hence the actual skew is 2 (3 - 1) - which violates maxS
kew: 1. In this example, the incoming Pod can only be placed onto "zoneB":
OR
You can tweak the Pod spec to meet various kinds of requirements:
• Change maxSkew to a bigger value like "2" so that the incoming Pod can
be placed onto "zoneA" as well.
• Change topologyKey to "node" so as to distribute the Pods evenly
across nodes instead of zones. In the above example, if maxSkew
remains "1", the incoming Pod can only be placed onto "node4".
• Change whenUnsatisfiable: DoNotSchedule to whenUnsatisfiable:
ScheduleAnyway to ensure the incoming Pod to be always schedulable
(suppose other scheduling APIs are satisfied). However, it's preferred to
be placed onto the topology domain which has fewer matching Pods.
(Be aware that this preferability is jointly normalized with other
internal scheduling priorities like resource usage ratio, etc.)
Example: Multiple TopologySpreadConstraints
This builds upon the previous example. Suppose you have a 4-node cluster
where 3 Pods labeled foo:bar are located in node1, node2 and node3
respectively:
pods/topology-spread-constraints/two-constraints.yaml
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
- maxSkew: 1
topologyKey: node
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: pause
image: k8s.gcr.io/pause:3.1
In this case, to match the first constraint, the incoming Pod can only be
placed onto "zoneB"; while in terms of the second constraint, the incoming
Pod can only be placed onto "node4". Then the results of 2 constraints are
ANDed, so the only viable option is to place on "node4".
Multiple constraints can lead to conflicts. Suppose you have a 3-node cluster
across 2 zones:
[JavaScript must be enabled to view content]
To overcome this situation, you can either increase the maxSkew or modify
one of the constraints to use whenUnsatisfiable: ScheduleAnyway.
Conventions
• Only the Pods holding the same namespace as the incoming Pod can be
matching candidates.
and you know that "zoneC" must be excluded. In this case, you can
compose the yaml as below, so that "mypod" will be placed onto
"zoneB" instead of "zoneC". Similarly spec.nodeSelector is also
respected.
pods/topology-spread-constraints/one-constraint-with-
nodeaffinity.yaml
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: zone
operator: NotIn
values:
- zoneC
containers:
- name: pause
image: k8s.gcr.io/pause:3.1
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
defaultingType: List
defaultConstraints:
- maxSkew: 3
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 5
topologyKey: "topology.kubernetes.io/zone"
whenUnsatisfiable: ScheduleAnyway
The PodTopologySpread plugin does not score the nodes that don't
have the topology keys specified in the spreading constraints.
If you don't want to use the default Pod spreading constraints for your
cluster, you can disable those defaults by setting defaultingType to List
and leaving empty defaultConstraints in the PodTopologySpread plugin
configuration:
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints: []
defaultingType: List
• For PodAffinity, you can try to pack any number of Pods into
qualifying topology domain(s)
• For PodAntiAffinity, only one Pod can be scheduled into a single
topology domain.
For finer control, you can specify topology spread constraints to distribute
Pods across different topology domains - to achieve either high availability
or cost-saving. This can also help on rolling update workloads and scaling
out replicas smoothly. See Motivation for more details.
Known Limitations
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 02, 2020 at 11:03 AM PST: Graduate default pod
topology spread to beta (ac3d7d564)
Edit this page Create child page Create an issue
• Prerequisites
◦ Node Labels
• Spread Constraints for Pods
◦ API
◦ Example: One TopologySpreadConstraint
◦ Example: Multiple TopologySpreadConstraints
◦ Conventions
◦ Cluster-level default constraints
• Comparison with PodAffinity/PodAntiAffinity
• Known Limitations
• What's next
Disruptions
This guide is for application owners who want to build highly available
applications, and thus need to understand what types of disruptions can
happen to Pods.
Kubernetes offers features to help you run highly available applications even
when you introduce frequent voluntary disruptions.
Cluster managers and hosting providers should use tools which respect
PodDisruptionBudgets by calling the Eviction API instead of directly deleting
pods or deployments.
For example, the kubectl drain subcommand lets you mark a node as
going out of service. When you run kubectl drain, the tool tries to evict all
of the Pods on the Node you're taking out of service. The eviction request
that kubectl submits on your behalf may be temporarily rejected, so the tool
periodically retries all failed requests until all Pods on the target node are
terminated, or until a configurable timeout is reached.
The group of pods that comprise the application is specified using a label
selector, the same as the one used by the application's controller
(deployment, stateful-set, etc).
PodDisruptionBudget example
Consider a cluster with 3 nodes, node-1 through node-3. The cluster is
running several applications. One of them has 3 replicas initially called pod-
a, pod-b, and pod-c. Another, unrelated pod without a PDB, called pod-x, is
also shown. Initially, the pods are laid out as follows:
All 3 pods are part of a deployment, and they collectively have a PDB which
requires there be at least 2 of the 3 pods to be available at all times.
For example, assume the cluster administrator wants to reboot into a new
kernel version to fix a bug in the kernel. The cluster administrator first tries
to drain node-1 using the kubectl drain command. That tool tries to evict p
od-a and pod-x. This succeeds immediately. Both pods go into the terminat
ing state at the same time. This puts the cluster in this state:
(Note: for a StatefulSet, pod-a, which would be called something like pod-0,
would need to terminate completely before its replacement, which is also
called pod-0 but has a different UID, could be created. Otherwise, the
example applies to a StatefulSet as well.)
Now, the cluster administrator tries to drain node-2. The drain command
will try to evict the two pods in some order, say pod-b first and then pod-d.
It will succeed at evicting pod-b. But, when it tries to evict pod-d, it will be
refused because that would leave only one pod available for the deployment.
At this point, the cluster administrator needs to add a node back to the
cluster to proceed with the upgrade.
You can see how Kubernetes varies the rate at which disruptions can
happen, according to:
What's next
• Follow steps to protect your application by configuring a Pod Disruption
Budget.
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified July 17, 2020 at 3:46 PM PST: Replace reference to redirect
entries (1) (0bdcd44e6)
Edit this page Create child page Create an issue
Ephemeral Containers
FEATURE STATE: Kubernetes v1.16 [alpha]
Sometimes it's necessary to inspect the state of an existing Pod, however, for
example to troubleshoot a hard-to-reproduce bug. In these cases you can run
an ephemeral container in an existing Pod to inspect its state and run
arbitrary commands.
What is an ephemeral container?
Ephemeral containers differ from other containers in that they lack
guarantees for resources or execution, and they will never be automatically
restarted, so they are not appropriate for building applications. Ephemeral
containers are described using the same ContainerSpec as regular
containers, but many fields are incompatible and disallowed for ephemeral
containers.
• Ephemeral containers may not have ports, so fields such as ports, live
nessProbe, readinessProbe are disallowed.
• Pod resource allocations are immutable, so setting resources is
disallowed.
• For a complete list of allowed fields, see the EphemeralContainer
reference documentation.
{
"apiVersion": "v1",
"kind": "EphemeralContainers",
"metadata": {
"name": "example-pod"
},
"ephemeralContainers": [{
"command": [
"sh"
],
"image": "busybox",
"imagePullPolicy": "IfNotPresent",
"name": "debugger",
"stdin": true,
"tty": true,
"terminationMessagePolicy": "File"
}]
}
{
"kind":"EphemeralContainers",
"apiVersion":"v1",
"metadata":{
"name":"example-pod",
"namespace":"default",
"selfLink":"/api/v1/namespaces/default/pods/example-pod/
ephemeralcontainers",
"uid":"a14a6d9b-62f2-4119-9d8e-e2ed6bc3a47c",
"resourceVersion":"15886",
"creationTimestamp":"2019-08-29T06:41:42Z"
},
"ephemeralContainers":[
{
"name":"debugger",
"image":"busybox",
"command":[
"sh"
],
"resources":{
},
"terminationMessagePolicy":"File",
"imagePullPolicy":"IfNotPresent",
"stdin":true,
"tty":true
}
]
}
You can view the state of the newly created ephemeral container using kube
ctl describe:
...
Ephemeral Containers:
debugger:
Container ID: docker://
cf81908f149e7e9213d3c3644eda55c72efaff67652a2685c1146f0ce151e80f
Image: busybox
Image ID: docker-pullable://
busybox@sha256:9f1003c480699be56815db0f8146ad2e22efea85129b5b5983
d0e0fb52d9ab70
Port: <none>
Host Port: <none>
Command:
sh
State: Running
Started: Thu, 29 Aug 2019 06:42:21 +0000
Ready: False
Restart Count: 0
Environment: <none>
Mounts: <none>
...
You can interact with the new ephemeral container in the same way as other
containers using kubectl attach, kubectl exec, and kubectl logs, for
example:
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 09, 2020 at 1:46 PM PST: Update kubectl debug
docs for 1.20 release (#24847) (179c821b0)
Edit this page Create child page Create an issue
Workload Resources
Deployments
ReplicaSet
StatefulSets
DaemonSet
Jobs
Garbage Collection
CronJob
ReplicationController
Deployments
A Deployment provides declarative updates for Pods and ReplicaSets.
Use Case
The following are typical use cases for Deployments:
Creating a Deployment
The following is an example of a Deployment. It creates a ReplicaSet to
bring up three nginx Pods:
controllers/nginx-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
In this example:
Before you begin, make sure your Kubernetes cluster is up and running.
Follow the steps given below to create the above Deployment:
Note: You can specify the --record flag to write the command
executed in the resource annotation kubernetes.io/change-
cause. The recorded change is useful for future introspection. For
example, to see the commands executed in each Deployment
revision.
When you inspect the Deployments in your cluster, the following fields
are displayed:
3. Run the kubectl get deployments again a few seconds later. The
output is similar to this:
Notice that the Deployment has created all three replicas, and all
replicas are up-to-date (they contain the latest Pod template) and
available.
The created ReplicaSet ensures that there are three nginx Pods.
Note:
Pod-template-hash label
Caution: Do not change this label.
Updating a Deployment
Note: A Deployment's rollout is triggered if and only if the
Deployment's Pod template (that is, .spec.template) is changed,
for example if the labels or container images of the template are
updated. Other updates, such as scaling the Deployment, do not
trigger a rollout.
1. Let's update the nginx Pods to use the nginx:1.16.1 image instead of
the nginx:1.14.2 image.
kubectl --record deployment.apps/nginx-deployment set image
deployment.v1.apps/nginx-deployment nginx=nginx:1.16.1
deployment.apps/nginx-deployment edited
or
• After the rollout succeeds, you can view the Deployment by running kub
ectl get deployments. The output is similar to this:
• Run kubectl get rs to see that the Deployment updated the Pods by
creating a new ReplicaSet and scaling it up to 3 replicas, as well as
scaling down the old ReplicaSet to 0 replicas.
kubectl get rs
• Running get pods should now show only the new Pods:
kubectl get pods
Next time you want to update these Pods, you only need to update the
Deployment's Pod template again.
Deployment ensures that only a certain number of Pods are down while
they are being updated. By default, it ensures that at least 75% of the
desired number of Pods are up (25% max unavailable).
For example, if you look at the above Deployment closely, you will see
that it first created a new Pod, then deleted some old Pods, and created
new ones. It does not kill old Pods until a sufficient number of new Pods
have come up, and does not create new Pods until a sufficient number
of old Pods have been killed. It makes sure that at least 2 Pods are
available and that at max 4 Pods in total are available.
Name: nginx-deployment
Namespace: default
CreationTimestamp: Thu, 30 Nov 2017 10:56:25 +0000
Labels: app=nginx
Annotations: deployment.kubernetes.io/revision=2
Selector: app=nginx
Replicas: 3 desired | 3 updated | 3 total | 3
available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx:1.16.1
Port: 80/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: nginx-deployment-1564180365 (3/3 replicas
created)
Events:
Type Reason Age From
Message
---- ------ ---- ----
-------
Normal ScalingReplicaSet 2m deployment-controller
Scaled up replica set nginx-deployment-2035384211 to 3
Normal ScalingReplicaSet 24s deployment-controller
Scaled up replica set nginx-deployment-1564180365 to 1
Normal ScalingReplicaSet 22s deployment-controller
Scaled down replica set nginx-deployment-2035384211 to 2
Normal ScalingReplicaSet 22s deployment-controller
Scaled up replica set nginx-deployment-1564180365 to 2
Normal ScalingReplicaSet 19s deployment-controller
Scaled down replica set nginx-deployment-2035384211 to 1
Normal ScalingReplicaSet 19s deployment-controller
Scaled up replica set nginx-deployment-1564180365 to 3
Normal ScalingReplicaSet 14s deployment-controller
Scaled down replica set nginx-deployment-2035384211 to 0
Here you see that when you first created the Deployment, it created a
ReplicaSet (nginx-deployment-2035384211) and scaled it up to 3
replicas directly. When you updated the Deployment, it created a new
ReplicaSet (nginx-deployment-1564180365) and scaled it up to 1 and
then scaled down the old ReplicaSet to 2, so that at least 2 Pods were
available and at most 4 Pods were created at all times. It then
continued scaling up and down the new and the old ReplicaSet, with
the same rolling update strategy. Finally, you'll have 3 available replicas
in the new ReplicaSet, and the old ReplicaSet is scaled down to 0.
• The rollout gets stuck. You can verify it by checking the rollout status:
• Press Ctrl-C to stop the above rollout status watch. For more
information on stuck rollouts, read more here.
kubectl get rs
• Looking at the Pods created, you see that 1 Pod created by new
ReplicaSet is stuck in an image pull loop.
NAME READY
STATUS RESTARTS AGE
nginx-deployment-1564180365-70iae 1/1
Running 0 25s
nginx-deployment-1564180365-jbqqo 1/1
Running 0 25s
nginx-deployment-1564180365-hysrc 1/1
Running 0 25s
nginx-deployment-3066724191-08mng 0/1
ImagePullBackOff 0 6s
Name: nginx-deployment
Namespace: default
CreationTimestamp: Tue, 15 Mar 2016 14:48:04 -0700
Labels: app=nginx
Selector: app=nginx
Replicas: 3 desired | 1 updated | 4 total | 3
available | 1 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx:1.161
Port: 80/TCP
Host Port: 0/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True ReplicaSetUpdated
OldReplicaSets: nginx-deployment-1564180365 (3/3
replicas created)
NewReplicaSet: nginx-deployment-3066724191 (1/1
replicas created)
Events:
FirstSeen LastSeen Count From
SubObjectPath Type Reason Message
--------- -------- ----- ----
------------- -------- ------ -------
1m 1m 1 {deployment-
controller } Normal ScalingReplicaSet
Scaled up replica set nginx-deployment-2035384211 to 3
22s 22s 1 {deployment-
controller } Normal ScalingReplicaSet
Scaled up replica set nginx-deployment-1564180365 to 1
22s 22s 1 {deployment-
controller } Normal ScalingReplicaSet
Scaled down replica set nginx-deployment-2035384211 to 2
22s 22s 1 {deployment-
controller } Normal ScalingReplicaSet
Scaled up replica set nginx-deployment-1564180365 to 2
21s 21s 1 {deployment-
controller } Normal ScalingReplicaSet
Scaled down replica set nginx-deployment-2035384211 to 1
21s 21s 1 {deployment-
controller } Normal ScalingReplicaSet
Scaled up replica set nginx-deployment-1564180365 to 3
13s 13s 1 {deployment-
controller } Normal ScalingReplicaSet
Scaled down replica set nginx-deployment-2035384211 to 0
13s 13s 1 {deployment-
controller } Normal ScalingReplicaSet
Scaled up replica set nginx-deployment-3066724191 to 1
deployments "nginx-deployment"
REVISION CHANGE-CAUSE
1 kubectl apply --filename=https://k8s.io/examples/
controllers/nginx-deployment.yaml --record=true
2 kubectl set image deployment.v1.apps/nginx-
deployment nginx=nginx:1.16.1 --record=true
3 kubectl set image deployment.v1.apps/nginx-
deployment nginx=nginx:1.161 --record=true
1. Now you've decided to undo the current rollout and rollback to the
previous revision:
Name: nginx-deployment
Namespace: default
CreationTimestamp: Sun, 02 Sep 2018 18:17:55 -0500
Labels: app=nginx
Annotations: deployment.kubernetes.io/revision=4
kubernetes.io/change-cause=kubectl
set image deployment.v1.apps/nginx-deployment nginx=nginx:
1.16.1 --record=true
Selector: app=nginx
Replicas: 3 desired | 3 updated | 3 total | 3
available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx:1.16.1
Port: 80/TCP
Host Port: 0/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: nginx-deployment-c4747d96c (3/3 replicas
created)
Events:
Type Reason Age From
Message
---- ------ ---- ----
-------
Normal ScalingReplicaSet 12m deployment-controller
Scaled up replica set nginx-deployment-75675f5897 to 3
Normal ScalingReplicaSet 11m deployment-controller
Scaled up replica set nginx-deployment-c4747d96c to 1
Normal ScalingReplicaSet 11m deployment-controller
Scaled down replica set nginx-deployment-75675f5897 to 2
Normal ScalingReplicaSet 11m deployment-controller
Scaled up replica set nginx-deployment-c4747d96c to 2
Normal ScalingReplicaSet 11m deployment-controller
Scaled down replica set nginx-deployment-75675f5897 to 1
Normal ScalingReplicaSet 11m deployment-controller
Scaled up replica set nginx-deployment-c4747d96c to 3
Normal ScalingReplicaSet 11m deployment-controller
Scaled down replica set nginx-deployment-75675f5897 to 0
Normal ScalingReplicaSet 11m deployment-controller
Scaled up replica set nginx-deployment-595696685f to 1
Normal DeploymentRollback 15s deployment-controller
Rolled back deployment "nginx-deployment" to revision 2
Normal ScalingReplicaSet 15s deployment-controller
Scaled down replica set nginx-deployment-595696685f to 0
Scaling a Deployment
You can scale a Deployment by using the following command:
deployment.apps/nginx-deployment scaled
deployment.apps/nginx-deployment scaled
Proportional scaling
RollingUpdate Deployments support running multiple versions of an
application at the same time. When you or an autoscaler scales a
RollingUpdate Deployment that is in the middle of a rollout (either in
progress or paused), the Deployment controller balances the additional
replicas in the existing active ReplicaSets (ReplicaSets with Pods) in order
to mitigate risk. This is called proportional scaling.
kubectl get rs
• Then a new scaling request for the Deployment comes along. The
autoscaler increments the Deployment replicas to 15. The Deployment
controller needs to decide where to add these new 5 replicas. If you
weren't using proportional scaling, all 5 of them would be added in the
new ReplicaSet. With proportional scaling, you spread the additional
replicas across all ReplicaSets. Bigger proportions go to the
ReplicaSets with the most replicas and lower proportions go to
ReplicaSets with less replicas. Any leftovers are added to the
ReplicaSet with the most replicas. ReplicaSets with zero replicas are
not scaled up.
In our example above, 3 replicas are added to the old ReplicaSet and 2
replicas are added to the new ReplicaSet. The rollout process should
eventually move all replicas to the new ReplicaSet, assuming the new
replicas become healthy. To confirm this, run:
The rollout status confirms how the replicas were added to each ReplicaSet.
kubectl get rs
• For example, with a Deployment that was just created: Get the
Deployment details:
kubectl get rs
deployments "nginx"
REVISION CHANGE-CAUSE
1 <none>
kubectl get rs
• You can make as many updates as you wish, for example, update the
resources that will be used:
The initial state of the Deployment prior to pausing it will continue its
function, but new updates to the Deployment will not have any effect as
long as the Deployment is paused.
deployment.apps/nginx-deployment resumed
Watch the status of the rollout until it's done.
•
kubectl get rs -w
kubectl get rs
Deployment status
A Deployment enters various states during its lifecycle. It can be
progressing while rolling out a new ReplicaSet, it can be complete, or it can
fail to progress.
Progressing Deployment
Kubernetes marks a Deployment as progressing when one of the following
tasks is performed:
Complete Deployment
Kubernetes marks a Deployment as complete when it has the following
characteristics:
• All of the replicas associated with the Deployment have been updated
to the latest version you've specified, meaning any updates you've
requested have been completed.
• All of the replicas associated with the Deployment are available.
• No old replicas for the Deployment are running.
echo $?
Failed Deployment
Your Deployment may get stuck trying to deploy its newest ReplicaSet
without ever completing. This can occur due to some of the following
factors:
• Insufficient quota
• Readiness probe failures
• Image pull errors
• Insufficient permissions
• Limit ranges
• Application runtime misconfiguration
One way you can detect this condition is to specify a deadline parameter in
your Deployment spec: (.spec.progressDeadlineSeconds). .spec.progres
sDeadlineSeconds denotes the number of seconds the Deployment
controller waits before indicating (in the Deployment status) that the
Deployment progress has stalled.
The following kubectl command sets the spec with progressDeadlineSecon
ds to make the controller report lack of progress for a Deployment after 10
minutes:
deployment.apps/nginx-deployment patched
Once the deadline has been exceeded, the Deployment controller adds a
DeploymentCondition with the following attributes to the Deployment's .sta
tus.conditions:
• Type=Progressing
• Status=False
• Reason=ProgressDeadlineExceeded
You may experience transient errors with your Deployments, either due to a
low timeout that you have set or due to any other kind of error that can be
treated as transient. For example, let's suppose you have insufficient quota.
If you describe the Deployment you will notice the following section:
<...>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True ReplicaSetUpdated
ReplicaFailure True FailedCreate
<...>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing False ProgressDeadlineExceeded
ReplicaFailure True FailedCreate
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
Type=Available with Status=True means that your Deployment has
minimum availability. Minimum availability is dictated by the parameters
specified in the deployment strategy. Type=Progressing with Status=True
means that your Deployment is either in the middle of a rollout and it is
progressing or that it has successfully completed its progress and the
minimum required new replicas are available (see the Reason of the
condition for the particulars - in our case Reason=NewReplicaSetAvailable
means that the Deployment is complete).
echo $?
Clean up Policy
You can set .spec.revisionHistoryLimit field in a Deployment to specify
how many old ReplicaSets for this Deployment you want to retain. The rest
will be garbage-collected in the background. By default, it is 10.
Canary Deployment
If you want to roll out releases to a subset of users or servers using the
Deployment, you can create multiple Deployments, one for each release,
following the canary pattern described in managing resources.
Writing a Deployment Spec
As with all other Kubernetes configs, a Deployment needs .apiVersion, .ki
nd, and .metadata fields. For general information about working with config
files, see deploying applications, configuring containers, and using kubectl
to manage resources documents. The name of a Deployment object must be
a valid DNS subdomain name.
Pod Template
The .spec.template and .spec.selector are the only required field of the
.spec.
Replicas
.spec.replicas is an optional field that specifies the number of desired
Pods. It defaults to 1.
Selector
.spec.selector is a required field that specifies a label selector for the
Pods targeted by this Deployment.
A Deployment may terminate Pods whose labels match the selector if their
template is different from .spec.template or if the total number of such
Pods exceeds .spec.replicas. It brings up new Pods with .spec.template
if the number of Pods is less than the desired number.
Note: You should not create other Pods whose labels match this
selector, either directly, by creating another Deployment, or by
creating another controller such as a ReplicaSet or a
ReplicationController. If you do so, the first Deployment thinks
that it created these other Pods. Kubernetes does not stop you
from doing this.
Strategy
.spec.strategy specifies the strategy used to replace old Pods by new
ones. .spec.strategy.type can be "Recreate" or "RollingUpdate".
"RollingUpdate" is the default value.
Recreate Deployment
All existing Pods are killed before new ones are created when .spec.strate
gy.type==Recreate.
Max Unavailable
For example, when this value is set to 30%, the old ReplicaSet can be scaled
down to 70% of desired Pods immediately when the rolling update starts.
Once new Pods are ready, old ReplicaSet can be scaled down further,
followed by scaling up the new ReplicaSet, ensuring that the total number of
Pods available at all times during the update is at least 70% of the desired
Pods.
Max Surge
For example, when this value is set to 30%, the new ReplicaSet can be
scaled up immediately when the rolling update starts, such that the total
number of old and new Pods does not exceed 130% of desired Pods. Once
old Pods have been killed, the new ReplicaSet can be scaled up further,
ensuring that the total number of Pods running at any time during the
update is at most 130% of desired Pods.
Paused
.spec.paused is an optional boolean field for pausing and resuming a
Deployment. The only difference between a paused Deployment and one that
is not paused, is that any changes into the PodTemplateSpec of the paused
Deployment will not trigger new rollouts as long as it is paused. A
Deployment is not paused by default when it is created.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
• Use Case
• Creating a Deployment
◦ Pod-template-hash label
• Updating a Deployment
◦ Rollover (aka multiple updates in-flight)
◦ Label selector updates
• Rolling Back a Deployment
◦ Checking Rollout History of a Deployment
◦ Rolling Back to a Previous Revision
• Scaling a Deployment
◦ Proportional scaling
• Pausing and Resuming a Deployment
• Deployment status
◦ Progressing Deployment
◦ Complete Deployment
◦ Failed Deployment
◦ Operating on a failed deployment
• Clean up Policy
• Canary Deployment
• Writing a Deployment Spec
◦ Pod Template
◦ Replicas
◦ Selector
◦ Strategy
◦ Progress Deadline Seconds
◦ Min Ready Seconds
◦ Revision History Limit
◦ Paused
ReplicaSet
A ReplicaSet's purpose is to maintain a stable set of replica Pods running at
any given time. As such, it is often used to guarantee the availability of a
specified number of identical Pods.
This actually means that you may never need to manipulate ReplicaSet
objects: use a Deployment instead, and define your application in the spec
section.
Example
controllers/frontend.yaml
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: frontend
labels:
app: guestbook
tier: frontend
spec:
# modify replicas according to your case
replicas: 3
selector:
matchLabels:
tier: frontend
template:
metadata:
labels:
tier: frontend
spec:
containers:
- name: php-redis
image: gcr.io/google_samples/gb-frontend:v3
kubectl get rs
Name: frontend
Namespace: default
Selector: tier=frontend
Labels: app=guestbook
tier=frontend
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"apps/v1","kind":"ReplicaSet","meta
data":{"annotations":{},"labels":{"app":"guestbook","tier":"front
end"},"name":"frontend",...
Replicas: 3 current / 3 desired
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: tier=frontend
Containers:
php-redis:
Image: gcr.io/google_samples/gb-frontend:v3
Port: <none>
Host Port: <none>
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 117s replicaset-controller Created
pod: frontend-wtsmm
Normal SuccessfulCreate 116s replicaset-controller Created
pod: frontend-b2zdv
Normal SuccessfulCreate 116s replicaset-controller Created
pod: frontend-vcmts
And lastly you can check for the Pods brought up:
You can also verify that the owner reference of these pods is set to the
frontend ReplicaSet. To do this, get the yaml of one of the Pods running:
The output will look similar to this, with the frontend ReplicaSet's info set in
the metadata's ownerReferences field:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2020-02-12T07:06:16Z"
generateName: frontend-
labels:
tier: frontend
name: frontend-b2zdv
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: frontend
uid: f391f6db-bb9b-4c09-ae74-6a1f77f3d5cf
...
Take the previous frontend ReplicaSet example, and the Pods specified in
the following manifest:
pods/pod-rs.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod1
labels:
tier: frontend
spec:
containers:
- name: hello1
image: gcr.io/google-samples/hello-app:2.0
---
apiVersion: v1
kind: Pod
metadata:
name: pod2
labels:
tier: frontend
spec:
containers:
- name: hello2
image: gcr.io/google-samples/hello-app:1.0
As those Pods do not have a Controller (or any object) as their owner
reference and match the selector of the frontend ReplicaSet, they will
immediately be acquired by it.
Suppose you create the Pods after the frontend ReplicaSet has been
deployed and has set up its initial Pod replicas to fulfill its replica count
requirement:
The new Pods will be acquired by the ReplicaSet, and then immediately
terminated as the ReplicaSet would be over its desired count.
The output shows that the new Pods are either already terminated, or in the
process of being terminated:
You shall see that the ReplicaSet has acquired the Pods and has only created
new ones according to its spec until the number of its new Pods and the
original matches its desired count. As fetching the Pods:
Pod Template
The .spec.template is a pod template which is also required to have labels
in place. In our frontend.yaml example we had one label: tier: frontend.
Be careful not to overlap with the selectors of other controllers, lest they try
to adopt this Pod.
Pod Selector
The .spec.selector field is a label selector. As discussed earlier these are
the labels used to identify potential Pods to acquire. In our frontend.yaml
example, the selector was:
matchLabels:
tier: frontend
Replicas
You can specify how many Pods should run concurrently by setting .spec.re
plicas. The ReplicaSet will create/delete its Pods to match this number.
When using the REST API or the client-go library, you must set propagati
onPolicy to Background or Foreground in the -d option. For example:
Once the original is deleted, you can create a new ReplicaSet to replace it.
As long as the old and new .spec.selector are the same, then the new one
will adopt the old Pods. However, it will not make any effort to make existing
Pods match a new, different pod template. To update Pods to a new spec in a
controlled way, use a Deployment, as ReplicaSets do not support a rolling
update directly.
Scaling a ReplicaSet
A ReplicaSet can be easily scaled up or down by simply updating the .spec.
replicas field. The ReplicaSet controller ensures that a desired number of
Pods with a matching label selector are available and operational.
ReplicaSet as a Horizontal Pod Autoscaler Target
A ReplicaSet can also be a target for Horizontal Pod Autoscalers (HPA). That
is, a ReplicaSet can be auto-scaled by an HPA. Here is an example HPA
targeting the ReplicaSet we created in the previous example.
controllers/hpa-rs.yaml
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: frontend-scaler
spec:
scaleTargetRef:
kind: ReplicaSet
name: frontend
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 50
Alternatives to ReplicaSet
Deployment (recommended)
Deployment is an object which can own ReplicaSets and update them and
their Pods via declarative, server-side rolling updates. While ReplicaSets can
be used independently, today they're mainly used by Deployments as a
mechanism to orchestrate Pod creation, deletion and updates. When you use
Deployments you don't have to worry about managing the ReplicaSets that
they create. Deployments own and manage their ReplicaSets. As such, it is
recommended to use Deployments when you want ReplicaSets.
Bare Pods
Unlike the case where a user directly created Pods, a ReplicaSet replaces
Pods that are deleted or terminated for any reason, such as in the case of
node failure or disruptive node maintenance, such as a kernel upgrade. For
this reason, we recommend that you use a ReplicaSet even if your
application requires only a single Pod. Think of it similarly to a process
supervisor, only it supervises multiple Pods across multiple nodes instead of
individual processes on a single node. A ReplicaSet delegates local container
restarts to some agent on the node (for example, Kubelet or Docker).
Job
Use a Job instead of a ReplicaSet for Pods that are expected to terminate on
their own (that is, batch jobs).
DaemonSet
Use a DaemonSet instead of a ReplicaSet for Pods that provide a machine-
level function, such as machine monitoring or machine logging. These Pods
have a lifetime that is tied to a machine lifetime: the Pod needs to be
running on the machine before other Pods start, and are safe to terminate
when the machine is otherwise ready to be rebooted/shutdown.
ReplicationController
ReplicaSets are the successors to ReplicationControllers. The two serve the
same purpose, and behave similarly, except that a ReplicationController
does not support set-based selector requirements as described in the labels
user guide. As such, ReplicaSets are preferred over ReplicationControllers
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified October 22, 2020 at 2:24 PM PST: Fix links in concepts
section (070023b24)
Edit this page Create child page Create an issue
StatefulSets
StatefulSet is the workload API object used to manage stateful applications.
If you want to use storage volumes to provide persistence for your workload,
you can use a StatefulSet as part of the solution. Although individual Pods in
a StatefulSet are susceptible to failure, the persistent Pod identifiers make it
easier to match existing volumes to the new Pods that replace any that have
failed.
Using StatefulSets
StatefulSets are valuable for applications that require one or more of the
following.
Limitations
• The storage for a given Pod must either be provisioned by a
PersistentVolume Provisioner based on the requested storage class,
or pre-provisioned by an admin.
• Deleting and/or scaling a StatefulSet down will not delete the volumes
associated with the StatefulSet. This is done to ensure data safety,
which is generally more valuable than an automatic purge of all related
StatefulSet resources.
• StatefulSets currently require a Headless Service to be responsible for
the network identity of the Pods. You are responsible for creating this
Service.
• StatefulSets do not provide any guarantees on the termination of pods
when a StatefulSet is deleted. To achieve ordered and graceful
termination of the pods in the StatefulSet, it is possible to scale the
StatefulSet down to 0 prior to deletion.
• When using Rolling Updates with the default Pod Management Policy (O
rderedReady), it's possible to get into a broken state that requires
manual intervention to repair.
Components
The example below demonstrates the components of a StatefulSet.
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
replicas: 3 # by default is 1
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "my-storage-class"
resources:
requests:
storage: 1Gi
Pod Selector
You must set the .spec.selector field of a StatefulSet to match the labels
of its .spec.template.metadata.labels. Prior to Kubernetes 1.8, the .spec
.selector field was defaulted when omitted. In 1.8 and later versions,
failing to specify a matching Pod Selector will result in a validation error
during StatefulSet creation.
Pod Identity
StatefulSet Pods have a unique identity that is comprised of an ordinal, a
stable network identity, and stable storage. The identity sticks to the Pod,
regardless of which node it's (re)scheduled on.
Ordinal Index
For a StatefulSet with N replicas, each Pod in the StatefulSet will be
assigned an integer ordinal, from 0 up through N-1, that is unique over the
Set.
Stable Network ID
Each Pod in a StatefulSet derives its hostname from the name of the
StatefulSet and the ordinal of the Pod. The pattern for the constructed
hostname is $(statefulset name)-$(ordinal). The example above will
create three Pods named web-0,web-1,web-2. A StatefulSet can use a
Headless Service to control the domain of its Pods. The domain managed by
this Service takes the form: $(service name).$
(namespace).svc.cluster.local, where "cluster.local" is the cluster
domain. As each Pod is created, it gets a matching DNS subdomain, taking
the form: $(podname).$(governing service domain), where the governing
service is defined by the serviceName field on the StatefulSet.
Depending on how DNS is configured in your cluster, you may not be able to
look up the DNS name for a newly-run Pod immediately. This behavior can
occur when other clients in the cluster have already sent queries for the
hostname of the Pod before it was created. Negative caching (normal in
DNS) means that the results of previous failed lookups are remembered and
reused, even after the Pod is running, for at least a few seconds.
If you need to discover Pods promptly after they are created, you have a few
options:
• Query the Kubernetes API directly (for example, using a watch) rather
than relying on DNS lookups.
• Decrease the time of caching in your Kubernetes DNS provider
(typically this means editing the config map for CoreDNS, which
currently caches for 30 seconds).
As mentioned in the limitations section, you are responsible for creating the
Headless Service responsible for the network identity of the pods.
Here are some examples of choices for Cluster Domain, Service name,
StatefulSet name, and how that affects the DNS names for the StatefulSet's
Pods.
Service
Cluster StatefulSet
(ns/ StatefulSet Domain Pod DNS
Domain (ns/name)
name)
default/ web-
cluster.local default/web nginx.default.svc.cluster.local
nginx {0..N-1}.nginx.default.s
foo/ web-
cluster.local foo/web nginx.foo.svc.cluster.local
nginx {0..N-1}.nginx.foo.svc.c
foo/
kube.local foo/web nginx.foo.svc.kube.local web-{0..N-1}.nginx.foo.
nginx
Stable Storage
Kubernetes creates one PersistentVolume for each VolumeClaimTemplate. In
the nginx example above, each Pod will receive a single PersistentVolume
with a StorageClass of my-storage-class and 1 Gib of provisioned storage.
If no StorageClass is specified, then the default StorageClass will be used.
When a Pod is (re)scheduled onto a node, its volumeMounts mount the
PersistentVolumes associated with its PersistentVolume Claims. Note that,
the PersistentVolumes associated with the Pods' PersistentVolume Claims
are not deleted when the Pods, or StatefulSet are deleted. This must be done
manually.
When the nginx example above is created, three Pods will be deployed in the
order web-0, web-1, web-2. web-1 will not be deployed before web-0 is
Running and Ready, and web-2 will not be deployed until web-1 is Running
and Ready. If web-0 should fail, after web-1 is Running and Ready, but
before web-2 is launched, web-2 will not be launched until web-0 is
successfully relaunched and becomes Running and Ready.
Update Strategies
In Kubernetes 1.7 and later, StatefulSet's .spec.updateStrategy field
allows you to configure and disable automated rolling updates for
containers, labels, resource request/limits, and annotations for the Pods in a
StatefulSet.
On Delete
The OnDelete update strategy implements the legacy (1.6 and prior)
behavior. When a StatefulSet's .spec.updateStrategy.type is set to OnDele
te, the StatefulSet controller will not automatically update the Pods in a
StatefulSet. Users must manually delete Pods to cause the controller to
create new Pods that reflect modifications made to a StatefulSet's .spec.te
mplate.
Rolling Updates
The RollingUpdate update strategy implements automated, rolling update
for the Pods in a StatefulSet. It is the default strategy when .spec.updateSt
rategy is left unspecified. When a StatefulSet's .spec.updateStrategy.typ
e is set to RollingUpdate, the StatefulSet controller will delete and recreate
each Pod in the StatefulSet. It will proceed in the same order as Pod
termination (from the largest ordinal to the smallest), updating each Pod one
at a time. It will wait until an updated Pod is Running and Ready prior to
updating its predecessor.
Partitions
When using Rolling Updates with the default Pod Management Policy (Order
edReady), it's possible to get into a broken state that requires manual
intervention to repair.
In this state, it's not enough to revert the Pod template to a good
configuration. Due to a known issue, StatefulSet will continue to wait for the
broken Pod to become Ready (which never happens) before it will attempt to
revert it back to the working configuration.
After reverting the template, you must also delete any Pods that StatefulSet
had already attempted to run with the bad configuration. StatefulSet will
then begin to recreate the Pods using the reverted template.
What's next
• Follow an example of deploying a stateful application.
• Follow an example of deploying Cassandra with Stateful Sets.
• Follow an example of running a replicated stateful application.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified October 19, 2020 at 1:29 PM PST: Fix minor typo in
StatefulSets docs (427c96e64)
Edit this page Create child page Create an issue
• Using StatefulSets
• Limitations
• Components
• Pod Selector
• Pod Identity
◦ Ordinal Index
◦ Stable Network ID
◦ Stable Storage
◦ Pod Name Label
• Deployment and Scaling Guarantees
◦ Pod Management Policies
• Update Strategies
◦ On Delete
◦ Rolling Updates
• What's next
DaemonSet
A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes
are added to the cluster, Pods are added to them. As nodes are removed
from the cluster, those Pods are garbage collected. Deleting a DaemonSet
will clean up the Pods it created.
In a simple case, one DaemonSet, covering all nodes, would be used for each
type of daemon. A more complex setup might use multiple DaemonSets for a
single type of daemon, but with different flags and/or different memory and
cpu requests for different hardware types.
controllers/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd-elasticsearch
namespace: kube-system
labels:
k8s-app: fluentd-logging
spec:
selector:
matchLabels:
name: fluentd-elasticsearch
template:
metadata:
labels:
name: fluentd-elasticsearch
spec:
tolerations:
# this toleration is to have the daemonset runnable on
master nodes
# remove it if your masters can't run pods
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluentd-elasticsearch
image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
terminationGracePeriodSeconds: 30
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
Required Fields
As with all other Kubernetes config, a DaemonSet needs apiVersion, kind,
and metadata fields. For general information about working with config files,
see running stateless applications, configuring containers, and object
management using kubectl documents.
Pod Template
The .spec.template is one of the required fields in .spec.
Pod Selector
The .spec.selector field is a pod selector. It works the same as the .spec.
selector of a Job.
As of Kubernetes 1.8, you must specify a pod selector that matches the
labels of the .spec.template. The pod selector will no longer be defaulted
when left empty. Selector defaulting was not compatible with kubectl
apply. Also, once a DaemonSet is created, its .spec.selector can not be
mutated. Mutating the pod selector can lead to the unintentional orphaning
of Pods, and it was found to be confusing to users.
A DaemonSet ensures that all eligible nodes run a copy of a Pod. Normally,
the node that a Pod runs on is selected by the Kubernetes scheduler.
However, DaemonSet pods are created and scheduled by the DaemonSet
controller instead. That introduces the following issues:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- target-host-name
Updating a DaemonSet
If node labels are changed, the DaemonSet will promptly add Pods to newly
matching nodes and delete Pods from newly not-matching nodes.
You can modify the Pods that a DaemonSet creates. However, Pods do not
allow all fields to be updated. Also, the DaemonSet controller will use the
original template the next time a node (even with the same name) is created.
• Ability to monitor and manage logs for daemons in the same way as
applications.
• Same config language and tools (e.g. Pod templates, kubectl) for
daemons and applications.
• Running daemons in containers with resource limits increases isolation
between daemons from app containers. However, this can also be
accomplished by running the daemons in a container but not in a Pod
(e.g. start directly via Docker).
Bare Pods
It is possible to create Pods directly which specify a particular node to run
on. However, a DaemonSet replaces Pods that are deleted or terminated for
any reason, such as in the case of node failure or disruptive node
maintenance, such as a kernel upgrade. For this reason, you should use a
DaemonSet rather than creating individual Pods.
Static Pods
It is possible to create Pods by writing a file to a certain directory watched
by Kubelet. These are called static pods. Unlike DaemonSet, static Pods
cannot be managed with kubectl or other Kubernetes API clients. Static
Pods do not depend on the apiserver, making them useful in cluster
bootstrapping cases. Also, static Pods may be deprecated in the future.
Deployments
DaemonSets are similar to Deployments in that they both create Pods, and
those Pods have processes which are not expected to terminate (e.g. web
servers, storage servers).
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Jobs
A Job creates one or more Pods and ensures that a specified number of them
successfully terminate. As pods successfully complete, the Job tracks the
successful completions. When a specified number of successful completions
is reached, the task (ie, Job) is complete. Deleting a Job will clean up the
Pods it created.
A simple case is to create one Job object in order to reliably run one Pod to
completion. The Job object will start a new Pod if the first Pod fails or is
deleted (for example due to a node hardware failure or a node reboot).
controllers/job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print
bpi(2000)"]
restartPolicy: Never
backoffLimit: 4
job.batch/pi created
Name: pi
Namespace: default
Selector: controller-uid=c9948307-e56d-4b5d-8302-
ae2d7b7da67c
Labels: controller-uid=c9948307-e56d-4b5d-8302-
ae2d7b7da67c
job-name=pi
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"batch/
v1","kind":"Job","metadata":{"annotations":
{},"name":"pi","namespace":"default"},"spec":{"backoffLimit":
4,"template":...
Parallelism: 1
Completions: 1
Start Time: Mon, 02 Dec 2019 15:20:11 +0200
Completed At: Mon, 02 Dec 2019 15:21:16 +0200
Duration: 65s
Pods Statuses: 0 Running / 1 Succeeded / 0 Failed
Pod Template:
Labels: controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
job-name=pi
Containers:
pi:
Image: perl
Port: <none>
Host Port: <none>
Command:
perl
-Mbignum=bpi
-wle
print bpi(2000)
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 14m job-controller Created pod:
pi-5rwd7
To list all the Pods that belong to a Job in a machine readable form, you can
use a command like this:
pi-5rwd7
Here, the selector is the same as the selector for the Job. The --
output=jsonpath option specifies an expression that just gets the name
from each Pod in the returned list.
3.141592653589793238462643383279502884197169399375105820974944592
30781640628620899862803482534211706798214808651328230664709384460
95505822317253594081284811174502841027019385211055596446229489549
30381964428810975665933446128475648233786783165271201909145648566
92346034861045432664821339360726024914127372458700660631558817488
15209209628292540917153643678925903600113305305488204665213841469
51941511609433057270365759591953092186117381932611793105118548074
46237996274956735188575272489122793818301194912983367336244065664
30860213949463952247371907021798609437027705392171762931767523846
74818467669405132000568127145263560827785771342757789609173637178
72146844090122495343014654958537105079227968925892354201995611212
90219608640344181598136297747713099605187072113499999983729780499
51059731732816096318595024459455346908302642522308253344685035261
93118817101000313783875288658753320838142061717766914730359825349
04287554687311595628638823537875937519577818577805321712268066130
01927876611195909216420198938095257201065485863278865936153381827
96823030195203530185296899577362259941389124972177528347913151557
48572424541506959508295331168617278558890750983817546374649393192
55060400927701671139009848824012858361603563707660104710181942955
59619894676783744944825537977472684710404753464620804668425906949
12933136770289891521047521620569660240580381501935112533824300355
87640247496473263914199272604269922796782354781636009341721641219
92458631503028618297455570674983850549458858692699569092721079750
93029553211653449872027559602364806654991198818347977535663698074
26542527862551818417574672890977772793800081647060016145249192173
21721477235014144197356854816136115735255213347574184946843852332
39073941433345477624168625189835694855620992192221842725502542568
87671790494601653466804988627232791786085784383827967976681454100
95388378636095068006422512520511739298489608412848862694560424196
52850222106611863067442786220391949450471237137869609563643719172
874677646575739624138908658326459958133904780275901
Pod Template
The .spec.template is the only required field of the .spec.
In addition to required fields for a Pod, a pod template in a Job must specify
appropriate labels (see pod selector) and an appropriate restart policy.
Pod selector
The .spec.selector field is optional. In almost all cases you should not
specify it. See section specifying your own pod selector.
1. Non-parallel Jobs
◦ normally, only one Pod is started, unless the Pod fails.
◦ the Job is complete as soon as its Pod terminates successfully.
2. Parallel Jobs with a fixed completion count:
◦ specify a non-zero positive value for .spec.completions.
◦ the Job represents the overall task, and is complete when there is
one successful Pod for each value in the range 1 to .spec.complet
ions.
◦ not implemented yet: Each Pod is passed a different index in the
range 1 to .spec.completions.
3. Parallel Jobs with a work queue:
◦ do not specify .spec.completions, default to .spec.parallelism.
◦ the Pods must coordinate amongst themselves or an external
service to determine what each should work on. For example, a
Pod might fetch a batch of up to N items from the work queue.
◦ each Pod is independently capable of determining whether or not
all its peers are done, and thus that the entire Job is done.
◦ when any Pod from the Job terminates with success, no new Pods
are created.
◦ once at least one Pod has terminated with success and all Pods are
terminated, then the Job is completed with success.
◦ once any Pod has exited with success, no other Pod should still be
doing any work for this task or writing any output. They should all
be in the process of exiting.
For a non-parallel Job, you can leave both .spec.completions and .spec.pa
rallelism unset. When both are unset, both are defaulted to 1.
For a fixed completion count Job, you should set .spec.completions to the
number of completions needed. You can set .spec.parallelism, or leave it
unset and it will default to 1.
For a work queue Job, you must leave .spec.completions unset, and set .s
pec.parallelism to a non-negative integer.
For more information about how to make use of the different types of job,
see the job patterns section.
Controlling parallelism
• For fixed completion count Jobs, the actual number of pods running in
parallel will not exceed the number of remaining completions. Higher
values of .spec.parallelism are effectively ignored.
• For work queue Jobs, no new Pods are started after any Pod has
succeeded -- remaining Pods are allowed to complete, however.
• If the Job Controller has not had time to react.
• If the Job controller failed to create Pods for any reason (lack of Resour
ceQuota, lack of permission, etc.), then there may be fewer pods than
requested.
• The Job controller may throttle new Pod creation due to excessive
previous pod failures in the same Job.
• When a Pod is gracefully shut down, it takes time to stop.
Handling Pod and container failures
A container in a Pod may fail for a number of reasons, such as because the
process in it exited with a non-zero exit code, or the container was killed for
exceeding a memory limit, etc. If this happens, and the .spec.template.spe
c.restartPolicy = "OnFailure", then the Pod stays on the node, but the
container is re-run. Therefore, your program needs to handle the case when
it is restarted locally, or else specify .spec.template.spec.restartPolicy
= "Never". See pod lifecycle for more information on restartPolicy.
An entire Pod can also fail, for a number of reasons, such as when the pod is
kicked off the node (node is upgraded, rebooted, deleted, etc.), or if a
container of the Pod fails and the .spec.template.spec.restartPolicy =
"Never". When a Pod fails, then the Job controller starts a new Pod. This
means that your application needs to handle the case when it is restarted in
a new pod. In particular, it needs to handle temporary files, locks,
incomplete output and the like caused by previous runs.
Example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-timeout
spec:
backoffLimit: 5
activeDeadlineSeconds: 100
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print
bpi(2000)"]
restartPolicy: Never
Note that both the Job spec and the Pod template spec within the Job have
an activeDeadlineSeconds field. Ensure that you set this field at the proper
level.
Keep in mind that the restartPolicy applies to the Pod, and not to the Job
itself: there is no automatic Job restart once the Job status is type: Failed.
That is, the Job termination mechanisms activated with .spec.activeDeadl
ineSeconds and .spec.backoffLimit result in a permanent Job failure that
requires manual intervention to resolve.
Clean up finished jobs automatically
Finished Jobs are usually no longer needed in the system. Keeping them
around in the system will put pressure on the API server. If the Jobs are
managed directly by a higher level controller, such as CronJobs, the Jobs can
be cleaned up by CronJobs based on the specified capacity-based cleanup
policy.
When the TTL controller cleans up the Job, it will delete the Job cascadingly,
i.e. delete its dependent objects, such as Pods, together with the Job. Note
that when the Job is deleted, its lifecycle guarantees, such as finalizers, will
be honored.
For example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-ttl
spec:
ttlSecondsAfterFinished: 100
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print
bpi(2000)"]
restartPolicy: Never
There are several different patterns for parallel computation, each with
strengths and weaknesses. The tradeoffs are:
• One Job object for each work item, vs. a single Job object for all work
items. The latter is better for large numbers of work items. The former
creates some overhead for the user and for the system to manage large
numbers of Job objects.
• Number of pods created equals number of work items, vs. each Pod can
process multiple work items. The former typically requires less
modification to existing code and containers. The latter is better for
large numbers of work items, for similar reasons to the previous bullet.
• Several approaches use a work queue. This requires running a queue
service, and modifications to the existing program or container to make
it use the work queue. Other approaches are easier to adapt to an
existing containerised application.
Advanced usage
Specifying your own Pod selector
Normally, when you create a Job object, you do not specify .spec.selector.
The system defaulting logic adds this field when the Job is created. It picks a
selector value that will not overlap with any other jobs.
However, in some cases, you might need to override this automatically set
selector. To do this, you can specify the .spec.selector of the Job.
Be very careful when doing this. If you specify a label selector which is not
unique to the pods of that Job, and which matches unrelated Pods, then pods
of the unrelated job may be deleted, or this Job may count other Pods as
completing it, or one or both Jobs may refuse to create Pods or run to
completion. If a non-unique selector is chosen, then other controllers (e.g.
ReplicationController) and their Pods may behave in unpredictable ways too.
Kubernetes will not stop you from making a mistake when specifying .spec.
selector.
Here is an example of a case when you might want to use this feature.
Say Job old is already running. You want existing Pods to keep running, but
you want the rest of the Pods it creates to use a different pod template and
for the Job to have a new name. You cannot update the Job because these
fields are not updatable. Therefore, you delete Job old but leave its pods
running, using kubectl delete jobs/old --cascade=false. Before
deleting it, you make a note of what selector it uses:
kind: Job
metadata:
name: old
...
spec:
selector:
matchLabels:
controller-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
...
Then you create a new Job with name new and you explicitly specify the
same selector. Since the existing Pods have label controller-
uid=a8f3d00d-c6d2-11e5-9f87-42010af00002, they are controlled by Job n
ew as well.
You need to specify manualSelector: true in the new Job since you are not
using the selector that the system normally generates for you automatically.
kind: Job
metadata:
name: new
...
spec:
manualSelector: true
selector:
matchLabels:
controller-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
...
The new Job itself will have a different uid from a8f3d00d-
c6d2-11e5-9f87-42010af00002. Setting manualSelector: true tells the
system to that you know what you are doing and to allow this mismatch.
Alternatives
Bare Pods
When the node that a Pod is running on reboots or fails, the pod is
terminated and will not be restarted. However, a Job will create new Pods to
replace terminated ones. For this reason, we recommend that you use a Job
rather than a bare Pod, even if your application requires only a single Pod.
Replication Controller
Jobs are complementary to Replication Controllers. A Replication Controller
manages Pods which are not expected to terminate (e.g. web servers), and a
Job manages Pods that are expected to terminate (e.g. batch tasks).
As discussed in Pod Lifecycle, Job is only appropriate for pods with Restart
Policy equal to OnFailure or Never. (Note: If RestartPolicy is not set, the
default value is Always.)
Cron Jobs
You can use a CronJob to create a Job that will run at specified times/dates,
similar to the Unix tool cron.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified August 10, 2020 at 9:56 AM PST: Improve order of workload
resources (7f1a2cace)
Edit this page Create child page Create an issue
Garbage Collection
The role of the Kubernetes garbage collector is to delete certain objects that
once had an owner, but no longer have an owner.
Owners and dependents
Some Kubernetes objects are owners of other objects. For example, a
ReplicaSet is the owner of a set of Pods. The owned objects are called
dependents of the owner object. Every dependent object has a metadata.ow
nerReferences field that points to the owning object.
controllers/replicaset.yaml
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: my-repset
spec:
replicas: 3
selector:
matchLabels:
pod-is-for: garbage-collection-example
template:
metadata:
labels:
pod-is-for: garbage-collection-example
spec:
containers:
- name: nginx
image: nginx
If you create the ReplicaSet and then view the Pod metadata, you can see
OwnerReferences field:
The output shows that the Pod owner is a ReplicaSet named my-repset:
apiVersion: v1
kind: Pod
metadata:
...
ownerReferences:
- apiVersion: apps/v1
controller: true
blockOwnerDeletion: true
kind: ReplicaSet
name: my-repset
uid: d9607e19-f88f-11e6-a518-42010a800195
...
Note:
Once the "deletion in progress" state is set, the garbage collector deletes the
object's dependents. Once the garbage collector has deleted all "blocking"
dependents (objects with ownerReference.blockOwnerDeletion=true), it
deletes the owner object.
Known issues
Tracked at #26120
What's next
Design Doc 1
Design Doc 2
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 17, 2020 at 5:38 PM PST: Update GC cross-
namespace note (8d96fcb42)
Edit this page Create child page Create an issue
The TTL controller provides a TTL (time to live) mechanism to limit the
lifetime of resource objects that have finished execution. TTL controller only
handles Jobs for now, and may be expanded to handle other resources that
will finish execution, such as Pods and custom resources.
Alpha Disclaimer: this feature is currently alpha, and can be enabled with
both kube-apiserver and kube-controller-manager feature gate TTLAfterFin
ished.
TTL Controller
The TTL controller only supports Jobs for now. A cluster operator can use
this feature to clean up finished Jobs (either Complete or Failed)
automatically by specifying the .spec.ttlSecondsAfterFinished field of a
Job, as in this example. The TTL controller will assume that a resource is
eligible to be cleaned up TTL seconds after the resource has finished, in
other words, when the TTL has expired. When the TTL controller cleans up a
resource, it will delete it cascadingly, that is to say it will delete its
dependent objects together with it. Note that when the resource is deleted,
its lifecycle guarantees, such as finalizers, will be honored.
The TTL seconds can be set at any time. Here are some examples for setting
the .spec.ttlSecondsAfterFinished field of a Job:
• Specify this field in the resource manifest, so that a Job can be cleaned
up automatically some time after it finishes.
• Set this field of existing, already finished resources, to adopt this new
feature.
• Use a mutating admission webhook to set this field dynamically at
resource creation time. Cluster administrators can use this to enforce a
TTL policy for finished resources.
• Use a mutating admission webhook to set this field dynamically after
the resource has finished, and choose different TTL values based on
resource status, labels, etc.
Caveat
Updating TTL Seconds
Note that the TTL period, e.g. .spec.ttlSecondsAfterFinished field of
Jobs, can be modified after the resource is created or has finished. However,
once the Job becomes eligible to be deleted (when the TTL has expired), the
system won't guarantee that the Jobs will be kept, even if an update to
extend the TTL returns a successful API response.
Time Skew
Because TTL controller uses timestamps stored in the Kubernetes resources
to determine whether the TTL has expired or not, this feature is sensitive to
time skew in the cluster, which may cause TTL controller to clean up
resource objects at the wrong time.
In Kubernetes, it's required to run NTP on all nodes (see #6159) to avoid
time skew. Clocks aren't always correct, but the difference should be very
small. Please be aware of this risk when setting a non-zero TTL.
What's next
• Clean up Jobs automatically
• Design doc
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified July 17, 2020 at 3:46 PM PST: Replace reference to redirect
entries (1) (0bdcd44e6)
Edit this page Create child page Create an issue
• TTL Controller
• Caveat
◦ Updating TTL Seconds
◦ Time Skew
• What's next
CronJob
FEATURE STATE: Kubernetes v1.8 [beta]
One CronJob object is like one line of a crontab (cron table) file. It runs a job
periodically on a given schedule, written in Cron format.
Caution:
When creating the manifest for a CronJob resource, make sure the name you
provide is a valid DNS subdomain name. The name must be no longer than
52 characters. This is because the CronJob controller will automatically
append 11 characters to the job name provided and there is a constraint
that the maximum length of a Job name is no more than 63 characters.
CronJob
CronJobs are useful for creating periodic and recurring tasks, like running
backups or sending emails. CronJobs can also schedule individual tasks for a
specific time, such as scheduling a Job for when your cluster is likely to be
idle.
Example
This example CronJob manifest prints the current time and a hello message
every minute:
application/job/cronjob.yaml
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: hello
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox
imagePullPolicy: IfNotPresent
args:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: OnFailure
(Running Automated Tasks with a CronJob takes you through this example in
more detail).
CronJob limitations
A cron job creates a job object about once per execution time of its schedule.
We say "about" because there are certain circumstances where two jobs
might be created, or no job might be created. We attempt to make these
rare, but do not completely prevent them. Therefore, jobs should be
idempotent.
For every CronJob, the CronJob Controller checks how many schedules it
missed in the duration from its last scheduled time until now. If there are
more than 100 missed schedules, then it does not start the job and logs the
error
For example, suppose a CronJob is set to schedule a new Job every one
minute beginning at 08:30:00, and its startingDeadlineSeconds field is
not set. If the CronJob controller happens to be down from 08:29:00 to 10:2
1:00, the job will not start as the number of missed jobs which missed their
schedule is greater than 100.
To illustrate this concept further, suppose a CronJob is set to schedule a new
Job every one minute beginning at 08:30:00, and its startingDeadlineSeco
nds is set to 200 seconds. If the CronJob controller happens to be down for
the same period as the previous example (08:29:00 to 10:21:00,) the Job
will still start at 10:22:00. This happens as the controller now checks how
many missed schedules happened in the last 200 seconds (ie, 3 missed
schedules), rather than from the last scheduled time until now.
The CronJob is only responsible for creating Jobs that match its schedule,
and the Job in turn is responsible for the management of the Pods it
represents.
New controller
There's an alternative implementation of the CronJob controller, available as
an alpha feature since Kubernetes 1.20. To select version 2 of the CronJob
controller, pass the following feature gate flag to the kube-controller-
manager.
--feature-gates="CronJobControllerV2=true"
What's next
Cron expression format documents the format of CronJob schedule fields.
For instructions on creating and working with cron jobs, and for an example
of CronJob manifest, see Running automated tasks with cron jobs.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 04, 2020 at 5:54 PM PST: Add information how to
enable cronjob controller v2 (4f0068f33)
Edit this page Create child page Create an issue
• CronJob
◦ Example
• CronJob limitations
• New controller
• What's next
ReplicationController
Note: A Deployment that configures a ReplicaSet is now the
recommended way to set up replication.
controllers/replication.yaml
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 3
selector:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
Run the example job by downloading the example file and then running this
command:
replicationcontroller/nginx created
Name: nginx
Namespace: default
Selector: app=nginx
Labels: app=nginx
Annotations: <none>
Replicas: 3 current / 3 desired
Pods Status: 0 Running / 3 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx
Port: 80/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
FirstSeen LastSeen Count
From SubobjectPath Type
Reason Message
--------- -------- -----
---- ------------- ----
------ -------
20s 20s 1 {replication-
controller } Normal SuccessfulCreate
Created pod: nginx-qrm3m
20s 20s 1 {replication-
controller } Normal SuccessfulCreate
Created pod: nginx-3ntk0
20s 20s 1 {replication-
controller } Normal SuccessfulCreate
Created pod: nginx-4ok8v
Here, three pods are created, but none is running yet, perhaps because the
image is being pulled. A little later, the same command may show:
Here, the selector is the same as the selector for the ReplicationController
(seen in the kubectl describe output), and in a different form in replicati
on.yaml. The --output=jsonpath option specifies an expression that just
gets the name from each pod in the returned list.
Pod Template
The .spec.template is the only required field of the .spec.
Pod Selector
The .spec.selector field is a label selector. A ReplicationController
manages all the pods with labels that match the selector. It does not
distinguish between pods that it created or deleted and pods that another
person or process created or deleted. This allows the ReplicationController
to be replaced without affecting the running pods.
Also you should not normally create any pods whose labels match this
selector, either directly, with another ReplicationController, or with another
controller such as Job. If you do so, the ReplicationController thinks that it
created the other pods. Kubernetes does not stop you from doing this.
Multiple Replicas
You can specify how many pods should run concurrently by setting .spec.re
plicas to the number of pods you would like to have running concurrently.
The number running at any time may be higher or lower, such as if the
replicas were just increased or decreased, or if a pod is gracefully shutdown,
and a replacement starts early.
When using the REST API or go client library, you need to do the steps
explicitly (scale replicas to 0, wait for pod deletions, then delete the
ReplicationController).
Scaling
The ReplicationController makes it easy to scale the number of replicas up
or down, either manually or by an auto-scaling control agent, by simply
updating the replicas field.
Rolling updates
The ReplicationController is designed to facilitate rolling updates to a
service by replacing pods one-by-one.
Ideally, the rolling update controller would take application readiness into
account, and would ensure that a sufficient number of pods were
productively serving at any given time.
The two ReplicationControllers would need to create pods with at least one
differentiating label, such as the image tag of the primary container of the
pod, since it is typically image updates that motivate rolling updates.
Multiple release tracks
In addition to running multiple releases of an application while a rolling
update is in progress, it's common to run multiple releases for an extended
period of time, or even continuously, using multiple release tracks. The
tracks would be differentiated by labels.
For instance, a service might target all pods with tier in (frontend),
environment in (prod). Now say you have 10 replicated pods that make
up this tier. But you want to be able to 'canary' a new version of this
component. You could set up a ReplicationController with replicas set to 9
for the bulk of the replicas, with labels tier=frontend,
environment=prod, track=stable, and another ReplicationController with
replicas set to 1 for the canary, with labels tier=frontend,
environment=prod, track=canary. Now the service is covering both the
canary and non-canary pods. But you can mess with the
ReplicationControllers separately to test things out, monitor the results, etc.
API Object
Replication controller is a top-level resource in the Kubernetes REST API.
More details about the API object can be found at: ReplicationController API
object.
Alternatives to ReplicationController
ReplicaSet
ReplicaSet is the next-generation ReplicationController that supports the
new set-based label selector. It's mainly used by Deployment as a
mechanism to orchestrate pod creation, deletion and updates. Note that we
recommend using Deployments instead of directly using Replica Sets, unless
you require custom update orchestration or don't require updates at all.
Deployment (Recommended)
Deployment is a higher-level API object that updates its underlying Replica
Sets and their Pods. Deployments are recommended if you want this rolling
update functionality because, they are declarative, server-side, and have
additional features.
Bare Pods
Unlike in the case where a user directly created pods, a
ReplicationController replaces pods that are deleted or terminated for any
reason, such as in the case of node failure or disruptive node maintenance,
such as a kernel upgrade. For this reason, we recommend that you use a
ReplicationController even if your application requires only a single pod.
Think of it similarly to a process supervisor, only it supervises multiple pods
across multiple nodes instead of individual processes on a single node. A
ReplicationController delegates local container restarts to some agent on
the node (for example, Kubelet or Docker).
Job
Use a Job instead of a ReplicationController for pods that are expected to
terminate on their own (that is, batch jobs).
DaemonSet
Use a DaemonSet instead of a ReplicationController for pods that provide a
machine-level function, such as machine monitoring or machine logging.
These pods have a lifetime that is tied to a machine lifetime: the pod needs
to be running on the machine before other pods start, and are safe to
terminate when the machine is otherwise ready to be rebooted/shutdown.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified August 05, 2020 at 3:17 AM PST: Replace special quote
characters with normal ones. (c6a96128c)
Edit this page Create child page Create an issue
Service
Service Topology
EndpointSlices
Ingress
Ingress Controllers
Network Policies
IPv4/IPv6 dual-stack
Service
An abstract way to expose an application running on a set of Pods as a
network service.
Motivation
Kubernetes Pods are created and destroyed to match the state of your
cluster. Pods are nonpermanent resources. If you use a Deployment to run
your app, it can create and destroy Pods dynamically.
Each Pod gets its own IP address, however in a Deployment, the set of Pods
running in one moment in time could be different from the set of Pods
running that application a moment later.
This leads to a problem: if some set of Pods (call them "backends") provides
functionality to other Pods (call them "frontends") inside your cluster, how
do the frontends find out and keep track of which IP address to connect to,
so that the frontend can use the backend part of the workload?
Enter Services.
Service resources
Defining a Service
For example, suppose you have a set of Pods that each listen on TCP port
9376 and carry a label app=MyApp:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
targetPort: 9376
Port definitions in Pods have names, and you can reference these names in
the targetPort attribute of a Service. This works even if there is a mixture
of Pods in the Service using a single configured name, with the same
network protocol available via different port numbers. This offers a lot of
flexibility for deploying and evolving your Services. For example, you can
change the port numbers that Pods expose in the next version of your
backend software, without breaking clients.
The default protocol for Services is TCP; you can also use any other
supported protocol.
As many Services need to expose more than one port, Kubernetes supports
multiple port definitions on a Service object. Each port definition can have
the same protocol, or a different one.
Services most commonly abstract access to Kubernetes Pods, but they can
also abstract other kinds of backends. For example:
In any of these scenarios you can define a Service without a Pod selector.
For example:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
ports:
- protocol: TCP
port: 80
targetPort: 9376
apiVersion: v1
kind: Endpoints
metadata:
name: my-service
subsets:
- addresses:
- ip: 192.0.2.42
ports:
- port: 9376
The name of the Endpoints object must be a valid DNS subdomain name.
Note:
The endpoint IPs must not be: loopback (127.0.0.0/8 for IPv4, ::
1/128 for IPv6), or link-local (169.254.0.0/16 and 224.0.0.0/24 for
IPv4, fe80::/64 for IPv6).
EndpointSlices
Application protocol
This field follows standard Kubernetes label syntax. Values should either be
IANA standard service names or domain prefixed names such as mycompany.
com/my-custom-protocol.
A question that pops up every now and then is why Kubernetes relies on
proxying to forward inbound traffic to backends. What about other
approaches? For example, would it be possible to configure DNS records
that have multiple A values (or AAAA for IPv6), and rely on round-robin
name resolution?
In this mode, kube-proxy watches the Kubernetes master for the addition
and removal of Service and Endpoint objects. For each Service it opens a
port (randomly chosen) on the local node. Any connections to this "proxy
port" are proxied to one of the Service's backend Pods (as reported via
Endpoints). kube-proxy takes the SessionAffinity setting of the Service
into account when deciding which backend Pod to use.
Lastly, the user-space proxy installs iptables rules which capture traffic to
the Service's clusterIP (which is virtual) and port. The rules redirect that
traffic to the proxy port which proxies the backend Pod.
Node
Client apiserver
clusterIP
kube-proxy
(iptables)
In this mode, kube-proxy watches the Kubernetes control plane for the
addition and removal of Service and Endpoint objects. For each Service, it
installs iptables rules, which capture traffic to the Service's clusterIP and p
ort, and redirect that traffic to one of the Service's backend sets. For each
Endpoint object, it installs iptables rules which select a backend Pod.
If kube-proxy is running in iptables mode and the first Pod that's selected
does not respond, the connection fails. This is different from userspace
mode: in that scenario, kube-proxy would detect that the connection to the
first Pod had failed and would automatically retry with a different backend
Pod.
You can use Pod readiness probes to verify that backend Pods are working
OK, so that kube-proxy in iptables mode only sees backends that test out as
healthy. Doing this means you avoid having traffic sent via kube-proxy to a
Pod that's known to have failed.
apiserver
Client kube-proxy
clusterIP
(iptables)
Node
Backend Pod 1 Backend Pod 2 Backend Pod 3
labels: app=MyApp labels: app=MyApp labels: app=MyApp
port: 9376 port: 9376 port: 9376
The IPVS proxy mode is based on netfilter hook function that is similar to
iptables mode, but uses a hash table as the underlying data structure and
works in the kernel space. That means kube-proxy in IPVS mode redirects
traffic with lower latency than kube-proxy in iptables mode, with much
better performance when synchronising proxy rules. Compared to the other
proxy modes, IPVS mode also supports a higher throughput of network
traffic.
IPVS provides more options for balancing traffic to backend Pods; these are:
• rr: round-robin
• lc: least connection (smallest number of open connections)
• dh: destination hashing
• sh: source hashing
• sed: shortest expected delay
• nq: never queue
Note:
Client kube-proxy
clusterIP
(Virtual Server)
Node
Backend Pod 1 Backend Pod 2 Backend Pod 3
(Real Server)
In these proxy models, the traffic bound for the Service's IP:Port is proxied
to an appropriate backend without the clients knowing anything about
Kubernetes or Services or Pods.
If you want to make sure that connections from a particular client are
passed to the same Pod each time, you can select the session affinity based
on the client's IP addresses by setting service.spec.sessionAffinity to
"ClientIP" (the default is "None"). You can also set the maximum session
sticky time by setting service.spec.sessionAffinityConfig.clientIP.ti
meoutSeconds appropriately. (the default value is 10800, which works out to
be 3 hours).
Multi-Port Services
For some Services, you need to expose more than one port. Kubernetes lets
you configure multiple port definitions on a Service object. When using
multiple ports for a Service, you must give all of your ports names so that
these are unambiguous. For example:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: MyApp
ports:
- name: http
protocol: TCP
port: 80
targetPort: 9376
- name: https
protocol: TCP
port: 443
targetPort: 9377
Note:
For example, the names 123-abc and web are valid, but 123_abc
and -web are not.
You can specify your own cluster IP address as part of a Service creation
request. To do this, set the .spec.clusterIP field. For example, if you
already have an existing DNS entry that you wish to reuse, or legacy
systems that are configured for a specific IP address and difficult to re-
configure.
The IP address that you choose must be a valid IPv4 or IPv6 address from
within the service-cluster-ip-range CIDR range that is configured for the
API server. If you try to create a Service with an invalid clusterIP address
value, the API server will return a 422 HTTP status code to indicate that
there's a problem.
Discovering services
Environment variables
When a Pod is run on a Node, the kubelet adds a set of environment
variables for each active Service. It supports both Docker links compatible
variables (see makeLinkVariables) and simpler {SVCNAME}_SERVICE_HOST
and {SVCNAME}_SERVICE_PORT variables, where the Service name is upper-
cased and dashes are converted to underscores.
For example, the Service redis-master which exposes TCP port 6379 and
has been allocated cluster IP address 10.0.0.11, produces the following
environment variables:
REDIS_MASTER_SERVICE_HOST=10.0.0.11
REDIS_MASTER_SERVICE_PORT=6379
REDIS_MASTER_PORT=tcp://10.0.0.11:6379
REDIS_MASTER_PORT_6379_TCP=tcp://10.0.0.11:6379
REDIS_MASTER_PORT_6379_TCP_PROTO=tcp
REDIS_MASTER_PORT_6379_TCP_PORT=6379
REDIS_MASTER_PORT_6379_TCP_ADDR=10.0.0.11
Note:
When you have a Pod that needs to access a Service, and you are
using the environment variable method to publish the port and
cluster IP to the client Pods, you must create the Service before
the client Pods come into existence. Otherwise, those client Pods
won't have their environment variables populated.
If you only use DNS to discover the cluster IP for a Service, you
don't need to worry about this ordering issue.
DNS
You can (and almost always should) set up a DNS service for your
Kubernetes cluster using an add-on.
Kubernetes also supports DNS SRV (Service) records for named ports. If the
my-service.my-ns Service has a port named http with the protocol set to T
CP, you can do a DNS SRV query for _http._tcp.my-service.my-ns to
discover the port number for http, as well as the IP address.
Headless Services
Sometimes you don't need load-balancing and a single Service IP. In this
case, you can create what are termed "headless" Services, by explicitly
specifying "None" for the cluster IP (.spec.clusterIP).
You can use a headless Service to interface with other service discovery
mechanisms, without being tied to Kubernetes' implementation.
With selectors
For headless Services that define selectors, the endpoints controller creates
Endpoints records in the API, and modifies the DNS configuration to return
records (addresses) that point directly to the Pods backing the Service.
Without selectors
For headless Services that do not define selectors, the endpoints controller
does not create Endpoints records. However, the DNS system looks for and
configures either:
For some parts of your application (for example, frontends) you may want to
expose a Service onto an external IP address, that's outside of your cluster.
You can also use Ingress to expose your Service. Ingress is not a Service
type, but it acts as the entry point for your cluster. It lets you consolidate
your routing rules into a single resource as it can expose multiple services
under the same IP address.
Type NodePort
If you set the type field to NodePort, the Kubernetes control plane allocates
a port from a range specified by --service-node-port-range flag (default:
30000-32767). Each node proxies that port (the same port number on every
Node) into your Service. Your Service reports the allocated port in its .spec
.ports[*].nodePort field.
If you want to specify particular IP(s) to proxy the port, you can set the --
nodeport-addresses flag in kube-proxy to particular IP block(s); this is
supported since Kubernetes v1.10. This flag takes a comma-delimited list of
IP blocks (e.g. 10.0.0.0/8, 192.0.2.0/25) to specify IP address ranges that
kube-proxy should consider as local to this node.
For example, if you start kube-proxy with the --nodeport-
addresses=127.0.0.0/8 flag, kube-proxy only selects the loopback interface
for NodePort Services. The default for --nodeport-addresses is an empty
list. This means that kube-proxy should consider all available network
interfaces for NodePort. (That's also compatible with earlier Kubernetes
releases).
If you want a specific port number, you can specify a value in the nodePort
field. The control plane will either allocate you that port or report that the
API transaction failed. This means that you need to take care of possible
port collisions yourself. You also have to use a valid port number, one that's
inside the range configured for NodePort use.
Using a NodePort gives you the freedom to set up your own load balancing
solution, to configure environments that are not fully supported by
Kubernetes, or even to just expose one or more nodes' IPs directly.
For example:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
type: NodePort
selector:
app: MyApp
ports:
# By default and for convenience, the `targetPort` is set
to the same value as the `port` field.
- port: 80
targetPort: 80
# Optional field
# By default and for convenience, the Kubernetes control
plane will allocate a port from a range (default: 30000-32767)
nodePort: 30007
Type LoadBalancer
On cloud providers which support external load balancers, setting the type
field to LoadBalancer provisions a load balancer for your Service. The actual
creation of the load balancer happens asynchronously, and information
about the provisioned balancer is published in the Service's .status.loadBa
lancer field. For example:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
targetPort: 9376
clusterIP: 10.0.171.239
type: LoadBalancer
status:
loadBalancer:
ingress:
- ip: 192.0.2.127
Traffic from the external load balancer is directed at the backend Pods. The
cloud provider decides how it is load balanced.
Note:
By default, for LoadBalancer type of Services, when there is more than one
port defined, all ports must have the same protocol, and the protocol must
be one which is supported by the cloud provider.
Note: The set of protocols that can be used for LoadBalancer type
of Services is still defined by the cloud provider.
Starting in v1.20, you can optionally disable node port allocation for a
Service Type=LoadBalancer by setting the field spec.allocateLoadBalance
rNodePorts to false. This should only be used for load balancer
implementations that route traffic directly to pods as opposed to using node
ports. By default, spec.allocateLoadBalancerNodePorts is true and type
LoadBalancer Services will continue to allocate node ports. If spec.allocat
eLoadBalancerNodePorts is set to false on an existing Service with
allocated node ports, those node ports will NOT be de-allocated
automatically. You must explicitly remove the nodePorts entry in every
Service port to de-allocate those node ports. You must enable the ServiceLB
NodePortControl feature gate to use this field.
To set an internal load balancer, add one of the following annotations to your
Service depending on the cloud Service provider you're using.
• Default
• GCP
• AWS
• Azure
• IBM Cloud
• OpenStack
• Baidu Cloud
• Tencent Cloud
• Alibaba Cloud
[...]
metadata:
name: my-service
annotations:
cloud.google.com/load-balancer-type: "Internal"
[...]
[...]
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-internal: "t
rue"
[...]
[...]
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal:
"true"
[...]
[...]
metadata:
name: my-service
annotations:
service.kubernetes.io/ibm-load-balancer-cloud-provider-
ip-type: "private"
[...]
[...]
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/openstack-internal-load-
balancer: "true"
[...]
[...]
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/cce-load-balancer-internal-
vpc: "true"
[...]
[...]
metadata:
annotations:
service.kubernetes.io/qcloud-loadbalancer-internal-subnetid:
subnet-xxxxx
[...]
[...]
metadata:
annotations:
service.beta.kubernetes.io/alibaba-cloud-loadbalancer-
address-type: "intranet"
[...]
For partial TLS / SSL support on clusters running on AWS, you can add three
annotations to a LoadBalancer service:
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: arn:aw
s:acm:us-east-1:123456789012:certificate/12345678-1234-1234-1234-
123456789012
The first specifies the ARN of the certificate to use. It can be either a
certificate from a third party issuer that was uploaded to IAM or one created
within AWS Certificate Manager.
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-backend-
protocol: (https|http|ssl|tcp)
The second annotation specifies which protocol a Pod speaks. For HTTPS
and SSL, the ELB expects the Pod to authenticate itself over the encrypted
connection, using a certificate.
HTTP and HTTPS selects layer 7 proxying: the ELB terminates the
connection with the user, parses headers, and injects the X-Forwarded-For
header with the user's IP address (Pods only see the IP address of the ELB
at the other end of its connection) when forwarding requests.
TCP and SSL selects layer 4 proxying: the ELB forwards traffic without
modifying the headers.
In a mixed-use environment where some ports are secured and others are
left unencrypted, you can use the following annotations:
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-backend-
protocol: http
service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "
443,8443"
In the above example, if the Service contained three ports, 80, 443, and
8443, then 443 and 8443 would use the SSL certificate, but 80 would just be
proxied HTTP.
From Kubernetes v1.9 onwards you can use predefined AWS SSL policies
with HTTPS or SSL listeners for your Services. To see which policies are
available for use, you can use the aws command line tool:
You can then specify any one of those policies using the "service.beta.kub
ernetes.io/aws-load-balancer-ssl-negotiation-policy" annotation; for
example:
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-ssl-
negotiation-policy: "ELBSecurityPolicy-TLS-1-2-2017-01"
To enable PROXY protocol support for clusters running on AWS, you can use
the following service annotation:
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-proxy-
protocol: "*"
Since version 1.3.0, the use of this annotation applies to all ports proxied by
the ELB and cannot be configured otherwise.
There are several annotations to manage access logs for ELB Services on
AWS.
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-access-log-
enabled: "true"
# Specifies whether access logs are enabled for the load
balancer
service.beta.kubernetes.io/aws-load-balancer-access-log-
emit-interval: "60"
# The interval for publishing the access logs. You can
specify an interval of either 5 or 60 (minutes).
service.beta.kubernetes.io/aws-load-balancer-access-log-
s3-bucket-name: "my-bucket"
# The name of the Amazon S3 bucket where the access logs
are stored
service.beta.kubernetes.io/aws-load-balancer-access-log-
s3-bucket-prefix: "my-bucket-prefix/prod"
# The logical hierarchy you created for your Amazon S3
bucket, for example `my-bucket-prefix/prod`
Connection draining for Classic ELBs can be managed with the annotation s
ervice.beta.kubernetes.io/aws-load-balancer-connection-draining-
enabled set to the value of "true". The annotation service.beta.kubernet
es.io/aws-load-balancer-connection-draining-timeout can also be
used to set maximum time, in seconds, to keep the existing connections
open before deregistering the instances.
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-connection-
draining-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-connection-
draining-timeout: "60"
There are other annotations to manage Classic Elastic Load Balancers that
are described below.
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-connection-
idle-timeout: "60"
# The time, in seconds, that the connection is allowed
to be idle (no data has been sent over the connection) before it
is closed by the load balancer
service.beta.kubernetes.io/aws-load-balancer-cross-zone-
load-balancing-enabled: "true"
# Specifies whether cross-zone load balancing is enabled
for the load balancer
service.beta.kubernetes.io/aws-load-balancer-additional-
resource-tags: "environment=prod,owner=devops"
# A comma-separated list of key-value pairs which will
be recorded as
# additional tags in the ELB.
service.beta.kubernetes.io/aws-load-balancer-healthcheck-
healthy-threshold: ""
# The number of successive successful health checks
required for a backend to
# be considered healthy for traffic. Defaults to 2, must
be between 2 and 10
service.beta.kubernetes.io/aws-load-balancer-healthcheck-
unhealthy-threshold: "3"
# The number of unsuccessful health checks required for
a backend to be
# considered unhealthy for traffic. Defaults to 6, must
be between 2 and 10
service.beta.kubernetes.io/aws-load-balancer-healthcheck-
interval: "20"
# The approximate interval, in seconds, between health
checks of an
# individual instance. Defaults to 10, must be between 5
and 300
service.beta.kubernetes.io/aws-load-balancer-healthcheck-
timeout: "5"
# The amount of time, in seconds, during which no
response means a failed
# health check. This value must be less than the
service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval
# value. Defaults to 5, must be between 2 and 60
service.beta.kubernetes.io/aws-load-balancer-security-
groups: "sg-53fae93f"
# A list of existing security groups to be added to ELB
created. Unlike the annotation
# service.beta.kubernetes.io/aws-load-balancer-extra-
security-groups, this replaces all other security groups
previously assigned to the ELB.
service.beta.kubernetes.io/aws-load-balancer-extra-
security-groups: "sg-53fae93f,sg-42efd82e"
# A list of additional security groups to be added to
the ELB
service.beta.kubernetes.io/aws-load-balancer-target-node-
labels: "ingress-gw,gw-name=public-api"
# A comma separated list of key-value pairs which are
used
# to select the target nodes for the load balancer
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
Note: NLB only works with certain instance classes; see the AWS
documentation on Elastic Load Balancing for a list of supported
instance types.
You can also use NLB Services with the internal load balancer annotation.
In order for client traffic to reach instances behind an NLB, the Node
security groups are modified with the following IP rules:
In order to limit which client IP's can access the Network Load Balancer,
specify loadBalancerSourceRanges.
spec:
loadBalancerSourceRanges:
- "143.231.0.0/16"
There are other annotations for managing Cloud Load Balancers on TKE as
shown below.
metadata:
name: my-service
annotations:
# Bind Loadbalancers with specified nodes
service.kubernetes.io/qcloud-loadbalancer-backends-
label: key in (value1, value2)
Type ExternalName
This Service definition, for example, maps the my-service Service in the pro
d namespace to my.database.example.com:
apiVersion: v1
kind: Service
metadata:
name: my-service
namespace: prod
spec:
type: ExternalName
externalName: my.database.example.com
Warning:
External IPs
If there are external IPs that route to one or more cluster nodes, Kubernetes
Services can be exposed on those externalIPs. Traffic that ingresses into
the cluster with the external IP (as destination IP), on the Service port, will
be routed to one of the Service endpoints. externalIPs are not managed by
Kubernetes and are the responsibility of the cluster administrator.
In the Service spec, externalIPs can be specified along with any of the Ser
viceTypes. In the example below, "my-service" can be accessed by clients
on "80.11.12.10:80" (externalIP:port)
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: MyApp
ports:
- name: http
protocol: TCP
port: 80
targetPort: 9376
externalIPs:
- 80.11.12.10
Shortcomings
Using the userspace proxy for VIPs works at small to medium scale, but will
not scale to very large clusters with thousands of Services. The original
design proposal for portals has more details on this.
The Type field is designed as nested functionality - each level adds to the
previous. This is not strictly required on all cloud providers (e.g. Google
Compute Engine does not need to allocate a NodePort to make LoadBalance
r work, but AWS does) but the current API requires it.
Virtual IP implementation
The previous information should be sufficient for many people who just want
to use Services. However, there is a lot going on behind the scenes that may
be worth understanding.
Avoiding collisions
In order to allow you to choose a port number for your Services, we must
ensure that no two Services can collide. Kubernetes does that by allocating
each Service its own IP address.
Service IP addresses
Userspace
When a client connects to the Service's virtual IP address, the iptables rule
kicks in, and redirects the packets to the proxy's own port. The "Service
proxy" chooses a backend, and starts proxying traffic from the client to the
backend.
This means that Service owners can choose any port they want without risk
of collision. Clients can simply connect to an IP and port, without being
aware of which Pods they are actually accessing.
iptables
Again, consider the image processing application described above. When the
backend Service is created, the Kubernetes control plane assigns a virtual IP
address, for example 10.0.0.1. Assuming the Service port is 1234, the
Service is observed by all of the kube-proxy instances in the cluster. When a
proxy sees a new Service, it installs a series of iptables rules which redirect
from the virtual IP address to per-Service rules. The per-Service rules link to
per-Endpoint rules which redirect traffic (using destination NAT) to the
backends.
When a client connects to the Service's virtual IP address the iptables rule
kicks in. A backend is chosen (either based on session affinity or randomly)
and packets are redirected to the backend. Unlike the userspace proxy,
packets are never copied to userspace, the kube-proxy does not have to be
running for the virtual IP address to work, and Nodes see traffic arriving
from the unaltered client IP address.
This same basic flow executes when traffic comes in through a node-port or
through a load-balancer, though in those cases the client IP does get altered.
IPVS
iptables operations slow down dramatically in large scale cluster e.g 10,000
Services. IPVS is designed for load balancing and based on in-kernel hash
tables. So you can achieve performance consistency in large number of
Services from IPVS-based kube-proxy. Meanwhile, IPVS-based kube-proxy
has more sophisticated load balancing algorithms (least conns, locality,
weighted, persistence).
API Object
Service is a top-level resource in the Kubernetes REST API. You can find
more details about the API object at: Service API object.
Supported protocols
TCP
You can use TCP for any kind of Service, and it's the default network
protocol.
UDP
You can use UDP for most Services. For type=LoadBalancer Services, UDP
support depends on the cloud provider offering this facility.
SCTP
When using a network plugin that supports SCTP traffic, you can use SCTP
for most Services. For type=LoadBalancer Services, SCTP support depends
on the cloud provider offering this facility. (Most do not).
Warnings
Support for multihomed SCTP associations
Warning:
Windows
Userspace kube-proxy
HTTP
If your cloud provider supports it, you can use a Service in LoadBalancer
mode to set up external HTTP / HTTPS reverse proxying, forwarded to the
Endpoints of the Service.
Note: You can also use Ingress in place of Service to expose HTTP/
HTTPS Services.
PROXY protocol
If your cloud provider supports it, you can use a Service in LoadBalancer
mode to configure a load balancer outside of Kubernetes itself, that will
forward connections prefixed with PROXY protocol.
The load balancer will send an initial series of octets describing the
incoming connection, similar to this example
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 24, 2020 at 9:38 PM PST: add docs for
service.spec.allocateLoadBalancerNodePorts (acb476bec)
Edit this page Create child page Create an issue
• Motivation
• Service resources
◦ Cloud-native service discovery
• Defining a Service
◦ Services without selectors
◦ EndpointSlices
◦ Application protocol
• Virtual IPs and service proxies
◦ Why not use round-robin DNS?
◦ User space proxy mode
◦ iptables proxy mode
◦ IPVS proxy mode
• Multi-Port Services
• Choosing your own IP address
• Discovering services
◦ Environment variables
◦ DNS
• Headless Services
◦ With selectors
◦ Without selectors
• Publishing Services (ServiceTypes)
◦ Type NodePort
◦ Type LoadBalancer
◦ Type ExternalName
◦ External IPs
• Shortcomings
• Virtual IP implementation
◦ Avoiding collisions
◦ Service IP addresses
• API Object
• Supported protocols
◦ TCP
◦ UDP
◦ SCTP
◦ HTTP
◦ PROXY protocol
• What's next
Service Topology
FEATURE STATE: Kubernetes v1.17 [alpha]
Service Topology enables a service to route traffic based upon the Node
topology of the cluster. For example, a service can specify that traffic be
preferentially routed to endpoints that are on the same Node as the client,
or in the same availability zone.
Introduction
By default, traffic sent to a ClusterIP or NodePort Service may be routed to
any backend address for the Service. Since Kubernetes 1.7 it has been
possible to route "external" traffic to the Pods running on the Node that
received the traffic, but this is not supported for ClusterIP Services, and
more complex topologies — such as routing zonally — have not been
possible. The Service Topology feature resolves this by allowing the Service
creator to define a policy for routing traffic based upon the Node labels for
the originating and destination Nodes.
By using Node label matching between the source and destination, the
operator may designate groups of Nodes that are "closer" and "farther" from
one another, using whatever metric makes sense for that operator's
requirements. For many operators in public clouds, for example, there is a
preference to keep service traffic within the same zone, because interzonal
traffic has a cost associated with it, while intrazonal traffic does not. Other
common needs include being able to route traffic to a local Pod managed by
a DaemonSet, or keeping traffic to Nodes connected to the same top-of-rack
switch for the lowest latency.
Consider a cluster with Nodes that are labeled with their hostname, zone
name, and region name. Then you can set the topologyKeys values of a
service to direct traffic as follows.
Constraints
• Service topology is not compatible with
externalTrafficPolicy=Local, and therefore a Service cannot use
both of these features. It is possible to use both features in the same
cluster on different Services, just not on the same Service.
• Topology keys must be valid label keys and at most 16 keys may be
specified.
• The catch-all value, "*", must be the last value in the topology keys, if
it is used.
Examples
The following are common examples of using the Service Topology feature.
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 9376
topologyKeys:
- "kubernetes.io/hostname"
- "*"
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 9376
topologyKeys:
- "topology.kubernetes.io/zone"
- "topology.kubernetes.io/region"
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 9376
topologyKeys:
- "kubernetes.io/hostname"
- "topology.kubernetes.io/zone"
- "topology.kubernetes.io/region"
- "*"
What's next
• Read about enabling Service Topology
• Read Connecting Applications with Services
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified May 30, 2020 at 3:10 PM PST: add en pages (ecc27bbbe)
Edit this page Create child page Create an issue
• Introduction
• Using Service Topology
• Constraints
• Examples
◦ Only Node Local Endpoints
◦ Prefer Node Local Endpoints
◦ Only Zonal or Regional Endpoints
◦ Prefer Node Local, Zonal, then Regional Endpoints
• What's next
Introduction
Kubernetes DNS schedules a DNS Pod and Service on the cluster, and
configures the kubelets to tell individual containers to use the DNS Service's
IP to resolve DNS names.
The following sections detail the supported record types and layout that is
supported. Any other layout or names or queries that happen to work are
considered implementation details and are subject to change without
warning. For more up-to-date specification, see Kubernetes DNS-Based
Service Discovery.
Services
A/AAAA records
"Normal" (not headless) Services are assigned a DNS A or AAAA record,
depending on the IP family of the service, for a name of the form my-
svc.my-namespace.svc.cluster-domain.example. This resolves to the
cluster IP of the Service.
Pods
A/AAAA records
In general a pod has the following DNS resolution:
pod-ip-address.my-namespace.pod.cluster-domain.example.
172-17-0-3.default.pod.cluster.local.
pod-ip-address.deployment-name.my-namespace.svc.cluster-
domain.example.
The Pod spec has an optional hostname field, which can be used to specify
the Pod's hostname. When specified, it takes precedence over the Pod's
name to be the hostname of the pod. For example, given a Pod with hostnam
e set to "my-host", the Pod will have its hostname set to "my-host".
The Pod spec also has an optional subdomain field which can be used to
specify its subdomain. For example, a Pod with hostname set to "foo", and su
bdomain set to "bar", in namespace "my-namespace", will have the fully
qualified domain name (FQDN) "foo.bar.my-namespace.svc.cluster-
domain.example".
Example:
apiVersion: v1
kind: Service
metadata:
name: default-subdomain
spec:
selector:
name: busybox
clusterIP: None
ports:
- name: foo # Actually, no port is needed.
port: 1234
targetPort: 1234
---
apiVersion: v1
kind: Pod
metadata:
name: busybox1
labels:
name: busybox
spec:
hostname: busybox-1
subdomain: default-subdomain
containers:
- image: busybox:1.28
command:
- sleep
- "3600"
name: busybox
---
apiVersion: v1
kind: Pod
metadata:
name: busybox2
labels:
name: busybox
spec:
hostname: busybox-2
subdomain: default-subdomain
containers:
- image: busybox:1.28
command:
- sleep
- "3600"
name: busybox
If there exists a headless service in the same namespace as the pod and with
the same name as the subdomain, the cluster's DNS Server also returns an
A or AAAA record for the Pod's fully qualified hostname. For example, given
a Pod with the hostname set to "busybox-1" and the subdomain set to "defa
ult-subdomain", and a headless Service named "default-subdomain" in the
same namespace, the pod will see its own FQDN as "busybox-1.default-
subdomain.my-namespace.svc.cluster-domain.example". DNS serves an A
or AAAA record at that name, pointing to the Pod's IP. Both pods "busybox1"
and "busybox2" can have their distinct A or AAAA records.
The Endpoints object can specify the hostname for any endpoint addresses,
along with its IP.
Note: Because A or AAAA records are not created for Pod names,
hostname is required for the Pod's A or AAAA record to be created.
A Pod with no hostname but with subdomain will only create the A
or AAAA record for the headless service (default-subdomain.my-
namespace.svc.cluster-domain.example), pointing to the Pod's
IP address. Also, Pod needs to become ready in order to have a
record unless publishNotReadyAddresses=True is set on the
Service.
When a Pod is configured to have fully qualified domain name (FQDN), its
hostname is the short hostname. For example, if you have a Pod with the
fully qualified domain name busybox-1.default-subdomain.my-
namespace.svc.cluster-domain.example, then by default the hostname
command inside that Pod returns busybox-1 and the hostname --fqdn
command returns the FQDN.
When you set setHostnameAsFQDN: true in the Pod spec, the kubelet writes
the Pod's FQDN into the hostname for that Pod's namespace. In this case,
both hostname and hostname --fqdn return the Pod's FQDN.
Note:
• "Default": The Pod inherits the name resolution configuration from the
node that the pods run on. See related discussion for more details.
• "ClusterFirst": Any DNS query that does not match the configured
cluster domain suffix, such as "www.kubernetes.io", is forwarded to
the upstream nameserver inherited from the node. Cluster
administrators may have extra stub-domain and upstream DNS servers
configured. See related discussion for details on how DNS queries are
handled in those cases.
• "ClusterFirstWithHostNet": For Pods running with hostNetwork, you
should explicitly set its DNS policy "ClusterFirstWithHostNet".
• "None": It allows a Pod to ignore DNS settings from the Kubernetes
environment. All DNS settings are supposed to be provided using the d
nsConfig field in the Pod Spec. See Pod's DNS config subsection below.
The example below shows a Pod with its DNS policy set to "ClusterFirstWi
thHostNet" because it has hostNetwork set to true.
apiVersion: v1
kind: Pod
metadata:
name: busybox
namespace: default
spec:
containers:
- image: busybox:1.28
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
name: busybox
restartPolicy: Always
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
The dnsConfig field is optional and it can work with any dnsPolicy settings.
However, when a Pod's dnsPolicy is set to "None", the dnsConfig field has
to be specified.
Below are the properties a user can specify in the dnsConfig field:
service/networking/custom-dns.yaml
apiVersion: v1
kind: Pod
metadata:
namespace: default
name: dns-example
spec:
containers:
- name: test
image: nginx
dnsPolicy: "None"
dnsConfig:
nameservers:
- 1.2.3.4
searches:
- ns1.svc.cluster-domain.example
- my.dns.search.suffix
options:
- name: ndots
value: "2"
- name: edns0
When the Pod above is created, the container test gets the following
contents in its /etc/resolv.conf file:
nameserver 1.2.3.4
search ns1.svc.cluster-domain.example my.dns.search.suffix
options ndots:2 edns0
For IPv6 setup, search path and name server should be setup like this:
kubectl exec -it dns-example -- cat /etc/resolv.conf
nameserver fd00:79:30::a
search default.svc.cluster-domain.example svc.cluster-
domain.example cluster-domain.example
options ndots:5
Feature availability
The availability of Pod DNS Config and DNS Policy "None" is shown as below.
What's next
For guidance on administering DNS configurations, check Configure DNS
Service
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified October 02, 2020 at 5:59 PM PST: Updating doc to reflect that
setHostnameAsFQDN feature will be beta in v1.20 (c29185dac)
Edit this page Create child page Create an issue
• Introduction
◦ What things get DNS names?
• Services
◦ A/AAAA records
◦ SRV records
• Pods
◦ A/AAAA records
◦ Pod's hostname and subdomain fields
◦ Pod's setHostnameAsFQDN field
◦ Pod's DNS Policy
◦ Pod's DNS Config
◦ Feature availability
• What's next
Connecting Applications with
Services
The Kubernetes model for connecting
containers
Now that you have a continuously running, replicated application you can
expose it on a network. Before discussing the Kubernetes approach to
networking, it is worthwhile to contrast it with the "normal" way networking
works with Docker.
service/networking/run-my-nginx.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
selector:
matchLabels:
run: my-nginx
replicas: 2
template:
metadata:
labels:
run: my-nginx
spec:
containers:
- name: my-nginx
image: nginx
ports:
- containerPort: 80
This makes it accessible from any node in your cluster. Check the nodes the
Pod is running on:
You should be able to ssh into any node in your cluster and curl both IPs.
Note that the containers are not using port 80 on the node, nor are there
any special NAT rules to route traffic to the pod. This means you can run
multiple nginx pods on the same node all using the same containerPort and
access them from any other pod or node in your cluster using IP. Like
Docker, ports can still be published to the host node's interfaces, but the
need for this is radically diminished because of the networking model.
You can read more about how we achieve this if you're curious.
Creating a Service
So we have pods running nginx in a flat, cluster wide, address space. In
theory, you could talk to these pods directly, but what happens when a node
dies? The pods die with it, and the Deployment will create new ones, with
different IPs. This is the problem a Service solves.
You can create a Service for your 2 nginx replicas with kubectl expose:
service/my-nginx exposed
service/networking/nginx-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: my-nginx
labels:
run: my-nginx
spec:
ports:
- port: 80
protocol: TCP
selector:
run: my-nginx
This specification will create a Service which targets TCP port 80 on any Pod
with the run: my-nginx label, and expose it on an abstracted Service port (t
argetPort: is the port the container accepts traffic on, port: is the
abstracted Service port, which can be any port other pods use to access the
Service). View Service API object to see the list of supported fields in service
definition. Check your Service:
Name: my-nginx
Namespace: default
Labels: run=my-nginx
Annotations: <none>
Selector: run=my-nginx
Type: ClusterIP
IP: 10.0.162.149
Port: <unset> 80/TCP
Endpoints: 10.244.2.5:80,10.244.3.4:80
Session Affinity: None
Events: <none>
Environment Variables
When a Pod runs on a Node, the kubelet adds a set of environment variables
for each active Service. This introduces an ordering problem. To see why,
inspect the environment of your running nginx Pods (your Pod name will be
different):
KUBERNETES_SERVICE_HOST=10.0.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443
Note there's no mention of your Service. This is because you created the
replicas before the Service. Another disadvantage of doing this is that the
scheduler might put both Pods on the same machine, which will take your
entire Service down if it dies. We can do this the right way by killing the 2
Pods and waiting for the Deployment to recreate them. This time around the
Service exists before the replicas. This will give you scheduler-level Service
spreading of your Pods (provided all your nodes have equal capacity), as
well as the right environment variables:
You may notice that the pods have different names, since they are killed and
recreated.
KUBERNETES_SERVICE_PORT=443
MY_NGINX_SERVICE_HOST=10.0.162.149
KUBERNETES_SERVICE_HOST=10.0.0.1
MY_NGINX_SERVICE_PORT=80
KUBERNETES_SERVICE_PORT_HTTPS=443
DNS
Kubernetes offers a DNS cluster addon Service that automatically assigns
dns names to other Services. You can check if it's running on your cluster:
The rest of this section will assume you have a Service with a long lived IP
(my-nginx), and a DNS server that has assigned a name to that IP. Here we
use the CoreDNS cluster addon (application name kube-dns), so you can
talk to the Service from any pod in your cluster using standard methods (e.g.
gethostbyname()). If CoreDNS isn't running, you can enable it referring to
the CoreDNS README or Installing CoreDNS. Let's run another curl
application to test this:
Name: my-nginx
Address 1: 10.0.162.149
• Self signed certificates for https (unless you already have an identity
certificate)
• An nginx server configured to use the certificates
• A secret that makes the certificates accessible to pods
You can acquire all these from the nginx https example. This requires having
go and make tools installed. If you don't want to install those, then follow
the manual steps later. In short:
secret/nginxsecret created
NAME TYPE
DATA AGE
default-token-il9rc kubernetes.io/service-account-token
1 1d
nginxsecret kubernetes.io/tls
2 1m
configmap/nginxconfigmap created
Following are the manual steps to follow in case you run into problems
running make (on windows for example):
Use the output from the previous commands to create a yaml file as follows.
The base64 encoded value should all be on a single line.
apiVersion: "v1"
kind: "Secret"
metadata:
name: "nginxsecret"
namespace: "default"
type: kubernetes.io/tls
data:
tls.crt: "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURIekNDQWdlZ
0F3SUJBZ0lKQUp5M3lQK0pzMlpJTUEwR0NTcUdTSWIzRFFFQkJRVUFNQ1l4RVRBUE
JnTlYKQkFNVENHNW5hVzU0YzNaak1SRXdEd1lEVlFRS0V3aHVaMmx1ZUhOMll6QWV
GdzB4TnpFd01qWXdOekEzTVRKYQpGdzB4T0RFd01qWXdOekEzTVRKYU1DWXhFVEFQ
QmdOVkJBTVRDRzVuYVc1NGMzWmpNUkV3RHdZRFZRUUtFd2h1CloybHVlSE4yWXpDQ
0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBSjFxSU1SOV
dWM0IKMlZIQlRMRmtobDRONXljMEJxYUhIQktMSnJMcy8vdzZhU3hRS29GbHlJSU9
4NGUrMlN5ajBFcndCLzlYTnBwbQppeW1CL3JkRldkOXg5UWhBQUxCZkVaTmNiV3Ns
TVFVcnhBZW50VWt1dk1vLzgvMHRpbGhjc3paenJEYVJ4NEo5Ci82UVRtVVI3a0ZTW
UpOWTVQZkR3cGc3dlVvaDZmZ1Voam92VG42eHNVR0M2QURVODBpNXFlZWhNeVI1N2
lmU2YKNHZpaXdIY3hnL3lZR1JBRS9mRTRqakxCdmdONjc2SU90S01rZXV3R0ljNDF
hd05tNnNTSzRqYUNGeGpYSnZaZQp2by9kTlEybHhHWCtKT2l3SEhXbXNhdGp4WTRa
NVk3R1ZoK0QrWnYvcW1mMFgvbVY0Rmo1NzV3ajFMWVBocWtsCmdhSXZYRyt4U1FVQ
0F3RUFBYU5RTUU0d0hRWURWUjBPQkJZRUZPNG9OWkI3YXc1OUlsYkROMzhIYkduYn
hFVjcKTUI4R0ExVWRJd1FZTUJhQUZPNG9OWkI3YXc1OUlsYkROMzhIYkduYnhFVjd
NQXdHQTFVZEV3UUZNQU1CQWY4dwpEUVlKS29aSWh2Y05BUUVGQlFBRGdnRUJBRVhT
MW9FU0lFaXdyMDhWcVA0K2NwTHI3TW5FMTducDBvMm14alFvCjRGb0RvRjdRZnZqe
E04Tzd2TjB0clcxb2pGSW0vWDE4ZnZaL3k4ZzVaWG40Vm8zc3hKVmRBcStNZC9jTS
tzUGEKNmJjTkNUekZqeFpUV0UrKzE5NS9zb2dmOUZ3VDVDK3U2Q3B5N0M3MTZvUXR
UakViV05VdEt4cXI0Nk1OZWNCMApwRFhWZmdWQTRadkR4NFo3S2RiZDY5eXM3OVFH
Ymg5ZW1PZ05NZFlsSUswSGt0ejF5WU4vbVpmK3FqTkJqbWZjCkNnMnlwbGQ0Wi8rU
UNQZjl3SkoybFIrY2FnT0R4elBWcGxNSEcybzgvTHFDdnh6elZPUDUxeXdLZEtxaU
MwSVEKQ0I5T2wwWW5scE9UNEh1b2hSUzBPOStlMm9KdFZsNUIyczRpbDlhZ3RTVXF
xUlU9Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K"
tls.key: "LS0tLS1CRUdJTiBQUklWQVRFIEtFWS0tLS0tCk1JSUV2UUlCQURBT
kJna3Foa2lHOXcwQkFRRUZBQVNDQktjd2dnU2pBZ0VBQW9JQkFRQ2RhaURFZlZsZH
dkbFIKd1V5eFpJWmVEZWNuTkFhbWh4d1NpeWF5N1AvOE9ta3NVQ3FCWmNpQ0RzZUh
2dGtzbzlCSzhBZi9WemFhWm9zcApnZjYzUlZuZmNmVUlRQUN3WHhHVFhHMXJKVEVG
SzhRSHA3VkpMcnpLUC9QOUxZcFlYTE0yYzZ3MmtjZUNmZitrCkU1bEVlNUJVbUNUV
09UM3c4S1lPNzFLSWVuNEZJWTZMMDUrc2JGQmd1Z0ExUE5JdWFubm9UTWtlZTRuMG
4rTDQKb3NCM01ZUDhtQmtRQlAzeE9JNHl3YjREZXUraURyU2pKSHJzQmlIT05Xc0R
adXJFaXVJMmdoY1kxeWIyWHI2UAozVFVOcGNSbC9pVG9zQngxcHJHclk4V09HZVdP
eGxZZmcvbWIvNnBuOUYvNWxlQlkrZStjSTlTMkQ0YXBKWUdpCkwxeHZzVWtGQWdNQ
kFBRUNnZ0VBZFhCK0xkbk8ySElOTGo5bWRsb25IUGlHWWVzZ294RGQwci9hQ1Zkan
k4dlEKTjIwL3FQWkUxek1yall6Ry9kVGhTMmMwc0QxaTBXSjdwR1lGb0xtdXlWTjl
tY0FXUTM5SjM0VHZaU2FFSWZWNgo5TE1jUHhNTmFsNjRLMFRVbUFQZytGam9QSFlh
UUxLOERLOUtnNXNrSE5pOWNzMlY5ckd6VWlVZWtBL0RBUlBTClI3L2ZjUFBacDRuR
WVBZmI3WTk1R1llb1p5V21SU3VKdlNyblBESGtUdW1vVlVWdkxMRHRzaG9reUxiTW
VtN3oKMmJzVmpwSW1GTHJqbGtmQXlpNHg0WjJrV3YyMFRrdWtsZU1jaVlMbjk4QWx
iRi9DSmRLM3QraTRoMTVlR2ZQegpoTnh3bk9QdlVTaDR2Q0o3c2Q5TmtEUGJvS2Jn
eVVHOXBYamZhRGR2UVFLQmdRRFFLM01nUkhkQ1pKNVFqZWFKClFGdXF4cHdnNzhZT
jQyL1NwenlUYmtGcVFoQWtyczJxWGx1MDZBRzhrZzIzQkswaHkzaE9zSGgxcXRVK3
NHZVAKOWRERHBsUWV0ODZsY2FlR3hoc0V0L1R6cEdtNGFKSm5oNzVVaTVGZk9QTDh
PTm1FZ3MxMVRhUldhNzZxelRyMgphRlpjQ2pWV1g0YnRSTHVwSkgrMjZnY0FhUUtC
Z1FEQmxVSUUzTnNVOFBBZEYvL25sQVB5VWs1T3lDdWc3dmVyClUycXlrdXFzYnBkS
i9hODViT1JhM05IVmpVM25uRGpHVHBWaE9JeXg5TEFrc2RwZEFjVmxvcG9HODhXYk
9lMTAKMUdqbnkySmdDK3JVWUZiRGtpUGx1K09IYnRnOXFYcGJMSHBzUVpsMGhucDB
YSFNYVm9CMUliQndnMGEyOFVadApCbFBtWmc2d1BRS0JnRHVIUVV2SDZHYTNDVUsx
NFdmOFhIcFFnMU16M2VvWTBPQm5iSDRvZUZKZmcraEppSXlnCm9RN3hqWldVR3BIc
3AyblRtcHErQWlSNzdyRVhsdlhtOElVU2FsbkNiRGlKY01Pc29RdFBZNS9NczJMRm
5LQTQKaENmL0pWb2FtZm1nZEN0ZGtFMXNINE9MR2lJVHdEbTRpb0dWZGIwMllnbzF
yb2htNUpLMUI3MkpBb0dBUW01UQpHNDhXOTVhL0w1eSt5dCsyZ3YvUHM2VnBvMjZl
TzRNQ3lJazJVem9ZWE9IYnNkODJkaC8xT2sybGdHZlI2K3VuCnc1YytZUXRSTHlhQ
md3MUtpbGhFZDBKTWU3cGpUSVpnQWJ0LzVPbnlDak9OVXN2aDJjS2lrQ1Z2dTZsZl
BjNkQKckliT2ZIaHhxV0RZK2Q1TGN1YSt2NzJ0RkxhenJsSlBsRzlOZHhrQ2dZRUF
5elIzT3UyMDNRVVV6bUlCRkwzZAp4Wm5XZ0JLSEo3TnNxcGFWb2RjL0d5aGVycjFD
ZzE2MmJaSjJDV2RsZkI0VEdtUjZZdmxTZEFOOFRwUWhFbUtKCnFBLzVzdHdxNWd0W
GVLOVJmMWxXK29xNThRNTBxMmk1NVdUTThoSDZhTjlaMTltZ0FGdE5VdGNqQUx2dF
YxdEYKWSs4WFJkSHJaRnBIWll2NWkwVW1VbGc9Ci0tLS0tRU5EIFBSSVZBVEUgS0V
ZLS0tLS0K"
NAME TYPE
DATA AGE
default-token-il9rc kubernetes.io/service-account-token
1 1d
nginxsecret kubernetes.io/tls
2 1m
Now modify your nginx replicas to start an https server using the certificate
in the secret, and the Service, to expose both ports (80 and 443):
service/networking/nginx-secure-app.yaml
apiVersion: v1
kind: Service
metadata:
name: my-nginx
labels:
run: my-nginx
spec:
type: NodePort
ports:
- port: 8080
targetPort: 80
protocol: TCP
name: http
- port: 443
protocol: TCP
name: https
selector:
run: my-nginx
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
selector:
matchLabels:
run: my-nginx
replicas: 1
template:
metadata:
labels:
run: my-nginx
spec:
volumes:
- name: secret-volume
secret:
secretName: nginxsecret
- name: configmap-volume
configMap:
name: nginxconfigmap
containers:
- name: nginxhttps
image: bprashanth/nginxhttps:1.0
ports:
- containerPort: 443
- containerPort: 80
volumeMounts:
- mountPath: /etc/nginx/ssl
name: secret-volume
- mountPath: /etc/nginx/conf.d
name: configmap-volume
At this point you can reach the nginx server from any node.
Note how we supplied the -k parameter to curl in the last step, this is
because we don't know anything about the pods running nginx at certificate
generation time, so we have to tell curl to ignore the CName mismatch. By
creating a Service we linked the CName used in the certificate with the
actual DNS name used by pods during Service lookup. Let's test this from a
pod (the same secret is being reused for simplicity, the pod only needs
nginx.crt to access the Service):
service/networking/curlpod.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: curl-deployment
spec:
selector:
matchLabels:
app: curlpod
replicas: 1
template:
metadata:
labels:
app: curlpod
spec:
volumes:
- name: secret-volume
secret:
secretName: nginxsecret
containers:
- name: curlpod
command:
- sh
- -c
- while true; do sleep 1; done
image: radial/busyboxplus:curl
volumeMounts:
- mountPath: /etc/nginx/ssl
name: secret-volume
kubectl apply -f ./curlpod.yaml
kubectl get pods -l app=curlpod
$ curl https://<EXTERNAL-IP>:<NODE-PORT> -k
...
<h1>Welcome to nginx!</h1>
Let's now recreate the Service to use a cloud load balancer, just change the
Type of my-nginx Service from NodePort to LoadBalancer:
curl https://<EXTERNAL-IP> -k
...
<title>Welcome to nginx!</title>
The IP address in the EXTERNAL-IP column is the one that is available on the
public internet. The CLUSTER-IP is only available inside your cluster/private
cloud network.
Note that on AWS, type LoadBalancer creates an ELB, which uses a (long)
hostname, not an IP. It's too long to fit in the standard kubectl get svc
output, in fact, so you'll need to do kubectl describe service my-nginx to
see it. You'll see something like this:
What's next
• Learn more about Using a Service to Access an Application in a Cluster
• Learn more about Connecting a Front End to a Back End Using a
Service
• Learn more about Creating an External Load Balancer
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified August 05, 2020 at 3:17 AM PST: Replace special quote
characters with normal ones. (c6a96128c)
Edit this page Create child page Create an issue
EndpointSlices
FEATURE STATE: Kubernetes v1.17 [beta]
Motivation
The Endpoints API has provided a simple and straightforward way of
tracking network endpoints in Kubernetes. Unfortunately as Kubernetes
clusters and Services have grown to handle and send more traffic to more
backend Pods, limitations of that original API became more visible. Most
notably, those included challenges with scaling to larger numbers of network
endpoints.
Since all network endpoints for a Service were stored in a single Endpoints
resource, those resources could get quite large. That affected the
performance of Kubernetes components (notably the master control plane)
and resulted in significant amounts of network traffic and processing when
Endpoints changed. EndpointSlices help you mitigate those issues as well as
provide an extensible platform for additional features such as topological
routing.
EndpointSlice resources
In Kubernetes, an EndpointSlice contains references to a set of network
endpoints. The control plane automatically creates EndpointSlices for any
Kubernetes Service that has a selector specified. These EndpointSlices
include references to all the Pods that match the Service selector.
EndpointSlices group network endpoints together by unique combinations of
protocol, port number, and Service name. The name of a EndpointSlice
object must be a valid DNS subdomain name.
As an example, here's a sample EndpointSlice resource for the example
Kubernetes Service.
apiVersion: discovery.k8s.io/v1beta1
kind: EndpointSlice
metadata:
name: example-abc
labels:
kubernetes.io/service-name: example
addressType: IPv4
ports:
- name: http
protocol: TCP
port: 80
endpoints:
- addresses:
- "10.1.2.3"
conditions:
ready: true
hostname: pod-1
topology:
kubernetes.io/hostname: node-1
topology.kubernetes.io/zone: us-west2-a
EndpointSlices can act as the source of truth for kube-proxy when it comes
to how to route internal traffic. When enabled, they should provide a
performance improvement for services with large numbers of endpoints.
Address types
EndpointSlices support three address types:
• IPv4
• IPv6
• FQDN (Fully Qualified Domain Name)
Conditions
The EndpointSlice API stores conditions about endpoints that may be useful
for consumers. The three conditions are ready, serving, and terminating.
Ready
Serving
serving is identical to the ready condition, except it does not account for
terminating states. Consumers of the EndpointSlice API should check this
condition if they care about pod readiness while the pod is also terminating.
Terminating
Topology information
FEATURE STATE: Kubernetes v1.20 [deprecated]
The values of these labels are derived from resources associated with each
endpoint in a slice. The hostname label represents the value of the
NodeName field on the corresponding Pod. The zone and region labels
represent the value of the labels with the same names on the corresponding
Node.
Management
Most often, the control plane (specifically, the endpoint slice controller)
creates and manages EndpointSlice objects. There are a variety of other use
cases for EndpointSlices, such as service mesh implementations, that could
result in other entities or controllers managing additional sets of
EndpointSlices.
Ownership
In most use cases, EndpointSlices are owned by the Service that the
endpoint slice object tracks endpoints for. This ownership is indicated by an
owner reference on each EndpointSlice as well as a kubernetes.io/
service-name label that enables simple lookups of all EndpointSlices
belonging to a Service.
EndpointSlice mirroring
In some cases, applications create custom Endpoints resources. To ensure
that these applications do not need to concurrently write to both Endpoints
and EndpointSlice resources, the cluster's control plane mirrors most
Endpoints resources to corresponding EndpointSlices.
The control plane tries to fill EndpointSlices as full as possible, but does not
actively rebalance them. The logic is fairly straightforward:
In practice, this less than ideal distribution should be rare. Most changes
processed by the EndpointSlice controller will be small enough to fit in an
existing EndpointSlice, and if not, a new EndpointSlice is likely going to be
necessary soon anyway. Rolling updates of Deployments also provide a
natural repacking of EndpointSlices with all Pods and their corresponding
endpoints getting replaced.
Duplicate endpoints
Due to the nature of EndpointSlice changes, endpoints may be represented
in more than one EndpointSlice at the same time. This naturally occurs as
changes to different EndpointSlice objects can arrive at the Kubernetes
client watch/cache at different times. Implementations using EndpointSlice
must be able to have the endpoint appear in more than one slice. A
reference implementation of how to perform endpoint deduplication can be
found in the EndpointSliceCache implementation in kube-proxy.
What's next
• Learn about Enabling EndpointSlices
• Read Connecting Applications with Services
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 24, 2020 at 9:37 PM PST: add docs for
EndpointSlice conditions (d591f7e3b)
Edit this page Create child page Create an issue
• Motivation
• EndpointSlice resources
◦ Address types
◦ Conditions
◦ Topology information
◦ Management
◦ Ownership
◦ EndpointSlice mirroring
◦ Distribution of EndpointSlices
◦ Duplicate endpoints
• What's next
Ingress
FEATURE STATE: Kubernetes v1.19 [stable]
Terminology
For clarity, this guide defines the following terms:
What is Ingress?
Ingress exposes HTTP and HTTPS routes from outside the cluster to
services within the cluster. Traffic routing is controlled by rules defined on
the Ingress resource.
Here is a simple example where an Ingress sends all its traffic to one
Service:
Prerequisites
You may need to deploy an Ingress controller such as ingress-nginx. You can
choose from a number of Ingress controllers.
service/networking/minimal-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: minimal-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- http:
paths:
- path: /testpath
pathType: Prefix
backend:
service:
name: test
port:
number: 80
The Ingress spec has all the information needed to configure a load balancer
or proxy server. Most importantly, it contains a list of rules matched against
all incoming requests. Ingress resource only supports rules for directing
HTTP(S) traffic.
Ingress rules
DefaultBackend
An Ingress with no rules sends all traffic to a single default backend. The de
faultBackend is conventionally a configuration option of the Ingress
controller and is not specified in your Ingress resources.
If none of the hosts or paths match the HTTP request in the Ingress objects,
the traffic is routed to your default backend.
Resource backends
service/networking/ingress-resource-backend.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ingress-resource-backend
spec:
defaultBackend:
resource:
apiGroup: k8s.example.com
kind: StorageBucket
name: static-assets
rules:
- http:
paths:
- path: /icons
pathType: ImplementationSpecific
backend:
resource:
apiGroup: k8s.example.com
kind: StorageBucket
name: icon-assets
After creating the Ingress above, you can view it with the following
command:
Name: ingress-resource-backend
Namespace: default
Address:
Default backend: APIGroup: k8s.example.com, Kind:
StorageBucket, Name: static-assets
Rules:
Host Path Backends
---- ---- --------
*
/icons APIGroup: k8s.example.com, Kind:
StorageBucket, Name: icon-assets
Annotations: <none>
Events: <none>
Path types
• Exact: Matches the URL path exactly and with case sensitivity.
Examples
Request
Kind Path(s) Matches?
path(s)
Prefix / (all paths) Yes
Exact /foo /foo Yes
Exact /foo /bar No
Exact /foo /foo/ No
Exact /foo/ /foo No
Prefix /foo /foo, /foo/ Yes
Prefix /foo/ /foo, /foo/ Yes
Prefix /aaa/bb /aaa/bbb No
Prefix /aaa/bbb /aaa/bbb Yes
Prefix /aaa/bbb/ /aaa/bbb Yes, ignores trailing slash
Prefix /aaa/bbb /aaa/bbb/ Yes, matches trailing slash
Prefix /aaa/bbb /aaa/bbb/ccc Yes, matches subpath
No, does not match string
Prefix /aaa/bbb /aaa/bbbxyz
prefix
Prefix /, /aaa /aaa/ccc Yes, matches /aaa prefix
Yes, matches /aaa/bbb
Prefix /, /aaa, /aaa/bbb /aaa/bbb
prefix
Prefix /, /aaa, /aaa/bbb /ccc Yes, matches / prefix
Prefix /aaa /ccc No, uses default backend
/foo (Prefix), /foo
Mixed /foo Yes, prefers Exact
(Exact)
Multiple matches
Hostname wildcards
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ingress-wildcard-host
spec:
rules:
- host: "foo.bar.com"
http:
paths:
- pathType: Prefix
path: "/bar"
backend:
service:
name: service1
port:
number: 80
- host: "*.foo.com"
http:
paths:
- pathType: Prefix
path: "/foo"
backend:
service:
name: service2
port:
number: 80
Ingress class
service/networking/external-lb.yaml
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: external-lb
spec:
controller: example.com/ingress-controller
parameters:
apiGroup: k8s.example.com
kind: IngressParameters
name: external-lb
Deprecated annotation
Default IngressClass
You can mark a particular IngressClass as default for your cluster. Setting
the ingressclass.kubernetes.io/is-default-class annotation to true on
an IngressClass resource will ensure that new Ingresses without an ingress
ClassName field specified will be assigned this default IngressClass.
Types of Ingress
Ingress backed by a single Service
There are existing Kubernetes concepts that allow you to expose a single
Service (see alternatives). You can also do this with an Ingress by specifying
a default backend with no rules.
service/networking/test-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: test-ingress
spec:
defaultBackend:
service:
name: test
port:
number: 80
If you create it using kubectl apply -f you should be able to view the state
of the Ingress you just added:
Simple fanout
service/networking/simple-fanout-example.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: simple-fanout-example
spec:
rules:
- host: foo.bar.com
http:
paths:
- path: /foo
pathType: Prefix
backend:
service:
name: service1
port:
number: 4200
- path: /bar
pathType: Prefix
backend:
service:
name: service2
port:
number: 8080
Name: simple-fanout-example
Namespace: default
Address: 178.91.123.132
Default backend: default-http-backend:80 (10.8.2.3:8080)
Rules:
Host Path Backends
---- ---- --------
foo.bar.com
/foo service1:4200 (10.8.0.90:4200)
/bar service2:8080 (10.8.0.91:8080)
Events:
Type Reason Age From
Message
---- ------ ---- ----
-------
Normal ADD 22s loadbalancer-controller
default/test
Note: Depending on the Ingress controller you are using, you may
need to create a default-http-backend Service.
The following Ingress tells the backing load balancer to route requests
based on the Host header.
service/networking/name-virtual-host-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: name-virtual-host-ingress
spec:
rules:
- host: foo.bar.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service1
port:
number: 80
- host: bar.foo.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service2
port:
number: 80
If you create an Ingress resource without any hosts defined in the rules,
then any web traffic to the IP address of your Ingress controller can be
matched without a name based virtual host being required.
For example, the following Ingress routes traffic requested for first.bar.c
om to service1, second.foo.com to service2, and any traffic to the IP
address without a hostname defined in request (that is, without a request
header being presented) to service3.
service/networking/name-virtual-host-ingress-no-third-host.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: name-virtual-host-ingress-no-third-host
spec:
rules:
- host: first.bar.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service1
port:
number: 80
- host: second.bar.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service2
port:
number: 80
- http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service3
port:
number: 80
TLS
You can secure an Ingress by specifying a Secret that contains a TLS private
key and certificate. The Ingress resource only supports a single TLS port,
443, and assumes TLS termination at the ingress point (traffic to the Service
and its Pods is in plaintext). If the TLS configuration section in an Ingress
specifies different hosts, they are multiplexed on the same port according to
the hostname specified through the SNI TLS extension (provided the Ingress
controller supports SNI). The TLS secret must contain keys named tls.crt
and tls.key that contain the certificate and private key to use for TLS. For
example:
apiVersion: v1
kind: Secret
metadata:
name: testsecret-tls
namespace: default
data:
tls.crt: base64 encoded cert
tls.key: base64 encoded key
type: kubernetes.io/tls
Note: Keep in mind that TLS will not work on the default rule
because the certificates would have to be issued for all the
possible sub-domains. Therefore, hosts in the tls section need to
explicitly match the host in the rules section.
service/networking/tls-example-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: tls-example-ingress
spec:
tls:
- hosts:
- https-example.foo.com
secretName: testsecret-tls
rules:
- host: https-example.foo.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: service1
port:
number: 80
Load balancing
It's also worth noting that even though health checks are not exposed
directly through the Ingress, there exist parallel concepts in Kubernetes
such as readiness probes that allow you to achieve the same end result.
Please review the controller specific documentation to see how they handle
health checks (for example: nginx, or GCE).
Updating an Ingress
Name: test
Namespace: default
Address: 178.91.123.132
Default backend: default-http-backend:80 (10.8.2.3:8080)
Rules:
Host Path Backends
---- ---- --------
foo.bar.com
/foo service1:80 (10.8.0.90:80)
Annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
Events:
Type Reason Age From
Message
---- ------ ---- ----
-------
Normal ADD 35s loadbalancer-controller
default/test
spec:
rules:
- host: foo.bar.com
http:
paths:
- backend:
service:
name: service1
port:
number: 80
path: /foo
pathType: Prefix
- host: bar.baz.com
http:
paths:
- backend:
service:
name: service2
port:
number: 80
path: /foo
pathType: Prefix
..
After you save your changes, kubectl updates the resource in the API server,
which tells the Ingress controller to reconfigure the load balancer.
Verify this:
kubectl describe ingress test
Name: test
Namespace: default
Address: 178.91.123.132
Default backend: default-http-backend:80 (10.8.2.3:8080)
Rules:
Host Path Backends
---- ---- --------
foo.bar.com
/foo service1:80 (10.8.0.90:80)
bar.baz.com
/foo service2:80 (10.8.0.91:80)
Annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
Events:
Type Reason Age From
Message
---- ------ ---- ----
-------
Normal ADD 45s loadbalancer-controller
default/test
Techniques for spreading traffic across failure domains differ between cloud
providers. Please check the documentation of the relevant Ingress controller
for details.
Alternatives
You can expose a Service in multiple ways that don't directly involve the
Ingress resource:
• Use Service.Type=LoadBalancer
• Use Service.Type=NodePort
What's next
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified August 28, 2020 at 3:31 PM PST: Updated ingress.md TLS
section (73780ca1b)
Edit this page Create child page Create an issue
• Terminology
• What is Ingress?
• Prerequisites
• The Ingress resource
◦ Ingress rules
◦ DefaultBackend
◦ Resource backends
◦ Path types
◦ Examples
• Hostname wildcards
• Ingress class
◦ Deprecated annotation
◦ Default IngressClass
• Types of Ingress
◦ Ingress backed by a single Service
◦ Simple fanout
◦ Name based virtual hosting
◦ TLS
◦ Load balancing
• Updating an Ingress
• Failing across availability zones
• Alternatives
• What's next
Ingress Controllers
In order for the Ingress resource to work, the cluster must have an ingress
controller running.
Additional controllers
Caution: This section links to third party projects that provide
functionality required by Kubernetes. The Kubernetes project
authors aren't responsible for these projects. This page follows
CNCF website guidelines by listing projects alphabetically. To add
a project to this list, read the content guide before submitting a
change.
If you do not define a class, your cloud provider may use a default ingress
controller.
Ideally, all ingress controllers should fulfill this specification, but the various
ingress controllers operate slightly differently.
What's next
• Learn more about Ingress.
• Set up Ingress on Minikube with the NGINX Controller.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 13, 2020 at 6:36 PM PST: Revise ingress controller
list (53739e799)
Edit this page Create child page Create an issue
• Additional controllers
• Using multiple Ingress controllers
• What's next
Network Policies
If you want to control traffic flow at the IP address or port level (OSI layer 3
or 4), then you might consider using Kubernetes NetworkPolicies for
particular applications in your cluster. NetworkPolicies are an application-
centric construct which allow you to specify how a pod is allowed to
communicate with various network "entities" (we use the word "entity" here
to avoid overloading the more common terms such as "endpoints" and
"services", which have specific Kubernetes connotations) over the network.
The entities that a Pod can communicate with are identified through a
combination of the following 3 identifiers:
1. Other pods that are allowed (exception: a pod cannot block access to
itself)
2. Namespaces that are allowed
3. IP blocks (exception: traffic to and from the node where a Pod is
running is always allowed, regardless of the IP address of the Pod or
the node)
Prerequisites
Network policies are implemented by the network plugin. To use network
policies, you must be using a networking solution which supports
NetworkPolicy. Creating a NetworkPolicy resource without a controller that
implements it will have no effect.
Network policies do not conflict; they are additive. If any policy or policies
select a pod, the pod is restricted to what is allowed by the union of those
policies' ingress/egress rules. Thus, order of evaluation does not affect the
policy result.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: test-network-policy
namespace: default
spec:
podSelector:
matchLabels:
role: db
policyTypes:
- Ingress
- Egress
ingress:
- from:
- ipBlock:
cidr: 172.17.0.0/16
except:
- 172.17.1.0/24
- namespaceSelector:
matchLabels:
project: myproject
- podSelector:
matchLabels:
role: frontend
ports:
- protocol: TCP
port: 6379
egress:
- to:
- ipBlock:
cidr: 10.0.0.0/24
ports:
- protocol: TCP
port: 5978
Note: POSTing this to the API server for your cluster will have no
effect unless your chosen networking solution supports network
policy.
1. isolates "role=db" pods in the "default" namespace for both ingress and
egress traffic (if they weren't already isolated)
...
ingress:
- from:
- namespaceSelector:
matchLabels:
user: alice
podSelector:
matchLabels:
role: client
...
contains a single from element allowing connections from Pods with the
label role=client in namespaces with the label user=alice. But this policy:
...
ingress:
- from:
- namespaceSelector:
matchLabels:
user: alice
- podSelector:
matchLabels:
role: client
...
contains two elements in the from array, and allows connections from Pods
in the local Namespace with the label role=client, or from any Pod in any
namespace with the label user=alice.
Cluster ingress and egress mechanisms often require rewriting the source
or destination IP of packets. In cases where this happens, it is not defined
whether this happens before or after NetworkPolicy processing, and the
behavior may be different for different combinations of network plugin,
cloud provider, Service implementation, etc.
In the case of ingress, this means that in some cases you may be able to
filter incoming packets based on the actual original source IP, while in other
cases, the "source IP" that the NetworkPolicy acts on may be the IP of a Loa
dBalancer or of the Pod's node, etc.
For egress, this means that connections from pods to Service IPs that get
rewritten to cluster-external IPs may or may not be subject to ipBlock-based
policies.
Default policies
By default, if no policies exist in a namespace, then all ingress and egress
traffic is allowed to and from pods in that namespace. The following
examples let you change the default behavior in that namespace.
Default deny all ingress traffic
You can create a "default" isolation policy for a namespace by creating a
NetworkPolicy that selects all pods but does not allow any ingress traffic to
those pods.
service/networking/network-policy-default-deny-ingress.yaml
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
spec:
podSelector: {}
policyTypes:
- Ingress
This ensures that even pods that aren't selected by any other NetworkPolicy
will still be isolated. This policy does not change the default egress isolation
behavior.
service/networking/network-policy-allow-all-ingress.yaml
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-all-ingress
spec:
podSelector: {}
ingress:
- {}
policyTypes:
- Ingress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-egress
spec:
podSelector: {}
policyTypes:
- Egress
This ensures that even pods that aren't selected by any other NetworkPolicy
will not be allowed egress traffic. This policy does not change the default
ingress isolation behavior.
service/networking/network-policy-allow-all-egress.yaml
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-all-egress
spec:
podSelector: {}
egress:
- {}
policyTypes:
- Egress
service/networking/network-policy-default-deny-all.yaml
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
This ensures that even pods that aren't selected by any other NetworkPolicy
will not be allowed ingress or egress traffic.
SCTP support
FEATURE STATE: Kubernetes v1.19 [beta]
Note: You must be using a CNI plugin that supports SCTP protocol
NetworkPolicies.
What's next
• See the Declare Network Policy walkthrough for further examples.
• See more recipes for common scenarios enabled by the NetworkPolicy
resource.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 03, 2020 at 10:54 AM PST: fix heading level
(08b566111)
Edit this page Create child page Create an issue
• Prerequisites
• Isolated and Non-isolated Pods
• The NetworkPolicy resource
• Behavior of to and from selectors
• Default policies
◦ Default deny all ingress traffic
◦ Default allow all ingress traffic
◦ Default deny all egress traffic
◦ Default allow all egress traffic
◦ Default deny all ingress and all egress traffic
• SCTP support
• What you can't do with network policies (at least, not yet)
• What's next
Adding entries to Pod /etc/hosts
with HostAliases
Adding entries to a Pod's /etc/hosts file provides Pod-level override of
hostname resolution when DNS and other options are not applicable. You
can add these custom entries with the HostAliases field in PodSpec.
pod/nginx created
By default, the hosts file only includes IPv4 and IPv6 boilerplates like local
host and its own hostname.
apiVersion: v1
kind: Pod
metadata:
name: hostaliases-pod
spec:
restartPolicy: Never
hostAliases:
- ip: "127.0.0.1"
hostnames:
- "foo.local"
- "bar.local"
- ip: "10.1.2.3"
hostnames:
- "foo.remote"
- "bar.remote"
containers:
- name: cat-hosts
image: busybox
command:
- cat
args:
- "/etc/hosts"
pod/hostaliases-pod created
Examine a Pod's details to see its IPv4 address and its status:
Caution:
If you make manual changes to the hosts file, those changes are
lost when the container exits.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified June 25, 2020 at 1:37 PM PST: Update kubectl run from docs
where necessary (00f502fa6)
Edit this page Create child page Create an issue
IPv4/IPv6 dual-stack
FEATURE STATE: Kubernetes v1.16 [alpha]
IPv4/IPv6 dual-stack enables the allocation of both IPv4 and IPv6 addresses
to Pods and Services.
Prerequisites
The following prerequisites are needed in order to utilize IPv4/IPv6 dual-
stack Kubernetes clusters:
• kube-apiserver:
◦ --feature-gates="IPv6DualStack=true"
◦ --service-cluster-ip-range=<IPv4 CIDR>,<IPv6 CIDR>
• kube-controller-manager:
◦ --feature-gates="IPv6DualStack=true"
◦ --cluster-cidr=<IPv4 CIDR>,<IPv6 CIDR>
◦ --service-cluster-ip-range=<IPv4 CIDR>,<IPv6 CIDR>
◦ --node-cidr-mask-size-ipv4|--node-cidr-mask-size-ipv6
defaults to /24 for IPv4 and /64 for IPv6
• kubelet:
◦ --feature-gates="IPv6DualStack=true"
• kube-proxy:
◦ --cluster-cidr=<IPv4 CIDR>,<IPv6 CIDR>
◦ --feature-gates="IPv6DualStack=true"
Note:
Services
If your cluster has dual-stack enabled, you can create Services which can
use IPv4, IPv6, or both.
The address family of a Service defaults to the address family of the first
service cluster IP range (configured via the --service-cluster-ip-range
flag to the kube-controller-manager).
When you define a Service you can optionally configure it as dual stack. To
specify the behavior you want, you set the .spec.ipFamilyPolicy field to
one of the following values:
If you would like to define which IP family to use for single stack or define
the order of IP families for dual-stack, you can choose the address families
by setting an optional field, .spec.ipFamilies, on the Service.
• ["IPv4"]
• ["IPv6"]
• ["IPv4","IPv6"] (dual stack)
• ["IPv6","IPv4"] (dual stack)
The first family you list is used for the legacy .spec.ClusterIP field.
Dual-stack Service configuration scenarios
These examples demonstrate the behavior of various dual-stack Service
configuration scenarios.
service/networking/dual-stack-default-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
service/networking/dual-stack-preferred-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
ipFamilyPolicy: PreferDualStack
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
service/networking/dual-stack-preferred-ipfamilies-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
ipFamilyPolicy: PreferDualStack
ipFamilies:
- IPv6
- IPv4
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
apiVersion: v1
kind: Service
metadata:
labels:
app: MyApp
name: my-service
spec:
clusterIP: 10.0.197.123
clusterIPs:
- 10.0.197.123
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: MyApp
type: ClusterIP
status:
loadBalancer: {}
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
apiVersion: v1
kind: Service
metadata:
labels:
app: MyApp
name: my-service
spec:
clusterIP: None
clusterIPs:
- None
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: MyApp
Before:
spec:
ipFamilyPolicy: SingleStack
After:
spec:
ipFamilyPolicy: PreferDualStack
Egress traffic
If you want to enable egress traffic in order to reach off-cluster destinations
(eg. the public Internet) from a Pod that uses non-publicly routable IPv6
addresses, you need to enable the Pod to use a publicly routed IPv6 address
via a mechanism such as transparent proxying or IP masquerading. The ip-
masq-agent project supports IP masquerading on dual-stack clusters.
What's next
• Validate IPv4/IPv6 dual-stack networking
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified October 26, 2020 at 1:06 PM PST: Dual-stack docs for
Kubernetes 1.20 (8a3244fdd)
Edit this page Create child page Create an issue
• Supported Features
• Prerequisites
• Enable IPv4/IPv6 dual-stack
• Services
◦ Dual-stack Service configuration scenarios
◦ Headless Services without selector
◦ Service type LoadBalancer
• Egress traffic
• What's next
Storage
Ways to provide both long-term and temporary storage to Pods in your
cluster.
Volumes
Persistent Volumes
Volume Snapshots
Storage Classes
Storage Capacity
Ephemeral Volumes
Background
Docker has a concept of volumes, though it is somewhat looser and less
managed. A Docker volume is a directory on disk or in another container.
Docker provides volume drivers, but the functionality is somewhat limited.
Kubernetes supports many types of volumes. A Pod can use any number of
volume types simultaneously. Ephemeral volume types have a lifetime of a
pod, but persistent volumes exist beyond the lifetime of a pod. Consequently,
a volume outlives any containers that run within the pod, and data is
preserved across container restarts. When a pod ceases to exist, the volume
is destroyed.
At its core, a volume is just a directory, possibly with some data in it, which
is accessible to the containers in a pod. How that directory comes to be, the
medium that backs it, and the contents of it are determined by the particular
volume type used.
To use a volume, specify the volumes to provide for the Pod in .spec.volume
s and declare where to mount those volumes into containers in .spec.conta
iners[*].volumeMounts. A process in a container sees a filesystem view
composed from their Docker image and volumes. The Docker image is at the
root of the filesystem hierarchy. Volumes mount at the specified paths within
the image. Volumes can not mount onto other volumes or have hard links to
other volumes. Each Container in the Pod's configuration must
independently specify where to mount each volume.
Types of Volumes
Kubernetes supports several types of volumes.
awsElasticBlockStore
An awsElasticBlockStore volume mounts an Amazon Web Services (AWS)
EBS volume into your pod. Unlike emptyDir, which is erased when a pod is
removed, the contents of an EBS volume are persisted and the volume is
unmounted. This means that an EBS volume can be pre-populated with data,
and that data can be shared between pods.
Note: You must create an EBS volume by using aws ec2 create-
volume or the AWS API before you can use it.
There are some restrictions when using an awsElasticBlockStore volume:
• the nodes on which pods are running must be AWS EC2 instances
• those instances need to be in the same region and availability zone as
the EBS volume
• EBS only supports a single EC2 instance mounting a volume
Before you can use an EBS volume with a pod, you need to create it.
Make sure the zone matches the zone you brought up your cluster in. Check
that the size and EBS volume type are suitable for your use.
apiVersion: v1
kind: Pod
metadata:
name: test-ebs
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-ebs
name: test-volume
volumes:
- name: test-volume
# This AWS EBS volume must already exist.
awsElasticBlockStore:
volumeID: "<volume id>"
fsType: ext4
azureDisk
The azureDisk volume type mounts a Microsoft Azure Data Disk into a pod.
The CSIMigration feature for azureDisk, when enabled, redirects all plugin
operations from the existing in-tree plugin to the disk.csi.azure.com
Container Storage Interface (CSI) Driver. In order to use this feature, the
Azure Disk CSI Driver must be installed on the cluster and the CSIMigratio
n and CSIMigrationAzureDisk features must be enabled.
azureFile
The azureFile volume type mounts a Microsoft Azure File volume (SMB 2.1
and 3.0) into a pod.
The CSIMigration feature for azureFile, when enabled, redirects all plugin
operations from the existing in-tree plugin to the file.csi.azure.com
Container Storage Interface (CSI) Driver. In order to use this feature, the
Azure File CSI Driver must be installed on the cluster and the CSIMigration
and CSIMigrationAzureFile alpha features must be enabled.
cephfs
A cephfs volume allows an existing CephFS volume to be mounted into your
Pod. Unlike emptyDir, which is erased when a pod is removed, the contents
of a cephfs volume are preserved and the volume is merely unmounted. This
means that a cephfs volume can be pre-populated with data, and that data
can be shared between pods. The cephfs volume can be mounted by
multiple writers simultaneously.
Note: You must have your own Ceph server running with the share
exported before you can use it.
The cinder volume type is used to mount the OpenStack Cinder volume into
your pod.
apiVersion: v1
kind: Pod
metadata:
name: test-cinder
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-cinder-container
volumeMounts:
- mountPath: /test-cinder
name: test-volume
volumes:
- name: test-volume
# This OpenStack volume must already exist.
cinder:
volumeID: "<volume id>"
fsType: ext4
The CSIMigration feature for Cinder, when enabled, redirects all plugin
operations from the existing in-tree plugin to the cinder.csi.openstack.or
g Container Storage Interface (CSI) Driver. In order to use this feature, the
Openstack Cinder CSI Driver must be installed on the cluster and the CSIMi
gration and CSIMigrationOpenStack beta features must be enabled.
configMap
A ConfigMap provides a way to inject configuration data into pods. The data
stored in a ConfigMap can be referenced in a volume of type configMap and
then consumed by containerized applications running in a pod.
apiVersion: v1
kind: Pod
metadata:
name: configmap-pod
spec:
containers:
- name: test
image: busybox
volumeMounts:
- name: config-vol
mountPath: /etc/config
volumes:
- name: config-vol
configMap:
name: log-config
items:
- key: log_level
path: log_level
Note:
downwardAPI
A downwardAPI volume makes downward API data available to applications.
It mounts a directory and writes the requested data in plain text files.
emptyDir
An emptyDir volume is first created when a Pod is assigned to a node, and
exists as long as that Pod is running on that node. As the name says, the emp
tyDir volume is initially empty. All containers in the Pod can read and write
the same files in the emptyDir volume, though that volume can be mounted
at the same or different paths in each container. When a Pod is removed
from a node for any reason, the data in the emptyDir is deleted permanently.
Note: A container crashing does not remove a Pod from a node.
The data in an emptyDir volume is safe across container crashes.
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /cache
name: cache-volume
volumes:
- name: cache-volume
emptyDir: {}
fc (fibre channel)
An fc volume type allows an existing fibre channel block storage volume to
mount in a Pod. You can specify single or multiple target world wide names
(WWNs) using the parameter targetWWNs in your Volume configuration. If
multiple WWNs are specified, targetWWNs expect that those WWNs are
from multi-path connections.
Note: You must have your own Flocker installation running before
you can use it.
gcePersistentDisk
A gcePersistentDisk volume mounts a Google Compute Engine (GCE)
persistent disk (PD) into your Pod. Unlike emptyDir, which is erased when a
pod is removed, the contents of a PD are preserved and the volume is merely
unmounted. This means that a PD can be pre-populated with data, and that
data can be shared between pods.
Using a GCE persistent disk with a Pod controlled by a ReplicaSet will fail
unless the PD is read-only or the replica count is 0 or 1.
Before you can use a GCE persistent disk with a Pod, you need to create it.
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-pd
name: test-volume
volumes:
- name: test-volume
# This GCE PD must already exist.
gcePersistentDisk:
pdName: my-data-disk
fsType: ext4
The Regional persistent disks feature allows the creation of persistent disks
that are available in two zones within the same region. In order to use this
feature, the volume must be provisioned as a PersistentVolume; referencing
the volume directly from a pod is not supported.
apiVersion: v1
kind: PersistentVolume
metadata:
name: test-volume
spec:
capacity:
storage: 400Gi
accessModes:
- ReadWriteOnce
gcePersistentDisk:
pdName: my-data-disk
fsType: ext4
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- us-central1-a
- us-central1-b
The CSIMigration feature for GCE PD, when enabled, redirects all plugin
operations from the existing in-tree plugin to the pd.csi.storage.gke.io
Container Storage Interface (CSI) Driver. In order to use this feature, the
GCE PD CSI Driver must be installed on the cluster and the CSIMigration
and CSIMigrationGCE beta features must be enabled.
gitRepo (deprecated)
Warning: The gitRepo volume type is deprecated. To provision a
container with a git repo, mount an EmptyDir into an InitContainer
that clones the repo using git, then mount the EmptyDir into the
Pod's container.
apiVersion: v1
kind: Pod
metadata:
name: server
spec:
containers:
- image: nginx
name: nginx
volumeMounts:
- mountPath: /mypath
name: git-volume
volumes:
- name: git-volume
gitRepo:
repository: "git@somewhere:me/my-git-repository.git"
revision: "22f1d8406d464b0c0874075539c1f2e96c253775"
glusterfs
A glusterfs volume allows a Glusterfs (an open source networked
filesystem) volume to be mounted into your Pod. Unlike emptyDir, which is
erased when a Pod is removed, the contents of a glusterfs volume are
preserved and the volume is merely unmounted. This means that a glusterfs
volume can be pre-populated with data, and that data can be shared
between pods. GlusterFS can be mounted by multiple writers
simultaneously.
hostPath
A hostPath volume mounts a file or directory from the host node's
filesystem into your Pod. This is not something that most Pods will need, but
it offers a powerful escape hatch for some applications.
In addition to the required path property, you can optionally specify a type
for a hostPath volume.
Value Behavior
Empty string (default) is for backward compatibility,
which means that no checks will be performed before
mounting the hostPath volume.
If nothing exists at the given path, an empty directory
will be created there as needed with permission set to
DirectoryOrCreate
0755, having the same group and ownership with
Kubelet.
Directory A directory must exist at the given path
If nothing exists at the given path, an empty file will be
FileOrCreate created there as needed with permission set to 0644,
having the same group and ownership with Kubelet.
File A file must exist at the given path
Socket A UNIX socket must exist at the given path
CharDevice A character device must exist at the given path
BlockDevice A block device must exist at the given path
Watch out when using this type of volume, because:
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-pd
name: test-volume
volumes:
- name: test-volume
hostPath:
# directory location on host
path: /data
# this field is optional
type: Directory
apiVersion: v1
kind: Pod
metadata:
name: test-webserver
spec:
containers:
- name: test-webserver
image: k8s.gcr.io/test-webserver:latest
volumeMounts:
- mountPath: /var/local/aaa
name: mydir
- mountPath: /var/local/aaa/1.txt
name: myfile
volumes:
- name: mydir
hostPath:
# Ensure the file directory is created.
path: /var/local/aaa
type: DirectoryOrCreate
- name: myfile
hostPath:
path: /var/local/aaa/1.txt
type: FileOrCreate
iscsi
An iscsi volume allows an existing iSCSI (SCSI over IP) volume to be
mounted into your Pod. Unlike emptyDir, which is erased when a Pod is
removed, the contents of an iscsi volume are preserved and the volume is
merely unmounted. This means that an iscsi volume can be pre-populated
with data, and that data can be shared between pods.
Note: You must have your own iSCSI server running with the
volume created before you can use it.
local
A local volume represents a mounted local storage device such as a disk,
partition or directory.
apiVersion: v1
kind: PersistentVolume
metadata:
name: example-pv
spec:
capacity:
storage: 100Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
storageClassName: local-storage
local:
path: /mnt/disks/ssd1
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- example-node
Note: You must have your own NFS server running with the share
exported before you can use it.
persistentVolumeClaim
A persistentVolumeClaim volume is used to mount a PersistentVolume into
a Pod. PersistentVolumeClaims are a way for users to "claim" durable
storage (such as a GCE PersistentDisk or an iSCSI volume) without knowing
the details of the particular cloud environment.
portworxVolume
A portworxVolume is an elastic block storage layer that runs
hyperconverged with Kubernetes. Portworx fingerprints storage in a server,
tiers based on capabilities, and aggregates capacity across multiple servers.
Portworx runs in-guest in virtual machines or on bare metal Linux nodes.
apiVersion: v1
kind: Pod
metadata:
name: test-portworx-volume-pod
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /mnt
name: pxvol
volumes:
- name: pxvol
# This Portworx volume must already exist.
portworxVolume:
volumeID: "pxvol"
fsType: "<fs-type>"
Note: Make sure you have an existing PortworxVolume with name
pxvol before using it in the Pod.
projected
A projected volume maps several existing volume sources into the same
directory.
• secret
• downwardAPI
• configMap
• serviceAccountToken
All sources are required to be in the same namespace as the Pod. For more
details, see the all-in-one volume design document.
apiVersion: v1
kind: Pod
metadata:
name: volume-test
spec:
containers:
- name: container-test
image: busybox
volumeMounts:
- name: all-in-one
mountPath: "/projected-volume"
readOnly: true
volumes:
- name: all-in-one
projected:
sources:
- secret:
name: mysecret
items:
- key: username
path: my-group/my-username
- downwardAPI:
items:
- path: "labels"
fieldRef:
fieldPath: metadata.labels
- path: "cpu_limit"
resourceFieldRef:
containerName: container-test
resource: limits.cpu
- configMap:
name: myconfigmap
items:
- key: config
path: my-group/my-config
apiVersion: v1
kind: Pod
metadata:
name: volume-test
spec:
containers:
- name: container-test
image: busybox
volumeMounts:
- name: all-in-one
mountPath: "/projected-volume"
readOnly: true
volumes:
- name: all-in-one
projected:
sources:
- secret:
name: mysecret
items:
- key: username
path: my-group/my-username
- secret:
name: mysecret2
items:
- key: password
path: my-group/my-password
mode: 511
Each projected volume source is listed in the spec under sources. The
parameters are nearly the same with two exceptions:
The example Pod has a projected volume containing the injected service
account token. This token can be used by a Pod's containers to access the
Kubernetes API server. The audience field contains the intended audience of
the token. A recipient of the token must identify itself with an identifier
specified in the audience of the token, and otherwise should reject the
token. This field is optional and it defaults to the identifier of the API server.
quobyte
A quobyte volume allows an existing Quobyte volume to be mounted into
your Pod.
Note: You must have your own Quobyte setup and running with
the volumes created before you can use it.
Note: You must have a Ceph installation running before you can
use RBD.
scaleIO (deprecated)
ScaleIO is a software-based storage platform that uses existing hardware to
create clusters of scalable shared block networked storage. The scaleIO
volume plugin allows deployed pods to access existing ScaleIO volumes. For
information about dynamically provisioning new volumes for persistent
volume claims, see ScaleIO persistent volumes.
Note: You must have an existing ScaleIO cluster already setup and
running with the volumes created before you can use them.
apiVersion: v1
kind: Pod
metadata:
name: pod-0
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: pod-0
volumeMounts:
- mountPath: /test-pd
name: vol-0
volumes:
- name: vol-0
scaleIO:
gateway: https://localhost:443/api
system: scaleio
protectionDomain: sd0
storagePool: sp1
volumeName: vol-0
secretRef:
name: sio-secret
fsType: xfs
secret
A secret volume is used to pass sensitive information, such as passwords, to
Pods. You can store secrets in the Kubernetes API and mount them as files
for use by pods without coupling to Kubernetes directly. secret volumes are
backed by tmpfs (a RAM-backed filesystem) so they are never written to
non-volatile storage.
Note: You must create a Secret in the Kubernetes API before you
can use it.
storageOS
A storageos volume allows an existing StorageOS volume to mount into
your Pod.
Caution: You must run the StorageOS container on each node that
wants to access StorageOS volumes or that will contribute storage
capacity to the pool. For installation instructions, consult the
StorageOS documentation.
apiVersion: v1
kind: Pod
metadata:
labels:
name: redis
role: master
name: test-storageos-redis
spec:
containers:
- name: master
image: kubernetes/redis:v1
env:
- name: MASTER
value: "true"
ports:
- containerPort: 6379
volumeMounts:
- mountPath: /redis-master-data
name: redis-data
volumes:
- name: redis-data
storageos:
# The `redis-vol01` volume must already exist within
StorageOS in the `default` namespace.
volumeName: redis-vol01
fsType: ext4
vsphereVolume
Note: You must configure the Kubernetes vSphere Cloud Provider.
For cloudprovider configuration, refer to the vSphere Getting
Started guide.
Note: You must create vSphere VMDK volume using one of the
following methods before using with a Pod.
First ssh into ESX, then use the following command to create a VMDK:
vmkfstools -c 2G /vmfs/volumes/DatastoreName/volumes/myDisk.vmdk
apiVersion: v1
kind: Pod
metadata:
name: test-vmdk
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-vmdk
name: test-volume
volumes:
- name: test-volume
# This VMDK volume must already exist.
vsphereVolume:
volumePath: "[DatastoreName] volumes/myDisk"
fsType: ext4
Note:
• diskformat
• hostfailurestotolerate
• forceprovisioning
• cachereservation
• diskstripes
• objectspacereservation
• iopslimit
To turn off the vsphereVolume plugin from being loaded by the controller
manager and the kubelet, you need to set this feature flag to true. You must
install a csi.vsphere.vmware.com CSI driver on all worker nodes.
Using subPath
Sometimes, it is useful to share one volume for multiple uses in a single pod.
The volumeMounts.subPath property specifies a sub-path inside the
referenced volume instead of its root.
The following example shows how to configure a Pod with a LAMP stack
(Linux Apache MySQL PHP) using a single, shared volume. This sample subP
ath configuration is not recommended for production use.
The PHP application's code and assets map to the volume's html folder and
the MySQL database is stored in the volume's mysql folder. For example:
apiVersion: v1
kind: Pod
metadata:
name: my-lamp-site
spec:
containers:
- name: mysql
image: mysql
env:
- name: MYSQL_ROOT_PASSWORD
value: "rootpasswd"
volumeMounts:
- mountPath: /var/lib/mysql
name: site-data
subPath: mysql
- name: php
image: php:7.0-apache
volumeMounts:
- mountPath: /var/www/html
name: site-data
subPath: html
volumes:
- name: site-data
persistentVolumeClaim:
claimName: my-lamp-site-data
apiVersion: v1
kind: Pod
metadata:
name: pod1
spec:
containers:
- name: container1
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
image: busybox
command: [ "sh", "-c", "while [ true ]; do echo 'Hello';
sleep 10; done | tee -a /logs/hello.txt" ]
volumeMounts:
- name: workdir1
mountPath: /logs
subPathExpr: $(POD_NAME)
restartPolicy: Never
volumes:
- name: workdir1
hostPath:
path: /var/log/pods
Resources
The storage media (such as Disk or SSD) of an emptyDir volume is
determined by the medium of the filesystem holding the kubelet root dir
(typically /var/lib/kubelet). There is no limit on how much space an empty
Dir or hostPath volume can consume, and no isolation between containers
or between pods.
Previously, all volume plugins were "in-tree". The "in-tree" plugins were
built, linked, compiled, and shipped with the core Kubernetes binaries. This
meant that adding a new storage system to Kubernetes (a volume plugin)
required checking code into the core Kubernetes code repository.
csi
Container Storage Interface (CSI) defines a standard interface for container
orchestration systems (like Kubernetes) to expose arbitrary storage systems
to their container workloads.
Note: Support for CSI spec versions 0.2 and 0.3 are deprecated in
Kubernetes v1.13 and will be removed in a future release.
• driver: A string value that specifies the name of the volume driver to
use. This value must correspond to the value returned in the GetPlugin
InfoResponse by the CSI driver as defined in the CSI spec. It is used by
Kubernetes to identify which CSI driver to call out to, and by CSI driver
components to identify which PV objects belong to the CSI driver.
• volumeHandle: A string value that uniquely identifies the volume. This
value must correspond to the value returned in the volume.id field of
the CreateVolumeResponse by the CSI driver as defined in the CSI
spec. The value is passed as volume_id on all calls to the CSI volume
driver when referencing the volume.
• readOnly: An optional boolean value indicating whether the volume is
to be "ControllerPublished" (attached) as read only. Default is false.
This value is passed to the CSI driver via the readonly field in the Cont
rollerPublishVolumeRequest.
• fsType: If the PV's VolumeMode is Filesystem then this field may be
used to specify the filesystem that should be used to mount the volume.
If the volume has not been formatted and formatting is supported, this
value will be used to format the volume. This value is passed to the CSI
driver via the VolumeCapability field of ControllerPublishVolumeReq
uest, NodeStageVolumeRequest, and NodePublishVolumeRequest.
• volumeAttributes: A map of string to string that specifies static
properties of a volume. This map must correspond to the map returned
in the volume.attributes field of the CreateVolumeResponse by the
CSI driver as defined in the CSI spec. The map is passed to the CSI
driver via the volume_context field in the ControllerPublishVolumeRe
quest, NodeStageVolumeRequest, and NodePublishVolumeRequest.
• controllerPublishSecretRef: A reference to the secret object
containing sensitive information to pass to the CSI driver to complete
the CSI ControllerPublishVolume and ControllerUnpublishVolume
calls. This field is optional, and may be empty if no secret is required. If
the Secret contains more than one secret, all secrets are passed.
• nodeStageSecretRef: A reference to the secret object containing
sensitive information to pass to the CSI driver to complete the CSI Node
StageVolume call. This field is optional, and may be empty if no secret
is required. If the Secret contains more than one secret, all secrets are
passed.
• nodePublishSecretRef: A reference to the secret object containing
sensitive information to pass to the CSI driver to complete the CSI Node
PublishVolume call. This field is optional, and may be empty if no
secret is required. If the secret object contains more than one secret,
all secrets are passed.
Vendors with external CSI drivers can implement raw block volume support
in Kubernetes workloads.
You can directly configure CSI volumes within the Pod specification. Volumes
specified in this way are ephemeral and do not persist across pod restarts.
See Ephemeral Volumes for more information.
For more information on how to develop a CSI driver, refer to the
kubernetes-csi documentation
flexVolume
FlexVolume is an out-of-tree plugin interface that has existed in Kubernetes
since version 1.2 (before CSI). It uses an exec-based model to interface with
drivers. The FlexVolume driver binaries must be installed in a pre-defined
volume plugin path on each node and in some cases the control plane nodes
as well.
Mount propagation
Mount propagation allows for sharing volumes mounted by a container to
other containers in the same pod, or even to other pods on the same node.
• None - This volume mount will not receive any subsequent mounts that
are mounted to this volume or any of its subdirectories by the host. In
similar fashion, no mounts created by the container will be visible on
the host. This is the default mode.
In other words, if the host mounts anything inside the volume mount,
the container will see it mounted there.
Similarly, if any Pod with Bidirectional mount propagation to the
same volume mounts anything there, the container with HostToContain
er mount propagation will see it.
A typical use case for this mode is a Pod with a FlexVolume or CSI
driver or a Pod that needs to mount something on the host using a host
Path volume.
Configuration
Before mount propagation can work properly on some deployments
(CoreOS, RedHat/Centos, Ubuntu) mount share must be configured correctly
in Docker as shown below.
MountFlags=shared
What's next
Follow an example of deploying WordPress and MySQL with Persistent
Volumes.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
• Background
• Types of Volumes
◦ awsElasticBlockStore
◦ azureDisk
◦ azureFile
◦ cephfs
◦ cinder
◦ configMap
◦ downwardAPI
◦ emptyDir
◦ fc (fibre channel)
◦ flocker (deprecated)
◦ gcePersistentDisk
◦ gitRepo (deprecated)
◦ glusterfs
◦ hostPath
◦ iscsi
◦ local
◦ nfs
◦ persistentVolumeClaim
◦ portworxVolume
◦ projected
◦ quobyte
◦ rbd
◦ scaleIO (deprecated)
◦ secret
◦ storageOS
◦ vsphereVolume
• Using subPath
◦ Using subPath with expanded environment variables
• Resources
• Out-of-tree volume plugins
◦ csi
◦ flexVolume
• Mount propagation
◦ Configuration
• What's next
Persistent Volumes
This document describes the current state of persistent volumes in
Kubernetes. Familiarity with volumes is suggested.
Introduction
Managing storage is a distinct problem from managing compute instances.
The PersistentVolume subsystem provides an API for users and
administrators that abstracts details of how storage is provided from how it
is consumed. To do this, we introduce two new API resources:
PersistentVolume and PersistentVolumeClaim.
Provisioning
There are two ways PVs may be provisioned: statically or dynamically.
Static
When none of the static PVs the administrator created match a user's
PersistentVolumeClaim, the cluster may try to dynamically provision a
volume specially for the PVC. This provisioning is based on StorageClasses:
the PVC must request a storage class and the administrator must have
created and configured that class for dynamic provisioning to occur. Claims
that request the class "" effectively disable dynamic provisioning for
themselves.
Binding
A user creates, or in the case of dynamic provisioning, has already created,
a PersistentVolumeClaim with a specific amount of storage requested and
with certain access modes. A control loop in the master watches for new
PVCs, finds a matching PV (if possible), and binds them together. If a PV was
dynamically provisioned for a new PVC, the loop will always bind that PV to
the PVC. Otherwise, the user will always get at least what they asked for,
but the volume may be in excess of what was requested. Once bound,
PersistentVolumeClaim binds are exclusive, regardless of how they were
bound. A PVC to PV binding is a one-to-one mapping, using a ClaimRef
which is a bi-directional binding between the PersistentVolume and the
PersistentVolumeClaim.
Claims will remain unbound indefinitely if a matching volume does not exist.
Claims will be bound as matching volumes become available. For example, a
cluster provisioned with many 50Gi PVs would not match a PVC requesting
100Gi. The PVC can be bound when a 100Gi PV is added to the cluster.
Using
Pods use claims as volumes. The cluster inspects the claim to find the bound
volume and mounts that volume for a Pod. For volumes that support multiple
access modes, the user specifies which mode is desired when using their
claim as a volume in a Pod.
Once a user has a claim and that claim is bound, the bound PV belongs to
the user for as long as they need it. Users schedule Pods and access their
claimed PVs by including a persistentVolumeClaim section in a Pod's volum
es block. See Claims As Volumes for more details on this.
Storage Object in Use Protection
The purpose of the Storage Object in Use Protection feature is to ensure
that PersistentVolumeClaims (PVCs) in active use by a Pod and
PersistentVolume (PVs) that are bound to PVCs are not removed from the
system, as this may result in data loss.
Note: PVC is in active use by a Pod when a Pod object exists that
is using the PVC.
If a user deletes a PVC in active use by a Pod, the PVC is not removed
immediately. PVC removal is postponed until the PVC is no longer actively
used by any Pods. Also, if an admin deletes a PV that is bound to a PVC, the
PV is not removed immediately. PV removal is postponed until the PV is no
longer bound to a PVC.
You can see that a PVC is protected when the PVC's status is Terminating
and the Finalizers list includes kubernetes.io/pvc-protection:
You can see that a PV is protected when the PV's status is Terminating and
the Finalizers list includes kubernetes.io/pv-protection too:
Reclaiming
When a user is done with their volume, they can delete the PVC objects from
the API that allows reclamation of the resource. The reclaim policy for a
PersistentVolume tells the cluster what to do with the volume after it has
been released of its claim. Currently, volumes can either be Retained,
Recycled, or Deleted.
Retain
The Retain reclaim policy allows for manual reclamation of the resource.
When the PersistentVolumeClaim is deleted, the PersistentVolume still exists
and the volume is considered "released". But it is not yet available for
another claim because the previous claimant's data remains on the volume.
An administrator can manually reclaim the volume with the following steps.
Delete
For volume plugins that support the Delete reclaim policy, deletion removes
both the PersistentVolume object from Kubernetes, as well as the associated
storage asset in the external infrastructure, such as an AWS EBS, GCE PD,
Azure Disk, or Cinder volume. Volumes that were dynamically provisioned
inherit the reclaim policy of their StorageClass, which defaults to Delete.
The administrator should configure the StorageClass according to users'
expectations; otherwise, the PV must be edited or patched after it is created.
See Change the Reclaim Policy of a PersistentVolume.
Recycle
However, the particular path specified in the custom recycler Pod template
in the volumes part is replaced with the particular path of the volume that is
being recycled.
Reserving a PersistentVolume
The control plane can bind PersistentVolumeClaims to matching
PersistentVolumes in the cluster. However, if you want a PVC to bind to a
specific PV, you need to pre-bind them.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: foo-pvc
namespace: foo
spec:
storageClassName: "" # Empty string must be explicitly set
otherwise default StorageClass will be set
volumeName: foo-pv
...
This method does not guarantee any binding privileges to the
PersistentVolume. If other PersistentVolumeClaims could use the PV that you
specify, you first need to reserve that storage volume. Specify the relevant
PersistentVolumeClaim in the claimRef field of the PV so that other PVCs
can not bind to it.
apiVersion: v1
kind: PersistentVolume
metadata:
name: foo-pv
spec:
storageClassName: ""
claimRef:
name: foo-pvc
namespace: foo
...
This is useful if you want to consume PersistentVolumes that have their clai
mPolicy set to Retain, including cases where you are reusing an existing PV.
• gcePersistentDisk
• awsElasticBlockStore
• Cinder
• glusterfs
• rbd
• Azure File
• Azure Disk
• Portworx
• FlexVolumes
• CSI
You can only expand a PVC if its storage class's allowVolumeExpansion field
is set to true.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gluster-vol-default
provisioner: kubernetes.io/glusterfs
parameters:
resturl: "http://192.168.10.100:8080"
restuser: ""
secretNamespace: ""
secretName: ""
allowVolumeExpansion: true
To request a larger volume for a PVC, edit the PVC object and specify a
larger size. This triggers expansion of the volume that backs the underlying
PersistentVolume. A new PersistentVolume is never created to satisfy the
claim. Instead, an existing volume is resized.
Support for expanding CSI volumes is enabled by default but it also requires
a specific CSI driver to support volume expansion. Refer to documentation
of the specific CSI driver for more information.
You can only resize volumes containing a file system if the file system is XFS,
Ext3, or Ext4.
When a volume contains a file system, the file system is only resized when a
new Pod is using the PersistentVolumeClaim in ReadWrite mode. File system
expansion is either done when a Pod is starting up or when a Pod is running
and the underlying file system supports online expansion.
In this case, you don't need to delete and recreate a Pod or deployment that
is using an existing PVC. Any in-use PVC automatically becomes available to
its Pod as soon as its file system has been expanded. This feature has no
effect on PVCs that are not in use by a Pod or deployment. You must create a
Pod that uses the PVC before the expansion can complete.
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv0003
spec:
capacity:
storage: 5Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Recycle
storageClassName: slow
mountOptions:
- hard
- nfsvers=4.1
nfs:
path: /tmp
server: 172.17.0.2
Capacity
Generally, a PV will have a specific storage capacity. This is set using the
PV's capacity attribute. See the Kubernetes Resource Model to understand
the units expected by capacity.
Currently, storage size is the only resource that can be set or requested.
Future attributes may include IOPS, throughput, etc.
Volume Mode
FEATURE STATE: Kubernetes v1.18 [stable]
You can set the value of volumeMode to Block to use a volume as a raw block
device. Such volume is presented into a Pod as a block device, without any
filesystem on it. This mode is useful to provide a Pod the fastest possible way
to access a volume, without any filesystem layer between the Pod and the
volume. On the other hand, the application running in the Pod must know
how to handle a raw block device. See Raw Block Volume Support for an
example on how to use a volume with volumeMode: Block in a Pod.
Access Modes
A PersistentVolume can be mounted on a host in any way supported by the
resource provider. As shown in the table below, providers will have different
capabilities and each PV's access modes are set to the specific modes
supported by that particular volume. For example, NFS can support multiple
read/write clients, but a specific NFS PV might be exported on the server as
read-only. Each PV gets its own set of access modes describing that specific
PV's capabilities.
• RWO - ReadWriteOnce
• ROX - ReadOnlyMany
• RWX - ReadWriteMany
Class
A PV can have a class, which is specified by setting the storageClassName
attribute to the name of a StorageClass. A PV of a particular class can only
be bound to PVCs requesting that class. A PV with no storageClassName has
no class and can only be bound to PVCs that request no particular class.
Reclaim Policy
Current reclaim policies are:
Currently, only NFS and HostPath support recycling. AWS EBS, GCE PD,
Azure Disk, and Cinder volumes support deletion.
Mount Options
A Kubernetes administrator can specify additional mount options for when a
Persistent Volume is mounted on a node.
• AWSElasticBlockStore
• AzureDisk
• AzureFile
• CephFS
• Cinder (OpenStack block storage)
• GCEPersistentDisk
• Glusterfs
• NFS
• Quobyte Volumes
• RBD (Ceph Block Device)
• StorageOS
• VsphereVolume
• iSCSI
Mount options are not validated, so mount will simply fail if one is invalid.
Node Affinity
Note: For most volume types, you do not need to set this field. It is
automatically populated for AWS EBS, GCE PD and Azure Disk
volume block types. You need to explicitly set this for local
volumes.
A PV can specify node affinity to define constraints that limit what nodes this
volume can be accessed from. Pods that use a PV will only be scheduled to
nodes that are selected by the node affinity.
Phase
A volume will be in one of the following phases:
The CLI will show the name of the PVC bound to the PV.
PersistentVolumeClaims
Each PVC contains a spec and status, which is the specification and status of
the claim. The name of a PersistentVolumeClaim object must be a valid DNS
subdomain name.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: myclaim
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 8Gi
storageClassName: slow
selector:
matchLabels:
release: "stable"
matchExpressions:
- {key: environment, operator: In, values: [dev]}
Access Modes
Claims use the same conventions as volumes when requesting storage with
specific access modes.
Volume Modes
Claims use the same convention as volumes to indicate the consumption of
the volume as either a filesystem or block device.
Resources
Claims, like Pods, can request specific quantities of a resource. In this case,
the request is for storage. The same resource model applies to both volumes
and claims.
Selector
Claims can specify a label selector to further filter the set of volumes. Only
the volumes whose labels match the selector can be bound to the claim. The
selector can consist of two fields:
Class
A claim can request a particular class by specifying the name of a
StorageClass using the attribute storageClassName. Only PVs of the
requested class, ones with the same storageClassName as the PVC, can be
bound to the PVC.
PVCs don't necessarily have to request a class. A PVC with its storageClass
Name set equal to "" is always interpreted to be requesting a PV with no
class, so it can only be bound to PVs with no class (no annotation or one set
equal to ""). A PVC with no storageClassName is not quite the same and is
treated differently by the cluster, depending on whether the DefaultStorag
eClass admission plugin is turned on.
Claims As Volumes
Pods access storage by using the claim as a volume. Claims must exist in the
same namespace as the Pod using the claim. The cluster finds the claim in
the Pod's namespace and uses it to get the PersistentVolume backing the
claim. The volume is then mounted to the host and into the Pod.
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: myfrontend
image: nginx
volumeMounts:
- mountPath: "/var/www/html"
name: mypd
volumes:
- name: mypd
persistentVolumeClaim:
claimName: myclaim
A Note on Namespaces
PersistentVolumes binds are exclusive, and since PersistentVolumeClaims
are namespaced objects, mounting claims with "Many" modes (ROX, RWX) is
only possible within one namespace.
The following volume plugins support raw block volumes, including dynamic
provisioning where applicable:
• AWSElasticBlockStore
• AzureDisk
• CSI
• FC (Fibre Channel)
• GCEPersistentDisk
• iSCSI
• Local volume
• OpenStack Cinder
• RBD (Ceph Block Device)
• VsphereVolume
Note: When adding a raw block device for a Pod, you specify the
device path in the container instead of a mount path.
Volume snapshots only support the out-of-tree CSI volume plugins. For
details, see Volume Snapshots. In-tree volume plugins are deprecated. You
can read about the deprecated volume plugins in the [Volume Plugin FAQ]
(https://github.com/kubernetes/community/blob/master/sig-storage/volume-
plugin-faq.md).
Volume Cloning
Volume Cloning only available for CSI volume plugins.
Create PersistentVolumeClaim from an existing PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: cloned-pvc
spec:
storageClassName: my-csi-plugin
dataSource:
name: existing-src-pvc-name
kind: PersistentVolumeClaim
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
• Give the user the option of providing a storage class name when
instantiating the template.
◦ If the user provides a storage class name, put that value into the p
ersistentVolumeClaim.storageClassName field. This will cause
the PVC to match the right storage class if the cluster has
StorageClasses enabled by the admin.
◦ If the user does not provide a storage class name, leave the persis
tentVolumeClaim.storageClassName field as nil. This will cause a
PV to be automatically provisioned for the user with the default
StorageClass in the cluster. Many cluster environments have a
default StorageClass installed, or administrators can create their
own default StorageClass.
• In your tooling, watch for PVCs that are not getting bound after some
time and surface this to the user, as this may indicate that the cluster
has no dynamic storage support (in which case the user should create a
matching PV) or the cluster has no storage system (in which case the
user cannot deploy config requiring PVCs).
What's next
• Learn more about Creating a PersistentVolume.
• Learn more about Creating a PersistentVolumeClaim.
• Read the Persistent Storage design document.
Reference
• PersistentVolume
• PersistentVolumeSpec
• PersistentVolumeClaim
• PersistentVolumeClaimSpec
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified September 08, 2020 at 4:06 PM PST: Add links to volumes
from persistent volumes (9ddc0a3ad)
Edit this page Create child page Create an issue
• Introduction
• Lifecycle of a volume and claim
◦ Provisioning
◦ Binding
◦ Using
◦ Storage Object in Use Protection
◦ Reclaiming
◦ Reserving a PersistentVolume
◦ Expanding Persistent Volumes Claims
• Types of Persistent Volumes
• Persistent Volumes
◦ Capacity
◦ Volume Mode
◦ Access Modes
◦ Class
◦ Reclaim Policy
◦ Mount Options
◦ Node Affinity
◦ Phase
• PersistentVolumeClaims
◦ Access Modes
◦ Volume Modes
◦ Resources
◦ Selector
◦ Class
• Claims As Volumes
◦ A Note on Namespaces
• Raw Block Volume Support
◦ PersistentVolume using a Raw Block Volume
◦ PersistentVolumeClaim requesting a Raw Block Volume
◦ Pod specification adding Raw Block Device path in container
◦ Binding Block Volumes
• Volume Snapshot and Restore Volume from Snapshot Support
◦ Create a PersistentVolumeClaim from a Volume Snapshot
• Volume Cloning
◦ Create PersistentVolumeClaim from an existing PVC
• Writing Portable Configuration
• What's next
◦ Reference
Volume Snapshots
In Kubernetes, a VolumeSnapshot represents a snapshot of a volume on a
storage system. This document assumes that you are already familiar with
Kubernetes persistent volumes.
Introduction
Similar to how API resources PersistentVolume and PersistentVolumeCla
im are used to provision volumes for users and administrators, VolumeSnaps
hotContent and VolumeSnapshot API resources are provided to create
volume snapshots for users and administrators.
Pre-provisioned
Dynamic
Delete
Deletion is triggered by deleting the VolumeSnapshot object, and the Deleti
onPolicy will be followed. If the DeletionPolicy is Delete, then the
underlying storage snapshot will be deleted along with the VolumeSnapshot
Content object. If the DeletionPolicy is Retain, then both the underlying
snapshot and VolumeSnapshotContent remain.
VolumeSnapshots
Each VolumeSnapshot contains a spec and a status.
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: new-snapshot-test
spec:
volumeSnapshotClassName: csi-hostpath-snapclass
source:
persistentVolumeClaimName: pvc-test
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: test-snapshot
spec:
source:
volumeSnapshotContentName: test-content
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
name: snapcontent-72d9a349-aacd-42d2-a240-d775650d2455
spec:
deletionPolicy: Delete
driver: hostpath.csi.k8s.io
source:
volumeHandle: ee0cfb94-f8d4-11e9-b2d8-0242ac110002
volumeSnapshotClassName: csi-hostpath-snapclass
volumeSnapshotRef:
name: new-snapshot-test
namespace: default
uid: 72d9a349-aacd-42d2-a240-d775650d2455
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
name: new-snapshot-content-test
spec:
deletionPolicy: Delete
driver: hostpath.csi.k8s.io
source:
snapshotHandle: 7bdd0de3-aaeb-11e8-9aae-0242ac110002
volumeSnapshotRef:
name: new-snapshot-test
namespace: default
For more details, see Volume Snapshot and Restore Volume from Snapshot.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 13, 2020 at 8:08 AM PST: Add doc for snapshot GA
(#24849) (e62b6e1b1)
Edit this page Create child page Create an issue
• Introduction
• Lifecycle of a volume snapshot and volume snapshot content
◦ Provisioning Volume Snapshot
◦ Binding
◦ Persistent Volume Claim as Snapshot Source Protection
◦ Delete
• VolumeSnapshots
• Volume Snapshot Contents
• Provisioning Volumes from Snapshots
Introduction
The CSI Volume Cloning feature adds support for specifying existing PVCs in
the dataSource field to indicate a user would like to clone a Volume.
A Clone is defined as a duplicate of an existing Kubernetes Volume that can
be consumed as any standard Volume would be. The only difference is that
upon provisioning, rather than creating a "new" empty Volume, the back end
device creates an exact duplicate of the specified Volume.
Provisioning
Clones are provisioned just like any other PVC with the exception of adding
a dataSource that references an existing PVC in the same namespace.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: clone-of-pvc-1
namespace: myns
spec:
accessModes:
- ReadWriteOnce
storageClassName: cloning
resources:
requests:
storage: 5Gi
dataSource:
kind: PersistentVolumeClaim
name: pvc-1
Usage
Upon availability of the new PVC, the cloned PVC is consumed the same as
other PVC. It's also expected at this point that the newly created PVC is an
independent object. It can be consumed, cloned, snapshotted, or deleted
independently and without consideration for it's original dataSource PVC.
This also implies that the source is not linked in any way to the newly
created clone, it may also be modified or deleted without affecting the newly
created clone.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified May 30, 2020 at 3:10 PM PST: add en pages (ecc27bbbe)
Edit this page Create child page Create an issue
• Introduction
• Provisioning
• Usage
Storage Classes
This document describes the concept of a StorageClass in Kubernetes.
Familiarity with volumes and persistent volumes is suggested.
Introduction
A StorageClass provides a way for administrators to describe the "classes"
of storage they offer. Different classes might map to quality-of-service levels,
or to backup policies, or to arbitrary policies determined by the cluster
administrators. Kubernetes itself is unopinionated about what classes
represent. This concept is sometimes called "profiles" in other storage
systems.
The StorageClass Resource
Each StorageClass contains the fields provisioner, parameters, and reclai
mPolicy, which are used when a PersistentVolume belonging to the class
needs to be dynamically provisioned.
Administrators can specify a default StorageClass just for PVCs that don't
request any particular class to bind to: see the PersistentVolumeClaim
section for details.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
reclaimPolicy: Retain
allowVolumeExpansion: true
mountOptions:
- debug
volumeBindingMode: Immediate
Provisioner
Each StorageClass has a provisioner that determines what volume plugin is
used for provisioning PVs. This field must be specified.
You are not restricted to specifying the "internal" provisioners listed here
(whose names are prefixed with "kubernetes.io" and shipped alongside
Kubernetes). You can also run and specify external provisioners, which are
independent programs that follow a specification defined by Kubernetes.
Authors of external provisioners have full discretion over where their code
lives, how the provisioner is shipped, how it needs to be run, what volume
plugin it uses (including Flex), etc. The repository kubernetes-sigs/sig-
storage-lib-external-provisioner houses a library for writing external
provisioners that implements the bulk of the specification. Some external
provisioners are listed under the repository kubernetes-sigs/sig-storage-lib-
external-provisioner.
Reclaim Policy
PersistentVolumes that are dynamically created by a StorageClass will have
the reclaim policy specified in the reclaimPolicy field of the class, which
can be either Delete or Retain. If no reclaimPolicy is specified when a
StorageClass object is created, it will default to Delete.
Note: You can only use the volume expansion feature to grow a
Volume, not to shrink it.
Mount Options
PersistentVolumes that are dynamically created by a StorageClass will have
the mount options specified in the mountOptions field of the class.
If the volume plugin does not support mount options but mount options are
specified, provisioning will fail. Mount options are not validated on either
the class or PV, so mount of the PV will simply fail if one is invalid.
By default, the Immediate mode indicates that volume binding and dynamic
provisioning occurs once the PersistentVolumeClaim is created. For storage
backends that are topology-constrained and not globally accessible from all
Nodes in the cluster, PersistentVolumes will be bound or provisioned without
knowledge of the Pod's scheduling requirements. This may result in
unschedulable Pods.
• AWSElasticBlockStore
• GCEPersistentDisk
• AzureDisk
Allowed Topologies
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-standard
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-central1-a
- us-central1-b
Parameters
There can be at most 512 parameters defined for a StorageClass. The total
length of the parameters object including its keys and values cannot exceed
256 KiB.
AWS EBS
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/aws-ebs
parameters:
type: io1
iopsPerGB: "10"
fsType: ext4
• type: io1, gp2, sc1, st1. See AWS docs for details. Default: gp2.
• zone (Deprecated): AWS zone. If neither zone nor zones is specified,
volumes are generally round-robin-ed across all active zones where
Kubernetes cluster has a node. zone and zones parameters must not be
used at the same time.
• zones (Deprecated): A comma separated list of AWS zone(s). If neither
zone nor zones is specified, volumes are generally round-robin-ed
across all active zones where Kubernetes cluster has a node. zone and
zones parameters must not be used at the same time.
• iopsPerGB: only for io1 volumes. I/O operations per second per GiB.
AWS volume plugin multiplies this with size of requested volume to
compute IOPS of the volume and caps it at 20 000 IOPS (maximum
supported by AWS, see AWS docs. A string is expected here, i.e. "10",
not 10.
• fsType: fsType that is supported by kubernetes. Default: "ext4".
• encrypted: denotes whether the EBS volume should be encrypted or
not. Valid values are "true" or "false". A string is expected here, i.e. "
true", not true.
• kmsKeyId: optional. The full Amazon Resource Name of the key to use
when encrypting the volume. If none is supplied but encrypted is true,
a key is generated by AWS. See AWS docs for valid ARN value.
GCE PD
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-standard
fstype: ext4
replication-type: none
• fstype: ext4 or xfs. Default: ext4. The defined filesystem type must be
supported by the host operating system.
Glusterfs
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/glusterfs
parameters:
resturl: "http://127.0.0.1:8081"
clusterid: "630372ccdc720a92c681fb928f27b53f"
restauthenabled: "true"
restuser: "admin"
secretNamespace: "default"
secretName: "heketi-secret"
gidMin: "40000"
gidMax: "50000"
volumetype: "replicate:3"
• resturl: Gluster REST service/Heketi service url which provision
gluster volumes on demand. The general format should be IPaddress:P
ort and this is a mandatory parameter for GlusterFS dynamic
provisioner. If Heketi service is exposed as a routable service in
openshift/kubernetes setup, this can have a format similar to http://
heketi-storage-project.cloudapps.mystorage.com where the fqdn
is a resolvable Heketi service url.
• gidMin, gidMax : The minimum and maximum value of GID range for
the StorageClass. A unique value (GID) in this range ( gidMin-gidMax )
will be used for dynamically provisioned volumes. These are optional
values. If not specified, the volume will be provisioned with a value
between 2000-2147483647 which are defaults for gidMin and gidMax
respectively.
OpenStack Cinder
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gold
provisioner: kubernetes.io/cinder
parameters:
availability: nova
Note:
FEATURE STATE: Kubernetes v1.11 [deprecated]
vSphere
CSI Provisioner
vCP Provisioner
The following examples use the VMware Cloud Provider (vCP) StorageClass
provisioner.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast
provisioner: kubernetes.io/vsphere-volume
parameters:
diskformat: zeroedthick
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast
provisioner: kubernetes.io/vsphere-volume
parameters:
diskformat: zeroedthick
datastore: VSANDatastore
datastore: The user can also specify the datastore in the StorageClass.
The volume will be created on the datastore specified in the
StorageClass, which in this case is VSANDatastore. This field is
optional. If the datastore is not specified, then the volume will be
created on the datastore specified in the vSphere config file used to
initialize the vSphere Cloud Provider.
There are few vSphere examples which you try out for persistent volume
management inside Kubernetes for vSphere.
Ceph RBD
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast
provisioner: kubernetes.io/rbd
parameters:
monitors: 10.16.153.105:6789
adminId: kube
adminSecretName: ceph-secret
adminSecretNamespace: kube-system
pool: kube
userId: kube
userSecretName: ceph-secret-user
userSecretNamespace: default
fsType: ext4
imageFormat: "2"
imageFeatures: "layering"
• userId: Ceph client ID that is used to map the RBD image. Default is
the same as adminId.
Quobyte
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/quobyte
parameters:
quobyteAPIServer: "http://138.68.74.142:7860"
registry: "138.68.74.142:7861"
adminSecretName: "quobyte-admin-secret"
adminSecretNamespace: "kube-system"
user: "root"
group: "root"
quobyteConfig: "BASE"
quobyteTenant: "DEFAULT"
• registry: Quobyte registry to use to mount the volume. You can specify
the registry as <host>:<port> pair or if you want to specify multiple
registries you just have to put a comma between them e.q. <host1>:<p
ort>,<host2>:<port>,<host3>:<port>. The host can be an IP address
or if you have a working DNS you can also provide the DNS names.
Azure Disk
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/azure-disk
parameters:
skuName: Standard_LRS
location: eastus
storageAccount: azure_storage_account_name
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/azure-disk
parameters:
storageaccounttype: Standard_LRS
kind: Shared
Azure File
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azurefile
provisioner: kubernetes.io/azure-file
parameters:
skuName: Standard_LRS
location: eastus
storageAccount: azure_storage_account_name
Portworx Volume
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: portworx-io-priority-high
provisioner: kubernetes.io/portworx-volume
parameters:
repl: "1"
snap_interval: "70"
priority_io: "high"
ScaleIO
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/scaleio
parameters:
gateway: https://192.168.99.200:443/api
system: scaleio
protectionDomain: pd0
storagePool: sp1
storageMode: ThinProvisioned
secretRef: sio-secret
readOnly: false
fsType: xfs
StorageOS
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast
provisioner: kubernetes.io/storageos
parameters:
pool: default
description: Kubernetes volume
fsType: ext4
adminSecretNamespace: default
adminSecretName: storageos-secret
The StorageOS Kubernetes volume plugin can use a Secret object to specify
an endpoint and credentials to access the StorageOS API. This is only
required when the defaults have been changed. The secret must be created
with type kubernetes.io/storageos as shown in the following command:
kubectl create secret generic storageos-secret \
--type="kubernetes.io/storageos" \
--from-literal=apiAddress=tcp://localhost:5705 \
--from-literal=apiUsername=storageos \
--from-literal=apiPassword=storageos \
--namespace=default
Local
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 11, 2020 at 2:54 PM PST: Move link away from
deprecated external-storage repo (e3db38188)
Edit this page Create child page Create an issue
• Introduction
• The StorageClass Resource
◦ Provisioner
◦ Reclaim Policy
◦ Allow Volume Expansion
◦ Mount Options
◦ Volume Binding Mode
◦ Allowed Topologies
• Parameters
◦ AWS EBS
◦ GCE PD
◦ Glusterfs
◦ OpenStack Cinder
◦ vSphere
◦ Ceph RBD
◦ Quobyte
◦ Azure Disk
◦ Azure File
◦ Portworx Volume
◦ ScaleIO
◦ StorageOS
◦ Local
Introduction
Just like StorageClass provides a way for administrators to describe the
"classes" of storage they offer when provisioning a volume,
VolumeSnapshotClass provides a way to describe the "classes" of storage
when provisioning a volume snapshot.
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-hostpath-snapclass
driver: hostpath.csi.k8s.io
deletionPolicy: Delete
parameters:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-hostpath-snapclass
annotations:
snapshot.storage.kubernetes.io/is-default-class: "true"
driver: hostpath.csi.k8s.io
deletionPolicy: Delete
parameters:
Driver
Volume snapshot classes have a driver that determines what CSI volume
plugin is used for provisioning VolumeSnapshots. This field must be
specified.
DeletionPolicy
Volume snapshot classes have a deletionPolicy. It enables you to configure
what happens to a VolumeSnapshotContent when the VolumeSnapshot
object it is bound to is to be deleted. The deletionPolicy of a volume
snapshot can either be Retain or Delete. This field must be specified.
Parameters
Volume snapshot classes have parameters that describe volume snapshots
belonging to the volume snapshot class. Different parameters may be
accepted depending on the driver.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 13, 2020 at 8:08 AM PST: Add doc for snapshot GA
(#24849) (e62b6e1b1)
Edit this page Create child page Create an issue
• Introduction
• The VolumeSnapshotClass Resource
◦ Driver
◦ DeletionPolicy
• Parameters
Background
The implementation of dynamic volume provisioning is based on the API
object StorageClass from the API group storage.k8s.io. A cluster
administrator can define as many StorageClass objects as needed, each
specifying a volume plugin (aka provisioner) that provisions a volume and
the set of parameters to pass to that provisioner when provisioning. A
cluster administrator can define and expose multiple flavors of storage (from
the same or different storage systems) within a cluster, each with a custom
set of parameters. This design also ensures that end users don't have to
worry about the complexity and nuances of how storage is provisioned, but
still have the ability to select from multiple storage options.
The following manifest creates a storage class "fast" which provisions SSD-
like persistent disks.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
To select the "fast" storage class, for example, a user would create the
following PersistentVolumeClaim:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: claim1
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast
resources:
requests:
storage: 30Gi
Note that there can be at most one default storage class on a cluster, or a Pe
rsistentVolumeClaim without storageClassName explicitly specified cannot
be created.
Topology Awareness
In Multi-Zone clusters, Pods can be spread across Zones in a Region. Single-
Zone storage backends should be provisioned in the Zones where Pods are
scheduled. This can be accomplished by setting the Volume Binding Mode.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified August 05, 2020 at 3:17 AM PST: Replace special quote
characters with normal ones. (c6a96128c)
Edit this page Create child page Create an issue
• Background
• Enabling Dynamic Provisioning
• Using Dynamic Provisioning
• Defaulting Behavior
• Topology Awareness
Storage Capacity
Storage capacity is limited and may vary depending on the node on which a
pod runs: network-attached storage might not be accessible by all nodes, or
storage is local to a node to begin with.
This page describes how Kubernetes keeps track of storage capacity and
how the scheduler uses that information to schedule Pods onto nodes that
have access to enough storage capacity for the remaining missing volumes.
Without storage capacity tracking, the scheduler may choose a node that
doesn't have enough capacity to provision a volume and multiple scheduling
retries will be needed.
API
There are two API extensions for this feature:
Scheduling
Storage capacity information is used by the Kubernetes scheduler if:
In that case, the scheduler only considers nodes for the Pod which have
enough storage available to them. This check is very simplistic and only
compares the size of the volume against the capacity listed in CSIStorageCa
pacity objects with a topology that includes the node.
For volumes with Immediate volume binding mode, the storage driver
decides where to create the volume, independently of Pods that will use the
volume. The scheduler then schedules Pods onto nodes where the volume is
available after the volume has been created.
For CSI ephemeral volumes, scheduling always happens without considering
storage capacity. This is based on the assumption that this volume type is
only used by special CSI drivers which are local to a node and do not need
significant resources there.
Rescheduling
When a node has been selected for a Pod with WaitForFirstConsumer
volumes, that decision is still tentative. The next step is that the CSI storage
driver gets asked to create the volume with a hint that the volume is
supposed to be available on the selected node.
Limitations
Storage capacity tracking increases the chance that scheduling works on the
first try, but cannot guarantee this because the scheduler has to decide
based on potentially out-dated information. Usually, the same retry
mechanism as for scheduling without any storage capacity information
handles scheduling failures.
One situation where scheduling can fail permanently is when a Pod uses
multiple volumes: one volume might have been created already in a topology
segment which then does not have enough capacity left for another volume.
Manual intervention is necessary to recover from this, for example by
increasing capacity or deleting the volume that was already created. Further
work is needed to handle this automatically.
No resources found
In addition to enabling the feature in the cluster, a CSI driver also has to
support it. Please refer to the driver's documentation for details.
What's next
• For more information on the design, see the Storage Capacity
Constraints for Pod Scheduling KEP.
• For more information on further development of this feature, see the
enhancement tracking issue #1472.
• Learn about Kubernetes Scheduler
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified December 13, 2020 at 8:24 PM PST: Remove broken link to
CSIStorageCapacity in API reference (04664781a)
Edit this page Create child page Create an issue
• API
• Scheduling
• Rescheduling
• Limitations
• Enabling storage capacity tracking
• What's next
Ephemeral Volumes
This document describes ephemeral volumes in Kubernetes. Familiarity with
volumes is suggested, in particular PersistentVolumeClaim and
PersistentVolume.
Some application need additional storage but don't care whether that data is
stored persistently across restarts. For example, caching services are often
limited by memory size and can move infrequently used data into storage
that is slower than memory with little impact on overall performance.
Ephemeral volumes are specified inline in the Pod spec, which simplifies
application deployment and management.
• emptyDir: empty at Pod startup, with storage coming locally from the
kubelet base directory (usually the root disk) or RAM
• configMap, downwardAPI, secret: inject different kinds of Kubernetes
data into a Pod
• CSI ephemeral volumes: similar to the previous volume kinds, but
provided by special CSI drivers which specifically support this feature
• generic ephemeral volumes, which can be provided by all storage
drivers that also support persistent volumes
The advantage of using third-party drivers is that they can offer functionality
that Kubernetes itself does not support, for example storage with different
performance characteristics than the disk that is managed by kubelet, or
injecting different data.
Here's an example manifest for a Pod that uses CSI ephemeral storage:
kind: Pod
apiVersion: v1
metadata:
name: my-csi-app
spec:
containers:
- name: my-frontend
image: busybox
volumeMounts:
- mountPath: "/data"
name: my-csi-inline-vol
command: [ "sleep", "1000000" ]
volumes:
- name: my-csi-inline-vol
csi:
driver: inline.storage.kubernetes.io
volumeAttributes:
foo: bar
kind: Pod
apiVersion: v1
metadata:
name: my-app
spec:
containers:
- name: my-frontend
image: busybox
volumeMounts:
- mountPath: "/scratch"
name: scratch-volume
command: [ "sleep", "1000000" ]
volumes:
- name: scratch-volume
ephemeral:
volumeClaimTemplate:
metadata:
labels:
type: my-frontend-volume
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "scratch-storage-class"
resources:
requests:
storage: 1Gi
While these PVCs exist, they can be used like any other PVC. In particular,
they can be referenced as data source in volume cloning or snapshotting.
The PVC object also holds the current status of the volume.
PersistentVolumeClaim naming
Naming of the automatically created PVCs is deterministic: the name is a
combination of Pod name and volume name, with a hyphen (-) in the middle.
In the example above, the PVC name will be my-app-scratch-volume. This
deterministic naming makes it easier to interact with the PVC because one
does not have to search for it once the Pod name and volume name are
known.
Such conflicts are detected: a PVC is only used for an ephemeral volume if it
was created for the Pod. This check is based on the ownership relationship.
An existing PVC is not overwritten or modified. But this does not resolve the
conflict because without the right PVC, the Pod cannot start.
Caution: Take care when naming Pods and volumes inside the
same namespace, so that these conflicts can't occur.
Security
Enabling the GenericEphemeralVolume feature allows users to create PVCs
indirectly if they can create Pods, even if they do not have permission to
create PVCs directly. Cluster administrators must be aware of this. If this
does not fit their security model, they have two choices:
• Explicitly disable the feature through the feature gate, to avoid being
surprised when some future Kubernetes version enables it by default.
• Use a Pod Security Policy where the volumes list does not contain the e
phemeral volume type.
The normal namespace quota for PVCs in a namespace still applies, so even
if users are allowed to use this new mechanism, they cannot use it to
circumvent other policies.
What's next
Ephemeral volumes managed by kubelet
See local ephemeral storage.
CSI ephemeral volumes
• For more information on the design, see the Ephemeral Inline CSI
volumes KEP.
• For more information on further development of this feature, see the
enhancement tracking issue #596.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 28, 2020 at 12:43 PM PST: Fix link to CSI
ephemeral volumes (01d4a0a4d)
Edit this page Create child page Create an issue
Cloud providers like Google, Amazon, and Microsoft typically have a limit on
how many volumes can be attached to a Node. It is important for Kubernetes
to respect those limits. Otherwise, Pods scheduled on a Node could get
stuck waiting for volumes to attach.
Kubernetes default limits
The Kubernetes scheduler has default limits on the number of volumes that
can be attached to a Node:
Custom limits
You can change these limits by setting the value of the KUBE_MAX_PD_VOLS
environment variable, and then starting the scheduler. CSI drivers might
have a different procedure, see their documentation on how to customize
their limits.
Use caution if you set a limit that is higher than the default limit. Consult
the cloud provider's documentation to make sure that Nodes can actually
support the limit you set.
• Amazon EBS
• Google Persistent Disk
• Azure Disk
• CSI
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified May 30, 2020 at 3:10 PM PST: add en pages (ecc27bbbe)
Edit this page Create child page Create an issue
Configuration
Resources that Kubernetes provides for configuring Pods.
ConfigMaps
Secrets
• Write your configuration files using YAML rather than JSON. Though
these formats can be used interchangeably in almost all scenarios,
YAML tends to be more user-friendly.
• Group related objects into a single file whenever it makes sense. One
file is often easier to manage than several. See the guestbook-all-in-
one.yaml file as an example of this syntax.
Services
• Create a Service before its corresponding backend workloads
(Deployments or ReplicaSets), and before any workloads that need to
access it. When Kubernetes starts a container, it provides environment
variables pointing to all the Services which were running when the
container was started. For example, if a Service named foo exists, all
containers will get the following variables in their initial environment:
FOO_SERVICE_HOST=<the host the Service is running on>
FOO_SERVICE_PORT=<the port the Service is running on>
If you only need access to the port for debugging purposes, you can use
the apiserver proxy or kubectl port-forward.
If you explicitly need to expose a Pod's port on the node, consider using
a NodePort Service before resorting to hostPort.
Using Labels
• Define and use labels that identify semantic attributes of your
application or Deployment, such as { app: myapp, tier: frontend,
phase: test, deployment: v3 }. You can use these labels to select
the appropriate Pods for other resources; for example, a Service that
selects all tier: frontend Pods, or all phase: test components of app
: myapp. See the guestbook app for examples of this approach.
Container Images
The imagePullPolicy and the tag of the image affect when the kubelet
attempts to pull the specified image.
• imagePullPolicy is omitted and the image tag is present but not :late
st: IfNotPresent is applied.
Note: To make sure the container always uses the same version of
the image, you can specify its digest; replace <image-name>:<tag>
with <image-name>@<digest> (for example, image@sha256:45b23d
ee08af5e43a7fea6c4cf9c25ccf269ee113168c19722f87876677c5c
b2). The digest uniquely identifies a specific version of the image,
so it is never updated by Kubernetes unless you change the digest
value.
Note: You should avoid using the :latest tag when deploying
containers in production as it is harder to track which version of
the image is running and more difficult to roll back properly.
• Use label selectors for get and delete operations instead of specific
object names. See the sections on label selectors and using labels
effectively.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified July 17, 2020 at 3:46 PM PST: Replace reference to redirect
entries (1) (0bdcd44e6)
Edit this page Create child page Create an issue
ConfigMaps
A ConfigMap is an API object used to store non-confidential data in key-
value pairs. Pods can consume ConfigMaps as environment variables,
command-line arguments, or as configuration files in a volume.
For example, imagine that you are developing an application that you can
run on your own computer (for development) and in the cloud (to handle
real traffic). You write the code to look in an environment variable named DA
TABASE_HOST. Locally, you set that variable to localhost. In the cloud, you
set it to refer to a Kubernetes Service that exposes the database component
to your cluster. This lets you fetch a container image running in the cloud
and debug the exact same code locally if needed.
A ConfigMap is not designed to hold large chunks of data. The data stored in
a ConfigMap cannot exceed 1 MiB. If you need to store settings that are
larger than this limit, you may want to consider mounting a volume or use a
separate database or file service.
ConfigMap object
A ConfigMap is an API object that lets you store configuration for other
objects to use. Unlike most Kubernetes objects that have a spec, a
ConfigMap has data and binaryData fields. These fields accepts key-value
pairs as their values. Both the data field and the binaryData are optional.
The data field is designed to contain UTF-8 byte sequences while the binary
Data field is designed to contain binary data.
Each key under the data or the binaryData field must consist of
alphanumeric characters, -, _ or .. The keys stored in data must not overlap
with the keys in the binaryData field.
Here's an example ConfigMap that has some keys with single values, and
other keys where the value looks like a fragment of a configuration format.
apiVersion: v1
kind: ConfigMap
metadata:
name: game-demo
data:
# property-like keys; each key maps to a simple value
player_initial_lives: "3"
ui_properties_file_name: "user-interface.properties"
# file-like keys
game.properties: |
enemy.types=aliens,monsters
player.maximum-lives=5
user-interface.properties: |
color.good=purple
color.bad=yellow
allow.textmode=true
There are four different ways that you can use a ConfigMap to configure a
container inside a Pod:
The fourth method means you have to write code to read the ConfigMap and
its data. However, because you're using the Kubernetes API directly, your
application can subscribe to get updates whenever the ConfigMap changes,
and react when that happens. By accessing the Kubernetes API directly, this
technique also lets you access a ConfigMap in a different namespace.
Here's an example Pod that uses values from game-demo to configure a Pod:
apiVersion: v1
kind: Pod
metadata:
name: configmap-demo-pod
spec:
containers:
- name: demo
image: alpine
command: ["sleep", "3600"]
env:
# Define the environment variable
- name: PLAYER_INITIAL_LIVES # Notice that the case is
different here
# from the key name in the
ConfigMap.
valueFrom:
configMapKeyRef:
name: game-demo # The ConfigMap this
value comes from.
key: player_initial_lives # The key to fetch.
- name: UI_PROPERTIES_FILE_NAME
valueFrom:
configMapKeyRef:
name: game-demo
key: ui_properties_file_name
volumeMounts:
- name: config
mountPath: "/config"
readOnly: true
volumes:
# You set volumes at the Pod level, then mount them into
containers inside that Pod
- name: config
configMap:
# Provide the name of the ConfigMap you want to mount.
name: game-demo
# An array of keys from the ConfigMap to create as files
items:
- key: "game.properties"
path: "game.properties"
- key: "user-interface.properties"
path: "user-interface.properties"
For this example, defining a volume and mounting it inside the demo
container as /config creates two files, /config/game.properties and /
config/user-interface.properties, even though there are four keys in
the ConfigMap. This is because the Pod definition specifies an items array in
the volumes section. If you omit the items array entirely, every key in the
ConfigMap becomes a file with the same name as the key, and you get 4
files.
Using ConfigMaps
ConfigMaps can be mounted as data volumes. ConfigMaps can also be used
by other parts of the system, without being directly exposed to the Pod. For
example, ConfigMaps can hold data that other parts of the system should
use for configuration.
For example, you might encounter addons or operators that adjust their
behavior based on a ConfigMap.
Using ConfigMaps as files from a Pod
To consume a ConfigMap in a volume in a Pod:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: mypod
image: redis
volumeMounts:
- name: foo
mountPath: "/etc/foo"
readOnly: true
volumes:
- name: foo
configMap:
name: myconfigmap
If there are multiple containers in the Pod, then each container needs its
own volumeMounts block, but only one .spec.volumes is needed per
ConfigMap.
Immutable ConfigMaps
FEATURE STATE: Kubernetes v1.19 [beta]
• protects you from accidental (or unwanted) updates that could cause
applications outages
• improves performance of your cluster by significantly reducing load on
kube-apiserver, by closing watches for ConfigMaps marked as
immutable.
apiVersion: v1
kind: ConfigMap
metadata:
...
data:
...
immutable: true
What's next
• Read about Secrets.
• Read Configure a Pod to Use a ConfigMap.
• Read The Twelve-Factor App to understand the motivation for
separating code from configuration.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
• Motivation
• ConfigMap object
• ConfigMaps and Pods
• Using ConfigMaps
◦ Using ConfigMaps as files from a Pod
• Immutable ConfigMaps
• What's next
Secrets
Kubernetes Secrets let you store and manage sensitive information, such as
passwords, OAuth tokens, and ssh keys. Storing confidential information in a
Secret is safer and more flexible than putting it verbatim in a Pod definition
or in a container image. See Secrets design document for more information.
Overview of Secrets
To use a Secret, a Pod needs to reference the Secret. A Secret can be used
with a Pod in three ways:
The name of a Secret object must be a valid DNS subdomain name. You can
specify the data and/or the stringData field when creating a configuration
file for a Secret. The data and the stringData fields are optional. The values
for all keys in the data field have to be base64-encoded strings. If the
conversion to base64 string is not desirable, you can choose to specify the s
tringData field instead, which accepts arbitrary strings as values.
The keys of data and stringData must consist of alphanumeric characters,
-, _ or .. All key-value pairs in the stringData field are internally merged
into the data field. If a key appears in both the data and the stringData
field, the value specified in the stringData field takes precedence.
Types of Secret
When creating a Secret, you can specify its type using the type field of the S
ecret resource, or certain equivalent kubectl command line flags (if
available). The Secret type is used to facilitate programmatic handling of the
Secret data.
Kubernetes provides several builtin types for some common usage scenarios.
These types vary in terms of the validations performed and the constraints
Kubernetes imposes on them.
You can define and use your own Secret type by assigning a non-empty
string as the type value for a Secret object. An empty string is treated as an
Opaque type. Kubernetes doesn't impose any constraints on the type name.
However, if you are using one of the builtin types, you must meet all the
requirements defined for that type.
Opaque secrets
Opaque is the default Secret type if omitted from a Secret configuration file.
When you create a Secret using kubectl, you will use the generic
subcommand to indicate an Opaque Secret type. For example, the following
command creates an empty Secret of type Opaque.
apiVersion: v1
kind: Secret
metadata:
name: secret-sa-sample
annotations:
kubernetes.io/service-account.name: "sa-name"
type: kubernetes.io/service-account-token
data:
# You can include additional key value pairs as you do with
Opaque Secrets
extra: YmFyCg==
• kubernetes.io/dockercfg
• kubernetes.io/dockerconfigjson
apiVersion: v1
kind: Secret
metadata:
name: secret-dockercfg
type: kubernetes.io/dockercfg
data:
.dockercfg: |
"<base64 encoded ~/.dockercfg file>"
Note: If you do not want to perform the base64 encoding, you can
choose to use the stringData field instead.
When you create these types of Secrets using a manifest, the API server
checks whether the expected key does exists in the data field, and it verifies
if the value provided can be parsed as a valid JSON. The API server doesn't
validate if the JSON actually is a Docker config file.
When you do not have a Docker config file, or you want to use kubectl to
create a Docker registry Secret, you can do:
{
"auths": {
"https://index.docker.io/v1/": {
"username": "tiger",
"password": "pass113",
"email": "tiger@acme.com",
"auth": "dGlnZXI6cGFzczExMw=="
}
}
}
Basic authentication Secret
The kubernetes.io/basic-auth type is provided for storing credentials
needed for basic authentication. When using this Secret type, the data field
of the Secret must contain the following two keys:
Both values for the above two keys are base64 encoded strings. You can, of
course, provide the clear text content using the stringData for Secret
creation.
apiVersion: v1
kind: Secret
metadata:
name: secret-basic-auth
type: kubernetes.io/basic-auth
stringData:
username: admin
password: t0p-Secret
The basic authentication Secret type is provided only for user's convenience.
You can create an Opaque for credentials used for basic authentication.
However, using the builtin Secret type helps unify the formats of your
credentials and the API server does verify if the required keys are provided
in a Secret configuration.
apiVersion: v1
kind: Secret
metadata:
name: secret-ssh-auth
type: kubernetes.io/ssh-auth
data:
# the data is abbreviated in this example
ssh-privatekey: |
MIIEpQIBAAKCAQEAulqb/Y ...
The SSH authentication Secret type is provided only for user's convenience.
You can create an Opaque for credentials used for SSH authentication.
However, using the builtin Secret type helps unify the formats of your
credentials and the API server does verify if the required keys are provided
in a Secret configuration.
TLS secrets
Kubernetes provides a builtin Secret type kubernetes.io/tls for to storing
a certificate and its associated key that are typically used for TLS . This data
is primarily used with TLS termination of the Ingress resource, but may be
used with other resources or directly by a workload. When using this type of
Secret, the tls.key and the tls.crt key must be provided in the data (or s
tringData) field of the Secret configuration, although the API server doesn't
actually validate the values for each key.
apiVersion: v1
kind: Secret
metadata:
name: secret-tls
type: kubernetes.io/tls
data:
# the data is abbreviated in this example
tls.crt: |
MIIC2DCCAcCgAwIBAgIBATANBgkqh ...
tls.key: |
MIIEpgIBAAKCAQEA7yn3bRHQ5FHMQ ...
The TLS Secret type is provided for user's convenience. You can create an O
paque for credentials used for TLS server and/or client. However, using the
builtin Secret type helps ensure the consistency of Secret format in your
project; the API server does verify if the required keys are provided in a
Secret configuration.
When creating a TLS Secret using kubectl, you can use the tls
subcommand as shown in the following example:
The public/private key pair must exist before hand. The public key certificate
for --cert must be .PEM encoded (Base64-encoded DER format), and match
the given private key for --key. The private key must be in what is
commonly called PEM private key format, unencrypted. In both cases, the
initial and the last lines from PEM (for example, --------BEGIN
CERTIFICATE----- and -------END CERTIFICATE---- for a cetificate) are
not included.
Bootstrap token Secrets
A bootstrap token Secret can be created by explicitly specifying the Secret t
ype to bootstrap.kubernetes.io/token. This type of Secret is designed for
tokens used during the node bootstrap process. It stores tokens used to sign
well known ConfigMaps.
apiVersion: v1
kind: Secret
metadata:
name: bootstrap-token-5emitj
namespace: kube-system
type: bootstrap.kubernetes.io/token
data:
auth-extra-groups: c3lzdGVtOmJvb3RzdHJhcHBlcnM6a3ViZWFkbTpkZWZh
dWx0LW5vZGUtdG9rZW4=
expiration: MjAyMC0wOS0xM1QwNDozOToxMFo=
token-id: NWVtaXRq
token-secret: a3E0Z2lodnN6emduMXAwcg==
usage-bootstrap-authentication: dHJ1ZQ==
usage-bootstrap-signing: dHJ1ZQ==
A bootstrap type Secret has the following keys specified under data:
The above YAML may look confusing because the values are all in base64
encoded strings. In fact, you can create an identical Secret using the
following YAML:
apiVersion: v1
kind: Secret
metadata:
# Note how the Secret is named
name: bootstrap-token-5emitj
# A bootstrap token Secret usually resides in the kube-system
namespace
namespace: kube-system
type: bootstrap.kubernetes.io/token
stringData:
auth-extra-groups: "system:bootstrappers:kubeadm:default-node-
token"
expiration: "2020-09-13T04:39:10Z"
# This token ID is used in the name
token-id: "5emitj"
token-secret: "kq4gihvszzgn1p0r"
# This token can be used for authentication
usage-bootstrap-authentication: "true"
# and it can be used for signing
usage-bootstrap-signing: "true"
Creating a Secret
There are several options to create a Secret:
Editing a Secret
An existing Secret may be edited with the following command:
This will open the default configured editor and allow for updating the
base64 encoded Secret values in the data field:
# Please edit the object below. Lines beginning with a '#' will
be ignored,
# and an empty file will abort the edit. If an error occurs
while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
username: YWRtaW4=
password: MWYyZDFlMmU2N2Rm
kind: Secret
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: { ... }
creationTimestamp: 2016-01-22T18:41:56Z
name: mysecret
namespace: default
resourceVersion: "164619"
uid: cfee02d6-c137-11e5-8d73-42010af00002
type: Opaque
Using Secrets
Secrets can be mounted as data volumes or exposed as environment
variables to be used by a container in a Pod. Secrets can also be used by
other parts of the system, without being directly exposed to the Pod. For
example, Secrets can hold credentials that other parts of the system should
use to interact with external systems on your behalf.
1. Create a secret or use an existing one. Multiple Pods can reference the
same secret.
2. Modify your Pod definition to add a volume under .spec.volumes[].
Name the volume anything, and have a .spec.volumes[].secret.secr
etName field equal to the name of the Secret object.
3. Add a .spec.containers[].volumeMounts[] to each container that
needs the secret. Specify .spec.containers[].volumeMounts[].readO
nly = true and .spec.containers[].volumeMounts[].mountPath to
an unused directory name where you would like the secrets to appear.
4. Modify your image or command line so that the program looks for files
in that directory. Each key in the secret data map becomes the filename
under mountPath.
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: mypod
image: redis
volumeMounts:
- name: foo
mountPath: "/etc/foo"
readOnly: true
volumes:
- name: foo
secret:
secretName: mysecret
If there are multiple containers in the Pod, then each container needs its
own volumeMounts block, but only one .spec.volumes is needed per Secret.
You can package many files into one secret, or use many secrets, whichever
is convenient.
You can also control the paths within the volume where Secret keys are
projected. You can use the .spec.volumes[].secret.items field to change
the target path of each key:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: mypod
image: redis
volumeMounts:
- name: foo
mountPath: "/etc/foo"
readOnly: true
volumes:
- name: foo
secret:
secretName: mysecret
items:
- key: username
path: my-group/my-username
You can set the file access permission bits for a single Secret key. If you
don't specify any permissions, 0644 is used by default. You can also set a
default mode for the entire Secret volume and override per key if needed.
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: mypod
image: redis
volumeMounts:
- name: foo
mountPath: "/etc/foo"
volumes:
- name: foo
secret:
secretName: mysecret
defaultMode: 0400
Then, the secret will be mounted on /etc/foo and all the files created by the
secret volume mount will have permission 0400.
Note that the JSON spec doesn't support octal notation, so use the value 256
for 0400 permissions. If you use YAML instead of JSON for the Pod, you can
use octal notation to specify permissions in a more natural way.
Note if you kubectl exec into the Pod, you need to follow the symlink to
find the expected file mode. For example,
cd /etc/foo
ls -l
total 0
lrwxrwxrwx 1 root root 15 May 18 00:18 password -> ..data/
password
lrwxrwxrwx 1 root root 15 May 18 00:18 username -> ..data/
username
cd /etc/foo/..data
ls -l
total 8
-r-------- 1 root root 12 May 18 00:18 password
-r-------- 1 root root 5 May 18 00:18 username
You can also use mapping, as in the previous example, and specify different
permissions for different files like this:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: mypod
image: redis
volumeMounts:
- name: foo
mountPath: "/etc/foo"
volumes:
- name: foo
secret:
secretName: mysecret
items:
- key: username
path: my-group/my-username
mode: 0777
Note that this permission value might be displayed in decimal notation if you
read it later.
Inside the container that mounts a secret volume, the secret keys appear as
files and the secret values are base64 decoded and stored inside these files.
This is the result of commands executed inside the container from the
example above:
ls /etc/foo/
username
password
cat /etc/foo/username
admin
cat /etc/foo/password
1f2d1e2e67df
The program in a container is responsible for reading the secrets from the
files.
Mounted Secrets are updated automatically
1. Create a secret or use an existing one. Multiple Pods can reference the
same secret.
2. Modify your Pod definition in each container that you wish to consume
the value of a secret key to add an environment variable for each secret
key you wish to consume. The environment variable that consumes the
secret key should populate the secret's name and key in env[].valueFr
om.secretKeyRef.
3. Modify your image and/or command line so that the program looks for
values in the specified environment variables.
apiVersion: v1
kind: Pod
metadata:
name: secret-env-pod
spec:
containers:
- name: mycontainer
image: redis
env:
- name: SECRET_USERNAME
valueFrom:
secretKeyRef:
name: mysecret
key: username
- name: SECRET_PASSWORD
valueFrom:
secretKeyRef:
name: mysecret
key: password
restartPolicy: Never
echo $SECRET_USERNAME
admin
echo $SECRET_PASSWORD
1f2d1e2e67df
Immutable Secrets
FEATURE STATE: Kubernetes v1.19 [beta]
• protects you from accidental (or unwanted) updates that could cause
applications outages
• improves performance of your cluster by significantly reducing load on
kube-apiserver, by closing watches for secrets marked as immutable.
apiVersion: v1
kind: Secret
metadata:
...
data:
...
immutable: true
Using imagePullSecrets
The imagePullSecrets field is a list of references to secrets in the same
namespace. You can use an imagePullSecrets to pass a secret that contains
a Docker (or other) image registry password to the kubelet. The kubelet
uses this information to pull a private image on behalf of your Pod. See the
PodSpec API for more information about the imagePullSecrets field.
You can learn how to specify ImagePullSecrets from the container images
documentation.
Details
Restrictions
Secret volume sources are validated to ensure that the specified object
reference actually points to an object of type Secret. Therefore, a secret
needs to be created before any Pods that depend on it.
The kubelet only supports the use of secrets for Pods where the secrets are
obtained from the API server. This includes any Pods created using kubectl,
or indirectly via a replication controller. It does not include Pods created as
a result of the kubelet --manifest-url flag, its --config flag, or its REST
API (these are not common ways to create Pods.)
apiVersion: v1
kind: Secret
metadata:
name: mysecret
type: Opaque
data:
USER_NAME: YWRtaW4=
PASSWORD: MWYyZDFlMmU2N2Rm
apiVersion: v1
kind: Pod
metadata:
name: secret-test-pod
spec:
containers:
- name: test-container
image: k8s.gcr.io/busybox
command: [ "/bin/sh", "-c", "env" ]
envFrom:
- secretRef:
name: mysecret
restartPolicy: Never
Now you can create a Pod which references the secret with the ssh key and
consumes it in a volume:
apiVersion: v1
kind: Pod
metadata:
name: secret-test-pod
labels:
name: secret-test
spec:
volumes:
- name: secret-volume
secret:
secretName: ssh-key-secret
containers:
- name: ssh-test-container
image: mySshImage
volumeMounts:
- name: secret-volume
readOnly: true
mountPath: "/etc/secret-volume"
When the container's command runs, the pieces of the key will be available
in:
/etc/secret-volume/ssh-publickey
/etc/secret-volume/ssh-privatekey
The container is then free to use the secret data to establish an ssh
connection.
Note:
kubectl apply -k .
Both containers will have the following files present on their filesystems
with the values for each container's environment:
/etc/secret-volume/username
/etc/secret-volume/password
Note how the specs for the two Pods differ only in one field; this facilitates
creating Pods with different capabilities from a common Pod template.
You could further simplify the base Pod specification by using two service
accounts:
apiVersion: v1
kind: Pod
metadata:
name: prod-db-client-pod
labels:
name: prod-db-client
spec:
serviceAccount: prod-db-client
containers:
- name: db-client-container
image: myClientImage
apiVersion: v1
kind: Secret
metadata:
name: dotfile-secret
data:
.secret-file: dmFsdWUtMg0KDQo=
---
apiVersion: v1
kind: Pod
metadata:
name: secret-dotfiles-pod
spec:
volumes:
- name: secret-volume
secret:
secretName: dotfile-secret
containers:
- name: dotfile-test-container
image: k8s.gcr.io/busybox
command:
- ls
- "-l"
- "/etc/secret-volume"
volumeMounts:
- name: secret-volume
readOnly: true
mountPath: "/etc/secret-volume"
The volume will contain a single file, called .secret-file, and the dotfile-
test-container will have this file present at the path /etc/secret-
volume/.secret-file.
Note: Files beginning with dot characters are hidden from the
output of ls -l; you must use ls -la to see them when listing
directory contents.
With this partitioned approach, an attacker now has to trick the application
server into doing something rather arbitrary, which may be harder than
getting it to read a file.
Best practices
Clients that use the Secret API
When deploying applications that interact with the Secret API, you should
limit access using authorization policies such as RBAC.
For these reasons watch and list requests for secrets within a namespace
are extremely powerful capabilities and should be avoided, since listing
secrets allows the clients to inspect the values of all secrets that are in that
namespace. The ability to watch and list all secrets in a cluster should be
reserved for only the most privileged, system-level components.
Applications that need to access the Secret API should perform get requests
on the secrets they need. This lets administrators restrict access to all
secrets while white-listing access to individual instances that the app needs.
For improved performance over a looping get, clients can design resources
that reference a secret then watch the resource, re-requesting the secret
when the reference changes. Additionally, a "bulk watch" API to let clients w
atch individual resources has also been proposed, and will likely be
available in future releases of Kubernetes.
Security properties
Protections
Because secrets can be created independently of the Pods that use them,
there is less risk of the secret being exposed during the workflow of
creating, viewing, and editing Pods. The system can also take additional
precautions with Secrets, such as avoiding writing them to disk where
possible.
A secret is only sent to a node if a Pod on that node requires it. The kubelet
stores the secret into a tmpfs so that the secret is not written to disk
storage. Once the Pod that depends on the secret is deleted, the kubelet will
delete its local copy of the secret data as well.
There may be secrets for several Pods on the same node. However, only the
secrets that a Pod requests are potentially visible within its containers.
Therefore, one Pod does not have access to the secrets of another Pod.
You can enable encryption at rest for secret data, so that the secrets are not
stored in the clear into etcd.
Risks
• In the API server, secret data is stored in etcd; therefore:
◦ Administrators should enable encryption at rest for cluster data
(requires v1.13 or later).
◦ Administrators should limit access to etcd to admin users.
◦ Administrators may want to wipe/shred disks used by etcd when
no longer in use.
◦ If running etcd in a cluster, administrators should make sure to
use SSL/TLS for etcd peer-to-peer communication.
• If you configure the secret through a manifest (JSON or YAML) file
which has the secret data encoded as base64, sharing this file or
checking it in to a source repository means the secret is compromised.
Base64 encoding is not an encryption method and is considered the
same as plain text.
• Applications still need to protect the value of secret after reading it
from the volume, such as not accidentally logging it or transmitting it to
an untrusted party.
• A user who can create a Pod that uses a secret can also see the value of
that secret. Even if the API server policy does not allow that user to
read the Secret, the user could run a Pod which exposes the secret.
• Currently, anyone with root permission on any node can read any secret
from the API server, by impersonating the kubelet. It is a planned
feature to only send secrets to nodes that actually require them, to
restrict the impact of a root exploit on a single node.
What's next
• Learn how to manage Secret using kubectl
• Learn how to manage Secret using config file
• Learn how to manage Secret using kustomize
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 13, 2020 at 11:16 PM PST: Env var are not updated
after a secret update (d77262740)
Edit this page Create child page Create an issue
• Overview of Secrets
• Types of Secret
◦ Opaque secrets
◦ Service account token Secrets
◦ Docker config Secrets
◦ Basic authentication Secret
◦ SSH authentication secrets
◦ TLS secrets
◦ Bootstrap token Secrets
• Creating a Secret
• Editing a Secret
• Using Secrets
◦ Using Secrets as files from a Pod
◦ Using Secrets as environment variables
• Immutable Secrets
◦ Using imagePullSecrets
◦ Arranging for imagePullSecrets to be automatically attached
◦ Automatic mounting of manually created Secrets
• Details
◦ Restrictions
◦ Secret and Pod lifetime interaction
• Use cases
◦ Use-Case: As container environment variables
◦ Use-Case: Pod with ssh keys
◦ Use-Case: Pods with prod / test credentials
◦ Use-case: dotfiles in a secret volume
◦ Use-case: Secret visible to one container in a Pod
• Best practices
◦ Clients that use the Secret API
• Security properties
◦ Protections
◦ Risks
• What's next
Managing Resources for
Containers
When you specify a Pod, you can optionally specify how much of each
resource a Container needs. The most common resources to specify are CPU
and memory (RAM); there are others.
When you specify the resource request for Containers in a Pod, the
scheduler uses this information to decide which node to place the Pod on.
When you specify a resource limit for a Container, the kubelet enforces
those limits so that the running container is not allowed to use more of that
resource than the limit you set. The kubelet also reserves at least the
request amount of that system resource specifically for that container to
use.
For example, if you set a memory request of 256 MiB for a container, and that
container is in a Pod scheduled to a Node with 8GiB of memory and no other
Pods, then the container can try to use more RAM.
If you set a memory limit of 4GiB for that Container, the kubelet (and
container runtime) enforce the limit. The runtime prevents the container
from using more than the configured resource limit. For example: when a
process in the container tries to consume more than the allowed amount of
memory, the system kernel terminates the process that attempted the
allocation, with an out of memory (OOM) error.
Note: If a Container specifies its own memory limit, but does not
specify a memory request, Kubernetes automatically assigns a
memory request that matches the limit. Similarly, if a Container
specifies its own CPU limit, but does not specify a CPU request,
Kubernetes automatically assigns a CPU request that matches the
limit.
Resource types
CPU and memory are each a resource type. A resource type has a base unit.
CPU represents compute processing and is specified in units of Kubernetes
CPUs. Memory is specified in units of bytes. If you're using Kubernetes
v1.14 or newer, you can specify huge page resources. Huge pages are a
Linux-specific feature where the node kernel allocates blocks of memory
that are much larger than the default page size.
For example, on a system where the default page size is 4KiB, you could
specify a limit, hugepages-2Mi: 80Mi. If the container tries allocating over
40 2MiB huge pages (a total of 80 MiB), that allocation fails.
• spec.containers[].resources.limits.cpu
• spec.containers[].resources.limits.memory
• spec.containers[].resources.limits.hugepages-<size>
• spec.containers[].resources.requests.cpu
• spec.containers[].resources.requests.memory
• spec.containers[].resources.requests.hugepages-<size>
Meaning of memory
Limits and requests for memory are measured in bytes. You can express
memory as a plain integer or as a fixed-point number using one of these
suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei,
Pi, Ti, Gi, Mi, Ki. For example, the following represent roughly the same
value:
Here's an example. The following Pod has two Containers. Each Container
has a request of 0.25 cpu and 64MiB (226 bytes) of memory. Each Container
has a limit of 0.5 cpu and 128MiB of memory. You can say the Pod has a
request of 0.5 cpu and 128 MiB of memory, and a limit of 1 cpu and 256MiB
of memory.
apiVersion: v1
kind: Pod
metadata:
name: frontend
spec:
containers:
- name: app
image: images.my-company.example/app:v4
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
- name: log-aggregator
image: images.my-company.example/log-aggregator:v6
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
If a Container exceeds its memory request, it is likely that its Pod will be
evicted whenever the node runs out of memory.
A Container might or might not be allowed to exceed its CPU limit for
extended periods of time. However, it will not be killed for excessive CPU
usage.
Pods use ephemeral local storage for scratch space, caching, and for logs.
The kubelet can provide scratch space to Pods using local ephemeral
storage to mount emptyDir volumes into containers.
The kubelet also uses this kind of storage to hold node-level container logs,
container images, and the writable layers of running containers.
As a beta feature, Kubernetes lets you track, reserve and limit the amount of
ephemeral local storage a Pod can consume.
• Single filesystem
• Two filesystems
In this configuration, you place all different kinds of ephemeral local data (e
mptyDir volumes, writeable layers, container images, logs) into one
filesystem. The most effective way to configure the kubelet means
dedicating this filesystem to Kubernetes (kubelet) data.
The kubelet also writes node-level container logs and treats these similarly
to ephemeral local storage.
The kubelet writes logs to files inside its configured log directory (/var/log
by default); and has a base directory for other locally stored data (/var/
lib/kubelet by default).
Your node can have as many other filesystems, not used for Kubernetes, as
you like.
You have a filesystem on the node that you're using for ephemeral data that
comes from running Pods: logs, and emptyDir volumes. You can use this
filesystem for other data (for example: system logs not related to
Kubernetes); it can even be the root filesystem.
The kubelet also writes node-level container logs into the first filesystem,
and treats these similarly to ephemeral local storage.
The first filesystem does not hold any image layers or writeable layers.
Your node can have as many other filesystems, not used for Kubernetes, as
you like.
The kubelet can measure how much local storage it is using. It does this
provided that:
If you have a different configuration, then the kubelet does not apply
resource limits for ephemeral local storage.
• spec.containers[].resources.limits.ephemeral-storage
• spec.containers[].resources.requests.ephemeral-storage
Limits and requests for ephemeral-storage are measured in bytes. You can
express storage as a plain integer or as a fixed-point number using one of
these suffixes: E, P, T, G, M, K. You can also use the power-of-two
equivalents: Ei, Pi, Ti, Gi, Mi, Ki. For example, the following represent
roughly the same value:
In the following example, the Pod has two Containers. Each Container has a
request of 2GiB of local ephemeral storage. Each Container has a limit of
4GiB of local ephemeral storage. Therefore, the Pod has a request of 4GiB of
local ephemeral storage, and a limit of 8GiB of local ephemeral storage.
apiVersion: v1
kind: Pod
metadata:
name: frontend
spec:
containers:
- name: app
image: images.my-company.example/app:v4
resources:
requests:
ephemeral-storage: "2Gi"
limits:
ephemeral-storage: "4Gi"
- name: log-aggregator
image: images.my-company.example/log-aggregator:v6
resources:
requests:
ephemeral-storage: "2Gi"
limits:
ephemeral-storage: "4Gi"
The scheduler ensures that the sum of the resource requests of the
scheduled Containers is less than the capacity of the node.
If a Pod is using more ephemeral storage than you allow it to, the kubelet
sets an eviction signal that triggers Pod eviction.
For pod-level isolation the kubelet works out an overall Pod storage limit by
summing the limits for the containers in that Pod. In this case, if the sum of
the local ephemeral storage usage from all containers and also the Pod's emp
tyDir volumes exceeds the overall Pod storage limit, then the kubelet also
marks the Pod for eviction.
Caution:
If the kubelet is not measuring local ephemeral storage, then a
Pod that exceeds its local storage limit will not be evicted for
breaching local storage resource limits.
• Periodic scanning
• Filesystem project quota
The kubelet performs regular, scheduled checks that scan each emptyDir
volume, container log directory, and writeable container layer.
Note:
In this mode, the kubelet does not track open file descriptors for
deleted files.
Note: Project quotas let you monitor storage use; they do not
enforce limits.
Kubernetes uses project IDs starting from 1048576. The IDs in use are
registered in /etc/projects and /etc/projid. If project IDs in this range
are used for other purposes on the system, those project IDs must be
registered in /etc/projects and /etc/projid so that Kubernetes does not
use them.
Quotas are faster and more accurate than directory scanning. When a
directory is assigned to a project, all files created under a directory are
created in that project, and the kernel merely has to keep track of how many
blocks are in use by files in that project.
If a file is created and deleted, but has an open file descriptor, it continues to
consume space. Quota tracking records that space accurately whereas
directory scans overlook the storage used by deleted files.
• Ensure that the root filesystem (or optional runtime filesystem) has
project quotas enabled. All XFS filesystems support project quotas. For
ext4 filesystems, you need to enable the project quota tracking feature
while the filesystem is not mounted.
Extended resources
Extended resources are fully-qualified resource names outside the kubernet
es.io domain. They allow cluster operators to advertise and users to
consume the non-Kubernetes-built-in resources.
There are two steps required to use Extended Resources. First, the cluster
operator must advertise an Extended Resource. Second, users must request
the Extended Resource in Pods.
See Device Plugin for how to advertise device plugin managed resources on
each node.
Other resources
Example:
Here is an example showing how to use curl to form an HTTP request that
advertises five "example.com/foo" resources on node k8s-node-1 whose
master is k8s-master.
Cluster-level extended resources are not tied to nodes. They are usually
managed by scheduler extenders, which handle the resource consumption
and resource quota.
You can specify the extended resources that are handled by scheduler
extenders in scheduler policy configuration.
Example:
The following configuration for a scheduler policy indicates that the cluster-
level extended resource "example.com/foo" is handled by the scheduler
extender.
• The scheduler sends a Pod to the scheduler extender only if the Pod
requests "example.com/foo".
• The ignoredByScheduler field specifies that the scheduler does not
check the "example.com/foo" resource in its PodFitsResources
predicate.
{
"kind": "Policy",
"apiVersion": "v1",
"extenders": [
{
"urlPrefix":"<extender-endpoint>",
"bindVerb": "bind",
"managedResources": [
{
"name": "example.com/foo",
"ignoredByScheduler": true
}
]
}
]
}
A Pod is scheduled only if all of the resource requests are satisfied, including
CPU, memory and any extended resources. The Pod remains in the PENDING
state as long as the resource request cannot be satisfied.
Example:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: myimage
resources:
requests:
cpu: 2
example.com/foo: 1
limits:
example.com/foo: 1
PID limiting
Process ID (PID) limits allow for the configuration of a kubelet to limit the
number of PIDs that a given Pod can consume. See Pid Limiting for
information.
Troubleshooting
My Pods are pending with event message failedScheduling
If the scheduler cannot find any node where a Pod can fit, the Pod remains
unscheduled until a place can be found. An event is produced each time the
scheduler fails to find a place for the Pod, like this:
Events:
FirstSeen LastSeen Count From Subobject
PathReason Message
36s 5s 6 {scheduler }
FailedScheduling Failed for reason PodExceedsFreeCPU and
possibly others
You can check node capacities and amounts allocated with the kubectl
describe nodes command. For example:
Name: e2e-test-node-pool-4lw4
[ ... lines removed for clarity ...]
Capacity:
cpu: 2
memory: 7679792Ki
pods: 110
Allocatable:
cpu: 1800m
memory: 7474992Ki
pods: 110
[ ... lines removed for clarity ...]
Non-terminated Pods: (5 in total)
Namespace Name CPU
Requests CPU Limits Memory Requests Memory Limits
--------- ----
------------ ---------- --------------- -------------
kube-system fluentd-gcp-v1.38-28bv1 100m
(5%) 0 (0%) 200Mi (2%) 200Mi (2%)
kube-system kube-dns-3297075139-61lj3 260m
(13%) 0 (0%) 100Mi (1%) 170Mi (2%)
kube-system kube-proxy-e2e-test-... 100m
(5%) 0 (0%) 0 (0%) 0 (0%)
kube-system monitoring-influxdb-grafana-v4-z1m12 200m
(10%) 200m (10%) 600Mi (8%) 600Mi (8%)
kube-system node-problem-detector-v0.1-fj7m3 20m
(1%) 200m (10%) 20Mi (0%) 100Mi (1%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
680m (34%) 400m (20%) 920Mi (11%) 1070Mi (13%)
In the preceding output, you can see that if a Pod requests more than 1120m
CPUs or 6.23Gi of memory, it will not fit on the node.
By looking at the Pods section, you can see which Pods are taking up space
on the node.
The amount of resources available to Pods is less than the node capacity,
because system daemons use a portion of the available resources. The alloc
atable field NodeStatus gives the amount of resources that are available to
Pods. For more information, see Node Allocatable Resources.
The resource quota feature can be configured to limit the total amount of
resources that can be consumed. If used in conjunction with namespaces, it
can prevent one team from hogging all the resources.
My Container is terminated
Your Container might get terminated because it is resource-starved. To
check whether a Container is being killed because it is hitting a resource
limit, call kubectl describe pod on the Pod of interest:
Name: simmemleak-hra99
Namespace: default
Image(s): saadali/simmemleak
Node: kubernetes-node-tf0f/
10.240.216.66
Labels: name=simmemleak
Status: Running
Reason:
Message:
IP: 10.244.2.75
Replication Controllers: simmemleak (1/1 replicas created)
Containers:
simmemleak:
Image: saadali/simmemleak
Limits:
cpu: 100m
memory: 50Mi
State: Running
Started: Tue, 07 Jul 2015 12:54:41 -0700
Last Termination State: Terminated
Exit Code: 1
Started: Fri, 07 Jul 2015 12:54:30 -0700
Finished: Fri, 07 Jul 2015 12:54:33 -0700
Ready: False
Restart Count: 5
Conditions:
Type Status
Ready False
Events:
FirstSeen
LastSeen Count
From
SubobjectPath Reason Message
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51
-0700 1
{scheduler }
scheduled Successfully assigned simmemleak-hra99 to
kubernetes-node-tf0f
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51
-0700 1 {kubelet kubernetes-node-tf0f} implicitly
required container POD pulled Pod container image
"k8s.gcr.io/pause:0.8.0" already present on machine
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51
-0700 1 {kubelet kubernetes-node-tf0f} implicitly
required container POD created Created with docker id
6a41280f516d
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51
-0700 1 {kubelet kubernetes-node-tf0f} implicitly
required container POD started Started with docker id
6a41280f516d
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51
-0700 1 {kubelet kubernetes-node-tf0f}
spec.containers{simmemleak} created Created with
docker id 87348f12526a
In the preceding example, the Restart Count: 5 indicates that the simmeml
eak Container in the Pod was terminated and restarted five times.
You can call kubectl get pod with the -o go-template=... option to fetch
the status of previously terminated Containers:
kubectl get pod -o go-template='{{range.status.containerStatuses}
}{{"Container Name: "}}{{.name}}{{"\r\nLastState: "}}
{{.lastState}}{{end}}' simmemleak-hra99
You can see that the Container was terminated because of reason:OOM
Killed, where OOM stands for Out Of Memory.
What's next
• Get hands-on experience assigning Memory resources to Containers
and Pods.
• For more details about the difference between requests and limits, see
Resource QoS.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified December 17, 2020 at 8:43 PM PST: Fix typo in manage-
resources-containers.md (815fe3790)
Edit this page Create child page Create an issue
With kubeconfig files, you can organize your clusters, users, and
namespaces. You can also define contexts to quickly and easily switch
between clusters and namespaces.
Context
A context element in a kubeconfig file is used to group access parameters
under a convenient name. Each context has three parameters: cluster,
namespace, and user. By default, the kubectl command-line tool uses
parameters from the current context to communicate with the cluster.
Here are the rules that kubectl uses when it merges kubeconfig files:
1. If the --kubeconfig flag is set, use only the specified file. Do not
merge. Only one instance of this flag is allowed.
2. Determine the context to use based on the first hit in this chain:
3. Determine the cluster and user. At this point, there might or might not
be a context. Determine the cluster and user based on the first hit in
this chain, which is run twice: once for user and once for cluster:
6. For any information still missing, use default values and potentially
prompt for authentication information.
File references
File and path references in a kubeconfig file are relative to the location of
the kubeconfig file. File references on the command line are relative to the
current working directory. In $HOME/.kube/config, relative paths are stored
relatively, and absolute paths are stored absolutely.
What's next
• Configure Access to Multiple Clusters
• kubectl config
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified May 30, 2020 at 3:10 PM PST: add en pages (ecc27bbbe)
Edit this page Create child page Create an issue
Pods can have priority. Priority indicates the importance of a Pod relative to
other Pods. If a Pod cannot be scheduled, the scheduler tries to preempt
(evict) lower priority Pods to make scheduling of the pending Pod possible.
Warning:
In a cluster where not all users are trusted, a malicious user could
create Pods at the highest possible priorities, causing other Pods
to be evicted/not get scheduled. An administrator can use
ResourceQuota to prevent users from creating pods at high
priorities.
PriorityClass
A PriorityClass is a non-namespaced object that defines a mapping from a
priority class name to the integer value of the priority. The name is specified
in the name field of the PriorityClass object's metadata. The value is specified
in the required value field. The higher the value, the higher the priority. The
name of a PriorityClass object must be a valid DNS subdomain name, and it
cannot be prefixed with system-.
A PriorityClass object can have any 32-bit integer value smaller than or
equal to 1 billion. Larger numbers are reserved for critical system Pods that
should not normally be preempted or evicted. A cluster admin should create
one PriorityClass object for each such mapping that they want.
• If you delete a PriorityClass, existing Pods that use the name of the
deleted PriorityClass remain unchanged, but you cannot create more
Pods that use the name of the deleted PriorityClass.
Example PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for XYZ service
pods only."
Non-preempting PriorityClass
FEATURE STATE: Kubernetes v1.19 [beta]
An example use case is for data science workloads. A user may submit a job
that they want to be prioritized above other workloads, but do not wish to
discard existing work by preempting running pods. The high priority job
with PreemptionPolicy: Never will be scheduled ahead of other queued
pods, as soon as sufficient cluster resources "naturally" become free.
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
priorityClassName: high-priority
Preemption
When Pods are created, they go to a queue and wait to be scheduled. The
scheduler picks a Pod from the queue and tries to schedule it on a Node. If
no Node is found that satisfies all the specified requirements of the Pod,
preemption logic is triggered for the pending Pod. Let's call the pending Pod
P. Preemption logic tries to find a Node where removal of one or more Pods
with lower priority than P would enable P to be scheduled on that Node. If
such a Node is found, one or more lower priority Pods get evicted from the
Node. After the Pods are gone, P can be scheduled on the Node.
Please note that Pod P is not necessarily scheduled to the "nominated Node".
After victim Pods are preempted, they get their graceful termination period.
If another node becomes available while scheduler is waiting for the victim
Pods to terminate, scheduler will use the other node to schedule Pod P. As a
result nominatedNodeName and nodeName of Pod spec are not always the
same. Also, if scheduler preempts Pods on Node N, but then a higher priority
Pod than Pod P arrives, scheduler may give Node N to the new higher
priority Pod. In such a case, scheduler clears nominatedNodeName of Pod P.
By doing this, scheduler makes Pod P eligible to preempt Pods on another
Node.
Limitations of preemption
When Pods are preempted, the victims get their graceful termination period.
They have that much time to finish their work and exit. If they don't, they
are killed. This graceful termination period creates a time gap between the
point that the scheduler preempts Pods and the time when the pending Pod
(P) can be scheduled on the Node (N). In the meantime, the scheduler keeps
scheduling other pending Pods. As victims exit or get terminated, the
scheduler tries to schedule Pods in the pending queue. Therefore, there is
usually a time gap between the point that scheduler preempts victims and
the time that Pod P is scheduled. In order to minimize this gap, one can set
graceful termination period of lower priority Pods to zero or a small number.
A Node is considered for preemption only when the answer to this question
is yes: "If all the Pods with lower priority than the pending Pod are removed
from the Node, can the pending Pod be scheduled on the Node?"
If Pod Q were removed from its Node, the Pod anti-affinity violation would be
gone, and Pod P could possibly be scheduled on Node N.
Troubleshooting
Pod priority and pre-emption can have unwanted side effects. Here are some
examples of potential problems and ways to deal with them.
To address the problem, you can change the priorityClassName for those
Pods to use lower priority classes, or leave that field empty. An empty prior
ityClassName is resolved to zero by default.
When a Pod is preempted, there will be events recorded for the preempted
Pod. Preemption should happen only when a cluster does not have enough
resources for a Pod. In such cases, preemption happens only when the
priority of the pending Pod (preemptor) is higher than the victim Pods.
Preemption must not happen when there is no pending Pod, or when the
pending Pods have equal or lower priority than the victims. If preemption
happens in such scenarios, please file an issue.
While the preemptor Pod is waiting for the victims to go away, a higher
priority Pod may be created that fits on the same Node. In this case, the
scheduler will schedule the higher priority Pod instead of the preemptor.
This is expected behavior: the Pod with the higher priority should take the
place of a Pod with a lower priority.
When there are multiple nodes available for preemption, the scheduler tries
to choose the node with a set of Pods with lowest priority. However, if such
Pods have PodDisruptionBudget that would be violated if they are
preempted then the scheduler may choose another node with higher priority
Pods.
When multiple nodes exist for preemption and none of the above scenarios
apply, the scheduler chooses a node with the lowest priority.
The only component that considers both QoS and Pod priority is kubelet out-
of-resource eviction. The kubelet ranks Pods for eviction first by whether or
not their usage of the starved resource exceeds requests, then by Priority,
and then by the consumption of the starved compute resource relative to the
Pods' scheduling requests. See evicting end-user pods for more details.
kubelet out-of-resource eviction does not evict Pods when their usage does
not exceed their requests. If a Pod with lower priority is not exceeding its
requests, it won't be evicted. Another Pod with higher priority that exceeds
its requests may be evicted.
What's next
• Read about using ResourceQuotas in connection with PriorityClasses:
limit Priority Class consumption by default
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified October 07, 2020 at 7:16 PM PST: Revise cluster management
task (59dcd57cc)
Edit this page Create child page Create an issue
Cloud
In many ways, the Cloud (or co-located servers, or the corporate datacenter)
is the trusted computing base of a Kubernetes cluster. If the Cloud layer is
vulnerable (or configured in a vulnerable way) then there is no guarantee
that the components built on top of this base are secure. Each cloud
provider makes security recommendations for running workloads securely in
their environment.
Infrastructure security
Suggestions for securing your infrastructure in a Kubernetes cluster:
Area of Concern
for Kubernetes Recommendation
Infrastructure
All access to the Kubernetes control plane is not
Network access to
allowed publicly on the internet and is controlled by
API Server (Control
network access control lists restricted to the set of IP
plane)
addresses needed to administer the cluster.
Nodes should be configured to only accept connections
(via network access control lists)from the control plane
Network access to on the specified ports, and accept connections for
Nodes (nodes) services in Kubernetes of type NodePort and
LoadBalancer. If possible, these nodes should not be
exposed on the public internet entirely.
Area of Concern
for Kubernetes Recommendation
Infrastructure
Each cloud provider needs to grant a different set of
permissions to the Kubernetes control plane and nodes.
Kubernetes access It is best to provide the cluster with cloud provider
to Cloud Provider access that follows the principle of least privilege for
API the resources it needs to administer. The Kops
documentation provides information about IAM policies
and roles.
Access to etcd (the datastore of Kubernetes) should be
limited to the control plane only. Depending on your
Access to etcd configuration, you should attempt to use etcd over TLS.
More information can be found in the etcd
documentation.
Wherever possible it's a good practice to encrypt all
drives at rest, but since etcd holds the state of the
etcd Encryption
entire cluster (including Secrets) its disk should
especially be encrypted at rest.
Cluster
There are two areas of concern for securing Kubernetes:
Container
Container security is outside the scope of this guide. Here are general
recommendations and links to explore this topic:
Code
Application code is one of the primary attack surfaces over which you have
the most control. While securing application code is outside of the
Kubernetes security topic, here are recommendations to protect application
code:
Code security
Area of
Concern for Recommendation
Code
If your code needs to communicate by TCP, perform a TLS
handshake with the client ahead of time. With the
exception of a few cases, encrypt everything in transit.
Access over TLS Going one step further, it's a good idea to encrypt network
only traffic between services. This can be done through a
process known as mutual or mTLS which performs a two
sided verification of communication between two
certificate holding services.
This recommendation may be a bit self-explanatory, but
Limiting port
wherever possible you should only expose the ports on
ranges of
your service that are absolutely essential for
communication
communication or metric gathering.
It is a good practice to regularly scan your application's
3rd Party
third party libraries for known security vulnerabilities.
Dependency
Each programming language has a tool for performing this
Security
check automatically.
Most languages provide a way for a snippet of code to be
analyzed for any potentially unsafe coding practices.
Static Code Whenever possible you should perform checks using
Analysis automated tooling that can scan codebases for common
security errors. Some of the tools can be found at: https://
owasp.org/www-community/Source_Code_Analysis_Tools
There are a few automated tools that you can run against
your service to try some of the well known service attacks.
Dynamic probing
These include SQL injection, CSRF, and XSS. One of the
attacks
most popular dynamic analysis tools is the OWASP Zed
Attack proxy tool.
What's next
Learn about related Kubernetes security topics:
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Policy Types
There is an immediate need for base policy definitions to broadly cover the
security spectrum. These should range from highly restricted to highly
flexible:
Baseline/Default
The Baseline/Default policy is aimed at ease of adoption for common
containerized workloads while preventing known privilege escalations. This
policy is targeted at application operators and developers of non-critical
applications. The following listed controls should be enforced/disallowed:
Control Policy
Sharing the host namespaces must be disallowed.
Restricted Fields:
spec.hostNetwork
Host Namespaces
spec.hostPID
spec.hostIPC
Restricted Fields:
Capabilities
spec.containers[*].securityContext.capabilities.add
spec.initContainers[*].securityContext.capabilities.add
Restricted Fields:
HostPath Volumes
spec.volumes[*].hostPath
Restricted Fields:
Host Ports
spec.containers[*].ports[*].hostPort
spec.initContainers[*].ports[*].hostPort
Restricted Fields:
spec.securityContext.seLinuxOptions
SELinux (optional)
spec.containers[*].securityContext.seLinuxOptions
spec.initContainers[*].securityContext.seLinuxOptions
Restricted Fields:
/proc Mount Type
spec.containers[*].securityContext.procMount
spec.initContainers[*].securityContext.procMount
Restricted Fields:
spec.securityContext.sysctls
Sysctls
Allowed Values:
kernel.shm_rmid_forced
net.ipv4.ip_local_port_range
net.ipv4.tcp_syncookies
net.ipv4.ping_group_range
undefined/empty
Restricted
The Restricted policy is aimed at enforcing current Pod hardening best
practices, at the expense of some compatibility. It is targeted at operators
and developers of security-critical applications, as well as lower-trust
users.The following listed controls should be enforced/disallowed:
Control Policy
Everything from the default profile.
In addition to restricting HostPath volumes, the restricted
profile limits usage of non-core volume types to those defined
through PersistentVolumes.
Restricted Fields:
spec.volumes[*].hostPath
spec.volumes[*].gcePersistentDisk
spec.volumes[*].awsElasticBlockStore
spec.volumes[*].gitRepo
spec.volumes[*].nfs
spec.volumes[*].iscsi
spec.volumes[*].glusterfs
spec.volumes[*].rbd
spec.volumes[*].flexVolume
Volume Types
spec.volumes[*].cinder
spec.volumes[*].cephFS
spec.volumes[*].flocker
spec.volumes[*].fc
spec.volumes[*].azureFile
spec.volumes[*].vsphereVolume
spec.volumes[*].quobyte
spec.volumes[*].azureDisk
spec.volumes[*].portworxVolume
spec.volumes[*].scaleIO
spec.volumes[*].storageos
spec.volumes[*].csi
Restricted Fields:
Running as spec.securityContext.runAsNonRoot
Non-root spec.containers[*].securityContext.runAsNonRoot
spec.initContainers[*].securityContext.runAsNonRoot
Restricted Fields:
spec.securityContext.runAsGroup
spec.securityContext.supplementalGroups[*]
Non-root groups
spec.securityContext.fsGroup
(optional)
spec.containers[*].securityContext.runAsGroup
spec.initContainers[*].securityContext.runAsGroup
Allowed Values:
non-zero
undefined / nil (except for `*.runAsGroup`)
The RuntimeDefault seccomp profile must be required, or
allow specific additional profiles.
Restricted Fields:
spec.securityContext.seccompProfile.type
Seccomp spec.containers[*].securityContext.seccompProfile
spec.initContainers[*].securityContext.seccompProfile
Allowed Values:
'runtime/default'
undefined / nil
Policy Instantiation
Decoupling policy definition from policy instantiation allows for a common
understanding and consistent language of policies across clusters,
independent of the underlying enforcement mechanism.
PodSecurityPolicy
• Privileged
• Baseline
• Restricted
FAQ
Why isn't there a profile between privileged and default?
The three profiles defined here have a clear linear progression from most
secure (restricted) to least secure (privileged), and cover a broad set of
workloads. Privileges required above the baseline policy are typically very
application specific, so we do not offer a standard profile in this niche. This
is not to say that the privileged profile should always be used in this case,
but that policies in this space need to be defined on a case-by-case basis.
SIG Auth may reconsider this position in the future, should a clear need for
other profiles arise.
The protections necessary for sandboxed workloads can differ from others.
For example, the need to restrict privileged permissions is lessened when
the workload is isolated from the underlying kernel. This allows for
workloads requiring heightened permissions to still be isolated.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified September 19, 2020 at 4:48 PM PST: Contex to Context
(70eba58d3)
Edit this page Create child page Create an issue
• Policy Types
• Policies
◦ Privileged
◦ Baseline/Default
◦ Restricted
• Policy Instantiation
• FAQ
◦ Why isn't there a profile between privileged and default?
◦ What's the difference between a security policy and a security
context?
◦ What profiles should I apply to my Windows Pods?
◦ What about sandboxed Pods?
If your cluster uses a private certificate authority, you need a copy of that CA
certifcate configured into your ~/.kube/config on the client, so that you
can trust the connection and be confident it was not intercepted.
Authentication
Once TLS is established, the HTTP request moves to the Authentication
step. This is shown as step 1 in the diagram. The cluster creation script or
cluster admin configures the API server to run one or more Authenticator
modules. Authenticators are described in more detail in Authentication.
The input to the authentication step is the entire HTTP request; however, it
typically just examines the headers and/or client certificate.
Authorization
After the request is authenticated as coming from a specific user, the
request must be authorized. This is shown as step 2 in the diagram.
A request must include the username of the requester, the requested action,
and the object affected by the action. The request is authorized if an existing
policy declares that the user has permissions to complete the requested
action.
For example, if Bob has the policy below, then he can read pods only in the
namespace projectCaribou:
{
"apiVersion": "abac.authorization.kubernetes.io/v1beta1",
"kind": "Policy",
"spec": {
"user": "bob",
"namespace": "projectCaribou",
"resource": "pods",
"readonly": true
}
}
{
"apiVersion": "authorization.k8s.io/v1beta1",
"kind": "SubjectAccessReview",
"spec": {
"resourceAttributes": {
"namespace": "projectCaribou",
"verb": "get",
"group": "unicorn.example.org",
"resource": "pods"
}
}
}
If Bob makes a request to write (create or update) to the objects in the proj
ectCaribou namespace, his authorization is denied. If Bob makes a request
to read (get) objects in a different namespace such as projectFish, then his
authorization is denied.
1. localhost port:
2. "Secure port":
What's next
Read more documentation on authentication, authorization and API access
control:
• Authenticating
◦ Authenticating with Bootstrap Tokens
• Admission Controllers
◦ Dynamic Admission Control
• Authorization
◦ Role Based Access Control
◦ Attribute Based Access Control
◦ Node Authorization
◦ Webhook Authorization
• Certificate Signing Requests
◦ including CSR approval and certificate signing
• Service accounts
◦ Developer guide
◦ Administration
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
• Transport security
• Authentication
• Authorization
• Admission control
• API server ports and IPs
• What's next
Policies
Policies you can configure that apply to groups of resources.
Limit Ranges
Resource Quotas
Limit Ranges
By default, containers run with unbounded compute resources on a
Kubernetes cluster. With resource quotas, cluster administrators can restrict
resource consumption and creation on a namespace basis. Within a
namespace, a Pod or Container can consume as much CPU and memory as
defined by the namespace's resource quota. There is a concern that one Pod
or Container could monopolize all available resources. A LimitRange is a
policy to constrain resource allocations (to Pods or Containers) in a
namespace.
Enabling LimitRange
LimitRange support has been enabled by default since Kubernetes 1.10.
In the case where the total limits of the namespace is less than the sum of
the limits of the Pods/Containers, there may be contention for resources. In
this case, the Containers or Pods will not be created.
What's next
Refer to the LimitRanger design document for more information.
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified July 17, 2020 at 4:11 PM PST: Fix links in concepts section (2)
(c8f470487)
Edit this page Create child page Create an issue
• Enabling LimitRange
◦ Overview of Limit Range
• What's next
Resource Quotas
When several users or teams share a cluster with a fixed number of nodes,
there is a concern that one team could use more than its fair share of
resources.
• Users create resources (pods, services, etc.) in the namespace, and the
quota system tracks usage to ensure it does not exceed hard resource
limits defined in a ResourceQuota.
In the case where the total capacity of the cluster is less than the sum of the
quotas of the namespaces, there may be contention for resources. This is
handled on a first-come-first-served basis.
• requests.nvidia.com/gpu: 4
For example, if an operator wants to quota storage with gold storage class
separate from bronze storage class, the operator can define a quota as
follows:
• gold.storageclass.storage.k8s.io/requests.storage: 500Gi
• bronze.storageclass.storage.k8s.io/requests.storage: 100Gi
Here is an example set of resources users may want to put under object
count quota:
• count/persistentvolumeclaims
• count/services
• count/secrets
• count/configmaps
• count/replicationcontrollers
• count/deployments.apps
• count/replicasets.apps
• count/statefulsets.apps
• count/jobs.batch
• count/cronjobs.batch
The same syntax can be used for custom resources. For example, to create a
quota on a widgets custom resource in the example.com API group, use cou
nt/widgets.example.com.
When using count/* resource quota, an object is charged against the quota
if it exists in server storage. These types of quotas are useful to protect
against exhaustion of storage resources. For example, you may want to limit
the number of Secrets in a server given their large size. Too many Secrets in
a cluster can actually prevent servers and controllers from starting. You can
set a quota for Jobs to protect against a poorly configured CronJob. CronJobs
that create too many Jobs in a namespace can lead to a denial of service.
For example, pods quota counts and enforces a maximum on the number of
pods created in a single namespace that are not terminal. You might want to
set a pods quota on a namespace to avoid the case where a user creates
many small pods and exhausts the cluster's supply of Pod IPs.
Quota Scopes
Each quota can have an associated set of scopes. A quota will only measure
usage for a resource if it matches the intersection of enumerated scopes.
Scope Description
Terminating Match pods where .spec.activeDeadlineSeconds >= 0
NotTerminating Match pods where .spec.activeDeadlineSeconds is nil
BestEffort Match pods that have best effort quality of service.
NotBestEffort Match pods that do not have best effort quality of service.
PriorityClass Match pods that references the specified priority class.
• pods
The Terminating, NotTerminating, NotBestEffort and PriorityClass
scopes restrict a quota to tracking the following resources:
• pods
• cpu
• memory
• requests.cpu
• requests.memory
• limits.cpu
• limits.memory
Note that you cannot specify both the Terminating and the NotTerminating
scopes in the same quota, and you cannot specify both the BestEffort and N
otBestEffort scopes in the same quota either.
• In
• NotIn
• Exists
• DoesNotExist
When using one of the following values as the scopeName when defining the
scopeSelector, the operator must be Exists.
• Terminating
• NotTerminating
• BestEffort
• NotBestEffort
If the operator is In or NotIn, the values field must have at least one value.
For example:
scopeSelector:
matchExpressions:
- scopeName: PriorityClass
operator: In
values:
- middle
• pods
• cpu
• memory
• ephemeral-storage
• limits.cpu
• limits.memory
• limits.ephemeral-storage
• requests.cpu
• requests.memory
• requests.ephemeral-storage
This example creates a quota object and matches it with pods at specific
priorities. The example works as follows:
• Pods in the cluster have one of the three priority classes, "low",
"medium", "high".
• One quota object is created for each priority.
apiVersion: v1
kind: List
items:
- apiVersion: v1
kind: ResourceQuota
metadata:
name: pods-high
spec:
hard:
cpu: "1000"
memory: 200Gi
pods: "10"
scopeSelector:
matchExpressions:
- operator : In
scopeName: PriorityClass
values: ["high"]
- apiVersion: v1
kind: ResourceQuota
metadata:
name: pods-medium
spec:
hard:
cpu: "10"
memory: 20Gi
pods: "10"
scopeSelector:
matchExpressions:
- operator : In
scopeName: PriorityClass
values: ["medium"]
- apiVersion: v1
kind: ResourceQuota
metadata:
name: pods-low
spec:
hard:
cpu: "5"
memory: 10Gi
pods: "10"
scopeSelector:
matchExpressions:
- operator : In
scopeName: PriorityClass
values: ["low"]
resourcequota/pods-high created
resourcequota/pods-medium created
resourcequota/pods-low created
Name: pods-high
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 0 1k
memory 0 200Gi
pods 0 10
Name: pods-low
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 0 5
memory 0 10Gi
pods 0 10
Name: pods-medium
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 0 10
memory 0 20Gi
pods 0 10
Create a pod with priority "high". Save the following YAML to a file high-
priority-pod.yml.
apiVersion: v1
kind: Pod
metadata:
name: high-priority
spec:
containers:
- name: high-priority
image: ubuntu
command: ["/bin/sh"]
args: ["-c", "while true; do echo hello; sleep 10;done"]
resources:
requests:
memory: "10Gi"
cpu: "500m"
limits:
memory: "10Gi"
cpu: "500m"
priorityClassName: high
Verify that "Used" stats for "high" priority quota, pods-high, has changed
and that the other two quotas are unchanged.
Name: pods-high
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 500m 1k
memory 10Gi 200Gi
pods 1 10
Name: pods-low
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 0 5
memory 0 10Gi
pods 0 10
Name: pods-medium
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 0 10
memory 0 20Gi
pods 0 10
NAME AGE
compute-resources 30s
object-counts 32s
Name: compute-resources
Namespace: myspace
Resource Used Hard
-------- ---- ----
limits.cpu 0 2
limits.memory 0 2Gi
requests.cpu 0 1
requests.memory 0 1Gi
requests.nvidia.com/gpu 0 4
Name: object-counts
Namespace: myspace
Resource Used Hard
-------- ---- ----
configmaps 0 10
persistentvolumeclaims 0 4
pods 0 4
replicationcontrollers 0 20
secrets 1 10
services 0 10
services.loadbalancers 0 2
Kubectl also supports object count quota for all standard namespaced
resources using the syntax count/<resource>.<group>:
With this mechanism, operators are able to restrict usage of certain high
priority classes to a limited number of namespaces and not every namespace
will be able to consume these priority classes by default.
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: "ResourceQuota"
configuration:
apiVersion: apiserver.config.k8s.io/v1
kind: ResourceQuotaConfiguration
limitedResources:
- resource: pods
matchScopes:
- scopeName: PriorityClass
operator: In
values: ["cluster-services"]
scopeSelector:
matchExpressions:
- scopeName: PriorityClass
operator: In
values: ["cluster-services"]
What's next
• See ResourceQuota design doc for more information.
• See a detailed example for how to use resource quota.
• Read Quota support for priority class design doc.
• See LimitedResources
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 05, 2020 at 12:12 PM PST: [zh] Sync changes to
docs/concepts/policy/resource-quotas.md (4bf6c16cc)
Edit this page Create child page Create an issue
Control
Field Names
Aspect
Running of
privileged privileged
containers
Usage of
host hostPID, hostIPC
namespaces
Usage of
host
hostNetwork, hostPorts
networking
and ports
Usage of
volume volumes
types
Usage of
the host allowedHostPaths
filesystem
Allow
specific
allowedFlexVolumes
FlexVolume
drivers
Allocating
an FSGroup
that owns fsGroup
the pod's
volumes
Requiring
the use of a
read only readOnlyRootFilesystem
root file
system
Control
Field Names
Aspect
The user
and group
runAsUser, runAsGroup, supplementalGroups
IDs of the
container
Restricting
escalation
allowPrivilegeEscalation, defaultAllowPrivilegeEscalation
to root
privileges
Linux defaultAddCapabilities, requiredDropCapabilities, allowedCapabili
capabilities es
The
SELinux
context of seLinux
the
container
The
Allowed
Proc Mount
allowedProcMountTypes
types for
the
container
The
AppArmor
profile used annotations
by
containers
The
seccomp
profile used annotations
by
containers
The sysctl
profile used
forbiddenSysctls,allowedUnsafeSysctls
by
containers
Most Kubernetes pods are not created directly by users. Instead, they are
typically created indirectly as part of a Deployment, ReplicaSet, or other
templated controller via the controller manager. Granting the controller
access to the policy would grant access for all pods created by that
controller, so the preferred method for authorizing policies is to grant access
to the pod's service account (see example).
Via RBAC
RBAC is a standard Kubernetes authorization mode, and can easily be used
to authorize use of policies.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: <role name>
rules:
- apiGroups: ['policy']
resources: ['podsecuritypolicies']
verbs: ['use']
resourceNames:
- <list of policies to authorize>
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: <binding name>
roleRef:
kind: ClusterRole
name: <role name>
apiGroup: rbac.authorization.k8s.io
subjects:
# Authorize specific service accounts:
- kind: ServiceAccount
name: <authorized service account name>
namespace: <authorized pod namespace>
# Authorize specific users (not recommended):
- kind: User
apiGroup: rbac.authorization.k8s.io
name: <authorized user name>
If a RoleBinding (not a ClusterRoleBinding) is used, it will only grant
usage for pods being run in the same namespace as the binding. This can be
paired with system groups to grant access to all pods run in the namespace:
For more examples of RBAC bindings, see Role Binding Examples. For a
complete example of authorizing a PodSecurityPolicy, see below.
Troubleshooting
• The controller manager must be run against the secured API port and
must not have superuser permissions. See Controlling Access to the
Kubernetes API to learn about API server access controls.
If the controller manager connected through the trusted API port (also
known as the localhost listener), requests would bypass
authentication and authorization modules; all PodSecurityPolicy objects
would be allowed, and users would be able to create grant themselves
the ability to create privileged containers.
Policy Order
In addition to restricting pod creation and update, pod security policies can
also be used to provide default values for many of the fields that it controls.
When multiple policies are available, the pod security policy controller
selects policies according to the following criteria:
Example
This example assumes you have a running cluster with the PodSecurityPolicy
admission controller enabled and you have cluster admin privileges.
Set up
Set up a namespace and a service account to act as for this example. We'll
use this service account to mock a non-admin user.
To make it clear which user we're acting as and save some typing, create 2
aliases:
policy/example-psp.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: example
spec:
privileged: false # Don't allow privileged pods!
# The rest fills in some required fields.
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
runAsUser:
rule: RunAsAny
fsGroup:
rule: RunAsAny
volumes:
- '*'
Create the rolebinding to grant fake-user the use verb on the example
policy:
Note: This is not the recommended way! See the next section for
the preferred approach.
It works as expected! But any attempts to create a privileged pod should still
be denied:
What happened? We already bound the psp:unprivileged role for our fak
e-user, why are we getting the error Error creating: pods
"pause-7774d79b5-" is forbidden: no providers available to
validate pod request? The answer lies in the source - replicaset-
controller. Fake-user successfully created the deployment (which
successfully created a replicaset), but when the replicaset went to create the
pod it was not authorized to use the example podsecuritypolicy.
In order to fix this, bind the psp:unprivileged role to the pod's service
account instead. In this case (since we didn't specify it) the service account
is default:
Clean up
Delete the namespace to clean up most of the example resources:
Example Policies
This is the least restrictive policy you can create, equivalent to not using the
pod security policy admission controller:
policy/privileged-psp.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: privileged
annotations:
seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
spec:
privileged: true
allowPrivilegeEscalation: true
allowedCapabilities:
- '*'
volumes:
- '*'
hostNetwork: true
hostPorts:
- min: 0
max: 65535
hostIPC: true
hostPID: true
runAsUser:
rule: 'RunAsAny'
seLinux:
rule: 'RunAsAny'
supplementalGroups:
rule: 'RunAsAny'
fsGroup:
rule: 'RunAsAny'
policy/restricted-psp.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: restricted
annotations:
seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'do
cker/default,runtime/default'
apparmor.security.beta.kubernetes.io/allowedProfileNames: 'ru
ntime/default'
seccomp.security.alpha.kubernetes.io/defaultProfileName: 'ru
ntime/default'
apparmor.security.beta.kubernetes.io/defaultProfileName: 'ru
ntime/default'
spec:
privileged: false
# Required to prevent escalations to root.
allowPrivilegeEscalation: false
# This is redundant with non-root + disallow privilege
escalation,
# but we can provide it for defense in depth.
requiredDropCapabilities:
- ALL
# Allow core volume types.
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
# Assume that persistentVolumes set up by the cluster admin
are safe to use.
- 'persistentVolumeClaim'
hostNetwork: false
hostIPC: false
hostPID: false
runAsUser:
# Require the container to run without root privileges.
rule: 'MustRunAsNonRoot'
seLinux:
# This policy assumes the nodes are using AppArmor rather
than SELinux.
rule: 'RunAsAny'
supplementalGroups:
rule: 'MustRunAs'
ranges:
# Forbid adding the root group.
- min: 1
max: 65535
fsGroup:
rule: 'MustRunAs'
ranges:
# Forbid adding the root group.
- min: 1
max: 65535
readOnlyRootFilesystem: false
Policy Reference
Privileged
Privileged - determines if any container in a pod can enable privileged
mode. By default a container is not allowed to access any devices on the
host, but a "privileged" container is given access to all devices on the host.
This allows the container nearly all the same access as processes running on
the host. This is useful for containers that want to use linux capabilities like
manipulating the network stack and accessing devices.
Host namespaces
HostPID - Controls whether the pod containers can share the host process
ID namespace. Note that when paired with ptrace this can be used to
escalate privileges outside of the container (ptrace is forbidden by default).
HostIPC - Controls whether the pod containers can share the host IPC
namespace.
HostNetwork - Controls whether the pod may use the node network
namespace. Doing so gives the pod access to the loopback device, services
listening on localhost, and could be used to snoop on network activity of
other pods on the same node.
The recommended minimum set of allowed volumes for new PSPs are:
• configMap
• downwardAPI
• emptyDir
• persistentVolumeClaim
• secret
• projected
allowedHostPaths:
# This allows "/foo", "/foo/", "/foo/bar" etc., but
# disallows "/fool", "/etc/foo" etc.
# "/foo/../" is never valid.
- pathPrefix: "/foo"
readOnly: true # only allow read-only mounts
Warning:
FlexVolume drivers
This specifies a list of FlexVolume drivers that are allowed to be used by
flexvolume. An empty list or nil means there is no restriction on the drivers.
Please make sure volumes field contains the flexVolume volume type; no
FlexVolume driver is allowed otherwise.
For example:
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: allow-flex-volumes
spec:
# ... other spec fields
volumes:
- flexVolume
allowedFlexVolumes:
- driver: example/lvm
- driver: example/cifs
Users and groups
RunAsUser - Controls which user ID the containers are run with.
Privilege Escalation
These options control the allowPrivilegeEscalation container option. This
bool directly controls whether the no_new_privs flag gets set on the
container process. This flag will prevent setuid binaries from changing the
effective user ID, and prevent files from enabling extra capabilities (e.g. it
will prevent the use of the ping tool). This behavior is required to effectively
enforce MustRunAsNonRoot.
Capabilities
Linux capabilities provide a finer grained breakdown of the privileges
traditionally associated with the superuser. Some of these capabilities can
be used to escalate privileges or for container breakout, and may be
restricted by the PodSecurityPolicy. For more details on Linux capabilities,
see capabilities(7).
SELinux
• MustRunAs - Requires seLinuxOptions to be configured. Uses seLinux
Options as the default. Validates against seLinuxOptions.
• RunAsAny - No default provided. Allows any seLinuxOptions to be
specified.
AllowedProcMountTypes
allowedProcMountTypes is a list of allowed ProcMountTypes. Empty or nil
indicates that only the DefaultProcMountType may be used.
AppArmor
Controlled via annotations on the PodSecurityPolicy. Refer to the AppArmor
documentation.
Seccomp
As of Kubernetes v1.19, you can use the seccompProfile field in the securi
tyContext of Pods or containers to control use of seccomp profiles. In prior
versions, seccomp was controlled by adding annotations to a Pod. The same
PodSecurityPolicies can be used with either version to enforce how these
fields or annotations are applied.
seccomp.security.alpha.kubernetes.io/defaultProfileName - Annotation
that specifies the default seccomp profile to apply to containers. Possible
values are:
seccomp.security.alpha.kubernetes.io/allowedProfileNames -
Annotation that specifies which values are allowed for the pod seccomp
annotations. Specified as a comma-delimited list of allowed values. Possible
values are those listed above, plus * to allow all profiles. Absence of this
annotation means that the default cannot be changed.
Sysctl
By default, all safe sysctls are allowed.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Kubernetes allow you to limit the number of process IDs (PIDs) that a Pod
can use. You can also reserve a number of allocatable PIDs for each node for
use by the operating system and daemons (rather than by Pods).
You can configure a kubelet to limit the number of PIDs a given Pod can
consume. For example, if your node's host OS is set to use a maximum of 262
144 PIDs and expect to host less than 250 Pods, one can give each Pod a
budget of 1000 PIDs to prevent using up that node's overall number of
available PIDs. If the admin wants to overcommit PIDs similar to CPU or
memory, they may do so as well with some additional risks. Either way, a
single Pod will not be able to bring the whole machine down. This kind of
resource limiting helps to prevent simple fork bombs from affecting
operation of an entire cluster.
Per-Pod PID limiting allows administrators to protect one Pod from another,
but does not ensure that all Pods scheduled onto that host are unable to
impact the node overall. Per-Pod limiting also does not protect the node
agents themselves from PID exhaustion.
You can also reserve an amount of PIDs for node overhead, separate from
the allocation to Pods. This is similar to how you can reserve CPU, memory,
or other resources for use by the operating system and other facilities
outside of Pods and their containers.
Caution: This means that the limit that applies to a Pod may be
different depending on where the Pod is scheduled. To make things
simple, it's easiest if all Nodes use the same PID resource limits
and reservations.
PID limiting - per Pod and per Node sets the hard limit. Once the limit is hit,
workload will start experiencing failures when trying to get a new PID. It
may or may not lead to rescheduling of a Pod, depending on how workload
reacts on these failures and how liveleness and readiness probes are
configured for the Pod. However, if limits were set correctly, you can
guarantee that other Pods workload and system processes will not run out of
PIDs when one Pod is misbehaving.
What's next
• Refer to the PID Limiting enhancement document for more information.
• For historical context, read Process ID Limiting for Stability
Improvements in Kubernetes 1.14.
• Read Managing Resources for Containers.
• Learn how to Configure Out of Resource Handling.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Kubernetes Scheduler
Pod Overhead
Eviction Policy
Scheduling Framework
Scheduling overview
A scheduler watches for newly created Pods that have no Node assigned.
For every Pod that the scheduler discovers, the scheduler becomes
responsible for finding the best Node for that Pod to run on. The scheduler
reaches this placement decision taking into account the scheduling
principles described below.
If you want to understand why Pods are placed onto a particular Node, or if
you're planning to implement a custom scheduler yourself, this page will
help you learn about scheduling.
kube-scheduler
kube-scheduler is the default scheduler for Kubernetes and runs as part of
the control plane. kube-scheduler is designed so that, if you want and need
to, you can write your own scheduling component and use that instead.
In a cluster, Nodes that meet the scheduling requirements for a Pod are
called feasible nodes. If none of the nodes are suitable, the pod remains
unscheduled until the scheduler is able to place it.
The scheduler finds feasible Nodes for a Pod and then runs a set of functions
to score the feasible Nodes and picks a Node with the highest score among
the feasible ones to run the Pod. The scheduler then notifies the API server
about this decision in a process called binding.
Factors that need taken into account for scheduling decisions include
individual and collective resource requirements, hardware / software / policy
constraints, affinity and anti-affinity specifications, data locality, inter-
workload interference, and so on.
1. Filtering
2. Scoring
The filtering step finds the set of Nodes where it's feasible to schedule the
Pod. For example, the PodFitsResources filter checks whether a candidate
Node has enough available resource to meet a Pod's specific resource
requests. After this step, the node list contains any suitable Nodes; often,
there will be more than one. If the list is empty, that Pod isn't (yet)
schedulable.
In the scoring step, the scheduler ranks the remaining nodes to choose the
most suitable Pod placement. The scheduler assigns a score to each Node
that survived filtering, basing this score on the active scoring rules.
Finally, kube-scheduler assigns the Pod to the Node with the highest
ranking. If there is more than one node with equal scores, kube-scheduler
selects one of these at random.
There are two supported ways to configure the filtering and scoring
behavior of the scheduler:
What's next
• Read about scheduler performance tuning
• Read about Pod topology spread constraints
• Read the reference documentation for kube-scheduler
• Learn about configuring multiple schedulers
• Learn about topology management policies
• Learn about Pod Overhead
• Learn about scheduling of Pods that use volumes in:
◦ Volume Topology Support
◦ Storage Capacity Tracking
◦ Node-specific Volume Limits
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified October 22, 2020 at 2:24 PM PST: Fix links in concepts
section (070023b24)
Edit this page Create child page Create an issue
• Scheduling overview
• kube-scheduler
◦ Node selection in kube-scheduler
• What's next
Tolerations are applied to pods, and allow (but do not require) the pods to
schedule onto nodes with matching taints.
Taints and tolerations work together to ensure that pods are not scheduled
onto inappropriate nodes. One or more taints are applied to a node; this
marks that the node should not accept any pods that do not tolerate the
taints.
Concepts
You add a taint to a node using kubectl taint. For example,
places a taint on node node1. The taint has key key1, value value1, and taint
effect NoSchedule. This means that no pod will be able to schedule onto nod
e1 unless it has a matching toleration.
To remove the taint added by the command above, you can run:
You specify a toleration for a pod in the PodSpec. Both of the following
tolerations "match" the taint created by the kubectl taint line above, and
thus a pod with either toleration would be able to schedule onto node1:
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
tolerations:
- key: "key1"
operator: "Exists"
effect: "NoSchedule"
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
tolerations:
- key: "example-key"
operator: "Exists"
effect: "NoSchedule"
A toleration "matches" a taint if the keys are the same and the effects are
the same, and:
Note:
An empty key with operator Exists matches all keys, values and
effects which means this will tolerate everything.
The above example used effect of NoSchedule. Alternatively, you can use ef
fect of PreferNoSchedule. This is a "preference" or "soft" version of NoSche
dule -- the system will try to avoid placing a pod that does not tolerate the
taint on the node, but it is not required. The third kind of effect is NoExecu
te, described later.
You can put multiple taints on the same node and multiple tolerations on the
same pod. The way Kubernetes processes multiple taints and tolerations is
like a filter: start with all of a node's taints, then ignore the ones for which
the pod has a matching toleration; the remaining un-ignored taints have the
indicated effects on the pod. In particular,
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoExecute"
In this case, the pod will not be able to schedule onto the node, because
there is no toleration matching the third taint. But it will be able to continue
running if it is already running on the node when the taint is added, because
the third taint is the only one of the three that is not tolerated by the pod.
Normally, if a taint with effect NoExecute is added to a node, then any pods
that do not tolerate the taint will be evicted immediately, and pods that do
tolerate the taint will never be evicted. However, a toleration with NoExecut
e effect can specify an optional tolerationSeconds field that dictates how
long the pod will stay bound to the node after the taint is added. For
example,
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoExecute"
tolerationSeconds: 3600
means that if this pod is running and a matching taint is added to the node,
then the pod will stay bound to the node for 3600 seconds, and then be
evicted. If the taint is removed before that time, the pod will not be evicted.
Example Use Cases
Taints and tolerations are a flexible way to steer pods away from nodes or
evict pods that shouldn't be running. A few of the use cases are
The NoExecute taint effect, mentioned above, affects pods that are already
running on the node as follows
The node controller automatically taints a Node when certain conditions are
true. The following taints are built in:
Note: The control plane limits the rate of adding node new taints
to nodes. This rate limiting manages the number of evictions that
are triggered when many nodes become unreachable at once (for
example: if there is a network disruption).
You can specify tolerationSeconds for a Pod to define how long that Pod
stays bound to a failing or unresponsive Node.
For example, you might want to keep an application with a lot of local state
bound to node for a long time in the event of network partition, hoping that
the partition will recover and thus the pod eviction can be avoided. The
toleration you set for that Pod might look like:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 6000
Note:
DaemonSet pods are created with NoExecute tolerations for the following
taints with no tolerationSeconds:
• node.kubernetes.io/unreachable
• node.kubernetes.io/not-ready
This ensures that DaemonSet pods are never evicted due to these problems.
• node.kubernetes.io/memory-pressure
• node.kubernetes.io/disk-pressure
• node.kubernetes.io/out-of-disk (only for critical pods)
• node.kubernetes.io/unschedulable (1.10 or later)
• node.kubernetes.io/network-unavailable (host network only)
Adding these tolerations ensures backward compatibility. You can also add
arbitrary tolerations to DaemonSets.
What's next
• Read about out of resource handling and how you can configure it
• Read about pod priority
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 19, 2020 at 10:17 PM PST: Improve Taints and
Tolerations example (937ddcea5)
Edit this page Create child page Create an issue
• Concepts
• Example Use Cases
• Taint based Evictions
• Taint Nodes by Condition
• What's next
nodeSelector
nodeSelector is the simplest recommended form of node selection
constraint. nodeSelector is a field of PodSpec. It specifies a map of key-
value pairs. For the pod to be eligible to run on a node, the node must have
each of the indicated key-value pairs as labels (it can have additional labels
as well). The most common usage is one key-value pair.
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
pods/pod-nginx.yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
disktype: ssd
Node isolation/restriction
Adding labels to Node objects allows targeting pods to specific nodes or
groups of nodes. This can be used to ensure specific pods only run on nodes
with certain isolation, security, or regulatory properties. When using labels
for this purpose, choosing label keys that cannot be modified by the kubelet
process on the node is strongly recommended. This prevents a compromised
node from using its kubelet credential to set those labels on its own Node
object, and influencing the scheduler to schedule workloads to the
compromised node.
1. Ensure you are using the Node authorizer and have enabled the
NodeRestriction admission plugin.
2. Add labels under the node-restriction.kubernetes.io/ prefix to
your Node objects, and use those labels in your node selectors. For
example, example.com.node-restriction.kubernetes.io/fips=true
or example.com.node-restriction.kubernetes.io/pci-dss=true.
The affinity feature consists of two types of affinity, "node affinity" and
"inter-pod affinity/anti-affinity". Node affinity is like the existing nodeSelect
or (but with the first two benefits listed above), while inter-pod affinity/anti-
affinity constrains against pod labels rather than node labels, as described in
the third item listed above, in addition to having the first and second
properties listed above.
Node affinity
Node affinity is conceptually similar to nodeSelector -- it allows you to
constrain which nodes your pod is eligible to be scheduled on, based on
labels on the node.
pods/pod-with-node-affinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/e2e-az-name
operator: In
values:
- e2e-az1
- e2e-az2
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value
containers:
- name: with-node-affinity
image: k8s.gcr.io/pause:2.0
This node affinity rule says the pod can only be placed on a node with a label
whose key is kubernetes.io/e2e-az-name and whose value is either e2e-
az1 or e2e-az2. In addition, among nodes that meet that criteria, nodes with
a label whose key is another-node-label-key and whose value is another-
node-label-value should be preferred.
You can see the operator In being used in the example. The new node
affinity syntax supports the following operators: In, NotIn, Exists, DoesNot
Exist, Gt, Lt. You can use NotIn and DoesNotExist to achieve node anti-
affinity behavior, or use node taints to repel pods from specific nodes.
If you remove or change the label of the node where the pod is scheduled,
the pod won't be removed. In other words, the affinity selection works only
at the time of scheduling the pod.
profiles:
- schedulerName: default-scheduler
- schedulerName: foo-scheduler
pluginConfig:
- name: NodeAffinity
args:
addedAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: scheduler-profile
operator: In
values:
- foo
Since the addedAffinity is not visible to end users, its behavior might be
unexpected to them. We recommend to use node labels that have clear
correlation with the profile's scheduler name.
As with node affinity, there are currently two types of pod affinity and anti-
affinity, called requiredDuringSchedulingIgnoredDuringExecution and pr
eferredDuringSchedulingIgnoredDuringExecution which denote "hard"
vs. "soft" requirements. See the description in the node affinity section
earlier. An example of requiredDuringSchedulingIgnoredDuringExecution
affinity would be "co-locate the pods of service A and service B in the same
zone, since they communicate a lot with each other" and an example prefer
redDuringSchedulingIgnoredDuringExecution anti-affinity would be
"spread the pods from this service across zones" (a hard requirement
wouldn't make sense, since you probably have more pods than zones).
pods/pod-with-pod-affinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: with-pod-affinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S1
topologyKey: topology.kubernetes.io/zone
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: topology.kubernetes.io/zone
containers:
- name: with-pod-affinity
image: k8s.gcr.io/pause:2.0
The affinity on this pod defines one pod affinity rule and one pod anti-affinity
rule. In this example, the podAffinity is requiredDuringSchedulingIgnore
dDuringExecution while the podAntiAffinity is preferredDuringSchedul
ingIgnoredDuringExecution. The pod affinity rule says that the pod can be
scheduled onto a node only if that node is in the same zone as at least one
already-running pod that has a label with key "security" and value "S1".
(More precisely, the pod is eligible to run on node N if node N has a label
with key topology.kubernetes.io/zone and some value V such that there
is at least one node in the cluster with key topology.kubernetes.io/zone
and value V that is running a pod that has a label with key "security" and
value "S1".) The pod anti-affinity rule says that the pod cannot be scheduled
onto a node if that node is in the same zone as a pod with label having key
"security" and value "S2". See the design doc for many more examples of
pod affinity and anti-affinity, both the requiredDuringSchedulingIgnoredDu
ringExecution flavor and the preferredDuringSchedulingIgnoredDuringE
xecution flavor.
The legal operators for pod affinity and anti-affinity are In, NotIn, Exists, D
oesNotExist.
Interpod Affinity and AntiAffinity can be even more useful when they are
used with higher level collections such as ReplicaSets, StatefulSets,
Deployments, etc. One can easily configure that a set of workloads should be
co-located in the same defined topology, eg., the same node.
Here is the yaml snippet of a simple redis deployment with three replicas
and selector label app=store. The deployment has PodAntiAffinity
configured to ensure the scheduler does not co-locate replicas on a single
node.
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-cache
spec:
selector:
matchLabels:
app: store
replicas: 3
template:
metadata:
labels:
app: store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: redis-server
image: redis:3.2-alpine
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-server
spec:
selector:
matchLabels:
app: web-store
replicas: 3
template:
metadata:
labels:
app: web-store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-store
topologyKey: "kubernetes.io/hostname"
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: web-app
image: nginx:1.16-alpine
If we create the above two deployments, our three node cluster should look
like below.
As you can see, all the 3 replicas of the web-server are automatically co-
located with the cache as expected.
nodeName
nodeName is the simplest form of node selection constraint, but due to its
limitations it is typically not used. nodeName is a field of PodSpec. If it is non-
empty, the scheduler ignores the pod and the kubelet running on the named
node tries to run the pod. Thus, if nodeName is provided in the PodSpec, it
takes precedence over the above methods for node selection.
• If the named node does not exist, the pod will not be run, and in some
cases may be automatically deleted.
• If the named node does not have the resources to accommodate the
pod, the pod will fail and its reason will indicate why, for example
OutOfmemory or OutOfcpu.
• Node names in cloud environments are not always predictable or
stable.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
nodeName: kube-01
What's next
Taints allow a Node to repel a set of Pods.
The design documents for node affinity and for inter-pod affinity/anti-affinity
contain extra background information about these features.
Once a Pod is assigned to a Node, the kubelet runs the Pod and allocates
node-local resources. The topology manager can take part in node-level
resource allocation decisions.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 05, 2020 at 4:28 PM PST: Add usage for per-profile
node affinity (45da527a3)
Edit this page Create child page Create an issue
• nodeSelector
◦ Step Zero: Prerequisites
◦ Step One: Attach label to the node
◦ Step Two: Add a nodeSelector field to your pod configuration
• Interlude: built-in node labels
• Node isolation/restriction
• Affinity and anti-affinity
◦ Node affinity
◦ Inter-pod affinity and anti-affinity
• nodeName
• What's next
Pod Overhead
FEATURE STATE: Kubernetes v1.18 [beta]
When you run a Pod on a Node, the Pod itself takes an amount of system
resources. These resources are additional to the resources needed to run
the container(s) inside the Pod. Pod Overhead is a feature for accounting for
the resources consumed by the Pod infrastructure on top of the container
requests & limits.
Usage example
To use the PodOverhead feature, you need a RuntimeClass that defines the o
verhead field. As an example, you could use the following RuntimeClass
definition with a virtualizing container runtime that uses around 120MiB per
Pod for the virtual machine and the guest OS:
---
kind: RuntimeClass
apiVersion: node.k8s.io/v1
metadata:
name: kata-fc
handler: kata-fc
overhead:
podFixed:
memory: "120Mi"
cpu: "250m"
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
runtimeClassName: kata-fc
containers:
- name: busybox-ctr
image: busybox
stdin: true
tty: true
resources:
limits:
cpu: 500m
memory: 100Mi
- name: nginx-ctr
image: nginx
resources:
limits:
cpu: 1500m
memory: 100Mi
After the RuntimeClass admission controller, you can check the updated
PodSpec:
map[cpu:250m memory:120Mi]
When the kube-scheduler is deciding which node should run a new Pod, the
scheduler considers that Pod's overhead as well as the sum of container
requests for that Pod. For this example, the scheduler adds the requests and
the overhead, then looks for a node that has 2.25 CPU and 320 MiB of
memory available.
Once a Pod is scheduled to a node, the kubelet on that node creates a new
cgroup for the Pod. It is within this pod that the underlying container
runtime will create containers.
If the resource has a limit defined for each container (Guaranteed QoS or
Bustrable QoS with limits defined), the kubelet will set an upper limit for the
pod cgroup associated with that resource (cpu.cfs_quota_us for CPU and
memory.limit_in_bytes memory). This upper limit is based on the sum of the
container limits plus the overhead defined in the PodSpec.
For CPU, if the Pod is Guaranteed or Burstable QoS, the kubelet will set cpu.
shares based on the sum of container requests plus the overhead defined in
the PodSpec.
Looking at our example, verify the container requests for the workload:
kubectl get pod test-pod -o jsonpath='{.spec.containers[*].resour
ces.limits}'
The total container requests are 2000m CPU and 200MiB of memory:
The output shows 2250m CPU and 320MiB of memory are requested, which
includes PodOverhead:
From this, you can determine the cgroup path for the Pod:
The resulting cgroup path includes the Pod's pause container. The Pod level
cgroup is one directory above.
"cgroupsPath": "/kubepods/podd7f4b509-cf94-4951-9417-
d1087c92a5b2/7ccf55aee35dd16aca4189c952d83487297f3cd760f1bbf09620
e206e7d0c27a"
335544320
Observability
A kube_pod_overhead metric is available in kube-state-metrics to help
identify when PodOverhead is being utilized and to help observe stability of
workloads running with a defined Overhead. This functionality is not
available in the 1.9 release of kube-state-metrics, but is expected in a
following release. Users will need to build kube-state-metrics from source in
the meantime.
What's next
• RuntimeClass
• PodOverhead Design
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
apiVersion: v1
kind: Policy
# ...
priorities:
# ...
- name: RequestedToCapacityRatioPriority
weight: 2
argument:
requestedToCapacityRatioArguments:
shape:
- utilization: 0
score: 0
- utilization: 100
score: 10
resources:
- name: intel.com/foo
weight: 3
- name: intel.com/bar
weight: 5
shape:
- utilization: 0
score: 0
- utilization: 100
score: 10
shape:
- utilization: 0
score: 100
- utilization: 100
score: 0
resources:
- name: CPU
weight: 1
- name: Memory
weight: 1
resources:
- name: intel.com/foo
weight: 5
- name: CPU
weight: 3
- name: Memory
weight: 1
The weight parameter is optional and is set to 1 if not specified. Also, the we
ight cannot be set to a negative value.
Requested resources:
intel.com/foo : 2
Memory: 256MB
CPU: 2
Resource weights:
intel.com/foo : 5
Memory: 1
CPU: 3
FunctionShapePoint {{0, 0}, {100, 10}}
Node 1 spec:
Available:
intel.com/foo: 4
Memory: 1 GB
CPU: 8
Used:
intel.com/foo: 1
Memory: 256MB
CPU: 1
Node score:
intel.com/foo = resourceScoringFunction((2+1),4)
= (100 - ((4-3)*100/4)
= (100 - 25)
= 75 # requested + used =
75% * available
= rawScoringFunction(75)
= 7 # floor(75/10)
Memory = resourceScoringFunction((256+256),1024)
= (100 -((1024-512)*100/1024))
= 50 # requested + used =
50% * available
= rawScoringFunction(50)
= 5 # floor(50/10)
CPU = resourceScoringFunction((2+1),8)
= (100 -((8-3)*100/8))
= 37.5 # requested + used =
37.5% * available
= rawScoringFunction(37.5)
= 3 # floor(37.5/10)
NodeScore = (7 * 5) + (5 * 1) + (3 * 3) / (5 + 1 + 3)
= 5
Node 2 spec:
Available:
intel.com/foo: 8
Memory: 1GB
CPU: 8
Used:
intel.com/foo: 2
Memory: 512MB
CPU: 6
Node score:
intel.com/foo = resourceScoringFunction((2+2),8)
= (100 - ((8-4)*100/8)
= (100 - 50)
= 50
= rawScoringFunction(50)
= 5
Memory = resourceScoringFunction((256+512),1024)
= (100 -((1024-768)*100/1024))
= 75
= rawScoringFunction(75)
= 7
CPU = resourceScoringFunction((2+6),8)
= (100 -((8-8)*100/8))
= 100
= rawScoringFunction(100)
= 10
NodeScore = (5 * 5) + (7 * 1) + (10 * 3) / (5 + 1 + 3)
= 7
What's next
• Read more about the scheduling framework
• Read more about scheduler configuration
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 17, 2020 at 2:33 PM PST: Tweak page style for
resource-bin-packing (2c38cb1bd)
Edit this page Create child page Create an issue
Eviction Policy
The kubelet proactively monitors for and prevents total starvation of a
compute resource. In those cases, the kubelet can reclaim the starved
resource by failing one or more Pods. When the kubelet fails a Pod, it
terminates all of its containers and transitions its PodPhase to Failed. If the
evicted Pod is managed by a Deployment, the Deployment creates another
Pod to be scheduled by Kubernetes.
What's next
• Learn how to configure out of resource handling with eviction signals
and thresholds.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified September 29, 2020 at 8:37 PM PST: fix eviction policy
content type (3fc4a3e8b)
Edit this page Create child page Create an issue
• Eviction Policy
• What's next
Scheduling Framework
FEATURE STATE: Kubernetes v1.15 [alpha]
Each attempt to schedule one Pod is split into two phases, the scheduling
cycle and the binding cycle.
Scheduling cycles are run serially, while binding cycles may run
concurrently.
Extension points
The following picture shows the scheduling context of a Pod and the
extension points that the scheduling framework exposes. In this picture
"Filter" is equivalent to "Predicate" and "Scoring" is equivalent to "Priority
function".
PreFilter
These plugins are used to pre-process info about the Pod, or to check certain
conditions that the cluster or the Pod must meet. If a PreFilter plugin
returns an error, the scheduling cycle is aborted.
Filter
These plugins are used to filter out nodes that cannot run the Pod. For each
node, the scheduler will call filter plugins in their configured order. If any
filter plugin marks the node as infeasible, the remaining plugins will not be
called for that node. Nodes may be evaluated concurrently.
PostFilter
These plugins are called after Filter phase, but only when no feasible nodes
were found for the pod. Plugins are called in their configured order. If any
postFilter plugin marks the node as Schedulable, the remaining plugins will
not be called. A typical PostFilter implementation is preemption, which tries
to make the pod schedulable by preempting other Pods.
PreScore
These plugins are used to perform "pre-scoring" work, which generates a
sharable state for Score plugins to use. If a PreScore plugin returns an
error, the scheduling cycle is aborted.
Score
These plugins are used to rank nodes that have passed the filtering phase.
The scheduler will call each scoring plugin for each node. There will be a
well defined range of integers representing the minimum and maximum
scores. After the NormalizeScore phase, the scheduler will combine node
scores from all plugins according to the configured plugin weights.
NormalizeScore
These plugins are used to modify scores before the scheduler computes a
final ranking of Nodes. A plugin that registers for this extension point will be
called with the Score results from the same plugin. This is called once per
plugin per scheduling cycle.
Reserve
A plugin that implements the Reserve extension has two methods, namely Re
serve and Unreserve, that back two informational scheduling phases called
Reserve and Unreserve, respectively. Plugins which maintain runtime state
(aka "stateful plugins") should use these phases to be notified by the
scheduler when resources on a node are being reserved and unreserved for
a given Pod.
The Reserve phase happens before the scheduler actually binds a Pod to its
designated node. It exists to prevent race conditions while the scheduler
waits for the bind to succeed. The Reserve method of each Reserve plugin
may succeed or fail; if one Reserve method call fails, subsequent plugins are
not executed and the Reserve phase is considered to have failed. If the Rese
rve method of all plugins succeed, the Reserve phase is considered to be
successful and the rest of the scheduling cycle and the binding cycle are
executed.
The Unreserve phase is triggered if the Reserve phase or a later phase fails.
When this happens, the Unreserve method of all Reserve plugins will be
executed in the reverse order of Reserve method calls. This phase exists to
clean up the state associated with the reserved Pod.
1. approve
Once all Permit plugins approve a Pod, it is sent for binding.
2. deny
If any Permit plugin denies a Pod, it is returned to the scheduling
queue. This will trigger the Unreserve phase in Reserve plugins.
Note: While any plugin can access the list of "waiting" Pods and
approve them (see FrameworkHandle), we expect only the permit
plugins to approve binding of reserved Pods that are in "waiting"
state. Once a Pod is approved, it is sent to the PreBind phase.
PreBind
These plugins are used to perform any work required before a Pod is bound.
For example, a pre-bind plugin may provision a network volume and mount
it on the target node before allowing the Pod to run there.
If any PreBind plugin returns an error, the Pod is rejected and returned to
the scheduling queue.
Bind
These plugins are used to bind a Pod to a Node. Bind plugins will not be
called until all PreBind plugins have completed. Each bind plugin is called in
the configured order. A bind plugin may choose whether or not to handle the
given Pod. If a bind plugin chooses to handle a Pod, the remaining bind
plugins are skipped.
PostBind
This is an informational extension point. Post-bind plugins are called after a
Pod is successfully bound. This is the end of a binding cycle, and can be used
to clean up associated resources.
Plugin API
There are two steps to the plugin API. First, plugins must register and get
configured, then they use the extension point interfaces. Extension point
interfaces have the following form.
// ...
Plugin configuration
You can enable or disable plugins in the scheduler configuration. If you are
using Kubernetes v1.18 or later, most scheduling plugins are in use and
enabled by default.
In addition to default plugins, you can also implement your own scheduling
plugins and get them configured along with default plugins. You can visit
scheduler-plugins for more details.
If you are using Kubernetes v1.18 or later, you can configure a set of plugins
as a scheduler profile and then define multiple profiles to fit various kinds of
workload. Learn more at multiple profiles.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified October 14, 2020 at 7:04 PM PST: Change outdated link
(f566c22e6)
Edit this page Create child page Create an issue
Nodes in a cluster that meet the scheduling requirements of a Pod are called
feasible Nodes for the Pod. The scheduler finds feasible Nodes for a Pod and
then runs a set of functions to score the feasible Nodes, picking a Node with
the highest score among the feasible ones to run the Pod. The scheduler
then notifies the API server about this decision in a process called Binding.
This page explains performance tuning optimizations that are relevant for
large Kubernetes clusters.
You specify a threshold for how many nodes are enough, as a whole number
percentage of all the nodes in your cluster. The kube-scheduler converts this
into an integer number of nodes. During scheduling, if the kube-scheduler
has identified enough feasible nodes to exceed the configured percentage,
the kube-scheduler stops searching for more feasible nodes and moves on to
the scoring phase.
How the scheduler iterates over Nodes describes the process in detail.
Default threshold
If you don't specify a threshold, Kubernetes calculates a figure using a linear
formula that yields 50% for a 100-node cluster and yields 10% for a 5000-
node cluster. The lower bound for the automatic value is 5%.
If you want the scheduler to score all nodes in your cluster, set percentageO
fNodesToScore to 100.
Example
Below is an example configuration that sets percentageOfNodesToScore to
50%.
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
algorithmSource:
provider: DefaultProvider
...
percentageOfNodesToScore: 50
Tuning percentageOfNodesToScore
percentageOfNodesToScore must be a value between 1 and 100 with the
default value being calculated based on the cluster size. There is also a
hardcoded minimum value of 50 nodes.
Note:
In order to give all the Nodes in a cluster a fair chance of being considered
for running Pods, the scheduler iterates over the nodes in a round robin
fashion. You can imagine that Nodes are in an array. The scheduler starts
from the start of the array and checks feasibility of the nodes until it finds
enough Nodes as specified by percentageOfNodesToScore. For the next Pod,
the scheduler continues from the point in the Node array that it stopped at
when checking feasibility of Nodes for the previous Pod.
If Nodes are in multiple zones, the scheduler iterates over Nodes in various
zones to ensure that Nodes from different zones are considered in the
feasibility checks. As an example, consider six nodes in two zones:
Zone 1: Node 1, Node 2, Node 3, Node 4
Zone 2: Node 5, Node 6
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Cluster Administration
Lower-level detail relevant to creating or administering a Kubernetes
cluster.
Planning a cluster
See the guides in Setup for examples of how to plan, set up, and configure
Kubernetes clusters. The solutions listed in this article are called distros.
Managing a cluster
• Learn how to manage nodes.
• Learn how to set up and manage the resource quota for shared
clusters.
Securing a cluster
• Certificates describes the steps to generate certificates using different
tool chains.
Certificates
When using client certificate authentication, you can generate certificates
manually through easyrsa, openssl or cfssl.
easyrsa
easyrsa can manually generate certificates for your cluster.
./easyrsa --subject-alt-name="IP:${MASTER_IP},"\
"IP:${MASTER_CLUSTER_IP},"\
"DNS:kubernetes,"\
"DNS:kubernetes.default,"\
"DNS:kubernetes.default.svc,"\
"DNS:kubernetes.default.svc.cluster,"\
"DNS:kubernetes.default.svc.cluster.local" \
--days=10000 \
build-server-full server nopass
5. Fill in and add the following parameters into the API server start
parameters:
--client-ca-file=/yourdirectory/ca.crt
--tls-cert-file=/yourdirectory/server.crt
--tls-private-key-file=/yourdirectory/server.key
openssl
openssl can manually generate certificates for your cluster.
[ req ]
default_bits = 2048
prompt = no
default_md = sha256
req_extensions = req_ext
distinguished_name = dn
[ dn ]
C = <country>
ST = <state>
L = <city>
O = <organization>
OU = <organization unit>
CN = <MASTER_IP>
[ req_ext ]
subjectAltName = @alt_names
[ alt_names ]
DNS.1 = kubernetes
DNS.2 = kubernetes.default
DNS.3 = kubernetes.default.svc
DNS.4 = kubernetes.default.svc.cluster
DNS.5 = kubernetes.default.svc.cluster.local
IP.1 = <MASTER_IP>
IP.2 = <MASTER_CLUSTER_IP>
[ v3_ext ]
authorityKeyIdentifier=keyid,issuer:always
basicConstraints=CA:FALSE
keyUsage=keyEncipherment,dataEncipherment
extendedKeyUsage=serverAuth,clientAuth
subjectAltName=@alt_names
6. Generate the server certificate using the ca.key, ca.crt and server.csr:
Finally, add the same parameters into the API server start parameters.
cfssl
cfssl is another tool for certificate generation.
1. Download, unpack and prepare the command line tools as shown below.
Note that you may need to adapt the sample commands based on the
hardware architecture and cfssl version you are using.
curl -L https://github.com/cloudflare/cfssl/releases/
download/v1.4.1/cfssl_1.4.1_linux_amd64 -o cfssl
chmod +x cfssl
curl -L https://github.com/cloudflare/cfssl/releases/
download/v1.4.1/cfssljson_1.4.1_linux_amd64 -o cfssljson
chmod +x cfssljson
curl -L https://github.com/cloudflare/cfssl/releases/
download/v1.4.1/cfssl-certinfo_1.4.1_linux_amd64 -o cfssl-
certinfo
chmod +x cfssl-certinfo
mkdir cert
cd cert
../cfssl print-defaults config > config.json
../cfssl print-defaults csr > csr.json
3. Create a JSON config file for generating the CA file, for example, ca-
config.json:
{
"signing": {
"default": {
"expiry": "8760h"
},
"profiles": {
"kubernetes": {
"usages": [
"signing",
"key encipherment",
"server auth",
"client auth"
],
"expiry": "8760h"
}
}
}
}
4. Create a JSON config file for CA certificate signing request (CSR), for
example, ca-csr.json. Be sure to replace the values marked with
angle brackets with real values you want to use.
{
"CN": "kubernetes",
"key": {
"algo": "rsa",
"size": 2048
},
"names":[{
"C": "<country>",
"ST": "<state>",
"L": "<city>",
"O": "<organization>",
"OU": "<organization unit>"
}]
}
6. Create a JSON config file for generating keys and certificates for the
API server, for example, server-csr.json. Be sure to replace the
values in angle brackets with real values you want to use. The MASTER_C
LUSTER_IP is the service cluster IP for the API server as described in
previous subsection. The sample below also assumes that you are using
cluster.local as the default DNS domain name.
{
"CN": "kubernetes",
"hosts": [
"127.0.0.1",
"<MASTER_IP>",
"<MASTER_CLUSTER_IP>",
"kubernetes",
"kubernetes.default",
"kubernetes.default.svc",
"kubernetes.default.svc.cluster",
"kubernetes.default.svc.cluster.local"
],
"key": {
"algo": "rsa",
"size": 2048
},
"names": [{
"C": "<country>",
"ST": "<state>",
"L": "<city>",
"O": "<organization>",
"OU": "<organization unit>"
}]
}
7. Generate the key and certificate for the API server, which are by default
saved into file server-key.pem and server.pem respectively:
Certificates API
You can use the certificates.k8s.io API to provision x509 certificates to
use for authentication as documented here.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified May 30, 2020 at 3:10 PM PST: add en pages (ecc27bbbe)
Edit this page Create child page Create an issue
• ◦ easyrsa
◦ openssl
◦ cfssl
• Distributing Self-Signed CA Certificate
• Certificates API
Managing Resources
You've deployed your application and exposed it via a service. Now what?
Kubernetes provides a number of tools to help you manage your application
deployment, including scaling and updating. Among the features that we will
discuss in more depth are configuration files and labels.
apiVersion: v1
kind: Service
metadata:
name: my-nginx-svc
labels:
app: nginx
spec:
type: LoadBalancer
ports:
- port: 80
selector:
app: nginx
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
service/my-nginx-svc created
deployment.apps/my-nginx created
The resources will be created in the order they appear in the file. Therefore,
it's best to specify the service first, since that will ensure the scheduler can
spread the pods associated with the service as they are created by the
controller(s), such as Deployment.
kubectl will read any files with suffixes .yaml, .yml, or .json.
deployment.apps/my-nginx created
In the case of just two resources, it's also easy to specify both on the
command line using the resource/name syntax:
For larger numbers of resources, you'll find it easier to specify the selector
(label query) specified using -l or --selector, to filter resources by their
labels:
project/k8s/development
├── configmap
│  └── my-configmap.yaml
├── deployment
│  └── my-deployment.yaml
└── pvc
└── my-pvc.yaml
configmap/my-config created
deployment.apps/my-deployment created
persistentvolumeclaim/my-pvc created
The --recursive flag works with any operation that accepts the --
filename,-f flag such as: kubectl
{create,get,delete,describe,rollout} etc.
The --recursive flag also works when multiple -f arguments are provided:
namespace/development created
namespace/staging created
configmap/my-config created
deployment.apps/my-deployment created
persistentvolumeclaim/my-pvc created
For instance, different applications would use different values for the app
label, but a multi-tier application, such as the guestbook example, would
additionally need to distinguish each tier. The frontend could carry the
following labels:
labels:
app: guestbook
tier: frontend
while the Redis master and slave would have different tier labels, and
perhaps even an additional role label:
labels:
app: guestbook
tier: backend
role: master
and
labels:
app: guestbook
tier: backend
role: slave
The labels allow us to slice and dice our resources along any dimension
specified by a label:
kubectl apply -f examples/guestbook/all-in-one/guestbook-all-in-
one.yaml
kubectl get pods -Lapp -Ltier -Lrole
Canary deployments
Another scenario where multiple labels are needed is to distinguish
deployments of different releases or configurations of the same component.
It is common practice to deploy a canary of a new application release
(specified via image tag in the pod template) side by side with the previous
release so that the new release can receive live production traffic before
fully rolling it out.
For instance, you can use a track label to differentiate different releases.
The primary, stable release would have a track label with value as stable:
name: frontend
replicas: 3
...
labels:
app: guestbook
tier: frontend
track: stable
...
image: gb-frontend:v3
and then you can create a new release of the guestbook frontend that
carries the track label with different value (i.e. canary), so that two sets of
pods would not overlap:
name: frontend-canary
replicas: 1
...
labels:
app: guestbook
tier: frontend
track: canary
...
image: gb-frontend:v4
The frontend service would span both sets of replicas by selecting the
common subset of their labels (i.e. omitting the track label), so that the
traffic will be redirected to both applications:
selector:
app: guestbook
tier: frontend
You can tweak the number of replicas of the stable and canary releases to
determine the ratio of each release that will receive live production traffic
(in this case, 3:1). Once you're confident, you can update the stable track to
the new application release and remove the canary one.
Updating labels
Sometimes existing pods and other resources need to be relabeled before
creating new resources. This can be done with kubectl label. For example,
if you want to label all your nginx pods as frontend tier, simply run:
pod/my-nginx-2035384211-j5fhi labeled
pod/my-nginx-2035384211-u2c7e labeled
pod/my-nginx-2035384211-u3t6x labeled
This first filters all pods with the label "app=nginx", and then labels them
with the "tier=fe". To see the pods you just labeled, run:
This outputs all "app=nginx" pods, with an additional label column of pods'
tier (specified with -L or --label-columns).
Updating annotations
Sometimes you would want to attach annotations to resources. Annotations
are arbitrary non-identifying metadata for retrieval by API clients such as
tools, libraries, etc. This can be done with kubectl annotate. For example:
apiVersion: v1
kind: pod
metadata:
annotations:
description: my frontend running nginx
...
deployment.apps/my-nginx scaled
horizontalpodautoscaler.autoscaling/my-nginx autoscaled
Now your nginx replicas will be scaled up and down as needed,
automatically.
For more information, please see kubectl scale, kubectl autoscale and
horizontal pod autoscaler document.
kubectl apply
It is suggested to maintain a set of configuration files in source control (see
configuration as code), so that they can be maintained and versioned along
with the code for the resources they configure. Then, you can use kubectl
apply to push your configuration changes to the cluster.
This command will compare the version of the configuration that you're
pushing with the previous version and apply the changes you've made,
without overwriting any automated changes to properties you haven't
specified.
All subsequent calls to kubectl apply, and other commands that modify the
configuration, such as kubectl replace and kubectl edit, will update the
annotation, allowing subsequent calls to kubectl apply to detect and
perform deletions using a three-way diff.
kubectl edit
Alternatively, you may also update resources with kubectl edit:
This is equivalent to first get the resource, edit it in text editor, and then app
ly the resource with the updated version:
kubectl get deployment my-nginx -o yaml > /tmp/nginx.yaml
vi /tmp/nginx.yaml
# do some edit, and then save the file
rm /tmp/nginx.yaml
This allows you to do more significant changes more easily. Note that you
can specify the editor with your EDITOR or KUBE_EDITOR environment
variables.
kubectl patch
You can use kubectl patch to update API objects in place. This command
supports JSON patch, JSON merge patch, and strategic merge patch. See
Update API Objects in Place Using kubectl patch and kubectl patch.
Disruptive updates
In some cases, you may need to update resource fields that cannot be
updated once initialized, or you may just want to make a recursive change
immediately, such as to fix broken pods created by a Deployment. To change
such fields, use replace --force, which deletes and re-creates the
resource. In this case, you can simply modify your original configuration file:
deployment.apps/my-nginx deleted
deployment.apps/my-nginx replaced
We'll guide you through how to create and update applications with
Deployments.
deployment.apps/my-nginx created
with 3 replicas (so the old and new revisions can coexist):
deployment.apps/my-nginx scaled
That's it! The Deployment will declaratively update the deployed nginx
application progressively behind the scene. It ensures that only a certain
number of old replicas may be down while they are being updated, and only
a certain number of new replicas may be created above the desired number
of pods. To learn more details about it, visit Deployment page.
What's next
• Learn about how to use kubectl for application introspection and
debugging.
• See Configuration Best Practices and Tips.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified July 17, 2020 at 4:11 PM PST: Fix links in concepts section (2)
(c8f470487)
Edit this page Create child page Create an issue
Cluster Networking
Networking is a central part of Kubernetes, but it can be challenging to
understand exactly how it is expected to work. There are 4 distinct
networking problems to address:
• pods on a node can communicate with all pods on all nodes without
NAT
• agents on a node (e.g. system daemons, kubelet) can communicate with
all pods on that node
Note: For those platforms that support Pods running in the host network
(e.g. Linux):
• pods in the host network of a node can communicate with all pods on
all nodes without NAT
This model is not only less complex overall, but it is principally compatible
with the desire for Kubernetes to enable low-friction porting of apps from
VMs to containers. If your job previously ran in a VM, your VM had an IP
and could talk to other VMs in your project. This is the same basic model.
It is possible to request ports on the Node itself which forward to your Pod
(called host ports), but this is a very niche operation. How that forwarding is
implemented is also a detail of the container runtime. The Pod itself is blind
to the existence or non-existence of host ports.
The following networking options are sorted alphabetically - the order does
not imply any preferential status.
ACI
Cisco Application Centric Infrastructure offers an integrated overlay and
underlay SDN solution that supports containers, virtual machines, and bare
metal servers. ACI provides container networking integration for ACI. An
overview of the integration is provided here.
Antrea
Project Antrea is an opensource Kubernetes networking solution intended to
be Kubernetes native. It leverages Open vSwitch as the networking data
plane. Open vSwitch is a high-performance programmable virtual switch
that supports both Linux and Windows. Open vSwitch enables Antrea to
implement Kubernetes Network Policies in a high-performance and efficient
manner. Thanks to the "programmable" characteristic of Open vSwitch,
Antrea is able to implement an extensive set of networking and security
features and services on top of Open vSwitch.
The AOS Reference Design currently supports Layer-3 connected hosts that
eliminate legacy Layer-2 switching problems. These Layer-3 hosts can be
Linux servers (Debian, Ubuntu, CentOS) that create BGP neighbor
relationships directly with the top of rack switches (TORs). AOS automates
the routing adjacencies and then provides fine grained control over the
route health injections (RHI) that are common in a Kubernetes deployment.
AOS has a rich set of REST API endpoints that enable Kubernetes to quickly
change the network policy based on application requirements. Further
enhancements will integrate the AOS Graph model used for the network
design with the workload provisioning, enabling an end to end management
system for both private and public clouds.
Details on how the AOS system works can be accessed here: https://
www.apstra.com/products/how-it-works/
Using this CNI plugin allows Kubernetes pods to have the same IP address
inside the pod as they do on the VPC network. The CNI allocates AWS
Elastic Networking Interfaces (ENIs) to each Kubernetes node and using the
secondary IP range from each ENI for pods on the node. The CNI includes
controls for pre-allocation of ENIs and IP addresses for fast pod startup
times and enables large clusters of up to 2,000 nodes.
Additionally, the CNI can be run alongside Calico for network policy
enforcement. The AWS VPC CNI project is open source with documentation
on GitHub.
Azure CNI for Kubernetes
Azure CNI is an open source plugin that integrates Kubernetes Pods with an
Azure Virtual Network (also known as VNet) providing network performance
at par with VMs. Pods can connect to peered VNet and to on-premises over
Express Route or site-to-site VPN and are also directly reachable from these
networks. Pods can access Azure services, such as storage and SQL, that are
protected by Service Endpoints or Private Link. You can use VNet security
policies and routing to filter Pod traffic. The plugin assigns VNet IPs to Pods
by utilizing a pool of secondary IPs pre-configured on the Network Interface
of a Kubernetes node.
With the help of the Big Cloud Fabric's virtual pod multi-tenant architecture,
container orchestration systems such as Kubernetes, RedHat OpenShift,
Mesosphere DC/OS & Docker Swarm will be natively integrated alongside
with VM orchestration systems such as VMware, OpenStack & Nutanix.
Customers will be able to securely inter-connect any number of these
clusters and enable inter-tenant communication between them if needed.
Calico
Calico is an open source networking and network security solution for
containers, virtual machines, and native host-based workloads. Calico
supports multiple data planes including: a pure Linux eBPF dataplane, a
standard Linux networking dataplane, and a Windows HNS dataplane.
Calico provides a full networking stack but can also be used in conjunction
with cloud provider CNIs to provide network policy enforcement.
Cilium
Cilium is open source software for providing and transparently securing
network connectivity between application containers. Cilium is L7/HTTP
aware and can enforce network policies on L3-L7 using an identity based
security model that is decoupled from network addressing, and it can be
used in combination with other CNI plugins.
CNI-Genie from Huawei
CNI-Genie is a CNI plugin that enables Kubernetes to simultaneously have
access to different implementations of the Kubernetes network model in
runtime. This includes any implementation that runs as a CNI plugin, such
as Flannel, Calico, Romana, Weave-net.
cni-ipvlan-vpc-k8s
cni-ipvlan-vpc-k8s contains a set of CNI and IPAM plugins to provide a
simple, host-local, low latency, high throughput, and compliant networking
stack for Kubernetes within Amazon Virtual Private Cloud (VPC)
environments by making use of Amazon Elastic Network Interfaces (ENI)
and binding AWS-managed IPs into Pods using the Linux kernel's IPvlan
driver in L2 mode.
Coil
Coil is a CNI plugin designed for ease of integration, providing flexible
egress networking. Coil operates with a low overhead compared to bare
metal, and allows you to define arbitrary egress NAT gateways for external
networks.
Contiv
Contiv provides configurable networking (native l3 using BGP, overlay using
vxlan, classic l2, or Cisco-SDN/ACI) for various use cases. Contiv is all open
sourced.
Flannel
Flannel is a very simple overlay network that satisfies the Kubernetes
requirements. Many people have reported success with Flannel and
Kubernetes.
Docker will now allocate IPs from the cbr-cidr block. Containers can reach
each other and Nodes over the cbr0 bridge. Those IPs are all routable within
the GCE project network.
GCE itself does not know anything about these IPs, though, so it will not
NAT them for outbound internet traffic. To achieve that an iptables rule is
used to masquerade (aka SNAT - to make it seem as if packets came from
the Node itself) traffic that is bound for IPs outside the GCE project network
(10.0.0.0/8).
Lastly IP forwarding is enabled in the kernel (so the kernel will process
packets for bridged containers):
sysctl net.ipv4.ip_forward=1
The result of all this is that all Pods can reach each other and can egress
traffic to the internet.
Jaguar
Jaguar is an open source solution for Kubernetes's network based on
OpenDaylight. Jaguar provides overlay network using vxlan and Jaguar
CNIPlugin provides one IP address per pod.
k-vswitch
k-vswitch is a simple Kubernetes networking plugin based on Open vSwitch.
It leverages existing functionality in Open vSwitch to provide a robust
networking plugin that is easy-to-operate, performant and secure.
Knitter
Knitter is a network solution which supports multiple networking in
Kubernetes. It provides the ability of tenant management and network
management. Knitter includes a set of end-to-end NFV container networking
solutions besides multiple network planes, such as keeping IP address for
applications, IP address migration, etc.
Kube-OVN
Kube-OVN is an OVN-based kubernetes network fabric for enterprises. With
the help of OVN/OVS, it provides some advanced overlay network features
like subnet, QoS, static IP allocation, traffic mirroring, gateway, openflow-
based network policy and service proxy.
Kube-router
Kube-router is a purpose-built networking solution for Kubernetes that aims
to provide high performance and operational simplicity. Kube-router
provides a Linux LVS/IPVS-based service proxy, a Linux kernel forwarding-
based pod-to-pod networking solution with no overlays, and iptables/ipset-
based network policy enforcer.
L2 networks and linux bridging
If you have a "dumb" L2 network, such as a simple switch in a "bare-metal"
environment, you should be able to do something similar to the above GCE
setup. Note that these instructions have only been tried very casually - it
seems to work, but has not been thoroughly tested. If you use this technique
and perfect the process, please let us know.
Follow the "With Linux Bridge devices" section of this very nice tutorial from
Lars Kellogg-Stedman.
Multus supports all reference plugins (eg. Flannel, DHCP, Macvlan) that
implement the CNI specification and 3rd party plugins (eg. Calico, Weave,
Cilium, Contiv). In addition to it, Multus supports SRIOV, DPDK, OVS-DPDK
& VPP workloads in Kubernetes with both cloud native and NFV based
applications in Kubernetes.
NSX-T
VMware NSX-T is a network virtualization and security platform. NSX-T can
provide network virtualization for a multi-cloud and multi-hypervisor
environment and is focused on emerging application frameworks and
architectures that have heterogeneous endpoints and technology stacks. In
addition to vSphere hypervisors, these environments include other
hypervisors such as KVM, containers, and bare metal.
OpenVSwitch
OpenVSwitch is a somewhat more mature but also complicated way to build
an overlay network. This is endorsed by several of the "Big Shops" for
networking.
Romana
Romana is an open source network and security automation solution that
lets you deploy Kubernetes without an overlay network. Romana supports
Kubernetes Network Policy to provide isolation across network namespaces.
What's next
The early design of the networking model and its rationale, and some future
plans are described in more detail in the networking design document.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified December 16, 2020 at 2:52 AM PST: Space screwed up
formatting (c5006a6f1)
Edit this page Create child page Create an issue
Logging Architecture
Application logs can help you understand what is happening inside your
application. The logs are particularly useful for debugging problems and
monitoring cluster activity. Most modern applications have some kind of
logging mechanism; as such, most container engines are likewise designed
to support some kind of logging. The easiest and most embraced logging
method for containerized applications is to write to the standard output and
standard error streams.
debug/counter-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: counter
spec:
containers:
- name: count
image: busybox
args: [/bin/sh, -c,
'i=0; while true; do echo "$i: $(date)"; i=$((i+1));
sleep 1; done']
pod/counter created
As an example, you can find detailed information about how kube-up.sh sets
up logging for COS image on GCP in the corresponding script.
When you run kubectl logs as in the basic logging example, the kubelet on
the node handles the request and reads directly from the log file, returning
the contents in the response.
Because the logging agent must run on every node, it's common to
implement it as either a DaemonSet replica, a manifest pod, or a dedicated
native process on the node. However the latter two approaches are
deprecated and highly discouraged.
Kubernetes doesn't specify a logging agent, but two optional logging agents
are packaged with the Kubernetes release: Stackdriver Logging for use with
Google Cloud Platform, and Elasticsearch. You can find more information
and instructions in the dedicated documents. Both use fluentd with custom
configuration as an agent on the node.
Using a sidecar container with the logging agent
You can use a sidecar container in one of the following ways:
By having your sidecar containers stream to their own stdout and stderr
streams, you can take advantage of the kubelet and the logging agent that
already run on each node. The sidecar containers read logs from a file, a
socket, or the journald. Each individual sidecar container prints log to its
own stdout or stderr stream.
This approach allows you to separate several log streams from different
parts of your application, some of which can lack support for writing to stdo
ut or stderr. The logic behind redirecting logs is minimal, so it's hardly a
significant overhead. Additionally, because stdout and stderr are handled
by the kubelet, you can use built-in tools like kubectl logs.
Consider the following example. A pod runs a single container, and the
container writes to two different log files, using two different formats.
Here's a configuration file for the Pod:
admin/logging/two-files-counter-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: counter
spec:
containers:
- name: count
image: busybox
args:
- /bin/sh
- -c
- >
i=0;
while true;
do
echo "$i: $(date)" >> /var/log/1.log;
echo "$(date) INFO $i" >> /var/log/2.log;
i=$((i+1));
sleep 1;
done
volumeMounts:
- name: varlog
mountPath: /var/log
volumes:
- name: varlog
emptyDir: {}
It would be a mess to have log entries of different formats in the same log
stream, even if you managed to redirect both components to the stdout
stream of the container. Instead, you could introduce two sidecar containers.
Each sidecar container could tail a particular log file from a shared volume
and then redirect the logs to its own stdout stream.
Here's a configuration file for a pod that has two sidecar containers:
admin/logging/two-files-counter-pod-streaming-sidecar.yaml
apiVersion: v1
kind: Pod
metadata:
name: counter
spec:
containers:
- name: count
image: busybox
args:
- /bin/sh
- -c
- >
i=0;
while true;
do
echo "$i: $(date)" >> /var/log/1.log;
echo "$(date) INFO $i" >> /var/log/2.log;
i=$((i+1));
sleep 1;
done
volumeMounts:
- name: varlog
mountPath: /var/log
- name: count-log-1
image: busybox
args: [/bin/sh, -c, 'tail -n+1 -f /var/log/1.log']
volumeMounts:
- name: varlog
mountPath: /var/log
- name: count-log-2
image: busybox
args: [/bin/sh, -c, 'tail -n+1 -f /var/log/2.log']
volumeMounts:
- name: varlog
mountPath: /var/log
volumes:
- name: varlog
emptyDir: {}
Now when you run this pod, you can access each log stream separately by
running the following commands:
The node-level agent installed in your cluster picks up those log streams
automatically without any further configuration. If you like, you can
configure the agent to parse log lines depending on the source container.
Note, that despite low CPU and memory usage (order of couple of millicores
for cpu and order of several megabytes for memory), writing logs to a file
and then streaming them to stdout can double disk usage. If you have an
application that writes to a single file, it's generally better to set /dev/
stdout as destination rather than implementing the streaming sidecar
container approach.
Sidecar containers can also be used to rotate log files that cannot be rotated
by the application itself. An example of this approach is a small container
running logrotate periodically. However, it's recommended to use stdout
and stderr directly and leave rotation and retention policies to the kubelet.
If the node-level logging agent is not flexible enough for your situation, you
can create a sidecar container with a separate logging agent that you have
configured specifically to run with your application.
admin/logging/fluentd-sidecar-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
data:
fluentd.conf: |
<source>
type tail
format none
path /var/log/1.log
pos_file /var/log/1.log.pos
tag count.format1
</source>
<source>
type tail
format none
path /var/log/2.log
pos_file /var/log/2.log.pos
tag count.format2
</source>
<match **>
type google_cloud
</match>
The second file describes a pod that has a sidecar container running fluentd.
The pod mounts a volume where fluentd can pick up its configuration data.
admin/logging/two-files-counter-pod-agent-sidecar.yaml
apiVersion: v1
kind: Pod
metadata:
name: counter
spec:
containers:
- name: count
image: busybox
args:
- /bin/sh
- -c
- >
i=0;
while true;
do
echo "$i: $(date)" >> /var/log/1.log;
echo "$(date) INFO $i" >> /var/log/2.log;
i=$((i+1));
sleep 1;
done
volumeMounts:
- name: varlog
mountPath: /var/log
- name: count-agent
image: k8s.gcr.io/fluentd-gcp:1.30
env:
- name: FLUENTD_ARGS
value: -c /etc/fluentd-config/fluentd.conf
volumeMounts:
- name: varlog
mountPath: /var/log
- name: config-volume
mountPath: /etc/fluentd-config
volumes:
- name: varlog
emptyDir: {}
- name: config-volume
configMap:
name: fluentd-config
After some time you can find log messages in the Stackdriver interface.
Remember, that this is just an example and you can actually replace fluentd
with any logging agent, reading from any source inside an application
container.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified December 07, 2020 at 11:24 AM PST: Add a missing period in
logging.md (b1f91de70)
Edit this page Create child page Create an issue
Metrics in Kubernetes
In most cases metrics are available on /metrics endpoint of the HTTP
server. For components that doesn't expose endpoint by default it can be
enabled using --bind-address flag.
• kube-controller-manager
• kube-proxy
• kube-apiserver
• kube-scheduler
• kubelet
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- nonResourceURLs:
- "/metrics"
verbs:
- get
Metric lifecycle
Alpha metric → Stable metric → Deprecated metric → Hidden metric →
Deleted metric
Deprecated metrics are slated for deletion, but are still available for use.
These metrics include an annotation about the version in which they became
deprecated.
For example:
• Before deprecation
• After deprecation
Hidden metrics are no longer published for scraping, but are still available
for use. To use a hidden metric, please refer to the Show hidden metrics
section.
The flag can only take the previous minor version as it's value. All metrics
hidden in previous will be emitted if admins set the previous version to show
-hidden-metrics-for-version. The too old version is not allowed because
this violates the metrics deprecated policy.
If you're upgrading from release 1.12 to 1.13, but still depend on a metric A
deprecated in 1.12, you should set hidden metrics via command line: --
show-hidden-metrics=1.12 and remember to remove this metric
dependency before upgrading to 1.14
cloudprovider_gce_api_request_duration_seconds { request =
"instance_list"}
cloudprovider_gce_api_request_duration_seconds { request =
"disk_insert"}
cloudprovider_gce_api_request_duration_seconds { request =
"disk_delete"}
cloudprovider_gce_api_request_duration_seconds { request =
"attach_disk"}
cloudprovider_gce_api_request_duration_seconds { request =
"detach_disk"}
cloudprovider_gce_api_request_duration_seconds { request =
"list_disk"}
kube-scheduler metrics
FEATURE STATE: Kubernetes v1.20 [alpha]
The scheduler exposes optional metrics that reports the requested resources
and the desired limits of all running pods. These metrics can be used to
build capacity planning dashboards, assess current or historical scheduling
limits, quickly identify workloads that cannot schedule due to lack of
resources, and compare actual usage to the pod's request.
• namespace
• pod name
• the node where the pod is scheduled or an empty string if not yet
scheduled
• priority
• the assigned scheduler for that pod
• the name of the resource (for example, cpu)
• the unit of the resource if known (for example, cores)
Once a pod reaches completion (has a restartPolicy of Never or OnFailur
e and is in the Succeeded or Failed pod phase, or has been deleted and all
containers have a terminated state) the series is no longer reported since
the scheduler is now free to schedule other pods to run. The two metrics are
called kube_pod_resource_request and kube_pod_resource_limit.
What's next
• Read about the Prometheus text format for metrics
• See the list of stable Kubernetes metrics
• Read about the Kubernetes deprecation policy
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
• Metrics in Kubernetes
• Metric lifecycle
• Show hidden metrics
• Disable accelerator metrics
• Component metrics
◦ kube-controller-manager metrics
◦ kube-scheduler metrics
• What's next
System Logs
System component logs record events happening in cluster, which can be
very useful for debugging. You can configure log verbosity to see more or
less detail. Logs can be as coarse-grained as showing errors within a
component, or as fine-grained as showing step-by-step traces of events (like
HTTP access logs, pod state changes, controller actions, or scheduler
decisions).
Klog
klog is the Kubernetes logging library. klog generates log messages for the
Kubernetes system components.
For more information about klog configuration, see the Command line tool
reference.
Structured Logging
FEATURE STATE: Kubernetes v1.19 [alpha]
Warning:
Example:
Warning:
JSON output does not support many standard klog flags. For list of
unsupported klog flags, see the Command line tool reference.
Not all logs are guaranteed to be written in JSON format (for
example, during process start). If you intend to parse logs, make
sure you can handle log lines that are not JSON as well.
{
"ts": 1580306777.04728,
"v": 4,
"msg": "Pod status updated",
"pod":{
"name": "nginx-1",
"namespace": "default"
},
"status": "ready"
}
• kube-controller-manager
• kube-apiserver
• kube-scheduler
• kubelet
Log sanitization
FEATURE STATE: Kubernetes v1.20 [alpha]
• kube-controller-manager
• kube-apiserver
• kube-scheduler
• kubelet
Note: The Log sanitization filter does not prevent user workload
logs from leaking sensitive data.
Log location
There are two types of system components: those that run in a container and
those that do not run in a container. For example:
What's next
• Read about the Kubernetes Logging Architecture
• Read about Structured Logging
• Read about the Conventions for logging severity
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 24, 2020 at 9:10 AM PST: Small wording update on
Log sanitization docs (1de3ac5a2)
Edit this page Create child page Create an issue
• Klog
◦ Structured Logging
◦ JSON log format
◦ Log sanitization
◦ Log verbosity level
◦ Log location
• What's next
External garbage collection tools are not recommended as these tools can
potentially break the behavior of kubelet by removing containers expected
to exist.
Image Collection
Kubernetes manages lifecycle of all images through imageManager, with the
cooperation of cadvisor.
The policy for garbage collecting images takes two factors into
consideration: HighThresholdPercent and LowThresholdPercent. Disk
usage above the high threshold will trigger garbage collection. The garbage
collection will delete least recently used images until the low threshold has
been met.
Container Collection
The policy for garbage collecting containers considers three user-defined
variables. MinAge is the minimum age at which a container can be garbage
collected. MaxPerPodContainer is the maximum number of dead containers
every single pod (UID, container name) pair is allowed to have. MaxContaine
rs is the maximum number of total dead containers. These variables can be
individually disabled by setting MinAge to zero and setting MaxPerPodContai
ner and MaxContainers respectively to less than zero.
Containers that are not managed by kubelet are not subject to container
garbage collection.
User Configuration
You can adjust the following thresholds to tune image garbage collection
with the following kubelet flags :
You can customize the garbage collection policy through the following
kubelet flags:
Deprecation
Some kubelet Garbage Collection features in this doc will be replaced by
kubelet eviction in the future.
Including:
What's next
See Configuring Out Of Resource Handling for more details.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
• Image Collection
• Container Collection
• User Configuration
• Deprecation
• What's next
Proxies in Kubernetes
This page explains proxies used with Kubernetes.
Proxies
There are several different proxies you may encounter when using
Kubernetes:
Requesting redirects
Proxies have replaced redirect capabilities. Redirects have been deprecated.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified May 30, 2020 at 3:10 PM PST: add en pages (ecc27bbbe)
Edit this page Create child page Create an issue
• Proxies
• Requesting redirects
The API Priority and Fairness feature (APF) is an alternative that improves
upon aforementioned max-inflight limitations. APF classifies and isolates
requests in a more fine-grained way. It also introduces a limited amount of
queuing, so that no requests are rejected in cases of very brief bursts.
Requests are dispatched from queues using a fair queuing technique so that,
for example, a poorly-behaved controller need not starve others (even at the
same priority level).
kube-apiserver \
--feature-gates=APIPriorityAndFairness=false \
--runtime-config=flowcontrol.apiserver.k8s.io/v1beta1=false \
# …and other flags as usual
Alternatively, you can enable the v1alpha1 version of the API group with --
runtime-config=flowcontrol.apiserver.k8s.io/v1beta1=true.
Concepts
There are several distinct features involved in the API Priority and Fairness
feature. Incoming requests are classified by attributes of the request using
FlowSchemas, and assigned to priority levels. Priority levels add a degree of
isolation by maintaining separate concurrency limits, so that requests
assigned to different priority levels cannot starve each other. Within a
priority level, a fair-queuing algorithm prevents requests from different
flows from starving each other, and allows for requests to be queued to
prevent bursty traffic from causing failed requests when the average load is
acceptably low.
Priority Levels
Without APF enabled, overall concurrency in the API server is limited by the
kube-apiserver flags --max-requests-inflight and --max-mutating-
requests-inflight. With APF enabled, the concurrency limits defined by
these flags are summed and then the sum is divided up among a
configurable set of priority levels. Each incoming request is assigned to a
single priority level, and each priority level will only dispatch as many
concurrent requests as its configuration allows.
The default configuration, for example, includes separate priority levels for
leader-election requests, requests from built-in controllers, and requests
from Pods. This means that an ill-behaved Pod that floods the API server
with requests cannot prevent leader election or actions by the built-in
controllers from succeeding.
Queuing
Even within a priority level there may be a large number of distinct sources
of traffic. In an overload situation, it is valuable to prevent one stream of
requests from starving others (in particular, in the relatively common case of
a single buggy client flooding the kube-apiserver with requests, that buggy
client would ideally not have much measurable impact on other clients at
all). This is handled by use of a fair-queuing algorithm to process requests
that are assigned the same priority level. Each request is assigned to a flow,
identified by the name of the matching FlowSchema plus a flow
distinguisher — which is either the requesting user, the target resource's
namespace, or nothing — and the system attempts to give approximately
equal weight to requests in different flows of the same priority level.
After classifying a request into a flow, the API Priority and Fairness feature
then may assign the request to a queue. This assignment uses a technique
known as shuffle sharding, which makes relatively efficient use of queues to
insulate low-intensity flows from high-intensity flows.
The details of the queuing algorithm are tunable for each priority level, and
allow administrators to trade off memory use, fairness (the property that
independent flows will all make progress when total traffic exceeds
capacity), tolerance for bursty traffic, and the added latency induced by
queuing.
Exempt requests
Some requests are considered sufficiently important that they are not
subject to any of the limitations imposed by this feature. These exemptions
prevent an improperly-configured flow control configuration from totally
disabling an API server.
Defaults
The Priority and Fairness feature ships with a suggested configuration that
should suffice for experimentation; if your cluster is likely to experience
heavy load then you should consider what configuration will work best. The
suggested configuration groups requests into five priority classes:
• The system priority level is for requests from the system:nodes group,
i.e. Kubelets, which must be able to contact the API server in order for
workloads to be able to schedule on them.
• The workload-low priority level is for requests from any other service
account, which will typically include all requests from controllers
running in Pods.
• The special exempt priority level is used for requests that are not
subject to flow control at all: they will always be dispatched
immediately. The special exempt FlowSchema classifies all requests
from the system:masters group into this priority level. You may define
other FlowSchemas that direct other requests to this priority level, if
appropriate.
If you add the following additional FlowSchema, this exempts those requests
from rate limiting.
Caution: Making this change also allows any hostile party to then
send health-check requests that match this FlowSchema, at any
volume they like. If you have a web traffic filter or similar external
security mechanism to protect your cluster's API server from
general internet traffic, you can configure rules to block any health
check requests that originate from outside your cluster.
priority-and-fairness/health-for-strangers.yaml
apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
kind: FlowSchema
metadata:
name: health-for-strangers
spec:
matchingPrecedence: 1000
priorityLevelConfiguration:
name: exempt
rules:
- nonResourceRules:
- nonResourceURLs:
- "/healthz"
- "/livez"
- "/readyz"
verbs:
- "*"
subjects:
- kind: Group
group:
name: system:unauthenticated
Resources
The flow control API involves two kinds of resources.
PriorityLevelConfigurations define the available isolation classes, the share
of the available concurrency budget that each can handle, and allow for fine-
tuning queuing behavior. FlowSchemas are used to classify individual
inbound requests, matching each to a single PriorityLevelConfiguration.
There is also a v1alpha1 version of the same API group, and it has the same
Kinds with the same syntax and semantics.
PriorityLevelConfiguration
A PriorityLevelConfiguration represents a single isolation class. Each
PriorityLevelConfiguration has an independent limit on the number of
outstanding requests, and limitations on the number of queued requests.
The queuing configuration allows tuning the fair queuing algorithm for a
priority level. Details of the algorithm can be read in the enhancement
proposal, but in short:
FlowSchema
A FlowSchema matches some inbound requests and assigns them to a
priority level. Every inbound request is tested against every FlowSchema in
turn, starting with those with numerically lowest --- which we take to be the
logically highest --- matchingPrecedence and working onward. The first
match wins.
For the name field in subjects, and the verbs, apiGroups, resources, namesp
aces, and nonResourceURLs fields of resource and non-resource rules, the
wildcard * may be specified to match all values for the given field,
effectively removing it from consideration.
Observability
Metrics
Note: In versions of Kubernetes before v1.20, the labels flow_sch
ema and priority_level were inconsistently named flowSchema
and priorityLevel, respectively. If you're running Kubernetes
versions v1.19 and earlier, you should refer to the documentation
for your version.
When you enable the API Priority and Fairness feature, the kube-apiserver
exports additional metrics. Monitoring these can help you determine
whether your configuration is inappropriately throttling important traffic, or
find poorly-behaved workloads that may be harming system health.
• apiserver_flowcontrol_rejected_requests_total is a counter
vector (cumulative since server start) of requests that were rejected,
broken down by the labels flow_schema (indicating the one that
matched the request), priority_level (indicating the one to which the
request was assigned), and reason. The reason label will be have one
of the following values:
• apiserver_flowcontrol_dispatched_requests_total is a counter
vector (cumulative since server start) of requests that began executing,
broken down by the labels flow_schema (indicating the one that
matched the request) and priority_level (indicating the one to which
the request was assigned).
• apiserver_flowcontrol_read_vs_write_request_count_samples is a
histogram vector of observations of the then-current number of
requests, broken down by the labels phase (which takes on the values w
aiting and executing) and request_kind (which takes on the values m
utating and readOnly). The observations are made periodically at a
high rate.
• apiserver_flowcontrol_read_vs_write_request_count_watermarks
is a histogram vector of high or low water marks of the number of
requests broken down by the labels phase (which takes on the values w
aiting and executing) and request_kind (which takes on the values m
utating and readOnly); the label mark takes on values high and low.
The water marks are accumulated over windows bounded by the times
when an observation was added to apiserver_flowcontrol_read_vs_w
rite_request_count_samples. These water marks show the range of
values that occurred between samples.
• apiserver_flowcontrol_current_executing_requests is a gauge
vector holding the instantaneous number of executing (not waiting in a
queue) requests, broken down by the labels priority_level and flow_
schema.
• apiserver_flowcontrol_priority_level_request_count_samples is
a histogram vector of observations of the then-current number of
requests broken down by the labels phase (which takes on the values w
aiting and executing) and priority_level. Each histogram gets
observations taken periodically, up through the last activity of the
relevant sort. The observations are made at a high rate.
• apiserver_flowcontrol_priority_level_request_count_watermarks
is a histogram vector of high or low water marks of the number of
requests broken down by the labels phase (which takes on the values w
aiting and executing) and priority_level; the label mark takes on
values high and low. The water marks are accumulated over windows
bounded by the times when an observation was added to apiserver_fl
owcontrol_priority_level_request_count_samples. These water
marks show the range of values that occurred between samples.
apiserver_flowcontrol_request_queue_length_after_enqueue is a
• histogram vector of queue lengths for the queues, broken down by the
labels priority_level and flow_schema, as sampled by the enqueued
requests. Each request that gets queued contributes one sample to its
histogram, reporting the length of the queue just after the request was
added. Note that this produces different statistics than an unbiased
survey would.
• apiserver_flowcontrol_request_concurrency_limit is a gauge
vector holding the computed concurrency limit (based on the API
server's total concurrency limit and PriorityLevelConfigurations'
concurrency shares), broken down by the label priority_level.
• apiserver_flowcontrol_request_wait_duration_seconds is a
histogram vector of how long requests spent queued, broken down by
the labels flow_schema (indicating which one matched the request), pr
iority_level (indicating the one to which the request was assigned),
and execute (indicating whether the request started executing).
• apiserver_flowcontrol_request_execution_seconds is a histogram
vector of how long requests took to actually execute, broken down by
the labels flow_schema (indicating which one matched the request) and
priority_level (indicating the one to which the request was
assigned).
Debug endpoints
When you enable the API Priority and Fairness feature, the kube-apiserver
serves the following additional paths at its HTTP[S] ports.
• /debug/api_priority_and_fairness/dump_priority_levels - a
listing of all the priority levels and the current state of each. You can
fetch like this:
You can get a more detailed listing with a command like this:
What's next
For background information on design details for API priority and fairness,
see the enhancement proposal. You can make suggestions and feature
requests via SIG API Machinery.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified November 21, 2020 at 10:53 PM PST: Fix typo (c0eafeb53)
Edit this page Create child page Create an issue
Installing Addons
Caution: This section links to third party projects that provide
functionality required by Kubernetes. The Kubernetes project
authors aren't responsible for these projects. This page follows
CNCF website guidelines by listing projects alphabetically. To add
a project to this list, read the content guide before submitting a
change.
This page lists some of the available add-ons and links to their respective
installation instructions.
Service Discovery
• CoreDNS is a flexible, extensible DNS server which can be installed as
the in-cluster DNS for pods.
Legacy Add-ons
There are several other add-ons documented in the deprecated cluster/
addons directory.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Extending Kubernetes
Different ways to change the behavior of your Kubernetes cluster.
Configuration
Configuration files and flags are documented in the Reference section of the
online documentation, under each binary:
• kubelet
• kube-apiserver
• kube-controller-manager
• kube-scheduler.
Extensions
Extensions are software components that extend and deeply integrate with
Kubernetes. They adapt it to support new types and new kinds of hardware.
Extension Patterns
Kubernetes is designed to be automated by writing client programs. Any
program that reads and/or writes to the Kubernetes API can provide useful
automation. Automation can run on the cluster or off it. By following the
guidance in this doc you can write highly available and robust automation.
Automation generally works with any Kubernetes cluster, including hosted
clusters and managed installations.
There is a specific pattern for writing client programs that work well with
Kubernetes called the Controller pattern. Controllers typically read an
object's .spec, possibly do things, and then update the object's .status.
Below is a diagram showing how the extension points interact with the
Kubernetes control plane.
Extension Points
If you are unsure where to start, this flowchart can help. Note that some
solutions may involve several types of extensions.
API Extensions
User-Defined Types
For more about Custom Resources, see the Custom Resources concept
guide.
The combination of a custom resource API and a control loop is called the
Operator pattern. The Operator pattern is used to manage specific, usually
stateful, applications. These custom APIs and control loops can also be used
to control other resources, such as storage or policies.
When you extend the Kubernetes API by adding custom resources, the
added resources always fall into a new API Groups. You cannot replace or
change existing API groups. Adding an API does not directly let you affect
the behavior of existing APIs (e.g. Pods), but API Access Extensions do.
Authentication
Authorization
Infrastructure Extensions
Storage Plugins
Flex Volumes allow users to mount volume types without built-in support by
having the Kubelet call a Binary Plugin to mount the volume.
Device Plugins
Device plugins allow a node to discover new Node resources (in addition to
the builtin ones like cpu and memory) via a Device Plugin.
Network Plugins
Scheduler Extensions
The scheduler is a special type of controller that watches pods, and assigns
pods to nodes. The default scheduler can be replaced entirely, while
continuing to use other Kubernetes components, or multiple schedulers can
run at the same time.
This is a significant undertaking, and almost all Kubernetes users find they
do not need to modify the scheduler.
The scheduler also supports a webhook that permits a webhook backend
(scheduler extension) to filter and prioritize the nodes chosen for a pod.
What's next
Overview
Customization approaches can be broadly divided into configuration, which
only involves changing flags, local configuration files, or API resources; and
extensions, which involve running additional programs or services. This
document is primarily about extensions.
Configuration
Configuration files and flags are documented in the Reference section of the
online documentation, under each binary:
• kubelet
• kube-apiserver
• kube-controller-manager
• kube-scheduler.
Extensions
Extensions are software components that extend and deeply integrate with
Kubernetes. They adapt it to support new types and new kinds of hardware.
Extension Patterns
Kubernetes is designed to be automated by writing client programs. Any
program that reads and/or writes to the Kubernetes API can provide useful
automation. Automation can run on the cluster or off it. By following the
guidance in this doc you can write highly available and robust automation.
Automation generally works with any Kubernetes cluster, including hosted
clusters and managed installations.
There is a specific pattern for writing client programs that work well with
Kubernetes called the Controller pattern. Controllers typically read an
object's .spec, possibly do things, and then update the object's .status.
Below is a diagram showing how the extension points interact with the
Kubernetes control plane.
Extension Points
This diagram shows the extension points in a Kubernetes system.
1. Users often interact with the Kubernetes API using kubectl. Kubectl
plugins extend the kubectl binary. They only affect the individual user's
local environment, and so cannot enforce site-wide policies.
2. The apiserver handles all requests. Several types of extension points in
the apiserver allow authenticating requests, or blocking them based on
their content, editing content, and handling deletion. These are
described in the API Access Extensions section.
3. The apiserver serves various kinds of resources. Built-in resource kinds,
like pods, are defined by the Kubernetes project and can't be changed.
You can also add resources that you define, or that other projects have
defined, called Custom Resources, as explained in the Custom
Resources section. Custom Resources are often used with API Access
Extensions.
4. The Kubernetes scheduler decides which nodes to place pods on. There
are several ways to extend scheduling. These are described in the
Scheduler Extensions section.
5. Much of the behavior of Kubernetes is implemented by programs called
Controllers which are clients of the API-Server. Controllers are often
used in conjunction with Custom Resources.
6. The kubelet runs on servers, and helps pods appear like virtual servers
with their own IPs on the cluster network. Network Plugins allow for
different implementations of pod networking.
7. The kubelet also mounts and unmounts volumes for containers. New
types of storage can be supported via Storage Plugins.
If you are unsure where to start, this flowchart can help. Note that some
solutions may involve several types of extensions.
API Extensions
User-Defined Types
Consider adding a Custom Resource to Kubernetes if you want to define new
controllers, application configuration objects or other declarative APIs, and
to manage them using Kubernetes tools, such as kubectl.
For more about Custom Resources, see the Custom Resources concept
guide.
Authentication
Authentication maps headers or certificates in all requests to a username for
the client making the request.
Authorization
Authorization determines whether specific users can read, write, and do
other operations on API resources. It just works at the level of whole
resources -- it doesn't discriminate based on arbitrary object fields. If the
built-in authorization options don't meet your needs, and Authorization
webhook allows calling out to user-provided code to make an authorization
decision.
Device Plugins
Device plugins allow a node to discover new Node resources (in addition to
the builtin ones like cpu and memory) via a Device Plugin.
Network Plugins
Different networking fabrics can be supported via node-level Network
Plugins.
Scheduler Extensions
The scheduler is a special type of controller that watches pods, and assigns
pods to nodes. The default scheduler can be replaced entirely, while
continuing to use other Kubernetes components, or multiple schedulers can
run at the same time.
This is a significant undertaking, and almost all Kubernetes users find they
do not need to modify the scheduler.
What's next
• Learn more about Custom Resources
• Learn about Dynamic admission control
• Learn more about Infrastructure extensions
◦ Network Plugins
◦ Device Plugins
• Learn about kubectl plugins
• Learn about the Operator pattern
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified October 13, 2020 at 12:41 AM PST: Transfer "Controlling
Access to the Kubernetes API" to the Concepts section (78351ecaf)
Edit this page Create child page Create an issue
• Overview
• Configuration
• Extensions
• Extension Patterns
• Extension Points
• API Extensions
◦ User-Defined Types
◦ Combining New APIs with Automation
◦ Changing Built-in Resources
◦ API Access Extensions
◦ Authentication
◦ Authorization
◦ Dynamic Admission Control
• Infrastructure Extensions
◦ Storage Plugins
◦ Device Plugins
◦ Network Plugins
◦ Scheduler Extensions
• What's next
Custom Resources
Custom Resources
Custom resources are extensions of the Kubernetes API. This page discusses
when to add a custom resource to your Kubernetes cluster and when to use
a standalone service. It describes the two methods for adding custom
resources and how to choose between them.
Custom resources
A resource is an endpoint in the Kubernetes API that stores a collection of
API objects of a certain kind; for example, the built-in pods resource
contains a collection of Pod objects.
Custom controllers
On their own, custom resources simply let you store and retrieve structured
data. When you combine a custom resource with a custom controller,
custom resources provide a true declarative API.
A declarative API allows you to declare or specify the desired state of your
resource and tries to keep the current state of Kubernetes objects in sync
with the desired state. The controller interprets the structured data as a
record of the user's desired state, and continually maintains this state.
Declarative APIs
In a Declarative API, typically:
Imperative APIs are not declarative. Signs that your API might not be
declarative include:
• The client says "do this", and then gets a synchronous response back
when it is done.
• The client says "do this", and then gets an operation ID back, and has to
check a separate Operation object to determine completion of the
request.
• You talk about Remote Procedure Calls (RPCs).
• Directly storing large amounts of data; for example, > a few kB per
object, or > 1000s of objects.
• High bandwidth access (10s of requests per second sustained) needed.
• Store end-user data (such as images, PII, etc.) or other large-scale data
processed by applications.
• The natural operations on the objects are not CRUD-y.
• The API is not easily modeled as objects.
• You chose to represent pending operations with an operation ID or an
operation object.
• You want to use Kubernetes client libraries and CLIs to create and
update the new resource.
• You want top-level support from kubectl; for example, kubectl get
my-object object-name.
• You want to build new automation that watches for updates on the new
object, and then CRUD other objects, or vice versa.
• You want to write automation that handles updates to the object.
• You want to use Kubernetes API conventions like .spec, .status, and .
metadata.
• You want the object to be an abstraction over a collection of controlled
resources, or a summarization of other resources.
Kubernetes provides these two options to meet the needs of different users,
so that neither ease of use nor flexibility is compromised.
Aggregated APIs are subordinate API servers that sit behind the primary API
server, which acts as a proxy. This arrangement is called API Aggregation
(AA). To users, it simply appears that the Kubernetes API is extended.
CRDs allow users to create new types of resources without adding another
API server. You do not need to understand API Aggregation to use CRDs.
Regardless of how they are installed, the new resources are referred to as
Custom Resources to distinguish them from built-in Kubernetes resources
(like pods).
CustomResourceDefinitions
The CustomResourceDefinition API resource allows you to define custom
resources. Defining a CRD object creates a new custom resource with a
name and schema that you specify. The Kubernetes API serves and handles
the storage of your custom resource. The name of a CRD object must be a
valid DNS subdomain name.
This frees you from writing your own API server to handle the custom
resource, but the generic nature of the implementation means you have less
flexibility than with API server aggregation.
Aggregated
Feature Description CRDs
API
Yes. Most
validation can be
Help users prevent errors specified in the
and allow you to evolve your CRD using
Yes,
API independently of your OpenAPI v3.0
arbitrary
Validation clients. These features are validation. Any
validation
most useful when there are other validations
checks
many clients who can't all supported by
update at the same time. addition of a
Validating
Webhook.
Yes, either via
OpenAPI v3.0
validation default
keyword (GA in
1.17), or via a
Defaulting See above Yes
Mutating Webhook
(though this will
not be run when
reading from etcd
for old objects).
Allows serving the same
object through two API
Multi- versions. Can help ease API
Yes Yes
versioning changes like renaming fields.
Less important if you control
your client versions.
If you need storage with a
different performance mode
(for example, a time-series
Custom database instead of key-
No Yes
Storage value store) or isolation for
security (for example,
encryption of sensitive
information, etc.)
Aggregated
Feature Description CRDs
API
Perform arbitrary checks or
Custom
actions when creating, Yes, using
Business Yes
reading, updating or deleting Webhooks.
Logic
an object
Allows systems like
HorizontalPodAutoscaler and
Scale
PodDisruptionBudget Yes Yes
Subresource
interact with your new
resource
Allows fine-grained access
control where user writes
the spec section and the
controller writes the status
Status section. Allows incrementing
Yes Yes
Subresource object Generation on custom
resource data mutation
(requires separate spec and
status sections in the
resource)
Add operations other than
Other
CRUD, such as "logs" or No Yes
Subresources
"exec".
The new endpoints support
PATCH with Content-Type:
application/strategic-
merge-patch+json. Useful
strategic- for updating objects that
No Yes
merge-patch may be modified both locally,
and by the server. For more
information, see "Update API
Objects in Place Using
kubectl patch"
The new resource supports
Protocol
clients that want to use No Yes
Buffers
Protocol Buffers
Is there an OpenAPI
(swagger) schema for the
types that can be
dynamically fetched from the
Yes, based on the
server? Is the user protected
OpenAPI OpenAPI v3.0
from misspelling field names Yes
Schema validation schema
by ensuring only allowed
(GA in 1.16).
fields are set? Are types
enforced (in other words,
don't put an int in a string
field?)
Common Features
When you create a custom resource, either via a CRD or an AA, you get
many features for your API, compared to implementing it outside the
Kubernetes platform:
Storage
Custom resources consume storage space in the same way that ConfigMaps
do. Creating too many custom resources may overload your API server's
storage space.
Aggregated API servers may use the same storage as the main API server, in
which case the same warning applies.
If you use RBAC for authorization, most RBAC roles will not grant access to
the new resources (except the cluster-admin role or any role created with
wildcard rules). You'll need to explicitly grant access to the new resources.
CRDs and Aggregated APIs often come bundled with new role definitions for
the types they add.
Aggregated API servers may or may not use the same authentication,
authorization, and auditing as the primary API server.
• kubectl
• The kubernetes dynamic client.
• A REST client that you write.
• A client generated using Kubernetes client generation tools (generating
one is an advanced undertaking, but some projects may provide a client
along with the CRD or AA).
What's next
• Learn how to Extend the Kubernetes API with the aggregation layer.
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified October 13, 2020 at 12:41 AM PST: Move API overview to be a
Docsy section overview (3edb97057)
Edit this page Create child page Create an issue
• Custom resources
• Custom controllers
• Should I add a custom resource to my Kubernetes Cluster?
◦ Declarative APIs
• Should I use a configMap or a custom resource?
• Adding custom resources
• CustomResourceDefinitions
• API server aggregation
• Choosing a method for adding custom resources
◦ Comparing ease of use
◦ Advanced features and flexibility
◦ Common Features
• Preparing to install a custom resource
◦ Third party code and new points of failure
◦ Storage
◦ Authentication, authorization, and auditing
• Accessing a custom resource
• What's next
The aggregation layer is different from Custom Resources, which are a way
to make the kube-apiserver recognise new kinds of object.
Aggregation layer
The aggregation layer runs in-process with the kube-apiserver. Until an
extension resource is registered, the aggregation layer will do nothing. To
register an API, you add an APIService object, which "claims" the URL path
in the Kubernetes API. At that point, the aggregation layer will proxy
anything sent to that API path (e.g. /apis/myextension.mycompany.io/v1/
…) to the registered APIService.
Response latency
Extension API servers should have low latency networking to and from the
kube-apiserver. Discovery requests are required to round-trip from the kube-
apiserver in five seconds or less.
What's next
• To get the aggregator working in your environment, configure the
aggregation layer.
• Then, setup an extension api-server to work with the aggregation layer.
• Also, learn how to extend the Kubernetes API using Custom Resource
Definitions.
• Read the specification for APIService
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified July 17, 2020 at 4:11 PM PST: Fix links in concepts section (2)
(c8f470487)
Edit this page Create child page Create an issue
• Aggregation layer
◦ Response latency
• What's next
Compute, Storage, and
Networking Extensions
Network Plugins
Device Plugins
Network Plugins
Network plugins in Kubernetes come in a few flavors:
Installation
The kubelet has a single default network plugin, and a default network
common to the entire cluster. It probes for plugins when it starts up,
remembers what it finds, and executes the selected plugin at appropriate
times in the pod lifecycle (this is only true for Docker, as CRI manages its
own CNI plugins). There are two Kubelet command line parameters to keep
in mind when using plugins:
CNI
The CNI plugin is selected by passing Kubelet the --network-plugin=cni
command-line option. Kubelet reads a file from --cni-conf-dir (default /
etc/cni/net.d) and uses the CNI configuration from that file to set up each
pod's network. The CNI configuration file must match the CNI specification,
and any required CNI plugins referenced by the configuration must be
present in --cni-bin-dir (default /opt/cni/bin).
If there are multiple CNI configuration files in the directory, the kubelet uses
the configuration file that comes first by name in lexicographic order.
Support hostPort
The CNI networking plugin supports hostPort. You can use the official
portmap plugin offered by the CNI plugin team or use your own plugin with
portMapping functionality.
{
"name": "k8s-pod-network",
"cniVersion": "0.3.0",
"plugins": [
{
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "127.0.0.1",
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "portmap",
"capabilities": {"portMappings": true}
}
]
}
Experimental Feature
The CNI networking plugin also supports pod ingress and egress traffic
shaping. You can use the official bandwidth plugin offered by the CNI plugin
team or use your own plugin with bandwidth control functionality.
If you want to enable traffic shaping support, you must add the bandwidth
plugin to your CNI configuration file (default /etc/cni/net.d) and ensure
that the binary is included in your CNI bin dir (default /opt/cni/bin).
{
"name": "k8s-pod-network",
"cniVersion": "0.3.0",
"plugins": [
{
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "127.0.0.1",
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "bandwidth",
"capabilities": {"bandwidth": true}
}
]
}
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/ingress-bandwidth: 1M
kubernetes.io/egress-bandwidth: 1M
...
kubenet
Kubenet is a very basic, simple network plugin, on Linux only. It does not, of
itself, implement more advanced features like cross-node networking or
network policy. It is typically used together with a cloud provider that sets
up routing rules for communication between nodes, or in single-node
environments.
Kubenet creates a Linux bridge named cbr0 and creates a veth pair for each
pod with the host end of each pair connected to cbr0. The pod end of the
pair is assigned an IP address allocated from a range assigned to the node
either through configuration or by the controller-manager. cbr0 is assigned
an MTU matching the smallest MTU of an enabled normal interface on the
host.
Where needed, you can specify the MTU explicitly with the network-
plugin-mtu kubelet option. For example, on AWS the eth0 MTU is typically
9001, so you might specify --network-plugin-mtu=9001. If you're using
IPSEC you might reduce it to allow for encapsulation overhead; for example:
--network-plugin-mtu=8873.
What's next
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified October 13, 2020 at 7:22 AM PST: Make the CNI usage more
accurate (c3a9924bb)
Edit this page Create child page Create an issue
• Installation
• Network Plugin Requirements
◦ CNI
◦ kubenet
◦ Customizing the MTU (with kubenet)
• Usage Summary
• What's next
Device Plugins
Use the Kubernetes device plugin framework to implement plugins for
GPUs, NICs, FPGAs, InfiniBand, and similar resources that require vendor-
specific setup.
FEATURE STATE: Kubernetes v1.10 [beta]
service Registration {
rpc Register(RegisterRequest) returns (Empty) {}
}
A device plugin can register itself with the kubelet through this gRPC
service. During the registration, the device plugin needs to send:
Following a successful registration, the device plugin sends the kubelet the
list of devices it manages, and the kubelet is then in charge of advertising
those resources to the API server as part of the kubelet node status update.
For example, after a device plugin registers hardware-vendor.example/foo
with the kubelet and reports two healthy devices on a node, the node status
is updated to advertise that the node has 2 "Foo" devices installed and
available.
---
apiVersion: v1
kind: Pod
metadata:
name: demo-pod
spec:
containers:
- name: demo-container-1
image: k8s.gcr.io/pause:2.0
resources:
limits:
hardware-vendor.example/foo: 2
#
# This Pod needs 2 of the hardware-vendor.example/foo devices
# and can only schedule onto a Node that's able to satisfy
# that need.
#
# If the Node has more than 2 of those devices available, the
# remainder would be available for other Pods to use.
• The plugin starts a gRPC service, with a Unix socket under host path /
var/lib/kubelet/device-plugins/, that implements the following
interfaces:
service DevicePlugin {
// GetDevicePluginOptions returns options to be
communicated with Device Manager.
rpc GetDevicePluginOptions(Empty) returns
(DevicePluginOptions) {}
• The plugin registers itself with the kubelet through the Unix socket at
host path /var/lib/kubelet/device-plugins/kubelet.sock.
If you choose the DaemonSet approach you can rely on Kubernetes to: place
the device plugin's Pod onto Nodes, to restart the daemon Pod after failure,
and to help automate upgrades.
API compatibility
Kubernetes device plugin support is in beta. The API may change before
stabilization, in incompatible ways. As a project, Kubernetes recommends
that device plugin developers:
If you enable the DevicePlugins feature and run device plugins on nodes that
need to be upgraded to a Kubernetes release with a newer device plugin API
version, upgrade your device plugins to support both versions before
upgrading these nodes. Taking that approach will ensure the continuous
functioning of the device allocations during the upgrade.
message TopologyInfo {
repeated NUMANode nodes = 1;
}
message NUMANode {
int64 ID = 1;
}
Device Plugins that wish to leverage the Topology Manager can send back a
populated TopologyInfo struct as part of the device registration, along with
the device IDs and the health of the device. The device manager will then
use this information to consult with the Topology Manager and make
resource assignment decisions.
TopologyInfo supports a nodes field that is either nil (the default) or a list
of NUMA nodes. This lets the Device Plugin publish that can span NUMA
nodes.
What's next
• Learn about scheduling GPU resources using device plugins
• Learn about advertising extended resources on a node
• Read about using hardware acceleration for TLS ingress with
Kubernetes
• Learn about the Topology Manager
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Operator pattern
Operators are software extensions to Kubernetes that make use of custom
resources to manage applications and their components. Operators follow
Kubernetes principles, notably the control loop.
Motivation
The Operator pattern aims to capture the key aim of a human operator who
is managing a service or set of services. Human operators who look after
specific applications and services have deep knowledge of how the system
ought to behave, how to deploy it, and how to react if there are problems.
Operators in Kubernetes
Kubernetes is designed for automation. Out of the box, you get lots of built-
in automation from the core of Kubernetes. You can use Kubernetes to
automate deploying and running workloads, and you can automate how
Kubernetes does that.
An example Operator
Some of the things that you can use an operator to automate include:
What might an Operator look like in more detail? Here's an example in more
detail:
1. A custom resource named SampleDB, that you can configure into the
cluster.
2. A Deployment that makes sure a Pod is running that contains the
controller part of the operator.
3. A container image of the operator code.
4. Controller code that queries the control plane to find out what
SampleDB resources are configured.
5. The core of the Operator is code to tell the API server how to make
reality match the configured resources.
◦ If you add a new SampleDB, the operator sets up
PersistentVolumeClaims to provide durable database storage, a
StatefulSet to run SampleDB and a Job to handle initial
configuration.
◦ If you delete it, the Operator takes a snapshot, then makes sure
that the StatefulSet and Volumes are also removed.
6. The operator also manages regular database backups. For each
SampleDB resource, the operator determines when to create a Pod that
can connect to the database and take backups. These Pods would rely
on a ConfigMap and / or a Secret that has database connection details
and credentials.
7. Because the Operator aims to provide robust automation for the
resource it manages, there would be additional supporting code. For
this example, code checks to see if the database is running an old
version and, if so, creates Job objects that upgrade it for you.
Deploying Operators
The most common way to deploy an Operator is to add the Custom Resource
Definition and its associated Controller to your cluster. The Controller will
normally run outside of the control plane, much as you would run any
containerized application. For example, you can run the controller in your
cluster as a Deployment.
Using an Operator
Once you have an Operator deployed, you'd use it by adding, modifying or
deleting the kind of resource that the Operator uses. Following the above
example, you would set up a Deployment for the Operator itself, and then:
…and that's it! The Operator will take care of applying the changes as well
as keeping the existing service in good shape.
You also implement an Operator (that is, a Controller) using any language /
runtime that can act as a client for the Kubernetes API.
What's next
• Learn more about Custom Resources
• Find ready-made operators on OperatorHub.io to suit your use case
• Use existing tools to write your own operator, eg:
◦ using KUDO (Kubernetes Universal Declarative Operator)
◦ using kubebuilder
◦ using Metacontroller along with WebHooks that you implement
yourself
◦ using the Operator Framework
• Publish your operator for other people to use
• Read CoreOS' original article that introduced the Operator pattern
• Read an article from Google Cloud about best practices for building
Operators
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.
Last modified October 22, 2020 at 2:24 PM PST: Fix links in concepts
section (070023b24)
Edit this page Create child page Create an issue
• Motivation
• Operators in Kubernetes
• An example Operator
• Deploying Operators
• Using an Operator
• Writing your own Operator
• What's next
Service Catalog
Service Catalog is an extension API that enables applications running in
Kubernetes clusters to easily use external managed software offerings, such
as a datastore service offered by a cloud provider.
It provides a way to list, provision, and bind with external Managed Services
from Service Brokers without needing detailed knowledge about how those
services are created or managed.
Using Service Catalog, a cluster operator can browse the list of managed
services offered by a service broker, provision an instance of a managed
service, and bind with it to make it available to an application in the
Kubernetes cluster.
A cluster operator can setup Service Catalog and use it to communicate with
the cloud provider's service broker to provision an instance of the message
queuing service and make it available to the application within the
Kubernetes cluster. The application developer therefore does not need to be
concerned with the implementation details or management of the message
queue. The application can simply use it as a service.
Architecture
Service Catalog uses the Open service broker API to communicate with
service brokers, acting as an intermediary for the Kubernetes API Server to
negotiate the initial provisioning and retrieve the credentials necessary for
the application to use a managed service.
Secret:
Connection Credentials
Service Details
…
Bind Instance
Service Broker Z
Kubernetes
API Resources
Authentication
• Basic (username/password)
• OAuth 2.0 Bearer Token
Usage
A cluster operator can use Service Catalog API Resources to provision
managed services and make them available within a Kubernetes cluster. The
steps involved are:
1. Listing the managed services and Service Plans available from a service
broker.
2. Provisioning a new instance of the managed service.
3. Binding to the managed service, which returns the connection
credentials.
4. Mapping the connection credentials into the application.
apiVersion: servicecatalog.k8s.io/v1beta1
kind: ClusterServiceBroker
metadata:
name: cloud-broker
spec:
# Points to the endpoint of a service broker. (This example is
not a working URL.)
url: https://servicebroker.somecloudprovider.com/v1alpha1/
projects/service-catalog/brokers/default
#####
# Additional values can be added here, which may be used to
communicate
# with the service broker, such as bearer token info or a
caBundle for TLS.
#####
ClusterServiceBroker
Resource
1.
List Services
List of
Services,
ClusterServiceClass Plans
get clusterserviceclasses Resource Services, Plans
2.
get clusterserviceplans ClusterServicePlan
Resource
3.
3. A cluster operator can then get the list of available managed services
using the following command:
They can also view the Service Plans available using the following
command:
apiVersion: servicecatalog.k8s.io/v1beta1
kind: ServiceInstance
metadata:
name: cloud-queue-instance
namespace: cloud-apps
spec:
# References one of the previously returned services
clusterServiceClassExternalName: cloud-provider-service
clusterServicePlanExternalName: service-plan-name
#####
# Additional parameters can be added here,
# which may be used by the service broker.
#####
ServiceInstance
Resource 1.
Provision Instance 2.
Service
ServiceInstance
get serviceinstance
Resource
3.
READY
After a new instance has been provisioned, a cluster operator must bind to
the managed service to get the connection credentials and service account
details necessary for the application to use the service. This is done by
creating a ServiceBinding resource.
apiVersion: servicecatalog.k8s.io/v1beta1
kind: ServiceBinding
metadata:
name: cloud-queue-binding
namespace: cloud-apps
spec:
instanceRef:
name: cloud-queue-instance
#####
# Additional information can be added here, such as a
secretName or
# service account parameters, which may be used by the service
broker.
#####
ServiceBinding
Resource 1.
Bind Instance 2.
Service
Connection
Information
3.
ServiceBinding
Resource
servicecatalog.k8s.io:
ServiceBinding
Secret:
Connection Credentials
ServiceAccount Details
…
Kubernetes
...
spec:
volumes:
- name: provider-cloud-key
secret:
secretName: sa-key
containers:
...
volumeMounts:
- name: provider-cloud-key
mountPath: /var/secrets/provider
env:
- name: PROVIDER_APPLICATION_CREDENTIALS
value: "/var/secrets/provider/key.json"
The following example describes how to map secret values into application
environment variables. In this example, the messaging queue topic name is
mapped from a secret named provider-queue-credentials with a key
named topic to the environment variable TOPIC.
...
env:
- name: "TOPIC"
valueFrom:
secretKeyRef:
name: provider-queue-credentials
key: topic
What's next
• If you are familiar with Helm Charts, install Service Catalog using Helm
into your Kubernetes cluster. Alternatively, you can install Service
Catalog using the SC tool.
• View sample service brokers.
• Explore the kubernetes-sigs/service-catalog project.
• View svc-cat.io.
Feedback
Was this page helpful?
Yes No
Thanks for the feedback. If you have a specific, answerable question about
how to use Kubernetes, ask it on Stack Overflow. Open an issue in the
GitHub repo if you want to report a problem or suggest an improvement.