Kubernetes Best Practices

Building Containers

Avoid common pitfalls and use best practices

Deployment vs Pod

Pods are the fundamental Kubernetes building block for your container and now you hear that you shouldn't use Pods directly but through an abstraction such as a Deployment. Why is that and what makes the difference?

If you deploy a Pod directly to your Kubernetes cluster, your container(s) will run, but nothing takes care of its lifecycle. Once a node goes down, capacity on the current node is needed, etc the Pod will get lost forever.

Thats the point where building blocks such as ReplicaSet and Deployment come into play. A ReplicaSet acts as a supervisor to the Pods it watches and recreates Pods that don´t exist anymore. Deployments are an even higher abstraction and create and manage ReplicaSets to enable the developer to use a declarative state instead of imperative commands (e.g. kubectl rolling-update). The real advantage is that Deployments will automatically do rolling-updates and always ensure a given target state instead of having to deal with imperative changes.

Container image tags

When creating Kubernetes Pods, it is for certain reasons not advisable to use latest tags on Container Images.

The first reason for not using it is the fact that you can´t be 100% sure which exact version of your software you are running - lets dive a bit deeper. Once Kubernetes creates a Pod for you it assigns an imagePullPolicy to it.

By default this will be IfNotPresent which means that the container runtime will only pull the image if it is not present on the node the Pod was assigned to. Once you use latest as an image tag this default behavior switches to Always resulting in the runtime pulling the image every time it starts up a container using that image in a Pod.

There are two really important reasons why this is really bad thing to do:

You loose control over which exact code is running in your system
Rolling-Updates/Rollbacks are not possible anymore

Lets dive deeper into this:

Imagine you have version A of your Software, tag it with latest, test version A in a CI system, start a Pod on Node 1, tag version B of your Software again with latest, Node 1 goes down and your Pod is moved to Node 2 before version B was tested in a CI.

Which version of your Software will be running? Version B
Which version should be running? Version A
Can you immediately switch back to the previous version? NO!

This simple scenario already shows quite clearly some problems that will arise but this is only the tip of the iceberg! You should always be able to see which tested version of your software is currently running, when it was deployed, which source code (e.g. commit ids) your image was built on and know how to switch to a previous version easily - ideally with the push of button.

A container image built from source might be tagged GITSHA-d670460b4b4aece5915caf5c68d12f560a9fe3e4 to allow you to easily link the source code state to the generated artifact (which will hopefully be immutable). A container image built from a java jar artifact might have a tag matching the maven release version MVN-3.10.6.RELEASE.

Use a non-root user inside the container

RUN groupadd -r nodejs
RUN useradd -m -r -g nodejs nodejs
USER nodejs

Enforce it!

apiVersion: v1
kind: Pod
metadata:
name: hello-world
spec:
containers:
# specification of the pod’s containers
# ...
securityContext:
 runAsNonRoot: true

Make the filesystem read-only


apiVersion: v1
kind: Pod
metadata:
name: hello-world
spec:
containers:
# specification of the pod’s containers
# ...
securityContext:
 runAsNonRoot: true
 readOnlyRootFilesystem: true

One process per container

Don’t restart on failure. Crash cleanly instead.

Log to stdout and stderr

Add "tini" or "dumb-init" to prevent zombie processes

(Good News: No need to do this in K8s 1.7)

Deployments

Use the “record” option for easier rollbacks

 kubectl apply -f deployment.yaml --record
 kubectl rollout history deployments my-deployment

Use plenty of descriptive labels
Use sidecar containers for proxies, watchers, etc
Don’t use sidecars for bootstrapping!
Use init containers instead!
Don’t use :latest or no tag (as above)
Readiness and Liveness probes are your friend

Health Checks

Readiness → Is the app ready to start serving traffic?

Won’t be added to a service endpoint until it passes
Required for a “production app” in my opinion Liveness → Is the app still running?
Default is “process is running”
Possible that the process can be running but not working correctly
Good to define, might not be 100% necessary These can sometimes be the same endpoint, but not always

Ingress and AWS

We originally used the Loadbalancer service type, which uses the AWS functionality in k8s to create an ELB and connect it to all nodes in the cluster. We found this had a few limitations:

Creates an ELB for every single service – we quickly ran into ELB limits, and it’s also costly with hundreds of services.
Each service ELB attaches to every single node in the cluster, which causes a large amount of unnecessary traffic from health checks. It increases as O(m*n) where m=# services and n=# nodes.
Performs poorly in failure scenarios. If a single pod dies, the ELB might end up removing a significant portion of your cluster’s nodes because it doesn’t know where the unhealthy instance is.

In modern k8s I would advise not using LoadBalancer type service, instead use NodePort type services, and use an ingress controller fronted by ELBs/NLBs.

Ingress is a generic resource that defines how you want your service accessed externally. A controller is created in the cluster to watch ingress resources and then setup the external access. A benefit in AWS is we could create a single shared ELB/NLB and handle virtual dispatch in our cluster instead of outside of it, allowing us to avoid the drawbacks described above.

I also would avoid using ALBs. The connection draining behaviour on them is broken when we last tried using them a few months ago. This prevents 0 downtime deployments of the nginx component.

Organizing resource configurations

Many applications require multiple resources to be created, such as a Deployment and a Service. Management of multiple resources can be simplified by grouping them together in the same file (separated by --- in YAML). While this is convenient for initially applying to the cluster, it is better to keep them seperated for several reasons:

generally the Deployment will need to be altered far more frequently than the Service.
You can gaplessly replace (delete and recreate) a Deployment but not a Service. Combining them makes for accidental service gaps.

JVM in a container

Most of our applications are JVM based. Running the hotspot JVM out of the box in a container is problematic. All the defaults, such as number of GC threads, sizing of memory pools, and so forth use the host system’s resources as a basis. This is not ideal since the container is usually far more constrained than the host. In our case, we use m4.4xlarge nodes which have 16 cores and 64Gi of ram, while running pods that are usually limited to a couple cores and couple Gi of ram.

It took us a while to figure out the right balance of options to get the JVM to behave well:

-Xmx to roughly half the size of the memory limits of the pod. There is a lot of memory overhead in hotspot. This value will depend on the application as well, so it takes some work to find the right amount of max heap to prevent an oom kill. -XX:+AlwaysPreTouch so we can more quickly ensure our max heap setting is reasonable. Otherwise we might only get oom killed under load. Never limit CPU on a JVM container, unless you really don’t care about long pause times. The CPU limit will cause hard throttling of your application whenever garbage collections happen. The right choice almost always is to use burstable CPU. This let’s the scheduler soft clamp the application if it uses too much CPU in relation to other applications running on the node, while also letting it use as much CPU as available on the machine.

Layer 4 connectivity between pods

Most of the applications prior to k8s use relied on external load balancers to proxy client connections. This meant applications could be sloppy with regards to graceful termination, since they could rely on the load balancer to gracefully drain off connections. Most Java web frameworks do not shutdown in a non disruptive way. Instead, they do similar to nginx where in flight requests are processed (although even that is/was bugged in many frameworks) and then client connections are dropped without any further ado.

Once we moved these applications inside of k8s they became responsible for their own graceful termination. The solution, after much trial and error, was to add a filter with a conditional that gets activated at shutdown. When activated, it adds a connection: close header to all responses, and delays shutdown for a drain duration. As long as a request comes in during the drain time, the client connection gets closed due to the header. This was relatively straightforward to implement, although the implementation varied based on the framework used. This solved all of our issues with errors on application deployment, giving us zero downtime application deployments.

Other alternatives include using things like linkerd or istio to proxy all connections between pods, much like having a load balancer between each pod. We experimented with these, but none were as simple or reliable as simply having the application handle shutdown properly. Using an alternative RPC mechanism instead of http/1.1 would probably take care of this as well.

StevenACoffman/Kubernetes Best Practices.md

Building Containers

Deployment vs Pod

Container image tags

Use a non-root user inside the container

Make the filesystem read-only

One process per container

Don’t restart on failure. Crash cleanly instead.

Log to stdout and stderr

Add "tini" or "dumb-init" to prevent zombie processes

Deployments

Health Checks

Ingress and AWS

Organizing resource configurations

JVM in a container

Layer 4 connectivity between pods

StevenACoffman commented Nov 30, 2017 •

edited

Loading

StevenACoffman/Kubernetes Best Practices.md

Building Containers

Deployment vs Pod

Container image tags

Use a non-root user inside the container

Make the filesystem read-only

One process per container

Don’t restart on failure. Crash cleanly instead.

Log to stdout and stderr

Add "tini" or "dumb-init" to prevent zombie processes

Deployments

Health Checks

Ingress and AWS

Organizing resource configurations

JVM in a container

Layer 4 connectivity between pods

StevenACoffman commented Nov 30, 2017 • edited Loading

StevenACoffman commented Nov 30, 2017 •

edited

Loading