Avoid common pitfalls and use best practices
Pods are the fundamental Kubernetes building block for your container and now you hear that you shouldn't use Pods directly but through an abstraction such as a Deployment
. Why is that and what makes the difference?
If you deploy a Pod directly to your Kubernetes cluster, your container(s) will run, but nothing takes care of its lifecycle. Once a node goes down, capacity on the current node is needed, etc the Pod will get lost forever.
Thats the point where building blocks such as ReplicaSet
and Deployment
come into play. A ReplicaSet
acts as a supervisor to the Pods it watches and recreates Pods that don´t exist anymore.
Deployments
are an even higher abstraction and create and manage ReplicaSets
to enable the developer to use a declarative state instead of imperative commands (e.g. kubectl rolling-update
). The real advantage is that Deployments will automatically do rolling-updates and always ensure a given target state instead of having to deal with imperative changes.
When creating Kubernetes Pods, it is for certain reasons not advisable to use latest
tags on Container Images.
The first reason for not using it is the fact that you can´t be 100% sure which exact version of your software you are running - lets dive a bit deeper. Once Kubernetes creates a Pod for you it assigns an imagePullPolicy
to it.
By default this will be IfNotPresent
which means that the container runtime will only pull the image if it is not present on the node the Pod was assigned to.
Once you use latest
as an image tag this default behavior switches to Always
resulting in the runtime pulling the image every time it starts up a container using that image in a Pod.
There are two really important reasons why this is really bad thing to do:
- You loose control over which exact code is running in your system
- Rolling-Updates/Rollbacks are not possible anymore
Lets dive deeper into this:
Imagine you have version A of your Software, tag it with latest
, test version A in a CI system, start a Pod on Node 1, tag version B of your Software again with latest
, Node 1 goes down and your Pod is moved to Node 2 before version B was tested in a CI.
- Which version of your Software will be running? Version B
- Which version should be running? Version A
- Can you immediately switch back to the previous version? NO!
This simple scenario already shows quite clearly some problems that will arise but this is only the tip of the iceberg! You should always be able to see which tested version of your software is currently running, when it was deployed, which source code (e.g. commit ids) your image was built on and know how to switch to a previous version easily - ideally with the push of button.
A container image built from source might be tagged GITSHA-d670460b4b4aece5915caf5c68d12f560a9fe3e4
to allow you to easily link the source code state to the generated artifact (which will hopefully be immutable). A container image built from a java jar artifact might have a tag matching the maven release version MVN-3.10.6.RELEASE
.
RUN groupadd -r nodejs
RUN useradd -m -r -g nodejs nodejs
USER nodejs
Enforce it!
apiVersion: v1
kind: Pod
metadata:
name: hello-world
spec:
containers:
# specification of the pod’s containers
# ...
securityContext:
runAsNonRoot: true
apiVersion: v1
kind: Pod
metadata:
name: hello-world
spec:
containers:
# specification of the pod’s containers
# ...
securityContext:
runAsNonRoot: true
readOnlyRootFilesystem: true
(Good News: No need to do this in K8s 1.7)
- Use the “record” option for easier rollbacks
kubectl apply -f deployment.yaml --record
kubectl rollout history deployments my-deployment
- Use plenty of descriptive labels
- Use sidecar containers for proxies, watchers, etc
- Don’t use sidecars for bootstrapping!
- Use init containers instead!
- Don’t use
:latest
or no tag (as above) - Readiness and Liveness probes are your friend
Readiness → Is the app ready to start serving traffic?
- Won’t be added to a service endpoint until it passes
- Required for a “production app” in my opinion Liveness → Is the app still running?
- Default is “process is running”
- Possible that the process can be running but not working correctly
- Good to define, might not be 100% necessary These can sometimes be the same endpoint, but not always
We originally used the Loadbalancer
service type, which uses the AWS functionality in k8s to create an ELB and connect it to all nodes in the cluster. We found this had a few limitations:
- Creates an ELB for every single service – we quickly ran into ELB limits, and it’s also costly with hundreds of services.
- Each service ELB attaches to every single node in the cluster, which causes a large amount of unnecessary traffic from health checks. It increases as
O(m*n)
wherem=#
services andn=#
nodes. - Performs poorly in failure scenarios. If a single pod dies, the ELB might end up removing a significant portion of your cluster’s nodes because it doesn’t know where the unhealthy instance is.
In modern k8s I would advise not using LoadBalancer
type service, instead use NodePort
type services, and use an ingress controller fronted by ELBs/NLBs.
Ingress is a generic resource that defines how you want your service accessed externally. A controller is created in the cluster to watch ingress resources and then setup the external access. A benefit in AWS is we could create a single shared ELB/NLB and handle virtual dispatch in our cluster instead of outside of it, allowing us to avoid the drawbacks described above.
I also would avoid using ALBs. The connection draining behaviour on them is broken when we last tried using them a few months ago. This prevents 0 downtime deployments of the nginx component.
Many applications require multiple resources to be created, such as a Deployment and a Service. Management of multiple resources can be simplified by grouping them together in the same file (separated by --- in YAML). While this is convenient for initially applying to the cluster, it is better to keep them seperated for several reasons:
- generally the Deployment will need to be altered far more frequently than the Service.
- You can gaplessly replace (delete and recreate) a Deployment but not a Service. Combining them makes for accidental service gaps.
Most of our applications are JVM based. Running the hotspot JVM out of the box in a container is problematic. All the defaults, such as number of GC threads, sizing of memory pools, and so forth use the host system’s resources as a basis. This is not ideal since the container is usually far more constrained than the host. In our case, we use m4.4xlarge
nodes which have 16 cores and 64Gi of ram, while running pods that are usually limited to a couple cores and couple Gi of ram.
It took us a while to figure out the right balance of options to get the JVM to behave well:
-Xmx
to roughly half the size of the memory limits of the pod. There is a lot of memory overhead in hotspot. This value will depend on the application as well, so it takes some work to find the right amount of max heap to prevent an oom kill.
-XX:+AlwaysPreTouch
so we can more quickly ensure our max heap setting is reasonable. Otherwise we might only get oom killed under load.
Never limit CPU on a JVM container, unless you really don’t care about long pause times. The CPU limit will cause hard throttling of your application whenever garbage collections happen. The right choice almost always is to use burstable CPU. This let’s the scheduler soft clamp the application if it uses too much CPU in relation to other applications running on the node, while also letting it use as much CPU as available on the machine.
Most of the applications prior to k8s use relied on external load balancers to proxy client connections. This meant applications could be sloppy with regards to graceful termination, since they could rely on the load balancer to gracefully drain off connections. Most Java web frameworks do not shutdown in a non disruptive way. Instead, they do similar to nginx where in flight requests are processed (although even that is/was bugged in many frameworks) and then client connections are dropped without any further ado.
Once we moved these applications inside of k8s they became responsible for their own graceful termination. The solution, after much trial and error, was to add a filter with a conditional that gets activated at shutdown. When activated, it adds a connection: close
header to all responses, and delays shutdown for a drain duration. As long as a request comes in during the drain time, the client connection gets closed due to the header. This was relatively straightforward to implement, although the implementation varied based on the framework used. This solved all of our issues with errors on application deployment, giving us zero downtime application deployments.
Other alternatives include using things like linkerd or istio to proxy all connections between pods, much like having a load balancer between each pod. We experimented with these, but none were as simple or reliable as simply having the application handle shutdown properly. Using an alternative RPC mechanism instead of http/1.1 would probably take care of this as well.
These are where these great tips came from: