Presented during the PointClickCare SaaSOps Learnathon 2024.
preface
This article assumes you’ve got some Kubernetes experience (in the case you don’t check out my Kubernetes Quickstart Workshop to bring yourself to around a level 2).
After this read, you’ll be at a level 4 or 5, which is more than enough to be dangerous and make good decisions when planning and operating services in Kubernetes environments.
table of contents
pod lifecycle
Through your use of Kubernetes, you will hear two different terms that are often used interchangeably; pod phase and container state.
A pod phase is a Kubernetes construct and an actual Kubernetes object called PodStatus
which is used to denote at a high level where a pod is in its lifecycle.
This is the result of evaluating several conditions including container state, which is a construct used to denote where a pod’s containers are in their lifecycle, along with other pod, node and cluster statuses.
The above diagram helps illustrate what these different lifecycle stages are, what happens during them, and how your workload would progress from one to another.
In a normal, healthy service, the top-most blue dashed line is followed – a pod starts off pending
, where after a couple of activities it then enters running. Once in running
, a pod would either remain there for the rest of its life or reach succeeded
if its work is finite.
Related to container services with finite workload, if they terminate abnormally or unsuccessfully, the pod is then failed
. This can be observed with the red, thinner dashed line, where a pod can enter this phase either before running, during running, or after running.
Finally, we have a specific pod phase, unknown
, which covers conditions that do not evaluate to a known PodPhase
. While not often seen in the wild it can be observed at any point in a pod lifecycle as identified with the dotted gray line.
pending
Pods start their life when the Kubernetes control plane accepts their definition – whether this is a pod manifest provided to the API, or a pod spec templated from some set (such as deployment
, statefulset
, daemonset
, replicaset
, etc.).
It’s also during this phase that admission controllers kick in – this being extra code that intercepts Kubernetes API server requests to validate or mutate objects.
Once accepted, the Kubernetes control plane component kube-scheduler
helps us assign this pod to a node.
The container runtime on the node then authenticates against the designated container image repository and pulls the container image(s) associated with the pod.
From here, pod sandboxing begins.
The container runtime extracts the container image to the node’s local file system and establishes some kernel isolation.
Some of the techniques used to do this include:
- control groups, also known as
cgroups
, which provide low level API to constrain CPU and memory. - secure computing known as
seccomp
, which sandboxes the privileges of a process. It does so by performing a once-in-its-lifetime transition to a state preventing the process from making any system calls except a limited few (such asexit
). namespaces
, which through partitioning of kernel resources, defines what can be used by what process(es) and what isn’t. Examples of resources managed under namespaces include process IDs, host-names, user IDs, file names, some names associated with network access, and inter-process communication.
Parallel to container sandboxing, network setup is also done.
Here, a cloud provider-specific implementation of the container network interface works together with the node’s network manager to configure virtual pod network interface cards (NIC).
In addition to this, kube-proxy
running on the node sets up routes to get TO the containers within the pod, mostly required for the internal Kubernetes service load balancer. This is seen all the time in Azure Kubernetes Service (AKS) when a new pod gets created – the network plugin such as Azure CNI makes requests to a cloud-provider specific control plane component called the delegated network controller to reserve an IP for a pod within the Azure Virtual Network (VNET).
Another optional activity done during this phase is data and volume setup, where through some storage plugin that implements the container storage interface (CSI), mounts are created on the node file system directly then linked back to some path under the container file system.
as an operator, keep in mind
- This is normal behavior – don’t get spooked over pods in pending.
- You should be concerned though if they’ve been pending for awhile, in which case it could be due to a variety of things, such as:
- Resource contention – do your nodes have enough CPU and memory to schedule your pod? What is the requested CPU and memory, and can it fit on any available nodes?
- Container image pull – is this successful? Is your container registry reachable? Does it require authentication, and is your token valid? Does the image being pulled even exist?
- Taints & Tolerations – Does your workload have a matching toleration to the taint of the nodes you want to run it on?
- Affinities – Do your pods define any
requiredDuringScheduling
pod or node affinities and/or anti-affinities that are preventing it from scheduling (keeping in mind,affinity
is just a way of attracting pods to nodes relative to other nodes/pods in the cluster)? This also applies to zonal affinity resources such as persistent volumes (PV) in Azure. Azure disks are regional and can only be attached to VMs in a particular zone – which can prove troublesome if the PV required by your pod cannot be mounted on any of the nodes available since they’re all in other zones. - Other dependencies – If your workload depends on Kubernetes
secrets
, do the required secret objects exist? What aboutconfigmaps
?
running
Before we get to the main application container, Kubernetes will start all init
and sidecar
containers first. Only once all init
and sidecar
containers have started and completed successfully (the exception being sidecar
containers since they’ll continue to run), will the application containers and their entry point processes start.
It’s also during this stage that Ready
state containers are added to the Kubernetes service load balancer.
Running pods will remain running, unless the container processes within have specific tasks to do and no more.
as an operator, keep in mind
For engineers and operators alike, this is the most desired state for your workload.
As stated before, unless the application has finite work to do (in which case the pod is best templated from a Kubernetes job
), it will remain running forever.
This is also the first stage where we’ll begin using CPU, memory and storage from our nodes – an important consideration if you’re costing out a service’s overall run time for metered infrastructure billing.
Before you get too excited though, a running container does not necessarily mean the application within is ready. Even within the container, your application process(es) can be initializing, opening database connections and starting web APIs.
succeeded
For pods with containers that perform finite work, their final lifecycle stage is succeeded.
Often templated from a job
definition, these pods are considered succeeded when all containers within terminate with exit code 0.
Terminated containers with exit code 0 are not restarted, and the succeeded pods don’t utilize any further compute.
It should be noted that the pod IP for a succeeded pod is purely informational to identify the IP it had while it was running for debugging purposes – whatever network plugin used in that cluster has already released that IP for other workload.
Storage is still consumed on the node for the image, and persistent volume claims remain which result in lingering persistent volumes.
In addition the API object takes up space in the control plane etcd store for its definition, events and logs.
as an operator, keep in mind
An engineer that designs containers that perform finite work, or an SRE/operator that maintains them, keep in mind that Kubernetes job
objects cannot be restarted.
Sure, you can delete the job altogether and re-create it from a manifest, but it’s not a restart and the old job and pod are lost.
If your job object is templated from a cronjob
, then a new job can be manually created, and with it a pod.
By default jobs and their associated pods in succeeded state remain in the Kubernetes API server, up to 3 occurrences.
You can configure how many occurrences to keep, along with an automatic cleanup for successful jobs and their associated pods by defining .spec.ttlSecondsAfterFinished
failed
Very similar to the succeeded
phase, a pod with this status contains container processes with finite work.
To be in this state, all containers would have had to terminate – but with at least 1 that has a non-zero exit code.
Assuming the workload to be based off a job, no containers are restarted on failure. Instead, the job will create another pod as to preserve the state of the failed ones while ensuring all container processes in the pod run sequentially and successfully – from start to end.
Similar to succeeded pods, failed pods do not utilize compute, have a purely informative IP and retain storage associated with persistent volume claims.
as an operator, keep in mind
From an operators perspective, a pod in a failed
phase isn’t that different from a succeeded
one.
Once again assuming your workload pod was based off some job definition, then the job resource will ensure a new pod gets created until the required number of successful completions are reached – or until the backoff
limit is exhausted.
Conditions that you may see which place pods into a failed phase are typically service related (i.e. bad application code), but can be caused by bad infrastructure.
For example, the event of node death or the abrupt loss of a node in a Kubernetes cluster will have the Kubernetes API server defaulting all the pods previously running on those node(s) to failed.
unknown
For these pods, the Kubernetes API server is unable to build a known pod phase when considering container status and conditions within the pod.
These can be caused by a variety of undocumented cases including control plane failure – where some controller watchdog experiences an internal error and crashes, resulting in a loss of state for resources, and/or the inability to reconcile those that are under its command.
More common is would be node and/or kubelet
process failure, which would impact all pods running on that node.
In such cases, a built-in admission controller would automatically taint nodes housing unknown
status pods, and after a period of time evict them forcefully so that they can schedule onto healthy nodes that do not have this repulsive taint.
as an operator, keep in mind
For operators, this can be a fun one to investigate. It’s rarely seen, as most negative cases are caught by a container crash with attempted reconciliation by Kubernetes (which restarts the container).
Even from an infrastructure perspective – node failures are at least acknowledged by the API server and treated to systematic eviction, whether that’s kubelet
letting the control plane know that its underlying node is unhealthy, or the control plane’s own health probe against kubelet
fails.
Some scenarios that can result in brief situations where pods are in unknown
status include:
- Sudden breaks in network connectivity between agent pool nodes and the Kubernetes control plane master due to a change in network rules
- Sudden death of the
kubelet
process, or container runtime services on a node resulting in container processes being sent aSIGKILL
from the kernel - Abrupt node pressures such as a large unpredictable and lasting spike in CPU/memory usage, or the underlying file system running out of disk space
- Control plane failure
Pods in an unknown phase may still be running – there’s no guarantee even after reconciliation (i.e. deleting an ‘unknown’ phase pod) that the container runtime kills the container process and performs cleanup.
Pods in an unknown phase aren’t really recoverable – it’s recommended that you delete these pods.
If they’re templated out from some set, then the respective controller should realize it’s missing replicas and reconcile with new pods automatically.
probes and the questions they ask
Now that we’ve established the different pod phases – the good and the bad – let’s understand the mechanisms within Kubernetes used to self-heal from the bad ones.
At the highest level, a probe is some command execution, web request, TCP or GRPC connection periodically opened against the application process within a container, from kubelet
, with the expectation that it completes successfully to denote a healthy service.
As of today, there are three different kinds of probes – each with a different purpose and can be used independently or together.
startup
startup
probes are great for letting Kubernetes know when an application has started.
They’re ran only during startup, and upon the first success cease running altogether.
They also gate the execution of liveness
and readiness
probes, which is great since those test other conditions which may not be true during app start.
If this probe fails, then the container is restarted.
The most common use of startup probes are for applications with long and/or unpredictable startup times – such as those with complex initialization phases or dependency chains. Best practices for using this probe are to ensure the underlying check is very lightweight and simple since we’re just testing for the successful start of your container’s entry point process(es) (for example, a running web server rather than a running web application on that server).
liveness
liveness
probes ensure the container entry point process(es) are still running.
These are ran after the startup
probe, for the remainder of the pod lifecycle.
If this probe fails, then the container is restarted.
These are commonly implemented and used always, as they’re the most basic check to let Kubernetes know that your container processes are running – regardless of the state of those processes.
For this reason, it’s best practice that the action for liveness
probes are also lightweight and simple.
readiness
readiness
probes answer a vital question – can my application service traffic?
These are ran after the startup probe, for the remainder of the pod lifecycle.
If the probe fails, rather than restarting the container process, the pod is removed from the internal Kubernetes service load balancer. This prevents it from receiving traffic. Once it’s healthy, the pod is added back.
These are commonly implemented for container services which actively serve traffic to others. Readiness checks are redundant for standalone containers that work in a silo with no dependents.
Due to the nature of this probe, it’s probably the last check your pod would clear before it’s ready to face the world. As such, ensure the test is a comprehensive one which evaluates service availability as if an end-user or end-service is accessing it.
readiness gates
While the previous section on probes showcases Kubernetes’ ability to interrogate a container process for its state, a container process can directly inject its state into its pod using some relatively new functionality called readiness gates.
Readiness gates are nothing more than a map of label/key format entries based off feedback signals collected under a PodStatus
object, which are injected from the container process(es) themselves (AKA, your application).
To do so, the application would need to utilize some Kubernetes client library, and via API server requests, provide a ready condition. Using this, a pod is deemed ready when all its containers are ready AND all conditions specified under readinessGates
are true.
The result these can be observed under a pod’s status.conditions
.
While not conventional since container applications should be designed independently from the infrastructure they run on, there are interesting use cases when an application’s readiness is determined only from an external factor with no hook mechanisms that can be interrogated via standard probes and kubelet.
A couple of weird examples include;
- Rainwater analysis workload which should only be scheduled to run during a storm to process a non-stop stream of meteorological data from sensors
- Availability of a service running on an offline host such as a root CA where you may have complex, programmatic mechanisms to hook into
- Pod network readiness, where pods collaborates with an infrastructure’s network provider CNI implementation fronted load balancer to determine when it has been allocated an IP
There’s a few reference implementations floating around, namely AWS’ load balancer controller which ensures pods are registered to an ALB/NLB and are ready to serve.
It’s primarily meant to solve the problem of truly 0 downtime rolling deployments by eliminating the time for the load balancer controller to register the new pods, along with making it cognizant of pods in the beginning and end of their lifecycle.
As a fun note, this functionality is not yet available for Azure Gateway Ingress Controller (AGIC), which is a known limitation that Azure documents as resulting in possible 502’s – essentially, it’s not a technology ready for 0 downtime deployments (assuming no additional reverse proxy layer exists).
container lifecycle hooks
With all the fun mechanisms to interrogate and influence a pod lifecycle, let’s bring this article to a close by looking at the ways we can have our application hook into these lifecycle phases to do something cool.
Container lifecycle hooks are a way to make containers aware of the overall pod lifecycle.
Basically, we’re letting the container process(es) know when the container is starting, and when its terminating.
The supported handler functions to date include:
exec
, which is essentially executing a command within the application containerhttp
, which executes a web request against your application process running within the containersleep
, which pauses the container process
The different hooks available today only cover the start and end of a container, but can be quite useful.
Take the PostStart
hook for example, which executes immediately after a container is created.
Before any process starts, you may use this hook to patch some 3rd party library for a container image that you haven’t built or maintain, but need to remediate ASAP due to some security vulnerability.
Another example using the PreStop
hook, which executes immediately before a container is terminated, would be to execute a web request that releases a hold of some application license.
Hooks are taken pretty seriously, and are guaranteed to complete by Kubernetes.
So much so that a failure of a hook is considered a container failure, resulting in the container being killed and restarted.
Thanks for coming to my TED Talk.