PAUL'S BLOG

Learn. Build. Share. Repeat.

You Down with PDB?

2024-03-22 7 min read Kubernetes Resiliency

P.D.B. how can I explain it
I’ll take you frame by frame it
To have y’all all jumpin’, shoutin’, sayin’ it
P is for Pod, D is for Disruption, reboot and watch it ripple
The last B, well, that’s super simple

~ An ode to Naughty by Nature’s track titled O.P.P. which Microsoft Copilot helped me write 😂

Overview

In this post, we’ll take a look at Kubernetes Pod Disruption Budgets (PDBs) and how they can be used to ensure that your applications remain available during planned disruptions.

In the world of Kubernetes, Pods are the smallest deployable units of computing that can be created and managed. They are the building blocks of Kubernetes applications and are created, scheduled, and managed by the Kubernetes control plane. When you deploy an application, you typically deploy them as a ReplicaSet via a Deployment manifest and run at least three replicas for high availability within a cluster. A PDB is a policy that specifies the minimum number of Pods that must be available at any given time during a disruption. So this is a way for you to protect your applications.

What is a disruption you ask? A disruption is a situation where a Pod is removed, either voluntarily or involuntarily. This could be due to a variety of reasons such as a node being drained for maintenance, or a node failing. Draining a node is considered to be a voluntary disruptions and PDBs will be respected by the Kubernetes Eviction Manager. However, PDBs probably won’t help much in the case of involuntary disruptions such as underlying machine hardware failures, machine crashes, pod deletions, or other catastrophic events that you may not have control over.

With that said, let’s see this in action. We’ll create a simple deployment and then apply a PDB to it. We’ll then simulate a disruption and see how the PDB protects our application.

Environment setup 🔨

For simple tests, I like to use KIND (Kubernetes in Docker) as it’s quick and easy to spin up a cluster. So go install Docker first then follow the instructions here.

With Docker and KIND installed, we can look to create a local Kubernetes cluster. We need to customize our cluster a bit to include 2 worker nodes, so we can use a simple KIND manifest file to spin up a multi-node cluster.

Create a file called kind.yaml and add the following:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker

Now, run this command to create the cluster:

kind create cluster --config kind.yaml

Run the following command to get the nodes in the cluster:

kubectl get nodes

You should see that we have a 3-node cluster up and running.

NAME                 STATUS   ROLES           AGE   VERSION
kind-control-plane   Ready    control-plane   34s   v1.29.2
kind-worker          Ready    <none>          9s    v1.29.2
kind-worker2         Ready    <none>          13s   v1.29.2

When deploying pods, Kubernetes will try to spread them around across nodes, but for our test, we want to have all our pods on a single node. To ensure all pods get scheduled to the kind-worker2, we’ll taint the kind-control-plane and kind-worker nodes to make them unavailable for pod scheduling.

Run the following commands to taint the nodes:

kubectl taint nodes kind-control-plane key1=value1:NoSchedule
kubectl taint nodes kind-worker key1=value1:NoSchedule

Now, let’s create a simple nginx deployment that runs three replicas:

kubectl create deployment nginx --image=nginx --replicas=3

After a minute or two, run the following command to see the pods running. You’ll see they’ve all been scheduled to the kind-worker node:

kubectl get pod -o wide

You should see output similar to the following:

NAME                     READY   STATUS    RESTARTS   AGE   IP           NODE           NOMINATED NODE   READINESS GATES
nginx-7854ff8877-lkkf7   1/1     Running   0          16s   10.244.1.3   kind-worker2   <none>           <none>
nginx-7854ff8877-mr6mf   1/1     Running   0          16s   10.244.1.4   kind-worker2   <none>           <none>
nginx-7854ff8877-pwxnn   1/1     Running   0          16s   10.244.1.2   kind-worker2   <none>           <none>

Make an oopsie 😨

Now, let’s say as a cluster admin you need to perform maintenance on the kind-worker2 node. So you decide to drain it.

Run the following command to drain the worker node:

kubectl drain kind-worker2 --ignore-daemonsets

You should see output similar to the following:

node/kind-worker2 cordoned
Warning: ignoring DaemonSet-managed Pods: kube-system/kindnet-tw6xc, kube-system/kube-proxy-whblx
evicting pod default/nginx-7854ff8877-pwxnn
evicting pod default/nginx-7854ff8877-lkkf7
evicting pod default/nginx-7854ff8877-mr6mf
pod/nginx-7854ff8877-lkkf7 evicted
pod/nginx-7854ff8877-mr6mf evicted
pod/nginx-7854ff8877-pwxnn evicted
node/kind-worker2 drained

Oopsie! The node was cordoned and all 3 nginx pods were evicted. You forgot to remove the taint on the kind-worker node, so there’s nowhere for these pods to be re-scheduled to in the cluster.

Let’s check the damage. Run the following command to see the pods running:

kubectl get po -o wide

You should see output similar to the following:

NAME                     READY   STATUS    RESTARTS   AGE    IP       NODE     NOMINATED NODE   READINESS GATES
nginx-7854ff8877-gdqtx   0/1     Pending   0          114s   <none>   <none>   <none>           <none>
nginx-7854ff8877-jdrg5   0/1     Pending   0          114s   <none>   <none>   <none>           <none>
nginx-7854ff8877-szlkz   0/1     Pending   0          114s   <none>   <none>   <none>           <none>

Shoot! We’ve rendered our application useless 😭

PDB to the rescue 🛟

If we had a PDB in place, the Kubernetes Eviction Manager would have respected it and not evicted all three nginx pods.

Let’s unwind this situation and try it again, but this time with a PDB in place.

Run the following command to uncordon the kind-worker2 node:

kubectl uncordon kind-worker2

Within a minute or two, the pods should be rescheduled to the kind-worker2 node.

Run the following command to see the pods running:

kubectl get po -o wide

Alright, we’re back in action 😮‍💨

Now, create a PDB for the deployment and set the minimum available pods to 2. This is the budget that we want to enforce.

See, the last B was super simple 😂

kubectl create pdb nginx-pdb --selector=app=nginx --min-available=2

This will ensure that at least 2 pods are always available during a disruption.

Drain the kind-worker2 node again.

kubectl drain kind-worker2 --ignore-daemonsets

You should see output similar to the following:

node/kind-worker2 cordoned
Warning: ignoring DaemonSet-managed Pods: kube-system/kindnet-tw6xc, kube-system/kube-proxy-whblx
evicting pod default/nginx-7854ff8877-szlkz
evicting pod default/nginx-7854ff8877-gdqtx
evicting pod default/nginx-7854ff8877-jdrg5
error when evicting pods/"nginx-7854ff8877-jdrg5" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
error when evicting pods/"nginx-7854ff8877-gdqtx" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
pod/nginx-7854ff8877-szlkz evicted

The message “Cannot evict pod as it would violate the pod’s disruption budget” will continue to scroll by until the PDB is no longer violated. Only one pod was evicted because the PDB requires that at least 2 replicas running.

Open a new terminal window and run the following command to remove the taint from kind-worker so that pods can be scheduled on it:

kubectl taint nodes kind-worker key1:NoSchedule-

If you run the following command, you can see replica counts going down and back up. This tells us pods are being shuffled around between nodes.

kubectl get deploy -w

Hit Ctrl+C to stop the watch then run the following command to see the pods running:

kubectl get pods -o wide

You should see all pods have been rescheduled on the kind-worker node:

NAME                     READY   STATUS    RESTARTS   AGE   IP           NODE          NOMINATED NODE   READINESS GATES
nginx-7854ff8877-9bq4j   1/1     Running   0          74s   10.244.2.4   kind-worker   <none>           <none>
nginx-7854ff8877-bbn6z   1/1     Running   0          5m    10.244.2.2   kind-worker   <none>           <none>
nginx-7854ff8877-pr96l   1/1     Running   0          80s   10.244.2.3   kind-worker   <none>           <none>

Let’s breakdown what happened here.

  1. The kind-worker2 node was cordoned to prevent new pods from being scheduled on it.
  2. The three pods running on kind-worker2 went into the eviction process.
  3. The first pod was evicted successfully since the PDB was satisfied with 2 pods running.
  4. The second and third pods could not be evicted because it violates the PDB, so it tries to evict them again after a few seconds.
  5. As the eviction manager waits, new replicas are created on the kind-worker node.
  6. Once the PDB is satisfied, the second and third pods are evicted from the kind-worker2 node and the drain process completes successfully.

Conclusion

Pod Disruption Budgets are a great way to protect your applications from planned disruptions and especially useful when you’re performing maintenance on your nodes such as node upgrades, kernel upgrades, etc.

In our example, rather than evicting the second and third pods right away, the eviction manager waited for the ReplicationController to create a new pod on an available node then completed the eviction on the cordoned node.

Hopefully you have a better understanding of how PDBs work and will be using them as a “best practice” to protect your applications for that next maintenance window.

Now click the image below and go listen to some Naughty by Nature 🥳

Naughty By Nature