Tanzu Kubernetes Grid: Stateful architecture data volume considerations

August 17, 2020 By Corey Dinkens

While planning the architecture for our Tanzu Kubernetes Grid (TKG) deployment, one of the tests I wanted to perform was node + pod scaling to see if any issues arose.

The application in question is Django/Python/JS/Bootstrap based, and I was trying to achieve the following goals:

  • Migrate the application with as few architecture changes as possible, continuing to utilize the on-prem resources that have a couple of years worth of equipment + licensing to run out.
  • Prepare the application for CI/CD onto cloud-native resources with IaC
  • Utilize NFS RWX mounts to share media between the Django and NGINX sidecar container.

As I would soon find out, this is not necessarily a truly scale-able on-prem solution with the native Kubernetes/TKG toolset due to the Node Multi-Attach issue.


TLDR; Kubernetes does not allow multiple nodes to mount volumes. A Kubernetes bug also exists that does not forcefully detach pv from node after 6 minute timeout causing multi-attach headaches. Some references:

Commands to remove stuck volumes are close to bottom


Brief outline of the steps taken to test and troubleshoot:

Scale the worker nodes to start:

tkg scale cluster cluster_name --worker-machine-count 2 --namespace=development

Next, kill the replicaset to trigger a rebalance:

kubectl delete pod -l app=appname,tier=frontend

Check on pods:

kubectl get pods

I noticed the 2nd replica was stuck in pending; upon further investigation with describe, I was seeing the following error under events:

kubectl describe pod { podname }
Events:
  Type     Reason                 Age                 From                     Message
  ----     ------                 ----                ----                     -------
  Normal   Scheduled              43m                 default-scheduler        Successfully assigned nginx-x2hs2 to kubew05
  Warning  FailedAttachVolume     43m                 attachdetach-controller  Multi-Attach error for volume "pvc-0a5eb91b-3720-11e8-8d2b-000c29f8a512" Volume is already exclusively attached to one node and can't be attached to another
  Normal   SuccessfulMountVolume  43m                 kubelet, kubew05         MountVolume.SetUp succeeded for volume "default-token-scwwd"
  Warning  FailedMount            51s (x19 over 41m)  kubelet, kubew05         Unable to mount volumes for pod "nginx-x2hs2_project(c0e45e49-3721-11e8-8d2b-000c29f8a512)": timeout expired waiting for volumes to attach/mount for pod "project"/"nginx-x2hs2". list of unattached/unmounted volumes=[html]

Primary issue:

Warning FailedAttachVolume Multi-Attach error for volume “{pvc-GUID}” Volume is already exclusively attached to one node and can’t be attached to another


Only a single node can own a PV/PVC at a time. Essentially, you cannot have a deployment with 2 replicas mounting the same NFS share, if those replicas are hosted on different nodes. This is obviously going to be an issue if I am looking to scale up and rely on NFS to share files between pods and nodes.

In an effort to recover things to how they were, I tried to set the newly created node as un-scheduleable and delete the pod from that node:

kubectl taint nodes node1 key=value:NoSchedule
kubectl delete pod { podname }

This did not help. I check the pods/PV/PVCs, and still nothing obvious. After further research, I discovered that I had overlooked the Kubernetes object volumeattachments; they are responsible for publishing/unpublishing operations against a CSI endpoint (ie: storage request from pod)

Check the volumeattachments:

kubectl get volumeattachments
NAME                                                                   ATTACHER                 PV                                         NODE                    ATTACHED   AGE
csi-9f7704015b456f146ce8c6c3bd80a5ec6cc55f4f5bfb90c61c250d0b050a283c   openebs-csi.openebs.io   pvc-b39248ab-5a99-439b-ad6f-780aae30626c   csi-node2.mayalabs.io   true       66m

Well…. Kubernetes thinks the PVCs are attached to a non-existent node, and since we have passed the 6m timeout the PVCs will not be detaching automatically. The bugs linked at the beginning and end of this post go into more detail as to why they may not detach from a node automatically (potential data corruption/loss is the main one).

Recovery

How does one recover from this situation? In my case, the easiest was to remove the finalizers from the volumeattachments.

There a few ways to do this:

kubectl edit volumeattachment csi-xxxxxxxxx

Locate and comment out the finalizer (warning: if you use Windows, take care of line endings and tabs):

"finalizers": [
 #"external-attacher/csi-vsphere-vmware-com"
 ],

Remove all volumeattachments finalizers:

kubectl get volumeattachments| tail -n+2 | awk '{print $1}' | xargs -I{} kubectl patch volumeattachments {} --type='merge' -p '{"metadata":{"finalizers": null}}'

Remove finalizers from all PV/PVC:

kubectl get pvc | tail -n+2 | awk '{print $1}' | xargs -I{} kubectl patch pvc {} --type='merge' -p '{"metadata":{"finalizers": null}}'

Remove finalizers from single PV/PVC

kubectl patch pv pvc-*** --type='merge' -p '{"metadata":{"finalizers":null}}'

Upon verifying the volumeattachments, I can see that they are being removed once saved without the finalizers. Back in business.

Final lesson learned

Based on the observations above, I believe to properly scale our Django application, we will need to migrate to object based storage such as S3, MinIO, or use an on-prem CSI provider that supports RWX PVCs such as Portworx. Otherwise, an unnecessary system to synchronize the media stores between nodes/volumes would likely be required, significantly increasing our failure domains.

I believe these types of situations serve as an important reminder that testing and verification should always be a part of your design and deployment process.