Tanzu Kubernetes Grid: Stateful architecture data volume considerations
August 17, 2020While planning the architecture for our Tanzu Kubernetes Grid (TKG) deployment, one of the tests I wanted to perform was node + pod scaling to see if any issues arose.
The application in question is Django/Python/JS/Bootstrap based, and I was trying to achieve the following goals:
- Migrate the application with as few architecture changes as possible, continuing to utilize the on-prem resources that have a couple of years worth of equipment + licensing to run out.
- Prepare the application for CI/CD onto cloud-native resources with IaC
- Utilize NFS RWX mounts to share media between the Django and NGINX sidecar container.
As I would soon find out, this is not necessarily a truly scale-able on-prem solution with the native Kubernetes/TKG toolset due to the Node Multi-Attach issue.
TLDR; Kubernetes does not allow multiple nodes to mount volumes. A Kubernetes bug also exists that does not forcefully detach pv from node after 6 minute timeout causing multi-attach headaches. Some references:
- https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/221
- https://github.com/kubernetes/kubernetes/issues/65392
- https://cormachogan.com/2019/06/18/kubernetes-storage-on-vsphere-101-failure-scenarios/
Commands to remove stuck volumes are close to bottom
Brief outline of the steps taken to test and troubleshoot:
Scale the worker nodes to start:
tkg scale cluster cluster_name --worker-machine-count 2 --namespace=development
Next, kill the replicaset to trigger a rebalance:
kubectl delete pod -l app=appname,tier=frontend
Check on pods:
kubectl get pods
I noticed the 2nd replica was stuck in pending; upon further investigation with describe, I was seeing the following error under events:
kubectl describe pod { podname }
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 43m default-scheduler Successfully assigned nginx-x2hs2 to kubew05 Warning FailedAttachVolume 43m attachdetach-controller Multi-Attach error for volume "pvc-0a5eb91b-3720-11e8-8d2b-000c29f8a512" Volume is already exclusively attached to one node and can't be attached to another Normal SuccessfulMountVolume 43m kubelet, kubew05 MountVolume.SetUp succeeded for volume "default-token-scwwd" Warning FailedMount 51s (x19 over 41m) kubelet, kubew05 Unable to mount volumes for pod "nginx-x2hs2_project(c0e45e49-3721-11e8-8d2b-000c29f8a512)": timeout expired waiting for volumes to attach/mount for pod "project"/"nginx-x2hs2". list of unattached/unmounted volumes=[html]
Primary issue:
Warning FailedAttachVolume Multi-Attach error for volume “{pvc-GUID}” Volume is already exclusively attached to one node and can’t be attached to another
Only a single node can own a PV/PVC at a time. Essentially, you cannot have a deployment with 2 replicas mounting the same NFS share, if those replicas are hosted on different nodes. This is obviously going to be an issue if I am looking to scale up and rely on NFS to share files between pods and nodes.
In an effort to recover things to how they were, I tried to set the newly created node as un-scheduleable and delete the pod from that node:
kubectl taint nodes node1 key=value:NoSchedule kubectl delete pod { podname }
This did not help. I check the pods/PV/PVCs, and still nothing obvious. After further research, I discovered that I had overlooked the Kubernetes object volumeattachments; they are responsible for publishing/unpublishing operations against a CSI endpoint (ie: storage request from pod)
Check the volumeattachments:
kubectl get volumeattachments
NAME ATTACHER PV NODE ATTACHED AGE csi-9f7704015b456f146ce8c6c3bd80a5ec6cc55f4f5bfb90c61c250d0b050a283c openebs-csi.openebs.io pvc-b39248ab-5a99-439b-ad6f-780aae30626c csi-node2.mayalabs.io true 66m
Well…. Kubernetes thinks the PVCs are attached to a non-existent node, and since we have passed the 6m timeout the PVCs will not be detaching automatically. The bugs linked at the beginning and end of this post go into more detail as to why they may not detach from a node automatically (potential data corruption/loss is the main one).
Recovery
How does one recover from this situation? In my case, the easiest was to remove the finalizers from the volumeattachments.
There a few ways to do this:
kubectl edit volumeattachment csi-xxxxxxxxx
Locate and comment out the finalizer (warning: if you use Windows, take care of line endings and tabs):
"finalizers": [ #"external-attacher/csi-vsphere-vmware-com" ],
Remove all volumeattachments finalizers:
kubectl get volumeattachments| tail -n+2 | awk '{print $1}' | xargs -I{} kubectl patch volumeattachments {} --type='merge' -p '{"metadata":{"finalizers": null}}'
Remove finalizers from all PV/PVC:
kubectl get pvc | tail -n+2 | awk '{print $1}' | xargs -I{} kubectl patch pvc {} --type='merge' -p '{"metadata":{"finalizers": null}}'
Remove finalizers from single PV/PVC
kubectl patch pv pvc-*** --type='merge' -p '{"metadata":{"finalizers":null}}'
Upon verifying the volumeattachments, I can see that they are being removed once saved without the finalizers. Back in business.
Final lesson learned
Based on the observations above, I believe to properly scale our Django application, we will need to migrate to object based storage such as S3, MinIO, or use an on-prem CSI provider that supports RWX PVCs such as Portworx. Otherwise, an unnecessary system to synchronize the media stores between nodes/volumes would likely be required, significantly increasing our failure domains.
I believe these types of situations serve as an important reminder that testing and verification should always be a part of your design and deployment process.
Prior documentation on this surpisingly long standing ‘issue’:
Thanks, it’s very useful.