Node Failure Handling with Longhorn
When a Kubernetes node fails with CSI driver installed (all the following are based on Kubernetes v1.12 with default setup):
kubectl get nodes
will report NotReady
for the failure node.NotReady
node will change to either Unknown
or NodeLost
.Terminating
state and the attached Longhorn volumes
cannot be released/reused, the new pod will get stuck in ContainerCreating
state. That’s why users need to decide is if it’s safe to force deleting the pod.Terminating
state. See pod eviction timeout for details.
Then if the failed node is recovered later, Kubernetes will restart those terminating pods, detach the volumes, wait for the old VolumeAttachment cleanup, and reuse(re-attach & re-mount) the volumes. Typically these steps may take 1 ~ 7 minutes.
In this case, detaching and re-attaching operations are already included in the Kubernetes recovery procedures. Hence no extra operation is needed and the Longhorn volumes will be available after the above steps.© 2019-2024 Longhorn Authors | Documentation Distributed under CC-BY-4.0
© 2024 The Linux Foundation. All rights reserved. The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see our Trademark Usage page.