Troubleshooting: Handling Persistent Replica Failures via Node or Disk Isolation
| November 25, 2025
All Longhorn versions.
A Longhorn Replica enters a Failed state due to environmental instability. When the user deletes the failed replica to trigger a rebuild, the scheduler may place the new replica on the same problematic node or disk. This results in a Rebuild Loop, with each new replica failing immediately on the unstable node or disk.
Common scenarios causing this behavior include:
dmesg).longhorn-instance-manager pod to crash.Longhorn’s default behavior upon replica deletion is to rebuild it immediately to satisfy the numberOfReplicas requirement.
If the problematic node remains in the Kubernetes Ready state and its disk shows sufficient free space, the Longhorn replica scheduler may choose it again for the new replica. Because the scheduler has no visibility into the underlying instability, it can repeatedly select the same unstable node or disk unless the user intervenes.
The standard recovery procedure requires a strategy of Isolate then Rebuild. This forces the scheduler to bypass the problematic node or disk and place the new replica on a different, healthy node or disk.
Prevent the Longhorn Scheduler from assigning new workloads to the compromised infrastructure.
Open the Longhorn UI.
Navigate to the Nodes tab.
Locate the node hosting the failed replica and select Edit node and disks.
Unschedule node or disk
Node Scheduling box,Scheduling box.Note:
- Do not enable Eviction Requested at this stage; the goal is simply to stop new placement.
- Choose only one of these options depending on whether the entire node or only the disk is unstable.
Click Save.
Now that the unstable path is blocked, remove the failed replica.
Upon detecting an insufficient replica count, Longhorn triggers the replica scheduler to scan the cluster. Because the original disk is now unschedulable, the scheduler selects a different node or disk. The new replica then begins rebuilding from a healthy source, eventually returning the Volume status to Healthy.
This mitigation strategy also applies to Backing Images when minNumberOfCopies is enabled
Similar to volume replicas, if a disk failure prevents a Backing Image from syncing, the Backing Image Manager may repeatedly attempt to re-download the file to the same problematic disk to satisfy the minimum copy requirement.
By performing Phase 1: Isolate the Unstable Node or Disk, you force the system to bypass the compromised node or disk. Longhorn will then automatically select a different, healthy node or disk to fulfill the minNumberOfCopies requirement, effectively breaking the loop.
Recent articles
© 2019-2025 Longhorn Authors | Documentation Distributed under CC-BY-4.0
© 2025 The Linux Foundation. All rights reserved. The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see our Trademark Usage page.