Troubleshooting: Upgrading volume engine is stuck in deadlock
Phan Le | January 3, 2022
This happens when users upgrade Longhorn from version <= v1.1.1 to a newer version.
Upgrading Longhorn system includes 2 steps: first upgrade Longhorn manager to the latest version, then upgrade the Longhorn engine to the latest version using the latest Longhorn manager. When doing the second step (upgrading Longhorn engine), you may hit the problem that some volumes are stuck in engine upgrading. You may also see that volume attachment/detachment cannot finish (e.g., Longhorn volumes are stuck in detaching or attaching state).
There is a bug Longhorn version <= v1.1.1 which leads to a deadlock in the instance manager pods. See more details at https://github.com/longhorn/longhorn/issues/2697. When you upgrade Longhorn from version <= v1.1.1 to a newer version, you may hit this bug in a cluster with a few hundred volumes.
We fixed this bug in Longhorn version >= v1.1.2. If you are planning to upgrade Longhorn to a version >= v1.1.2, you can follow the following steps to avoid the bug:
kubectl drain --pod-selector='!longhorn.io/component,app!=csi-attacher,app!=csi-provisioner,app!=csi-snapshotter,app!=csi-resizer,app!=longhorn-driver-deployer,app!=longhorn-ui' <NODE-NAME> --ignore-daemonsets
longhorn-systemnamespace. Let’s call it
# Install grpcurl apt-get update apt-get install -y wget wget https://github.com/fullstorydev/grpcurl/releases/download/v1.8.0/grpcurl_1.8.0_linux_x86_64.tar.gz tar -zxvf grpcurl_1.8.0_linux_x86_64.tar.gz mv grpcurl /usr/local/bin/ # Call instance manager gRPC APIs wget https://raw.githubusercontent.com/longhorn/longhorn-instance-manager/master/pkg/rpc/rpc.proto wget https://raw.githubusercontent.com/grpc/grpc/master/src/proto/grpc/health/v1/health.proto # check the health of grpc server on the instance manager instance-manager-e-f386c595 grpcurl -d '' -plaintext -import-path ./ -proto health.proto <INSTANCE-MANAGER-IP>:8500 grpc.health.v1.Health/Check # Server returns "status": "SERVING" grpcurl -d '' -plaintext -import-path ./ -proto rpc.proto <INSTANCE-MANAGER-IP>:8500 ProcessManagerService/ProcessList # If the server never returns response, this is a stuck instance manager pod
Recent articlesTroubleshooting: Volumes Stuck in Attach/Detach Loop When Using Longhorn on OKD
© 2019-2023 Longhorn Authors | Documentation Distributed under CC-BY-4.0
© 2023 The Linux Foundation. All rights reserved. The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see our Trademark Usage page.