Troubleshooting: Upgrading volume engine is stuck in deadlock
| January 3, 2022
This happens when users upgrade Longhorn from version <= v1.1.1 to a newer version.
Upgrading Longhorn system includes 2 steps: first upgrade Longhorn manager to the latest version, then upgrade the Longhorn engine to the latest version using the latest Longhorn manager. When doing the second step (upgrading Longhorn engine), you may hit the problem that some volumes are stuck in engine upgrading. You may also see that volume attachment/detachment cannot finish (e.g., Longhorn volumes are stuck in detaching or attaching state).
There is a bug Longhorn version <= v1.1.1 which leads to a deadlock in the instance manager pods. See more details at https://github.com/longhorn/longhorn/issues/2697. When you upgrade Longhorn from version <= v1.1.1 to a newer version, you may hit this bug in a cluster with a few hundred volumes.
We fixed this bug in Longhorn version >= v1.1.2. If you are planning to upgrade Longhorn to a version >= v1.1.2, you can follow the following steps to avoid the bug:
kubectl drain --pod-selector='!longhorn.io/component,app!=csi-attacher,app!=csi-provisioner,app!=csi-snapshotter,app!=csi-resizer,app!=longhorn-driver-deployer,app!=longhorn-ui' <NODE-NAME> --ignore-daemonsets
instance-manager-e-xxxxxxxx
pods inside longhorn-system
namespace.
Let’s call it INSTANCE-MANAGER-IP
.longhorn-manager-xxxxx
pod inside longhorn-system
namespace.# Install grpcurl
apt-get update
apt-get install -y wget
wget https://github.com/fullstorydev/grpcurl/releases/download/v1.8.0/grpcurl_1.8.0_linux_x86_64.tar.gz
tar -zxvf grpcurl_1.8.0_linux_x86_64.tar.gz
mv grpcurl /usr/local/bin/
# Call instance manager gRPC APIs
wget https://raw.githubusercontent.com/longhorn/longhorn-instance-manager/master/pkg/rpc/rpc.proto
wget https://raw.githubusercontent.com/grpc/grpc/master/src/proto/grpc/health/v1/health.proto
# check the health of grpc server on the instance manager instance-manager-e-f386c595
grpcurl -d '' -plaintext -import-path ./ -proto health.proto <INSTANCE-MANAGER-IP>:8500 grpc.health.v1.Health/Check
# Server returns "status": "SERVING"
grpcurl -d '' -plaintext -import-path ./ -proto rpc.proto <INSTANCE-MANAGER-IP>:8500 ProcessManagerService/ProcessList
# If the server never returns response, this is a stuck instance manager pod
Recent articles
Troubleshooting: NoExecute taint prevents workloads from terminating© 2019-2024 Longhorn Authors | Documentation Distributed under CC-BY-4.0
© 2024 The Linux Foundation. All rights reserved. The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see our Trademark Usage page.