Troubleshooting: fstrim doesn't work on old kernel

| October 6, 2023

Applicable versions

  • Longhorn version >= v1.4.0
  • Node kernel version <= v4.12

Symptoms

When running filesystem trim (either by Longhorn UI or manually by fstrim command on the host), it hits the error similar to:

unable to trim filesystem for volume pvc-e381424a-4866-447a-a75c-3096036f7846: cannot find volume pvc-e381424a-4866-447a-a75c-3096036f7846 mount info on host: failed to execute: nsenter [--mount=/host/proc/20357/ns/mnt --net=/host/proc/20357/ns/net fstrim /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/d2df1133f3440486ddec39370380eeed7a3c71499981d63fc80e43c7ca9f4c9e/globalmount], output , stderr fstrim: /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/d2df1133f3440486ddec39370380eeed7a3c71499981d63fc80e43c7ca9f4c9e/globalmount: FITRIM ioctl failed: Input/output error\n: exit status 255

or

unable to trim filesystem for volume pvc-e381424a-4866-447a-a75c-3096036f7846: cannot find volume pvc-e381424a-4866-447a-a75c-3096036f7846 mount info on host: failed to execute: nsenter [--mount=/host/proc/20357/ns/mnt --net=/host/proc/20357/ns/net fstrim /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/d2df1133f3440486ddec39370380eeed7a3c71499981d63fc80e43c7ca9f4c9e/globalmount], output , stderr fstrim: /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/d2df1133f3440486ddec39370380eeed7a3c71499981d63fc80e43c7ca9f4c9e/globalmount: the discard operation is not supported : exit status 1

Details

This is caused by a problem in old kernel SCSI driver which sets the wrong provisioning mode for the SCSI device that advertised both UNMAP and WRITE SAME capability as detailed in this kernel patch https://github.com/torvalds/linux/commit/bcd069bb250acf6088b60d189ab3ec3ae8dd11a5

Troubleshooting steps

Case 1: Volume is using old engine image or running in the old instance manager pod

  1. Double-check the engine image of the volume is >= v1.4.0 in the volume detail page in Longhorn UI. Only engine image >= v1.4.0 supports filesystem trim feature.
  2. If the volume’s engine’s image was recently live upgraded from a version < v1.4.0 to >= v1.4.0, make sure that the volume is detached/reattached first to activate the trim feature in Longhorn volume. You can do this by manually scale down and up the workload deployment that is using the volume. See more details at https://longhorn.io/docs/1.4.0/volumes-and-nodes/trim-filesystem/#prerequisites
  3. After the above steps, try to run filesystem trim again to see if the problem still exists. If the problem still exists, move on to the next case below.

Case 2: Old node kernel version

  1. Find the node that the volume is currently attached to and ssh into that node
  2. Find the major and minor of a Longhorn block device by ls -l /dev/longhorn/<volume-name>. For example:
    [root@phan-v147-cloudera-pool1-f1dec634-cm59v ~]# ls -l /dev/longhorn/testvol
    brw-rw----. 1 root root 8, 0 Oct  6 20:24 /dev/longhorn/testvol
    

    In this case the major:minor versions is 8:0

  3. Find the corresponding SCSI device’s address by lsscsi -d. For example,
    [root@phan-v147-cloudera-pool1-f1dec634-cm59v ~]# lsscsi -d
    [3:0:0:0]    storage IET      Controller       0001  -
    [3:0:0:1]    disk    IET      VIRTUAL-DISK     0001  /dev/sda [8:0]
    

    In this case, the corresponding corresponding SCSI device’s address is [3:0:0:1] because it also has major:minor versions as 8:0

  4. Find the provisioning mode of SCSI device by find /sys/ -name provisioning_mode -exec grep -H . {} + | sort. For example:
    [root@phan-v147-cloudera-pool1-f1dec634-cm59v ~]# find /sys/ -name provisioning_mode -exec grep -H . {} + | sort
    /sys/devices/platform/host3/session1/target3:0:0/3:0:0:1/scsi_disk/3:0:0:1/provisioning_mode:writesame_16
    

    In this case, looks at the device with the address 3:0:0:1, it has provisioning mode as writesame_16 (in some cases, it can be disable) but not the correct value unmap

  5. Check the kernel version by uname -r. If the kernel version is < v4.12, you have hit this issue

Solution

Upgrade the kernel to a recommended version as in the best practice page https://longhorn.io/docs/1.5.1/best-practices/#operating-system. At the time when writing this article, the recommended kernel version is >= v5.8

Back to Knowledge Base

Recent articles

Troubleshooting: NoExecute taint prevents workloads from terminating
Troubleshooting: Orphan ISCSI Session Error
Failure to Attach Volumes After Upgrade to Longhorn v1.5.x
Kubernetes resource revision frequency expectations
SELinux and Longhorn
Troubleshooting: RWX Volume Fails to Be Attached Caused by `Protocol not supported`
Troubleshooting: fstrim doesn't work on old kernel
Troubleshooting: Failed RWX mount due to connection timeout
Space consumption guideline
Troubleshooting: Unexpected expansion leads to degradation or attach failure
Troubleshooting: Failure to delete orphaned Pod volume directory
Troubleshooting: Volume attachment fails due to SELinux denials in Fedora downstream distributions
Troubleshooting: Volumes Stuck in Attach/Detach Loop When Using Longhorn on OKD
Troubleshooting: Velero restores Longhorn PersistentVolumeClaim stuck in the Pending state when using the Velero CSI Plugin version before v0.4.0
Analysis: Potential Data/Filesystem Corruption
Instruction: How To Migrate Longhorn Chart Installed In Old Rancher UI To The Chart In New Rancher UI
Troubleshooting: Unable to access an NFS backup target
Troubleshooting: Pod with `volumeMode: Block` is stuck in terminating
Troubleshooting: Instance manager pods are restarted every hour
Troubleshooting: Open-iSCSI on RHEL based systems
Troubleshooting: Upgrading volume engine is stuck in deadlock
Tip: Set Longhorn To Only Use Storage On A Specific Set Of Nodes
Troubleshooting: Some old instance manager pods are still running after upgrade
Troubleshooting: Volume cannot be cleaned up after the node of the workload pod is down and recovered
Troubleshooting: DNS Resolution Failed
Troubleshooting: Generate pprof runtime profiling data
Troubleshooting: Pod stuck in creating state when Longhorn volumes filesystem is corrupted
Troubleshooting: None-standard Kubelet directory
Troubleshooting: Longhorn default settings do not persist
Troubleshooting: Recurring job does not create new jobs after detaching and attaching volume
Troubleshooting: Use Traefik 2.x as ingress controller
Troubleshooting: Create Support Bundle with cURL
Troubleshooting: Longhorn RWX shared mount ownership is shown as nobody in consumer Pod
Troubleshooting: `MountVolume.SetUp failed for volume` due to multipathd on the node
Troubleshooting: Longhorn-UI: Error during WebSocket handshake: Unexpected response code: 200 #2265
Troubleshooting: Longhorn volumes take a long time to finish mounting
Troubleshooting: `volume readonly or I/O error`
Troubleshooting: `volume pvc-xxx not scheduled`

© 2019-2024 Longhorn Authors | Documentation Distributed under CC-BY-4.0


© 2024 The Linux Foundation. All rights reserved. The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see our Trademark Usage page.