OBD devices are not always removed on umount #395

bwjoh · 2024-09-26T18:31:03Z

/kind bug

What happened?
Ran the following job on a cluster with aws-fsx-csi-driver:

apiVersion: batch/v1
kind: Job
metadata:
  generateName: mount-stress-
spec:
  parallelism: 1
  completions: 100
  ttlSecondsAfterFinished: 10
  template:
    spec:
      containers:
      - name: busybox-mount
        image: busybox
        imagePullPolicy: IfNotPresent
        command: ['sh', '-c', 'echo "Test Job Start" && sleep 15 && echo "Test Job End" && exit 0']
        resources:
          limits:
            memory: "2048Mi"
            cpu: "500m"
          requests:
            memory: "2048Mi"
            cpu: "500m"
        volumeMounts:
          - mountPath: /mnt/fsx/test
            name: fsx-mount
      restartPolicy: Never
      volumes:
        - name: fsx-mount
          persistentVolumeClaim:
            claimName: lustre-test
  backoffLimit: 4

OBD devices created when mounting the file system were removed on unmount only ~59% of the time. Monitored using lctl dl | wc -l.

There is some documentation about monitoring devices here (due to a limit of 8192 by the Lustre client): https://aws.amazon.com/blogs/storage/best-practices-for-monitoring-amazon-fsx-for-lustre-clients-and-file-systems/

What you expected to happen?
After running the above job lctl dl | wc -l would show 0.

How to reproduce it (as minimally and precisely as possible)?
I have not reproduced this with a generic AWS AMI, only with a customized Ubunutu 20 AMI. It is using Lustre client with version 1.12.8.

On the host instance I haven't been able to reproduce this behaviour with mount and umount directly. There are no obvious errors from syslog on the host when devices are not removed (fsx-driver logs related to unmounting are all successful).

I am not sure if this is an issue with Lustre client version, something specific to the CSI workflow, or something else.

Anything else we need to know?:
This has been problematic as we have workflows with short-lived pods, and nodes can be recycled frequently to avoid hitting the Lustre client 8192 device limit.

This may also be related to some memory issues we have had on nodes - /proc/vmallocinfo ends up with many cfs_hash_buckets_realloc entries (looks related to the Lustre client) from the leftover devices. We have not found a way to remove these leftover devices besides recycling nodes.

Any confirmation if others are hitting this issue, or guidance on how to avoid this would be appreciated!

Environment

Kubernetes version (use kubectl version): 1.30.3
Driver version: v1.2.0

The text was updated successfully, but these errors were encountered:

jacobwolfaws · 2024-09-27T13:22:59Z

I have not reproduced this with a generic AWS AMI, only with a customized Ubunutu 20 AMI. It is using Lustre client with version 1.12.8.

Was this custom AMI built using an FSx for Lustre vended 2.12.8 Lustre client?

bwjoh · 2024-09-27T16:20:12Z

Realized my initial post is a bit unclear - I haven't tried to reproduce this issue with a generic AWS AMI - I have only tested with a custom AMI.

The AMI we are using has Lustre client installed based on https://docs.aws.amazon.com/fsx/latest/LustreGuide/install-lustre-client.html

k8s-triage-robot · 2024-12-26T17:19:56Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 26, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OBD devices are not always removed on umount #395

OBD devices are not always removed on umount #395

bwjoh commented Sep 26, 2024 •

edited

Loading

jacobwolfaws commented Sep 27, 2024

bwjoh commented Sep 27, 2024

k8s-triage-robot commented Dec 26, 2024

OBD devices are not always removed on umount #395

OBD devices are not always removed on umount #395

Comments

bwjoh commented Sep 26, 2024 • edited Loading

jacobwolfaws commented Sep 27, 2024

bwjoh commented Sep 27, 2024

k8s-triage-robot commented Dec 26, 2024

bwjoh commented Sep 26, 2024 •

edited

Loading