Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck in terminating or Init:0/2 state after fileshere went down and then up again #693

Closed
achaikaJH opened this issue Nov 20, 2023 · 5 comments · Fixed by #694
Closed

Comments

@achaikaJH
Copy link

What happened:
I'm using csi-smb driver to connect to Windows DFS file shares from multiple pods and clusters in azure AKS v 1.26.3. The same share path might be assigned to different pods and sometimes the source for the PV would vary but ultimately come to the same DFS root. For example:
PV1:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-samba-auth-1
  namespace: ns-uat
spec:
  capacity:
    storage: 100Gi
  accessModes:
  - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  mountOptions:
  - dir_mode=0700
  - file_mode=0600
  - vers=3.1.1
  - uid=1001
  - gid=1001
  csi:
    driver: smb.csi.k8s.io
    readOnly: false
    volumeHandle: samba-api-id-1 # make sure it's a unique id in the cluster
    volumeAttributes:
      source: "//dfsroot.com/dfs/Apps/BUSINESS/cr/UAT/001"
    nodeStageSecretRef:
      name: sambacreds
      namespace: ns-uat

PV2

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-samba-auth-2
  namespace: ns-uat
spec:
  capacity:
    storage: 100Gi
  accessModes:
  - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  mountOptions:
  - dir_mode=0700
  - file_mode=0600
  - vers=3.1.1
  - uid=1001
  - gid=1001
  csi:
    driver: smb.csi.k8s.io
    readOnly: false
    volumeHandle: samba-api-id-2 # make sure it's a unique id in the cluster
    volumeAttributes:
      source: "//dfsroot.com/dfs/Apps/BUS/cr/UAT/002"
    nodeStageSecretRef:
      name: sambacreds
      namespace: ns-uat

PV3

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-samba-auth-3
  namespace: ns-uat
spec:
  capacity:
    storage: 100Gi
  accessModes:
  - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  mountOptions:
  - dir_mode=0700
  - file_mode=0600
  - vers=3.1.1
  - uid=1001
  - gid=1001
  csi:
    driver: smb.csi.k8s.io
    readOnly: false
    volumeHandle: samba-api-id-3 # make sure it's a unique id in the cluster
    volumeAttributes:
      source: "//dfsroot.com/dfs/Apps"
    nodeStageSecretRef:
      name: sambacreds
      namespace: ns-uat

Last week DFS server went down for several minutes and then got back up. But PVs in the clusters never recovered. If I try to restart pod it goes in terminating state and just stays there forever. New pod meanwhile goes into Init:0/2 state and also stuck there.

Events from pod which stuck in "Terminating" state:

Warning  FailedMount      14m (x67 over 144m)   kubelet       MountVolume.SetUp failed for volume "pv-samba-auth" : kubernetes.io/csi: mounter.SetUpAt failed to check for STAGE_UNSTAGE_VOLUME capability: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/smb.csi.k8s.io/csi.sock: connect: resource temporarily unavailable"                                                                                                           │
Warning  FailedMount      4m4s (x59 over 124m)  kubelet       (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[smb-auth], unattached volumes=[azure-keyvault-env smb-cps istiod-ca-cert workload-socket workload-certs istio-podinfo credential-socket istio-token kube-api-access-8njmw ns-auth-cmp istio-data istio-envoy akv2k8s-client-cert[]: timed out waiting for the condition

Events from pod which stuck in "Init:0/2" state:

Normal   Killing          2m45s (x12 over 5h38m)  kubelet       Stopping container gwam-digtl-cps-auth
Warning  FailedKillPod    2m45s (x7 over 5h7m)    kubelet       error killing pod: [failed to "KillContainer" for "gwam-digtl-auth" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "218df537-b65c-4404-8fe2-b64007de2eb4" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]

Events from cs0-smb-node, smb container:

2023-11-17T03:06:31.455545769Z I1117 03:06:31.455394       1 utils.go:77] GRPC request: {"target_path":"/var/lib/kubelet/pods/bc9577ae-7ee7-46a3-bd26-6c1a23fe2205/volumes/kubernetes.io~csi/pv-samba-qiss/mount","volume_id":"samba-qiss-api-id"}
2023-11-17T03:06:31.455687171Z I1117 03:06:31.455626       1 nodeserver.go:101] NodeUnpublishVolume: unmounting volume samba-qiss-api-id on /var/lib/kubelet/pods/bc9577ae-7ee7-46a3-bd26-6c1a23fe2205/volumes/kubernetes.io~csi/pv-samba-qiss/mount
2023-11-17T03:06:31.460369524Z I1117 03:06:31.460243       1 mount_helper_common.go:107] "/var/lib/kubelet/pods/bc9577ae-7ee7-46a3-bd26-6c1a23fe2205/volumes/kubernetes.io~csi/pv-samba-qiss/mount" is a mountpoint, unmounting
2023-11-17T03:06:31.460409024Z I1117 03:06:31.460267       1 mount_linux.go:362] Unmounting /var/lib/kubelet/pods/bc9577ae-7ee7-46a3-bd26-6c1a23fe2205/volumes/kubernetes.io~csi/pv-samba-qiss/mount
2023-11-17T03:06:31.468479816Z W1117 03:06:31.468361       1 mount_helper_common.go:142] Warning: "/var/lib/kubelet/pods/bc9577ae-7ee7-46a3-bd26-6c1a23fe2205/volumes/kubernetes.io~csi/pv-samba-qiss/mount" is not a mountpoint, deleting
2023-11-17T03:06:31.468502416Z I1117 03:06:31.468428       1 nodeserver.go:106] NodeUnpublishVolume: unmount volume samba-qiss-api-id on /var/lib/kubelet/pods/bc9577ae-7ee7-46a3-bd26-6c1a23fe2205/volumes/kubernetes.io~csi/pv-samba-qiss/mount successfully
2023-11-17T03:06:31.468517716Z I1117 03:06:31.468451       1 utils.go:83] GRPC response: {}
2023-11-17T03:06:31.564370204Z I1117 03:06:31.564174       1 utils.go:76] GRPC call: /csi.v1.Node/NodeUnstageVolume
2023-11-17T03:06:31.564404404Z I1117 03:06:31.564199       1 utils.go:77] GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/3f504e5b61365b11fed26fe2e3b876ba8054116f167d798be61d3f62efad391f/globalmount","volume_id":"samba-qiss-api-id"}
2023-11-17T03:06:31.564410504Z I1117 03:06:31.564251       1 nodeserver.go:260] NodeUnstageVolume: CleanupMountPoint on /var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/3f504e5b61365b11fed26fe2e3b876ba8054116f167d798be61d3f62efad391f/globalmount with volume samba-qiss-api-id
2023-11-17T03:06:31.564425904Z I1117 03:06:31.564348       1 mount_helper_common.go:107] "/var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/3f504e5b61365b11fed26fe2e3b876ba8054116f167d798be61d3f62efad391f/globalmount" is a mountpoint, unmounting
2023-11-17T03:06:31.564493305Z I1117 03:06:31.564368       1 mount_linux.go:362] Unmounting /var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/3f504e5b61365b11fed26fe2e3b876ba8054116f167d798be61d3f62efad391f/globalmount
2023-11-17T03:06:31.811025302Z W1117 03:06:31.810886       1 mount_helper_common.go:142] Warning: "/var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/3f504e5b61365b11fed26fe2e3b876ba8054116f167d798be61d3f62efad391f/globalmount" is not a mountpoint, deleting
2023-11-17T03:06:31.811081203Z I1117 03:06:31.810989       1 nodeserver.go:269] NodeUnstageVolume: unmount volume samba-qiss-api-id on /var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/3f504e5b61365b11fed26fe2e3b876ba8054116f167d798be61d3f62efad391f/globalmount successfully
2023-11-17T03:06:31.811108603Z I1117 03:06:31.811017       1 utils.go:83] GRPC response: {}
2023-11-17T04:44:49.242559854Z E1117 04:44:49.242235       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to get metrics: failed to get FsInfo due to error resource temporarily unavailable
2023-11-17T04:44:49.242623354Z E1117 04:44:49.242248       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to get metrics: failed to get FsInfo due to error resource temporarily unavailable
2023-11-17T04:47:30.382783755Z E1117 04:47:30.382613       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to get metrics: failed to get FsInfo due to error resource deadlock avoided
2023-11-17T04:49:39.104482522Z E1117 04:49:39.104284       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to get metrics: failed to get FsInfo due to error resource deadlock avoided
2023-11-17T04:51:52.168571655Z E1117 04:51:52.168396       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to get metrics: failed to get FsInfo due to error resource deadlock avoided
2023-11-17T04:54:05.518087612Z E1117 04:54:05.517850       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to get metrics: failed to get FsInfo due to error resource deadlock avoided
2023-11-17T04:56:14.379160439Z E1117 04:56:14.378933       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to get metrics: failed to get FsInfo due to error resource deadlock avoided
2023-11-17T04:57:18.765714965Z E1117 04:57:18.765508       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to get metrics: failed to get FsInfo due to error resource deadlock avoided
2023-11-17T04:59:27.523550193Z E1117 04:59:27.523384       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to get metrics: failed to get FsInfo due to error resource deadlock avoided
2023-11-17T05:00:36.169539579Z E1117 05:00:36.169320       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to get metrics: failed to get FsInfo due to error resource deadlock avoided
2023-11-17T05:02:45.884105200Z E1117 05:02:45.883918       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to get metrics: failed to get FsInfo due to error resource deadlock avoided
2023-11-17T05:04:54.606457300Z E1117 05:04:54.606260       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to get metrics: failed to get FsInfo due to error resource deadlock avoided
2023-11-17T05:07:03.366882842Z E1117 05:07:03.366690       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to get metrics: failed to get FsInfo due to error resource deadlock avoided
2023-11-17T05:09:12.116175454Z E1117 05:09:12.115917       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to get metrics: failed to get FsInfo due to error resource deadlock avoided
2023-11-17T05:10:54.777048023Z E1117 05:10:54.776873       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to stat file /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: lstat /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: input/output error
2023-11-17T05:12:50.535454951Z E1117 05:12:50.535261       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to stat file /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: lstat /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: input/output error
2023-11-17T05:14:18.994920948Z E1117 05:14:18.994748       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to stat file /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: lstat /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: input/output error
2023-11-17T05:15:36.243986459Z E1117 05:15:36.243822       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to stat file /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: lstat /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: input/output error
2023-11-17T05:17:30.465978422Z E1117 05:17:30.465803       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to stat file /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: lstat /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: input/output error
2023-11-17T05:19:24.667088627Z E1117 05:19:24.666893       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to stat file /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: lstat /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: input/output error
2023-11-17T05:20:31.938649081Z E1117 05:20:31.938454       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to stat file /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: lstat /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: input/output error
2023-11-17T05:21:57.881548821Z E1117 05:21:57.881279       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to stat file /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: lstat /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: input/output error
2023-11-17T05:23:19.138080236Z E1117 05:23:19.137893       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to stat file /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: lstat /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: input/output error
2023-11-17T05:25:11.871897246Z E1117 05:25:11.871757       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to stat file /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: lstat /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: input/output error
2023-11-17T05:26:43.540557696Z E1117 05:26:43.540382       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to stat file /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: lstat /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: input/output error
2023-11-17T05:28:29.595137464Z E1117 05:28:29.594902       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to stat file /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: lstat /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: input/output error
2023-11-17T05:30:39.560338048Z E1117 05:30:39.560119       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to stat file /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: lstat /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: resource deadlock avoided
2023-11-17T05:32:48.293963316Z E1117 05:32:48.293670       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to stat file /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: lstat /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: resource deadlock avoided
2023-11-17T05:34:57.074814512Z E1117 05:34:57.074630       1 utils.go:81] GRPC error: rpc error: code = Internal desc = failed to stat file /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: lstat /var/lib/kubelet/pods/218df537-b65c-4404-8fe2-b64007de2eb4/volumes/kubernetes.io~csi/pv-samba-cps/mount: resource deadlock avoided
2023-11-17T15:01:43.553489378Z I1117 15:01:43.553317       1 utils.go:76] GRPC call: /csi.v1.Node/NodeUnpublishVolume
2023-11-17T15:01:43.554253587Z I1117 15:01:43.553954       1 utils.go:77] GRPC request: {"target_path":"/var/lib/kubelet/pods/b7f1a55b-8344-4793-8569-203d857ce5a2/volumes/kubernetes.io~csi/pv-samba-pumkt-esgdata-azure/mount","volume_id":"samba-pumkt-esgdata-azure-id"}
2023-11-17T15:01:43.554398188Z I1117 15:01:43.554316       1 nodeserver.go:101] NodeUnpublishVolume: unmounting volume samba-pumkt-esgdata-azure-id on /var/lib/kubelet/pods/b7f1a55b-8344-4793-8569-203d857ce5a2/volumes/kubernetes.io~csi/pv-samba-pumkt-esgdata-azure/mount
2023-11-17T15:01:43.559413846Z I1117 15:01:43.559046       1 mount_helper_common.go:107] "/var/lib/kubelet/pods/b7f1a55b-8344-4793-8569-203d857ce5a2/volumes/kubernetes.io~csi/pv-samba-pumkt-esgdata-azure/mount" is a mountpoint, unmounting
2023-11-17T15:01:43.559443246Z I1117 15:01:43.559084       1 mount_linux.go:362] Unmounting /var/lib/kubelet/pods/b7f1a55b-8344-4793-8569-203d857ce5a2/volumes/kubernetes.io~csi/pv-samba-pumkt-esgdata-azure/mount
2023-11-17T15:01:43.567699840Z W1117 15:01:43.567562       1 mount_helper_common.go:142] Warning: "/var/lib/kubelet/pods/b7f1a55b-8344-4793-8569-203d857ce5a2/volumes/kubernetes.io~csi/pv-samba-pumkt-esgdata-azure/mount" is not a mountpoint, deleting
2023-11-17T15:01:43.567815642Z I1117 15:01:43.567651       1 nodeserver.go:106] NodeUnpublishVolume: unmount volume samba-pumkt-esgdata-azure-id on /var/lib/kubelet/pods/b7f1a55b-8344-4793-8569-203d857ce5a2/volumes/kubernetes.io~csi/pv-samba-pumkt-esgdata-azure/mount successfully
2023-11-17T15:01:43.567837342Z I1117 15:01:43.567674       1 utils.go:83] GRPC response: {}
2023-11-17T15:01:43.659187386Z I1117 15:01:43.659037       1 utils.go:76] GRPC call: /csi.v1.Node/NodeUnstageVolume
2023-11-17T15:01:43.659214986Z I1117 15:01:43.659066       1 utils.go:77] GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/7a4eca7d8c9df9b9e0413e1b5bbd97576d39606a4a243623c1a731615626e6b5/globalmount","volume_id":"samba-pumkt-esgdata-azure-id"}
2023-11-17T15:01:43.659219886Z I1117 15:01:43.659129       1 nodeserver.go:260] NodeUnstageVolume: CleanupMountPoint on /var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/7a4eca7d8c9df9b9e0413e1b5bbd97576d39606a4a243623c1a731615626e6b5/globalmount with volume samba-pumkt-esgdata-azure-id
2023-11-17T15:01:43.659302787Z I1117 15:01:43.659232       1 mount_helper_common.go:107] "/var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/7a4eca7d8c9df9b9e0413e1b5bbd97576d39606a4a243623c1a731615626e6b5/globalmount" is a mountpoint, unmounting
2023-11-17T15:01:43.659314087Z I1117 15:01:43.659251       1 mount_linux.go:362] Unmounting /var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/7a4eca7d8c9df9b9e0413e1b5bbd97576d39606a4a243623c1a731615626e6b5/globalmount
2023-11-17T15:01:43.706232923Z W1117 15:01:43.706058       1 mount_helper_common.go:142] Warning: "/var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/7a4eca7d8c9df9b9e0413e1b5bbd97576d39606a4a243623c1a731615626e6b5/globalmount" is not a mountpoint, deleting
2023-11-17T15:01:43.706294724Z I1117 15:01:43.706182       1 nodeserver.go:269] NodeUnstageVolume: unmount volume samba-pumkt-esgdata-azure-id on /var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/7a4eca7d8c9df9b9e0413e1b5bbd97576d39606a4a243623c1a731615626e6b5/globalmount successfully
2023-11-17T15:01:43.706319424Z I1117 15:01:43.706204       1 utils.go:83] GRPC response: {}

I noticed that there is no new events since Friday.

One workaround I found that if I cordon node and force delete pod it'll start on the another node but only if this node didn't have this share mounted previously.

What you expected to happen:

I expected SMB connection to reconcile after file share was available again.

How to reproduce it:

  1. Create several PVs with WindowsDFS share as a source.
  2. Shut DFS server down.
  3. Bring DFS server Up.
  4. Try to restart pod.
  5. Observe pod stuck in "Terminating" state and another one in Init:0/2

Anything else we need to know?:

Environment:

  • CSI Driver version: v1.11.0
  • Kubernetes version (use kubectl version): 1.26.3
  • OS (e.g. from /etc/os-release): Ubuntu 22.04.2 LTS
  • Kernel (e.g. uname -a): 5.15.0-1041-azure
  • Install tools: helm
  • Others: Azure AKS
@andyzhangx
Copy link
Member

should be related to this bug: kubernetes/kubernetes#121851, force delete the pod could workaround.
I will port the fix from kubernetes/kubernetes#121851 to csi driver later.

@achaikaJH
Copy link
Author

@andyzhangx
Thanks for reply! Force delete pod didn't resolve issue in my case, but it could be because a lot of pods have the same share mounted as different PV. So, I guess if I force delete all of them at some point all mounts will be released which would allow to remount PVs.
Any chance this fix will be available in the next release of the driver?
Thank you!

@andyzhangx
Copy link
Member

@andyzhangx Thanks for reply! Force delete pod didn't resolve issue in my case, but it could be because a lot of pods have the same share mounted as different PV. So, I guess if I force delete all of them at some point all mounts will be released which would allow to remount PVs. Any chance this fix will be available in the next release of the driver? Thank you!

@achaikaJH there is already a PR to fix it: #694
Can you try with this image first to check whether it could fix the issue? andyzhangx/smb-csi:v1.14.0

@achaikaJH
Copy link
Author

@andyzhangx
Thank you! Will try on my sandbox and post an update.

@achaikaJH
Copy link
Author

achaikaJH commented Dec 14, 2023

@andyzhangx Thanks for reply! Force delete pod didn't resolve issue in my case, but it could be because a lot of pods have the same share mounted as different PV. So, I guess if I force delete all of them at some point all mounts will be released which would allow to remount PVs. Any chance this fix will be available in the next release of the driver? Thank you!

@achaikaJH there is already a PR to fix it: #694 Can you try with this image first to check whether it could fix the issue? andyzhangx/smb-csi:v1.14.0

Sorry it took me so long to respond.
I tried this image and it definitely makes things better:

  • pods do not stuck in "Terminating" forever
  • pods which are not catching file system exception go to CrashLoopBack but recover after shared resource is back
  • pods which are catching file system exception just chugging along before resource is up
    However, recovery is not immediate. There is about 10 minutes difference between the time actual resource is up and pods mount recovery.
    10 minutes is just approximate time as my pods were probing file system every 3 minutes. I did this several times and this difference stayed about the same.

Thank you for your help with this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants