Support Data Repository Associations for Lusture 2.12 or newer filesystems(e.g. `PERSISTENT_2` deployment type) #368

everpeace · 2023-12-27T08:46:39Z

Is this a bug fix or adding new feature?

new feature
fixes #367

What is this PR about? / Why do we need it?

This PR supports Data Repository Association(API reference) for Lusture 2.12 or newer filesystems(e.g. PERSISTENT_2 deployment type) like below:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: fsx-sc
provisioner: fsx.csi.aws.com
parameters:
  subnetId: subnet-0d7b5e117ad7b4961
  securityGroupIds: sg-05a37bfe01467059a
  deploymentType: PERSISTENT_2
  perUnitStorageThroughput: "125"
  # User can specify multiple data repository associations like this
  dataRepositoryAssociations: |
    - batchImportMetaDataOnCreate: true
      dataRepositoryPath: s3://ml-training-data-000
      fileSystemPath: /ml-training-data-000
      s3:
        autoExportPolicy:
          events: ["NEW", "CHANGED", "DELETED" ]
        autoImportPolicy:
          events: ["NEW", "CHANGED", "DELETED" ]
    - batchImportMetaDataOnCreate: true
      dataRepositoryPath: s3://ml-training-data-001
      fileSystemPath: /ml-training-data-001
      s3:
        autoExportPolicy:
          events: ["NEW", "CHANGED", "DELETED" ]
        autoImportPolicy:
          events: ["NEW", "CHANGED", "DELETED" ]

  # NOTE: These parameters can't be set when using　dataRepositoryAssociations
  #       as document explained:: https://docs.aws.amazon.com/fsx/latest/APIReference/API_CreateFileSystemLustreConfiguration.html
  # s3ImportPath: s3://ml-training-data-000
  # s3ExportPath: s3://ml-training-data-000/export
  # autoImportPolicy: NEW_CHANGED

What testing is done?

make test

# with setting GINKGO_FOCUS=".*fsx-csi-e2e.*PERSISTENT_2.*"
make test-e2e

…t types

k8s-ci-robot · 2023-12-27T08:46:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: everpeace
Once this PR has been reviewed and has the lgtm label, please assign olemarkus for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

everpeace · 2023-12-27T14:12:36Z

/retest pull-aws-fsx-csi-driver-e2e

k8s-ci-robot · 2023-12-27T14:12:38Z

@everpeace: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test pull-aws-fsx-csi-driver-e2e
/test pull-aws-fsx-csi-driver-unit
/test pull-aws-fsx-csi-driver-verify

Use /test all to run all jobs.

In response to this:

/retest pull-aws-fsx-csi-driver-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

everpeace · 2023-12-27T14:13:02Z

/test pull-aws-fsx-csi-driver-e2e

jacobwolfaws · 2023-12-27T15:21:43Z

pkg/cloud/cloud.go

@@ -62,6 +64,10 @@ var (
 	// disks are found with the same volume name.
 	ErrMultiFileSystems = errors.New("Multiple filesystems with same ID")

+	// ErrMultiAssociations is an error that is returned when multiple
+	// associations are found with the same volume name.


If I understand correctly, this would be if there are multiple DRAs with the same association id, not multiple associations with the same volume.

Thanks. fixed in bba7484.

charts/aws-fsx-csi-driver/templates/controller-deployment.yaml

everpeace

@jacobwolfaws Thank you for the quick review. I addressed your feedbacks. PTAL 🙇

everpeace · 2023-12-28T06:36:19Z

charts/aws-fsx-csi-driver/values.yaml

@@ -65,6 +65,7 @@ controller:
    - effect: NoExecute
      operator: Exists
      tolerationSeconds: 300
+  provisionerTimeout: 5m


Should we make the default provisioner timeout longer in helm chart? It is because it often takes more time to prepare an FSx filesystem when it has data repository associations.

Single FSx for Lusture fielsystem can have up to 8 data repository associations.

In my experience, it usually takes around 7-10 minutes to make single data repository associations available even for empty S3 bucket.
Moreover, setting up data repository associations on the specific filesystem looks like sequential.

So, I think 90 min = 10x8 min (data repository associations) + 5 min (FSx filesystem) + <buffer> would be safe because the current CreateVolume operation is synchronous operation and not be safe when timeout happened.

What do you think??

I think keeping the default timer the same + clearly documenting the need to change the timeout if using DRAs would be the correct move. This ensures consistent behavior for users who aren't using DRAs. Extending it is a one way door (because reducing the timeout would break compatibility for users who are using a large number of DRAs).

Extending it is a one way door (because reducing the timeout would break compatibility for users who are using a large number of DRAs)

It makes sense.

the default timer the same + clearly documenting the need to change the timeout if using DRAs would be the correct move

OK. let me add the documentation.

addressed in below commits:

03a32dc

fafc9e9

I'm not sure about the information for users, it seems like users using DRAs will still be fine in most cases:
https://github.com/kubernetes-csi/external-provisioner?tab=readme-ov-file
https://github.com/kubernetes-csi/external-provisioner?tab=readme-ov-file#csi-error-and-timeout-handling
The CreateVolume will timeout and other ones will be made with an exponential backoff. It's only in the case of a large number of DRAs where this will be an issue.

everpeace · 2023-12-28T12:06:49Z

/test pull-aws-fsx-csi-driver-e2e

…x filesystem

…sing Data Repository Associasions

jacobwolfaws · 2024-02-27T21:24:53Z

deploy/kubernetes/base/controller.env

@@ -0,0 +1 @@
+CONTROLLER_PROVISIONER_TIMEOUT=5m


What's the value of creating a separate file for this vs. putting it in the values.yaml:
https://github.com/kubernetes-sigs/aws-fsx-csi-driver/blob/master/charts/aws-fsx-csi-driver/values.yaml#L42-L67

This file is for manifests only for kustomize. values.yaml is dedicated to helm chart. I understand this driver supports both kustomize and helm.

In kustomize, injecting parameter in building manifest needs a bit hack. This env file is needed for kustomize users to change timeout value. I also updated install.md as below:

https://github.com/everpeace/aws-fsx-csi-driver/blob/suppor-dra/docs/install.md#deploy-driver

# To set CSI controller's provisioner timeout, # Please follow the instruction $ cd $(mktemp -d) $ kustomize init $ kustomize edit add resource "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-1.1" $ kustomize edit add configmap fsx-csi-controller --from-literal=CONTROLLER_PROVISIONER_TIMEOUT=30m --behavior=merge $ kubectl apply -k .

I think we should avoid hacks when possible and this seems like an avoidable instance. If users want to configure their kustomize templates, they can download them, configure them, and deploy them freely. We should follow precedent in terms of implementation, which is to put it in the values.yaml.

jacobwolfaws · 2024-03-12T19:05:16Z

charts/aws-fsx-csi-driver/values.yaml

@@ -65,6 +65,7 @@ controller:
    - effect: NoExecute
      operator: Exists
      tolerationSeconds: 300
+  provisionerTimeout: 5m


I'm not sure about the information for users, it seems like users using DRAs will still be fine in most cases:
https://github.com/kubernetes-csi/external-provisioner?tab=readme-ov-file
https://github.com/kubernetes-csi/external-provisioner?tab=readme-ov-file#csi-error-and-timeout-handling
The CreateVolume will timeout and other ones will be made with an exponential backoff. It's only in the case of a large number of DRAs where this will be an issue.

jacobwolfaws · 2024-03-12T19:07:39Z

deploy/kubernetes/base/controller.env

@@ -0,0 +1 @@
+CONTROLLER_PROVISIONER_TIMEOUT=5m


I think we should avoid hacks when possible and this seems like an avoidable instance. If users want to configure their kustomize templates, they can download them, configure them, and deploy them freely. We should follow precedent in terms of implementation, which is to put it in the values.yaml.

jacobwolfaws · 2024-03-12T19:53:45Z

pkg/cloud/cloud.go

 	// target file system values
-	PollCheckTimeout = 10 * time.Minute
+	PollCheckTimeout = 15 * time.Minute


If PollCheckTimeout < provisionerTimeout, the provisionerTimeout will always kill the CreateVolume call before it the PollCheckTimeout is hit. I don't think incrementing this should make a difference

k8s-triage-robot · 2024-06-10T20:40:36Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-07-10T20:52:23Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

jacobwolfaws · 2024-07-10T21:07:39Z

This has been up for a while, sorry. Going to freeze this PR for now, seems like there are some open comments and design decisions to be made here
/lifecycle frozen

jacobwolfaws · 2024-07-10T21:09:23Z

/lifecycle frozen

k8s-ci-robot · 2024-07-10T21:09:24Z

@jacobwolfaws: The lifecycle/frozen label cannot be applied to Pull Requests.

In response to this:

/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

jacobwolfaws · 2024-07-10T21:10:48Z

Seems like I can't freeze a PR :(

/remove-lifecycle rotten

k8s-ci-robot · 2024-09-30T09:00:48Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-triage-robot · 2024-12-29T09:40:58Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Support Data Repository Associations for filesystems in new deploymen…

f8dd1e2

…t types

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 27, 2023

k8s-ci-robot requested a review from jacobwolfaws December 27, 2023 08:46

k8s-ci-robot requested a review from nckturner December 27, 2023 08:46

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 27, 2023

everpeace changed the title ~~Support Data Repository Associations for PERSISTENT_2 deployment type filesystems~~ Support Data Repository Associations for Lusture 2.12 or newer filesystems(e.g. PERSISTENT_2 deployment type) Dec 27, 2023

everpeace force-pushed the suppor-dra branch from d7260f4 to c268728 Compare December 27, 2023 13:32

everpeace force-pushed the suppor-dra branch from c268728 to bf8d4ac Compare December 27, 2023 15:33

jacobwolfaws reviewed Dec 27, 2023

View reviewed changes

charts/aws-fsx-csi-driver/templates/controller-deployment.yaml Show resolved Hide resolved

Make csi-provisioner timeout customizable in helm chart

22c8a71

everpeace force-pushed the suppor-dra branch from bf8d4ac to 04efd39 Compare December 28, 2023 00:58

everpeace added 4 commits December 28, 2023 11:47

Add simple e2e test case for Data Repository Associations support

68ccdee

Add an example for Data Repository Associations support

fdd3643

fix comment on ErrMultiAssociations

bba7484

Add required IAM actions for Data Repository Associations in install.md

3120d26

everpeace force-pushed the suppor-dra branch from 96129a4 to 3120d26 Compare December 28, 2023 02:47

everpeace commented Dec 28, 2023

View reviewed changes

everpeace requested a review from jacobwolfaws December 28, 2023 06:23

everpeace commented Dec 28, 2023

View reviewed changes

DataRepositoryAssociations should set the same extra tags with the FS…

4d05480

…x filesystem

everpeace force-pushed the suppor-dra branch from 063cd00 to 4d05480 Compare December 28, 2023 19:24

everpeace added 2 commits January 18, 2024 13:29

Make csi-provisioner timeout customizable in kustomize

03a32dc

Add warning for setting loger controller's provisioner timeout when u…

fafc9e9

…sing Data Repository Associasions

everpeace force-pushed the suppor-dra branch from 8d23e81 to fafc9e9 Compare January 18, 2024 04:29

jacobwolfaws reviewed Feb 27, 2024

View reviewed changes

jacobwolfaws reviewed Mar 12, 2024

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 10, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 10, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 10, 2024

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2024

michael-diggin mentioned this pull request Oct 20, 2024

Migrate to aws sdk v2 #397

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Data Repository Associations for Lusture 2.12 or newer filesystems(e.g. `PERSISTENT_2` deployment type) #368

Support Data Repository Associations for Lusture 2.12 or newer filesystems(e.g. `PERSISTENT_2` deployment type) #368

everpeace commented Dec 27, 2023 •

edited

Loading

k8s-ci-robot commented Dec 27, 2023

everpeace commented Dec 27, 2023

k8s-ci-robot commented Dec 27, 2023

everpeace commented Dec 27, 2023

jacobwolfaws Dec 27, 2023

everpeace Dec 28, 2023 •

edited

Loading

everpeace left a comment •

edited

Loading

everpeace Dec 28, 2023 •

edited

Loading

jacobwolfaws Jan 17, 2024

everpeace Jan 18, 2024

everpeace Jan 18, 2024 •

edited

Loading

jacobwolfaws Mar 12, 2024

everpeace commented Dec 28, 2023

jacobwolfaws Feb 27, 2024

everpeace Feb 28, 2024 •

edited

Loading

jacobwolfaws Mar 12, 2024

jacobwolfaws Mar 12, 2024

jacobwolfaws Mar 12, 2024

jacobwolfaws Mar 12, 2024

k8s-triage-robot commented Jun 10, 2024

k8s-triage-robot commented Jul 10, 2024

jacobwolfaws commented Jul 10, 2024 •

edited

Loading

jacobwolfaws commented Jul 10, 2024

k8s-ci-robot commented Jul 10, 2024

jacobwolfaws commented Jul 10, 2024

k8s-ci-robot commented Sep 30, 2024

k8s-triage-robot commented Dec 29, 2024

		@@ -0,0 +1 @@
		CONTROLLER_PROVISIONER_TIMEOUT=5m

		@@ -0,0 +1 @@
		CONTROLLER_PROVISIONER_TIMEOUT=5m

Support Data Repository Associations for Lusture 2.12 or newer filesystems(e.g. PERSISTENT_2 deployment type) #368

Are you sure you want to change the base?

Support Data Repository Associations for Lusture 2.12 or newer filesystems(e.g. PERSISTENT_2 deployment type) #368

Conversation

everpeace commented Dec 27, 2023 • edited Loading

k8s-ci-robot commented Dec 27, 2023

everpeace commented Dec 27, 2023

k8s-ci-robot commented Dec 27, 2023

everpeace commented Dec 27, 2023

Choose a reason for hiding this comment

everpeace Dec 28, 2023 • edited Loading

Choose a reason for hiding this comment

everpeace left a comment • edited Loading

Choose a reason for hiding this comment

everpeace Dec 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

everpeace Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

everpeace commented Dec 28, 2023

Choose a reason for hiding this comment

everpeace Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-triage-robot commented Jun 10, 2024

k8s-triage-robot commented Jul 10, 2024

jacobwolfaws commented Jul 10, 2024 • edited Loading

jacobwolfaws commented Jul 10, 2024

k8s-ci-robot commented Jul 10, 2024

jacobwolfaws commented Jul 10, 2024

k8s-ci-robot commented Sep 30, 2024

k8s-triage-robot commented Dec 29, 2024

Support Data Repository Associations for Lusture 2.12 or newer filesystems(e.g. `PERSISTENT_2` deployment type) #368

Support Data Repository Associations for Lusture 2.12 or newer filesystems(e.g. `PERSISTENT_2` deployment type) #368

everpeace commented Dec 27, 2023 •

edited

Loading

everpeace Dec 28, 2023 •

edited

Loading

everpeace left a comment •

edited

Loading

everpeace Dec 28, 2023 •

edited

Loading

everpeace Jan 18, 2024 •

edited

Loading

everpeace Feb 28, 2024 •

edited

Loading

jacobwolfaws commented Jul 10, 2024 •

edited

Loading