diff --git a/keps/prod-readiness/sig-storage/3314.yaml b/keps/prod-readiness/sig-storage/3314.yaml new file mode 100644 index 00000000000..02c23e823fb --- /dev/null +++ b/keps/prod-readiness/sig-storage/3314.yaml @@ -0,0 +1,6 @@ +# The KEP must have an approver from the +# "prod-readiness-approvers" group +# of http://git.k8s.io/enhancements/OWNERS_ALIASES +kep-number: 3314 +alpha: + approver: "@johnbelamaric" diff --git a/keps/sig-storage/3314-csi-volume-snapshot-delta/README.md b/keps/sig-storage/3314-csi-volume-snapshot-delta/README.md new file mode 100644 index 00000000000..2ca63792270 --- /dev/null +++ b/keps/sig-storage/3314-csi-volume-snapshot-delta/README.md @@ -0,0 +1,1480 @@ + +# KEP-3314: Changed Block Tracking With CSI VolumeSnapshotDelta + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Risks and Mitigations](#risks-and-mitigations) + - [Kubernetes API Server - flow control and response latency](#kubernetes-api-server---flow-control-and-response-latency) + - [Aggregated API Server - denial of service](#aggregated-api-server---denial-of-service) + - [APIService Resource - CA bundle expiry](#apiservice-resource---ca-bundle-expiry) +- [Design Details](#design-details) + - [CBT Request/Response Flow](#cbt-requestresponse-flow) + - [High Availability Mode](#high-availability-mode) + - [API Specification](#api-specification) + - [VolumeSnapshotDelta Resource](#volumesnapshotdelta-resource) + - [VolumeSnapshotDeltaServiceConfiguration Resource](#volumesnapshotdeltaserviceconfiguration-resource) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [x] (R) KEP approvers have approved the KEP status as `implementable` +- [x] (R) Design details are appropriately documented +- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [x] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [x] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +Changed block tracking (CBT) techniques have been used by backup systems to +efficiently back up large amount of data in block volumes. They identify +block-level changes between two arbitrary pair of snapshots of the same block +volume, and selectively capture what has changed between the two checkpoints. +This type of differential backup approach is a lot more efficient than backing +up the entire volume. This KEP proposes a new CSI API that can be used to +identify the list of changed blocks between a pair of CSI volume snapshots. + +## Motivation + + + +Efficient and reliable backup of data is intrinsic to production-grade backup +systems. Since majority of the data in a volume does not change in between +backups, being able to identify and back up only what has changed can +drastically improve backup bottlenecks and streamline the user's data protection +workflows. + +Many storage providers already have the ability to detect block-level changes +for efficient data backup and restoration. This KEP proposes a design to extend +the Kubernetes CSI framework to utilize these CBT features to bring efficient, +cloud-native data protection to Kubernetes users. + +### Goals + + + +* Provide a secure, idiomatic CSI API to efficiently identify changes between +two arbitrary pairs of CSI volume snapshots of the same block volume. +* The API can efficiently and reliably relay large amount of changed block data +from the storage provider back to the user, without exhausting cluster resources, +nor introducing flaky resource spikes and leaks. +* The blast radius of component failure should be sufficiently isolated from the +rest of the cluster. +* This API remains an optional component of the CSI framework. Storage providers +can opt in to expose their CBT functionality to Kubernetes via this new API. +* Provide CBT support for both block as well as file system mode (backed by +block volume) persistent volumes. + +### Non-Goals + + + +* Retrieval of the actual data blocks is outside the scope of this KEP. The +proposed API returns only the metadata of the changed blocks. +* Changed list support for file storage system are slated for future KEP. + +## Proposal + + + +This KEP introduces two new custom resources called `VolumeSnapshotDelta` and +`VolumeSnapshotDeltaServiceConfiguration` to the CSI framework. + +The `VolumeSnapshotDelta` resource abstracts away the details around interacting +with the storage providers' CBT endpoints. Essentially, this new API allows a +Kubernetes user to say, + +> Find all the data blocks that have changed between these two snapshots. + +**Note that the proposed API is used to retrieve the changed blocks metadata +only. Retrieval of the actual data blocks is out of the scope of this KEP.** + +The new component that serves the `VolumeSnapshotDelta` API must be able to +handle large amount of data returned by the storage providers, without +contesting with or starving the rest of the cluster. Specifically, the CBT +datapath should not burden the Kubernetes' etcd with heavy IOPS operations. + +The `VolumeSnapshotDelta` API is owned and handled by an [aggregated API +server][0]. This aggregation extension mechanism provides more control over the +registry and storage implementation, than a [custom resource controller][11]. + +The Kubernetes API server and [metrics server][12] are the two main +implementations that inspired this aggregated extension design, recognizing that +the CBT payloads are no worse (in size and bandwidth) than the metrics or [pod +logs][13] served by these two API servers. + +The `VolumeSnapshotDeltaServiceConfiguration` resource is implemented as a +[Custom Resource Definition (CRD)][2], owned and handled by a new custom +resource controller. The aggregated API server uses this resource to discover +CSI drivers that support CBT functionalities. This new resource is introduced +and preferred over the existing `CSIDriver` resource because the latter is used +mainly for volume properties needed by the Kubelet, at the node level. Also, +since CBT is an out-of-tree feature, it makes sense to not coupled it to the +in-tree `CSIDriver` resource. + +### Risks and Mitigations + + + +#### Kubernetes API Server - flow control and response latency + +The proposed aggregation extension model relies on the Kubernetes API server to +proxy all the CBT requests to the aggregated API server. To protect the +Kubernetes API server from being overwhelmed by the CBT payloads, flow controls +policy enforced by Kubernetes [API Priority and Fairness][4] will be bundled +with the CBT deployment manifest, with the default priority level set to +[`workload-low`][5]. + +To ensure [low latency between the Kubernetes API server and the aggregated API +server][6]: + +* The aggregated API server will enforce overridable pagination behaviour such +as limiting the response payload to 256 CBT entries +* The deployment manifests will include configurable pod affinity, taint and +toleration properties to provide control over scheduling and placement of the +aggregated API server + +The latency incurred by discovery requests is expected to be relatively +insignificant as the aggregated API server exposes only one API, to serve the +`VolumeSnapshotDelta` resource. + +#### Aggregated API Server - denial of service + +The CBT aggregated API server must be protected from a series of continuous +expensive requests, e.g., from rogue retries or malicious DOS attempts targeting +the storage provider's backend CBT endpoints. + +To mitigate this, the aggregated API server can enforce server-side rate +limiting using constructs found in the `k8s.io/apiserver/pkg/util/flowcontrol` +package and its sub-packages. + +A caching layer to serve subsequent requests with the same input parameters can +also be considered. Given the complexity associated with caching, more design +consideration will be needed to determine cache invalidation period, persistence +mechanism, cache key scheme etc. + +#### APIService Resource - CA bundle expiry + +The [`APIService` resource][8] provides information on the in-cluster target +`Service` resource that fronts the aggregated API server. The `spec.caBundle` +property defines the PEM bundle needed to establish the TLS trust between the +Kubernetes API server and the service. If expired, the communication between the +Kubernetes API server and the aggregated API server can be disrupted. + +This KEP deems the bundling of certificate management tools with the CBT +components to be out-of-scope. User should incorprate the management of this CA +bundle into their overall cluster PKI strategy. + +## Design Details + + + +The proposed design involves extending CSI with the `VolumeSnapshotDelta` and +`VolumeSnapshotDeltaServiceConfiguration` custom resources. Storage providers +can opt in to support this feature by implementing the +`LIST_BLOCK_SNAPSHOT_DELTAS` capability in their CSI drivers. + +The `VolumeSnapshotDelta` resource is a namespace-scoped resource. It must be +created in the same namespace as the base and target CSI `VolumeSnapshot`s. On +the other hand, the `VolumeSnapshotDeltaServiceConfiguration` resource will be +implemented as a cluster-scoped resource. + +### CBT Request/Response Flow + +The CSI CBT is made up of three components: + +* An aggregated API server to serve CBT requests, initiated by the creation of +`VolumeSnapshotDelta` resources +* A CRD controller that watches and reconciles +`VolumeSnapshotDeltaServiceConfiguration` resources +* A [CSI CBT sidecar][9] that accepts and converts CBT requests from the CBT +aggregated API server into CSI RPC calls + +![CBT Step 1](./img/cbt-step-01.png) + +When CSI CBT is deployed on a Kubernetes cluster, the `cbt-aggapi` aggregated +API server registers itself with the `kube-aggregator` to claim the URL path +of the `v1alpha1.cbt.storage.k8s.io` group version, as defined in the +`APIService` resource. + +The `svc-cfg` CRD controller starts a reconciler to watch and reconcile +`VolumeSnapshotDeltaServiceConfiguration` resources. When installing CSI CBT, +the installation artifacts (Helm charts, Kustomize manifests etc.) will include +the YAML manifests of the `VolumeSnaphotDeltaServiceConfiguration` as well as +the `Service` resource that it references. + +A CBT-enabled CSI driver needs to embed the `cbt-svc` sidecar in its pod to be +able to accept and convert CBT requests from `cbt-aggapi` to CSI RPC calls. + +With CSI CBT in-place, a user can initiate the CBT workflow by creating a +new `VolumeSnapshotDelta` resource: + +![CBT Step 2](./img/cbt-step-02.png) + +Following the [storage implementation pattern][3] of the `authorization.k8s.io` +group, a `VolumeSnapshotDelta` resource is treated as a "virtual resource" (like +`SubjectAccessReview`), where it is created without being persisted in the +Kubernetes etcd. + +The `cbt-aggapi` depends on the Kubernetes API server to authenticate and +authorize the new request. For more information on how this delegation works, +see the aggregated API server authentication flow [documentation][10]. + +To fulfill the CBT request associated to this new `VolumeSnapshotDelta` resource +, the `cbt-aggapi` will need to retrieve: + +* The `VolumeSnaphot` resources referenced by the `spec.fromVolumeSnapshotName` and +`spec.toVolumeSnapshotName` properties, +* The bounded `VolumeSnapshotContent` resources referenced by the +`status.BoundVolumeSnapshotContentName` properties of the `VolumeSnapshot` +resources, +* The `VolumeSnapshotDeltaServiceConfiguration` resource defined by the +`spec.driverName` of the `VolumeSnapshot` resources + +![CBT Step 3](./img/cbt-step-03.png) + +If any of these resources don't exist in the cluster, the `cbt-aggapi` service +fails the request. + +The snapshot handles (i.e. snapshot IDs) from the `VolumeSnapshotContent` +resources along with the pagination parameters found in the +`VolumeSnapshotDelta` resources are sent to the `cbt-svc` sidecar in the CSI +driver as JSON payload over HTTP: + +![CBT Step 4](./img/cbt-step-04.png) + +The `VolumeSnapshotDeltaServiceConfiguration` resource has the endpoint +information of the `Service` resource that fronts the `cbt-svc` sidecar in its +`clientConfig` property. + +The `cbt-svc` sidecar then issues a GRPC call to the storage provider's +`csi-plugin` container, over the host's local Unix socket. + +The `csi-plugin` is responsible for invoking the storage provider's +backend CBT endpoints to fulfill the request. It also takes care of the +authentication with the storage provider's backend, freeing CSI CBT from this +concern. + +The CBT entries are then returned to the user, as JSON payload, through the +`cbt-svc` sidecar and then the `cbt-aggapi` aggregated API server, without +persisting any of them in the Kubernetes etcd. + +The CBT entries are appended to the `status` of the original +`VolumeSnapshotDelta` resource: + +```json +{ + "kind": "VolumeSnapshotDelta", + "apiVersion": "cbt.storage.k8s.io/v1alpha1", + "metadata": { + "name": "test-delta", + "namespace": "default", + "creationTimestamp": null + }, + "spec": { + "fromVolumeSnapshotName": "vol-snap-base", + "toVolumeSnapshotName": "vol-snap-target", + "limit": 256, + "offsetBytes": 0 + }, + "status": { + "limit": 256, + "offsetBytes": 0, + "continue": 3, + "changedBlockDeltas": [ + { + "offsetBytes": 0, + "blockSizeBytes": 524288, + "dataToken": { + "token": "ieEEQ9Bj7E6XR", + "issuanceTime": "2022-07-13T03:19:30Z", + "ttl": "3h0m0s" + } + }, + { + "offsetBytes": 524288, + "blockSizeBytes": 524288, + "dataToken": { + "token": "widvSdPYZCyLB", + "issuanceTime": "2022-07-13T03:19:30Z", + "ttl": "3h0m0s" + } + }, + { + "offsetBytes": 1048576, + "blockSizeBytes": 524288, + "dataToken": { + "token": "VtSebH83xYzvB", + "issuanceTime": "2022-07-13T03:19:30Z", + "ttl": "3h0m0s" + } + } + ] + } +} +``` + +Any pagination parameters from the storage provider needed to fetch additional +data will also be included in the response payload to the user. The user will be +responsible for coordinating subsequent paginated requests, including managing +the pagination session including recovery from interruption and partition. + +### High Availability Mode + +To ensure high availability, CSI CBT can be scaled to run multiple replicas of +the `cbt-aggapi` aggregated API server. + +In setup where there may be multiple replicas of the CSI driver, an active/ +passive leader election process will be used to elect a single leader instance +of the `cbt-svc` sidecar, while idling other non-leader instances. Non-leader +instances will voluntarily fail their readiness probe to remove themselves from +the `Service`'s request path. + +### API Specification + +#### VolumeSnapshotDelta Resource + +The section describes the specification of `VolumeSnapshotDelta` resource: + +```go +// VolumeSnapshotDelta represents a VolumeSnapshotDelta resource. +type VolumeSnapshotDelta struct { + metav1.TypeMeta `json:",inline"` + + // Standard object's metadata. + // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata + // +optional + metav1.ObjectMeta `json:"metadata,omitempty"` + + // spec defines the desired characteristics of a snapshot delta requested by a user. + // Required. + Spec VolumeSnapshotDeltaSpec `json:"spec"` + + // status represents the current information of a snapshot delta. + // +optional + Status VolumeSnapshotDeltaStatus `json:"status,omitempty"` +} + +// VolumeSnapshotDeltaSpec is the spec of a VolumeSnapshotDelta resource. +type VolumeSnapshotDeltaSpec struct { + // The name of the base CSI volume snapshot to use for comparison. + // If not specified, return the delta between the first snapshot + // since volume creation and the target snapshot. + // +optional + FromVolumeSnapshotName string `json:"fromVolumeSnapshotName,omitempty"` + + // The name of the target CSI volume snapshot to use for comparison. + // Required. + ToVolumeSnapshotName string `json:"toVolumeSnapshotName"` + + // VolumeSnapshotDeltaRef is a reference to the secret object containing + // sensitive information to pass to the CSI driver to complete the CSI + // calls for VolumeSnapshotDelta. + // This field is optional, and may be empty if no secret is required. If the + // secret object contains more than one secret, all secrets are passed. + // +optional + VolumeSnapshotDeltaSecretRef *SecretReference + + // Define the maximum number of entries to return in the response. + Limit uint64 `json:"limit"` + + // Defines the start of the block index (in bytes). + OffsetBytes uint64 `json:"offsetBytes"` +} + +// VolumeSnapshotDeltaStatus is the status for a VolumeSnapshotDelta resource +type VolumeSnapshotDeltaStatus struct { + // The list of CBT entries. + ChangedBlocks []*ChangedBlocks `json:"changedBlocks;omitempty"` + + // Captures any error encountered. + Error string `json:"error,omitempty"` + + // A very brief description to communicate the current state of the CBT + // operation. + State VolumeSnapshotDeltaState `json:"state"` + + // The limit defined in the request. + Limit uint64 `json:"limit"` + + // The offset (in bytes) defined in the request. + OffsetBytes uint64 `json:"offsetBytes"` + + // The starting block index of the next request. + Continue uint64 `json:"continue"` +} + +// ChangedBlock represents a CBT entry returned by the storage provider. +type ChangedBlock struct { + // OffsetBytes defines the start of the block index in the response. + OffsetBytes uint64 `json:"offsetBytes"` + + // The size of the blocks. + BlockSizeBytes unit64 `json:"blockSizeBytes"` + + // The optional token used to retrieve the actual data block at the given + // offset. + DataToken *DataToken `json:"dataToken,omitempty"` +} + +type VolumeSnapshotDeltaState string + +const ( + // Successfully retrieved chunks of CBT entries starting at offset, and ending + // at offset + limit, with no more data left. + Completed VolumeSnapshotDeltaState = "completed" + + // Similar to `Completed`, but with more data available. + Continue VolumeSnapshotDeltaState = "continue" + + // Something went wrong while retrieving the CBT entries. + // `status.error` should have the error message. + Failed VolumeSnapshotDeltaFailed = "failed" +) + +// VolumeSnapshotDeltaList is a list of VolumeSnapshotDelta resources +type VolumeSnapshotDeltaList struct { + metav1.TypeMeta `json:",inline"` + + // +optional + metav1.ListMeta `json:"metadata"` + + // List of VolumeSnapshotDeltas. + Items []VolumeSnapshotDelta `json:"items"` +} +``` + +The corresponding CSI RPC and message definition are as follows: + +```proto +syntax = "proto3"; + +import "google/protobuf/timestamp.proto"; +import "google/protobuf/duration.proto"; + +// to be added to the Controller +rpc ListSnapshotDeltas(ListSnapshotDeltasRequest) + returns (ListSnapshotDeltasResponse) { + option (alpha_method) = true; + } + +// List the deltas between two snapshots on the storage system +// regardless of how they were created +message ListSnapshotDeltasRequest { + option (alpha_message) = true; + + // The ID of the base snapshot handle to use for comparison. If + // not specified, return all changed blocks up to the target + // specified by snapshot_target. This field is OPTIONAL. + string from_snapshot_id = 1; + + // The ID of the target snapshot handle to use for comparison. If + // not specified, an error is returned. This field is REQUIRED. + string to_snapshot_id = 2; + + // Secrets required by plugin to complete list snapshot deltas + // request. + // This field is OPTIONAL. Refer to the `Secrets Requirements` + // section on how to use this field. + map secrets = 3 [(csi_secret) = true]; + + // If specified (non-zero value), the Plugin MUST NOT return more + // entries than this number in the response. If the actual number of + // entries is more than this number, the Plugin MUST set `next_token` + // in the response which can be used to get the next page of entries + // in the subsequent `ListSnapshotDeltas` call. This field is + // OPTIONAL. If not specified (zero value), it means there is no + // restriction on the number of entries that can be returned. + // The value of this field MUST NOT be negative. + int32 max_entries = 4; + + // A token to specify where to start paginating. Set this field to + // `next_token` returned by a previous `ListSnapshotDeltas` call to + // get the next page of entries. This field is OPTIONAL. + // An empty string is equal to an unspecified field value. + string starting_token = 5; +} + +message ListSnapshotDeltasResponse { + option (alpha_message) = true; + + // The volume size in bytes. This field is OPTIONAL. + uint64 volume_size_bytes = 1; + + // This token allows you to get the next page of entries for + // `ListSnapshotDeltas` request. If the number of entries is larger + // than `max_entries`, use the `next_token` as a value for the + // `starting_token` field in the next `ListSnapshotDeltas` request. + // This field is OPTIONAL. + // An empty string is equal to an unspecified field value. + string next_token = 2; + + // Changed block deltas between the source and target snapshots. An + // empty list means there is no difference between the two. Leave + // unspecified if the volume isn't of block type. This field is + // OPTIONAL. + repeated BlockSnapshotChangedBlock changed_blocks = 3; +} + +message BlockSnapshotChangedBlock { + option (alpha_message) = true; + + // The block logical offset on the volume. This field is REQUIRED. + uint64 offset_bytes = 1; + + // The size of the block in bytes. This field is REQUIRED. + uint64 block_size_bytes = 2; + + // The token and other information needed to retrieve the actual + // data block at the given offset. If the provider doesn't support + // token-based data blocks retrieval, this should be left + // unspecified. This field is OPTIONAL. + BlockSnapshotChangedBlockToken token = 3; +} + +message BlockSnapshotChangedBlockToken { + option (alpha_message) = true; + + // The token to use to retrieve the actual data block at the given + // offset. This field is REQUIRED. + string token = 1; + + // Timestamp when the token is issued. This field is REQUIRED. + .google.protobuf.Timestamp issuance_time = 2; + + // The TTL of the token in seconds. The expiry time is calculated by + // adding the time of issuance with this value. This field is + // REQUIRED. + int32 ttl_seconds = 3; +} +``` + +#### VolumeSnapshotDeltaServiceConfiguration Resource + +The section describes the specification of +`VolumeSnapshotDeltaServiceConfiguration` resource: + +```go +// VolumeSnapshotDeltaServiceConfiguration represents the CBT service +// configuration resource. +type VolumeSnapshotDeltaServiceConfiguration struct { + metav1.TypeMeta `json:",inline"` + + // Standard object's metadata. + // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata + // +optional + metav1.ObjectMeta `json:"metadata,omitempty"` + + // Name of the CSI driver. + DriverName string `json: "driverName"` + + // Service endpoint configuration of the CBT CSI driver. The CBT aggregated + // API server sends the CBT request to this service. + ClientConfig ClientConfig `json:"clientConfig"` +} + +// ClientConfig contains the service endpoint configuration of the CBT CSI +// driver. The CBT aggregated API server sends the CBT request to this service. +type ClientConfig struct { + // `url` gives the location of the CBT-enabled CSI driver, in standard URL + // form (`scheme://host:port/path`). Exactly one of `url` or `service` + // must be specified. + // + // The `host` should not refer to a service running in the cluster; use + // the `service` field instead. The host might be resolved via external + // DNS in some apiservers (e.g., `kube-apiserver` cannot resolve + // in-cluster DNS as that would be a layering violation). `host` may + // also be an IP address. + // + // The scheme must be "https"; the URL must begin with "https://". + // +optional + URL *string `json:"url,omitempty"` + + // `service` is a reference to the service of the CSI driver. Either `service` + // or `url` must be specified. + // + // If the webhook is running within the cluster, then you should use `service`. + // +optional + Service *ServiceReference `json:"service,omitempty"` + + // `caBundle` is a PEM encoded CA bundle which will be used to validate the + // CSI driver's server certificate. + // +optional + CABundle []byte `json:"caBundle"` +} + +// ServiceReference holds a reference to the Service resource that fronts the +// CSI driver. +type ServiceReference struct { + // `namespace` is the namespace of the service. + // Required + Namespace string `json:"namespace"` + + // `name` is the name of the service. + // Required + Name string `json:"name"` + + // `path` is an optional URL path which will be sent in any request to + // this service. + // +optional + Path *string `json:"path,omitempty"` + + // If specified, the port on the service that hosting webhook. + // Default to 443 for backward compatibility. + // `port` should be a valid port number (1-65535, inclusive). + // +optional + Port *int32 `json:"port,omitempty"` +} +``` + +### Test Plan + + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +All unit tests will be included in the out-of-tree CSI repositories, with no +impact on the test coverage of the core packages. + +##### Integration tests + + + +The integration tests will cover the lifecycle of the `VolumeSnapshotDelta` and +`VolumeSnapshotDeltaServiceConfiguration` resources. Test cases will be included +to ensure that the `VolumeSnapshotDelta` resource works only with the `CREATE` +operation. The typical CRUD operations of a CRD will work with the +`VolumeSnapshotDeltaServiceConfiguration` resource. + +The validation logic of the `VolumeSnapshotDelta` resource will also be covered, +to ensure the aggregated API server serves the CBT request only if the required +`VolumeSnapshot` and `VolumeSnapshotDeltaServiceConfiguration` resources exist. +If they don't, the request will fail. + +The integration tests setup will require the following fixtures: + +* The `VolumeSnapshot` [controller][16] +* The CSI [host path driver][14] +* Inject a mock handler in the `cbt-aggapi` aggregated API server to return mock +CBT entries. + +##### e2e tests + + + +The e2e tests will extend the integration tests to run on an actual Kubernetes +cluster, set up with `kubetest` per the sig-testing [e2e tests +documentation][15]. + +A sample client will be used to create a `VolumeSnapshotDelta` resource to +initiate the CBT request to the aggregated API server. The aggregated API server +discovers the mock CSI driver using its `VolumeSnapshotDeltaServiceConfiguration` +resource. The CBT request is then forwarded to the mock CSI driver where the CSI +GRPC invocation happens. The mock response payload is then returned to the +sample client for verification. + +The e2e tests setup will require the following fixtures: + +* The `VolumeSnapshot` [controller][16] +* The CSI [host path driver][14] +* The sample client +* The mock CSI driver + +### Graduation Criteria + + + +#### Alpha + +- Approved specification of the `VolumeSnapshotDelta` and +`VolumeSnapshotDeltaServiceConfiguration` +custom resources +- Approved specification of the CSI CBT GRPC services and messages +- Can create `VolumeSnapshotDelta` resource and return CBT payload to user +- Can discover opt-in CSI drivers using the +`VolumeSnapshotDeltaServiceConfiguration` resources +- Initial e2e tests completed and enabled + +Since this is an out-of-tree CSI component, no feature flag is required. + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: + - Components depending on the feature gate: +- [x] Other + - Describe the mechanism: The new components will be implemented as part of +the out-of-tree CSI framework. Storage providers can embed the CBT sidecar +component in their CSI drivers, if they choose to support this feature. Users +will also need to install the CBT aggregated API server. + - Will enabling / disabling the feature require downtime of the control + plane? No. + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). No. + +###### Does enabling the feature change any default behavior? + + + +No. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +Users disable the CBT feature by uninstalling the CBT aggregated API servers +from their cluster. No feature gate is involved. + +###### What happens if we reenable the feature if it was previously rolled back? + +The CBT feature can be re-enabled by re-installing the CSI CBT components on +their cluster. There will be no unintended side-effects because resources from +the previous installation would have been deleted during the previous +uninstallation. Also, the `VolumeSnapshotDelta` resources are not persisted in +etcd, per the design proposal. + +###### Are there any tests for feature enablement/disablement? + + + +No. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +The CBT aggregated API server will be interacting with the APIs associated +with these GVRs: + +```yaml +- apiGroups: ["cbt.storage.k8s.io"] + resources: ["volumesnapshotdeltas"] + verbs: ["create"] +- apiGroups: ["cbt.storage.k8s.io"] + resources: ["volumesnapshotdeltaserviceconfiguration"] + verbs: ["get", "list", "watch"] +- apiGroups: ["snapshot.storage.k8s.io"] + resources: ["volumesnapshotcontents", "volumesnapshots", "volumesnapshotclasses"] + verbs: ["get", "list", "watch"] +``` + +###### Will enabling / using this feature result in introducing new API types? + + + +The `VolumeSnapshotDelta` and `VolumeSnapshotDeltaServiceConfiguration` custom +resources will be added to the CSI `cbt.storage.k8s.io` group. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +Not by the CSI. All external calls to storage provider endpoints will be handled +by the provider's CSI drivers. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +No. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +No. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +The Kubernetes API server proxies the CBT payloads between the user and the +aggregated API server. See the [Risks and Mitigations](#risks-and-mitigations) +section on using flow control to protect the Kubernetes API server. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +The previous alternate design which involves generating and returning a callback +endpoint to the caller has been superceded by the aggregation extension +mechanism described in this KEP. The aggregation extension design provides a +tighter integration with the Kubernetes API server, enabling the re-use of the +existing Kubernetes machinery of GVR and GVK binding, URL registration and +delegated authentication and authorization. + +Another alternative that allows user to provide the callback URL to send the +CBT response payload to has also been rejected due to concerns around the lack +of control over the authenticity of the remote URLs. + +Some consideration was given to associating the CBT requests with the [`LIST` +verb][17] to match the "collection of CBT entries" semantic, without introducing +the `VolumeSnapshotDelta` resource. Since this approach doesn't provide the kind +of structured request that the [`CREATE` verb][18] does, it is rejected. This +KEP proposes the introduction of the `VolumeSnapshotDelta` custom resources to +allow for cleaner encapsulation and extension of the supported CBT request +parameters. + +Another alternative which involves implementing the the CBT entry response as a +subresource of the `VolumeSnapshotDelta` resource is also discussed. In this +scenario, a CBT request is initiated through the creation of a +`VolumeSnapshotDelta` resource. A second `GET` request can then be issued to +retrieve the list of CBT entries. A non-empty response list will be appended to +the `VolumeSnaphotDelta` resource's `blocks` subresource. For this to work, the +`VolumeSnapshotDelta` resource must first be persisted in the K8s etcd before +the `GET` request can be issued. Depending on the setup of K8s, the CBT +aggregated API server may not have direct access or sufficient permissions to +write to the K8s etcd. Hence, this approach is also rejected. + +## Infrastructure Needed (Optional) + + + +[0]: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/ +[1]: https://kubernetes-csi.github.io/docs/sidecar-containers.html +[2]: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ +[3]: https://github.com/kubernetes/kubernetes/blob/cb057985ce2c1366eb7bf6adbcaa8af63a212bb8/pkg/registry/authorization/subjectaccessreview/rest.go#L55-L83 +[4]: https://kubernetes.io/docs/concepts/cluster-administration/flow-control/ +[5]: https://kubernetes.io/docs/concepts/cluster-administration/flow-control/#suggested-configuration-objects +[6]: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/#response-latency +[7]: https://en.wikipedia.org/wiki/Thundering_herd_problem +[8]: https://kubernetes.io/docs/tasks/extend-kubernetes/configure-aggregation-layer/#register-apiservice-objects +[9]: https://kubernetes-csi.github.io/docs/sidecar-containers.html +[10]: https://kubernetes.io/docs/tasks/extend-kubernetes/configure-aggregation-layer/#authentication-flow +[11]: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/#custom-controllers +[12]: https://github.com/kubernetes-sigs/metrics-server +[13]: https://kubernetes.io/docs/concepts/cluster-administration/logging/ +[14]: https://github.com/kubernetes-csi/csi-driver-host-path +[15]: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/e2e-tests.md#building-kubernetes-and-running-the-tests +[16]: https://github.com/kubernetes-csi/external-snapshotter +[17]: https://kubernetes.io/docs/reference/using-api/api-concepts/#collections +[18]: https://kubernetes.io/docs/reference/using-api/api-concepts/#api-verbs diff --git a/keps/sig-storage/3314-csi-volume-snapshot-delta/img/cbt-step-01.png b/keps/sig-storage/3314-csi-volume-snapshot-delta/img/cbt-step-01.png new file mode 100644 index 00000000000..8ce096bfd1c Binary files /dev/null and b/keps/sig-storage/3314-csi-volume-snapshot-delta/img/cbt-step-01.png differ diff --git a/keps/sig-storage/3314-csi-volume-snapshot-delta/img/cbt-step-02.png b/keps/sig-storage/3314-csi-volume-snapshot-delta/img/cbt-step-02.png new file mode 100644 index 00000000000..e509411b0b7 Binary files /dev/null and b/keps/sig-storage/3314-csi-volume-snapshot-delta/img/cbt-step-02.png differ diff --git a/keps/sig-storage/3314-csi-volume-snapshot-delta/img/cbt-step-03.png b/keps/sig-storage/3314-csi-volume-snapshot-delta/img/cbt-step-03.png new file mode 100644 index 00000000000..972d987c1f9 Binary files /dev/null and b/keps/sig-storage/3314-csi-volume-snapshot-delta/img/cbt-step-03.png differ diff --git a/keps/sig-storage/3314-csi-volume-snapshot-delta/img/cbt-step-04.png b/keps/sig-storage/3314-csi-volume-snapshot-delta/img/cbt-step-04.png new file mode 100644 index 00000000000..b5a8c7e10e4 Binary files /dev/null and b/keps/sig-storage/3314-csi-volume-snapshot-delta/img/cbt-step-04.png differ diff --git a/keps/sig-storage/3314-csi-volume-snapshot-delta/kep.yaml b/keps/sig-storage/3314-csi-volume-snapshot-delta/kep.yaml new file mode 100644 index 00000000000..a249c407dc8 --- /dev/null +++ b/keps/sig-storage/3314-csi-volume-snapshot-delta/kep.yaml @@ -0,0 +1,49 @@ +title: KEP Template +kep-number: 3314 +authors: + - "@ihcsim" + - "@PrasadG193" + - "@phuongatemc" +owning-sig: sig-storage +participating-sigs: + - sig-storage +status: implementable +creation-date: 2022-06-07 +reviewers: + - "@xing-yang" + - "@bswartz" +approvers: + - "@xing-yang" + - "@msau42" + - "@thockin" + +##### WARNING !!! ###### +# prr-approvers has been moved to its own location +# You should create your own in keps/prod-readiness +# Please make a copy of keps/prod-readiness/template/nnnn.yaml +# to keps/prod-readiness/sig-xxxxx/00000.yaml (replace with kep number) +#prr-approvers: + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.27" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.27" + beta: "" + stable: "" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: +- name: # not needed for this out-of-tree CSI component + components: +disable-supported: true + +# The following PRR answers are required at beta release +metrics: