Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

global context proposal #47

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open
Changes from 8 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
72c766f
add resource cache proposal
JimBugwadia Aug 8, 2023
04a81fd
add PR link
JimBugwadia Aug 8, 2023
f2ea256
add file!
JimBugwadia Aug 8, 2023
d7930a3
typo
JimBugwadia Aug 8, 2023
d92715b
updates from contrib mtg
JimBugwadia Aug 10, 2023
07ff5c7
Merge branch 'kyverno:main' into main
JimBugwadia Sep 19, 2023
24e963a
add implementation choices
JimBugwadia Sep 25, 2023
5a365c6
address comments
JimBugwadia Sep 25, 2023
261d5ac
feat: update resource-cache KDP
vishal-chdhry Nov 21, 2023
fc02a52
fix: remove outdated drawbacks
vishal-chdhry Nov 21, 2023
e311cc8
fix: typos
vishal-chdhry Nov 21, 2023
20a2ef1
fix: updates
vishal-chdhry Nov 22, 2023
eb4d88e
Update proposals/resource_cache.md
realshuting Dec 5, 2023
7974765
Update proposals/resource_cache.md
realshuting Dec 5, 2023
934bd51
Update proposals/resource_cache.md
realshuting Dec 5, 2023
4a255d2
fix: add failure
vishal-chdhry Jan 30, 2024
8166b4e
fix: nit
vishal-chdhry Jan 30, 2024
dea1fa5
Merge pull request #1 from vishal-chdhry/resource-cache-kdp-update
realshuting Jan 30, 2024
b2e0e27
fix: update proposal
vishal-chdhry Jan 31, 2024
34821ca
Update proposals/resource_cache.md
vishal-chdhry Jan 31, 2024
01b492f
Update proposals/resource_cache.md
vishal-chdhry Jan 31, 2024
ecd1808
Update proposals/resource_cache.md
vishal-chdhry Jan 31, 2024
ed685c1
Update proposals/resource_cache.md
vishal-chdhry Jan 31, 2024
354da16
fix: add more details
vishal-chdhry Jan 31, 2024
204b736
Update proposals/resource_cache.md
vishal-chdhry Feb 1, 2024
bcc5afd
Update proposals/resource_cache.md
vishal-chdhry Feb 1, 2024
2e06d81
Update proposals/resource_cache.md
vishal-chdhry Feb 1, 2024
f6aa8ee
fix: updates
vishal-chdhry Feb 1, 2024
ae4f3f8
Update proposals/resource_cache.md
JimBugwadia Feb 1, 2024
e2dfc10
Merge pull request #2 from vishal-chdhry/JimBugwadia/main
JimBugwadia Feb 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
271 changes: 271 additions & 0 deletions proposals/resource_cache.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,271 @@
# Meta
[meta]: #meta
- Name: Resource Cache
- Start Date: Aug 7, 2023
- Update Date: Sep 24th, 2023
- Author(s): @JimBugwadia

# Table of Contents
[table-of-contents]: #table-of-contents
- [Meta](#meta)
- [Table of Contents](#table-of-contents)
- [Overview](#overview)
- [Definitions](#definitions)
- [Motivation](#motivation)
- [Proposal](#proposal)
- [Implementation](#implementation)
- [Migration (OPTIONAL)](#migration-optional)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
- [Prior Art](#prior-art)
- [Unresolved Questions](#unresolved-questions)
- [CRD Changes (OPTIONAL)](#crd-changes-optional)

# Overview
[overview]: #overview

Optional caching of any Kubernetes resource.

# Motivation
[motivation]: #motivation

From: https://github.com/kyverno/kyverno/issues/4459

> In some cases to properly validate a resource we need to examine other resources. Particularly for Ingress and Istio Gateways/VirtualServices we need to look at all the other Ingress/virtualservices or services in the cluster. At large scale we are finding that Kyverno struggles to handle repeatedly pulling thousands of resources using the context apiCall variables. On a cluster with around 4,000 Service objects and 3,000 Ingresses we found that a policy validating Istio VirtualService destinations against Services (ensuring the target exists) was taking > 10s for all measurements (p50, p95 and p99), and the webhook timeout was exceeded. Another policy that validates an Ingress doesn't duplicate a hostname from another Ingress had p95 execution times over 5 seconds. During this time the controllers were at/below requested values for cpu/memory and otherwise had no other indicator of performance problems.

# Proposal

There are two parts to this feature:
1. Allow users to manage which resources should be cached
2. Allow policy rules to reference cached resource data

Users can manage which resources to cache using the same mechanism that is currently used for ConfigMap resources i.e. adding a label `cache.kyverno.io/enabled: "true"` to the resource.

In the Kyverno policy a new `context` entry `resourceCache` will be added. Here is an example:

```yaml
context:
- name: ingresses
resourceCache:
group: "apis/networking.k8s.io"
version: "v1"
kind: "ingresses"
namespace: apps
jmesPath: "ingresses | items[].spec.rules[].host"
```

This allows policy authors to declare what resource should be cached.

The `group` and `version` are optional. If not specified, the preferred versions should be used.

An optional `namespace` can be used to only cache resources in the namespace, rather than across all namespaces which is the behavior is a namespace is not specified.
JimBugwadia marked this conversation as resolved.
Show resolved Hide resolved

The JMESPath is also optional and is applied to add a resulting subset of the resource data to the rule context.

Note that Kyverno will only cache matching resources that have the label: `cache.kyverno.io/enabled: "true"`.

# Implementation
JimBugwadia marked this conversation as resolved.
Show resolved Hide resolved

There are multiple ways to implement this feature:

## Limited to known types

With this option, Kyverno will be able to cache types defined in the Kubernetes client set but will not be able to cache other custom resources.

As policies are created or modified, Kyverno will attempt to initialize informers for any `resourceCache` declaration. In case an informer cannot be initialized, an error will be returned.

During rule execution, Kyverno will add the resource data to the rule context.

## Support for any resource type

With this option, Kyverno will not be able to use informers but instead use dynamic watches and mantain its own cache.

This will be more involved, but will allow caching any custom resource. This approach can also allow finer grained filters, in addition to labels, for what should be cached.

As with the informers based implementation, during rule execution, Kyverno will add the resource data to the rule context.

## Other requirements

### Metrics

It would be useful to add cache metrics for observability and troubleshooting.
Copy link
Collaborator

@MarcelMue MarcelMue Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially cache size would be very important. With this current proposal I see the risk of Kyverno Policies being created which have a large influence on overall cache size.


### Failure handling

The assumption is that the cache is kept up-to-date. We will need to think about any potential race conditions, especially during startup, and how to handle scenarios where the cache is not populated.


### Failure

## Link to the Implementation PR

TBD

# Migration (OPTIONAL)

There is no automated migration.

However, to leverage resource caching, users can convert API calls to the new `resourceCache` declaration.

Here are some API calls from sample policies along with the correspinding `resourceCache` declarations:

https://kyverno.io/policies/other/e-l/ensure-production-matches-staging/ensure-production-matches-staging/

```yaml
context:
- name: deployment_count
apiCall:
urlPath: "/apis/apps/v1/namespaces/staging/deployments"
jmesPath: "items[?metadata.name=='{{ request.object.metadata.name }}'] || `[]` | length(@)"
```

can be converted to:

```yaml
context:
- name: deployment_count
resourceCache:
group: "apps"
version: "v1"
kind: "deployments"
namespace: "staging"
jmesPath: "items[?metadata.name=='{{ request.object.metadata.name }}'] || `[]` | length(@)"
```

https://kyverno.io/policies/linkerd/require-linkerd-server/require-linkerd-server/

```yaml
context:
- name: server_count
apiCall:
urlPath: "/apis/policy.linkerd.io/v1beta1/namespaces/{{request.namespace}}/servers"
```

can be converted to:

```yaml
context:
- name: server_count
resourceCache:
group: "policy.linkerd.io"
version: "v1beta1"
kind: "servers"
namespace: "{{request.namespace}}"
```

https://kyverno.io/policies/linkerd/check-linkerd-authorizationpolicy/check-linkerd-authorizationpolicy/

```yaml
context:
- name: servers
apiCall:
urlPath: "/apis/policy.linkerd.io/v1beta1/namespaces/{{request.namespace}}/servers"
jmesPath: "items[].metadata.name || `[]`"
- name: httproutes
apiCall:
urlPath: "/apis/policy.linkerd.io/v1beta1/namespaces/{{request.namespace}}/httproutes"
jmesPath: "items[].metadata.name || `[]`"
```

can be converted to:

```yaml
context:
- name: servers
resourceCache:
group: "policy.linkerd.io"
version: "v1beta1"
kind: "servers"
namespace: "{{request.namespace}}"
jmesPath: "items[].metadata.name || `[]"
```


https://kyverno.io/policies/istio/require-authorizationpolicy/require-authorizationpolicy/

```yaml
- name: allauthorizationpolicies
apiCall:
urlPath: "/apis/security.istio.io/v1beta1/authorizationpolicies"
jmesPath: "items[].metadata.namespace"

```

can be converted to:

```yaml
context:
- name: allauthorizationpolicies
resourceCache:
group: "security.istio.io"
version: "v1beta1"
kind: "authorizationpolicies"
jmesPath: "items[].metadata.namespace"
```


https://kyverno.io/policies/other/rec-req/require-netpol/require-netpol/


```yaml
- name: policies_count
apiCall:
urlPath: "/apis/networking.k8s.io/v1/namespaces/{{request.namespace}}/networkpolicies"
jmesPath: "items[?label_match(spec.podSelector.matchLabels, `{{request.object.spec.template.metadata.labels}}`)] | length(@)"
```

can be converted to:

```yaml
context:
- name: policies_count
resourceCache:
group: "networking.k8s.io"
version: "v1"
kind: "networkpolicies"
namespace: "{{request.namespace}}"
jmesPath: "items[?label_match(spec.podSelector.matchLabels, `{{request.object.spec.template.metadata.labels}}`)] | length(@)"
```

# Drawbacks

API calls do not leverage caching by default.

If needed, we can add a separate caching mechanism for API calls in the future.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating drawbacks of this approach

Suggested change
API calls do not leverage caching by default.
If needed, we can add a separate caching mechanism for API calls in the future.
1. API calls do not leverage caching by default. If needed, we can add a separate caching mechanism for API calls in the future.
2. Using `cache.kyverno.io/enabled: "true"` can cause issues when users forgets to add them to the resource.
1. It is easy to miss a resource when adding the label. `apiCall` will return all the resources of the given type while `resourceCache` will return only those resources that have the label. In case of 1-10k resources.
2. Users might not want to have a kyverno specific label in all their resources across all namespaces.
3. For the usecase mentioned in [motivation](#motivation), we need to add resource to the cache when the policies is applied, otherwise, when the resource is applied for the first time, it will fail because of the timeout like it currently does. This will take away the abilities to have substitutions in `resourceCache` (e.g. `namespace: "{{request.namespace}}"`) and the `resourceCache` field will have to be static.



# Alternatives

## Add caching to API calls

To cache resources, the `apiCall` context entry can be extended to add a `cache` field:

```yaml
context:
- name: hosts
apiCall:
urlPath: "/apis/networking.k8s.io/v1/ingresses"
jmesPath: "items[].spec.rules[].host"
cache: true
```

However, coupling resource caching to API calls could be confusing for users.

# Prior Art

N/A

# Limitations

1. This only addresses Kubernetes resource caching. Resources from other API calls are not cached.

# Open Items

1. We may not be able to use a static Kubernetes client for all types, as the client set can include custom types, and dynamic clients may be resource intensive. More research is needed to determine the best way to manage informers. The alternative is to use watches.
2. Typically informers are initialized on startup. This feature may require adding / deleting informers after startup.
3. All admission controller replicas will need to cache data. For background controller and reports controller the leader will need to cache data.


# CRD Changes (OPTIONAL)

Yes. A new context entry `resourceCache` will be added. The CRD will be defined in the implementation stage.