Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Regression in 1.6.0 Prevents Usage of Custom Registries and/or imagePullSecrets #664

Open
Tabrnicus opened this issue Jan 9, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@Tabrnicus
Copy link

Note: Make sure to check out known issues (https://akv2k8s.io/troubleshooting/known-issues/) before submitting

Note: Tangentially related to #495, through #631.

Components and versions

[ ] Controller, version: x.x.x (docker image tag)
[X] Env-Injector (webhook), version: 1.6.0 (docker image tag)
[ ] Other

Describe the bug

A regression in version 1.6.0 of the Env-Injector prevents the use of imagePullSecrets as valid registry authentication when using authType="azureCloudConfig". More specifically, the logic present in the current 1.6.0 version (covered below in detail) assumes that both the Key Vault and container registry must be using MSIs as the authentication method. My use case involves utilizing a third-party container registry, and not an ACR. As a result, I cannot set up any form of MSI to our container registry, which is why we use imagePullSecrets. My understanding from reading through the documentation and code from 1.5.0 and below is that this use case should be supported.

After investigating, I've found that the changes introduced by #631 break the previous logic that allowed imagePullSecrets. The function getContainerRegistryRemoteOptions() in registry.go (a) uses authType to decide how the container registry authentication should be handled, and (b) assumes that an ACR will be used when authType="azureCloudConfig":

func getContainerRegistryRemoteOptions(ctx context.Context, client kubernetes.Interface, container containerInfo, authType string, opt ImageRegistryOptions, r credentialprovider.CredentialProvider) ([]remote.Option, error) {
ref, err := name.ParseReference(container.Image)
if err != nil {
return nil, fmt.Errorf("failed to parse image reference: %w", err)
}
registry := ref.Context().Registry.Name()
klog.InfoS("using registry", "imageRegistry", registry)
authChain := new(authn.Keychain)
switch authType {
case "azureCloudConfig":
klog.InfoS("using cloudConfig for registry authentication", "config.authType", authType)
dockerConfigEntry, err := r.GetAcrCredentials(container.Image)
if err != nil {
return nil, fmt.Errorf("cannot fetch acr credentials: %w", err)
}
if dockerConfigEntry.Username != "" {
sec := []corev1.Secret{ //{
*dockerCfgSecretType.Create(container.Namespace, "secret", registry, authn.AuthConfig{
Username: dockerConfigEntry.Username, Password: dockerConfigEntry.Password,
}),
}
*authChain, err = k8schain.NewFromPullSecrets(
ctx,
sec,
)
if err != nil {
return nil, err
}
} else {
*authChain, err = k8schain.New(
ctx,
client,
k8schain.Options{
Namespace: container.Namespace,
ServiceAccountName: container.ServiceAccountName},
)
if err != nil {
return nil, err
}
}
default:
klog.InfoS("using imagePullSecrets for registry authentication", "config.authType", authType)
*authChain, err = k8schain.New(
ctx,
client,
k8schain.Options{
Namespace: container.Namespace,
ServiceAccountName: container.ServiceAccountName,
ImagePullSecrets: container.ImagePullSecrets,
},
)
if err != nil {
return nil, err
}
}

For context, the previous approach used a getImageConfig() function whose only branch includes imagePullSecrets (for clarity, this same logic is preserved in the default case statement in 1.6.0):

func getImageConfig(ctx context.Context, client kubernetes.Interface, container containerInfo, opt ImageRegistryOptions) (*v1.Config, error) {
authChain, err := k8schain.New(
ctx,
client,
k8schain.Options{
Namespace: container.Namespace,
ServiceAccountName: container.ServiceAccountName,
ImagePullSecrets: container.ImagePullSecrets,
},
)
if err != nil {
return nil, err
}
options := []remote.Option{
remote.WithAuthFromKeychain(authChain),
}
if opt.SkipVerify {
tr := &http.Transport{
TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, // nolint:gosec
}
options = append(options, remote.WithTransport(tr))
}
ref, err := name.ParseReference(container.Image)
if err != nil {
return nil, fmt.Errorf("failed to parse image reference: %w", err)
}
descriptor, err := remote.Get(ref, options...)
if err != nil {
return nil, fmt.Errorf("cannot fetch image descriptor: %w", err)
}
image, err := descriptor.Image()
if err != nil {
return nil, fmt.Errorf("cannot convert image descriptor to v1.Image: %w", err)
}
configFile, err := image.ConfigFile()
if err != nil {
return nil, fmt.Errorf("cannot extract config file of image: %w", err)
}
return &configFile.Config, nil
}

This seems to be a restrictive approach, for several reasons:

  1. authType seems to be documented as an option to specify key vault authentication only. I haven't seen this option mention container registries (Chart README.md, Chart env-injector-deployment.yaml). Aside from the issues being discussed here, this alone is cause for potential confusion.
  2. Making the container registry authentication option tied to the Key Vault authentication option is unnecessarily restrictive. Aside from the current use case, another use case that is excluded is that of custom authentication (i.e. authType="environment") with ACR MSI authentication. The newly added code branch in registry.go will never be taken because authType does not equal azureCloudConfig.

To Reproduce

Important

For ease of testing within the context of this project, these replication steps will include creating an ACR. To be clear, the two use cases (ACR without MSI, and third party registry entirely) fail the same way and for the purposes of this issue can be regarded as equivalent.

Note

I've replaced all the instances of identifiable names with their <pseudo> equivalents. Example: <registryName>.azurecr.io

  1. Provision an AKS Cluster
    • Kubernetes Version: 1.26.10
    • Tier: Free
    • AAD Pod Identity: Disabled
    • Authentication and Authorization: Local accounts with Kubernetes RBAC
    • Networking: Azure CNI, Calico
  2. Provision a Key Vault
    • With access to the cluster via an access policy on the agentpool MSI
    • Add a secret to this vault to be accessed later by a workload pod
      • For the purposes of these steps, name it admin-password
  3. Provision a container registry. We'll be using token based authentication, not MSIs.
    • This can be done using an ACR, following the instructions to create a token.
    • Generate a new token:
      • Name: mytoken
      • Scope Map: _repositories_push_metadata_write
      • After creation, generate a password. Save the password for later and copy the docker login command for the next step.
  4. Push an image to the registry (we'll pull this image again later):
    docker login -u mytoken -p <password> <registryName>
    docker pull alpine:latest
    docker tag alpine:latest <registryName>/alpine-test:latest
    docker push <registryName>/alpine-test:latest
    docker logout <registryName>
  5. Deploy akv2k8s at application version 1.6.0 (Helm chart version 2.6.0)
    helm upgrade --install akv2k8s spv-charts/akv2k8s --namespace akv2k8s --version 2.6.0 --set global.logLevel=trace
  6. Create / modify a namespace to have the proper azure-key-vault-env-injection annotation
    • I picked default, for ease.
  7. Create an AzureKeyVaultSecret in that namespace
    apiVersion: spv.no/v2beta1
    kind: AzureKeyVaultSecret
    metadata:
      name: admin-password
    spec:
      vault:
        name: <keyVaultName>
        object:
          name: admin-password
          type: secret
  8. Create a Kubernetes Secret with the token information
kubectl create secret docker-registry regcred --docker-server="<registryName>.azurecr.io" --docker-username="mytoken" --docker-password="<pwFromStep3>"
  1. Deploy a workload pod that uses the pull secret and requests the secret variable with admin-password@azurekeyvault
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: alpine-deployment
      labels:
        app: alpine
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: alpine
      template:
        metadata:
          labels:
            app: alpine
        spec:
          containers:
          - name: alpine
            image: <registryName>.azurecr.io/alpine-test:latest
            imagePullPolicy: Always  # This is here to eliminate node cache from the variables in this test
            env:
              - name: "MY_SECRET"
                value: "admin-password@azurekeyvault"
          imagePullSecrets:
            - name: regcred
  2. Notice that Pods are not deploying. Describe the ReplicaSet to show the unauthorized message:
    $ kubectl describe rs alpine-deployment-<hash>
    
    Events:
      Type     Reason        Age              From                   Message
      ----     ------        ----             ----                   -------
      Warning  FailedCreate  13s              replicaset-controller  Error creating: Internal error occurred: failed calling webhook "pods.env-injector.admission.spv.no": failed to call webhook: an error on the server ("{\"response\":{\"uid\":\"4946a53f-76c1-4f90-82ab-ab0f76272a79\",\"allowed\":false,\"status\":{\"metadata\":{},\"status\":\"Failure\",\"message\":\"failed to get auto cmd, error: cannot fetch image descriptor: GET https://<registryName>.azurecr.io/oauth2/token?scope=repository%3Aalpine-test%3Apull\\u0026service=<registryName>.azurecr.io: UNAUTHORIZED: authentication required, visit https://aka.ms/acr/authorization for more information.\"}}}") has prevented the request from succeeding

Expected behavior

Based on the above replication steps, I would expect to see the env-injector use the imagePullSecrets in the deployment specification to pull the image from the registry. Instead, it tries to pull it using invalid/anonymous credentials.

As a point of comparison, installing akv2k8s at application version 1.5.0 (chart version 2.5.0) and reinstalling the deployment will show the intended behavior. Of course, the container is nonfunctional (it crash loops because it's default Alpine), which is beside the point.

Logs

Full ReplicaSet describe log:

Name:           alpine-deployment-bbd99dbf5
Namespace:      default
Selector:       app=alpine,pod-template-hash=bbd99dbf5
Labels:         app=alpine
                pod-template-hash=bbd99dbf5
Annotations:    deployment.kubernetes.io/desired-replicas: 1
                deployment.kubernetes.io/max-replicas: 2
                deployment.kubernetes.io/revision: 1
Controlled By:  Deployment/alpine-deployment
Replicas:       0 current / 1 desired
Pods Status:    0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  app=alpine
           pod-template-hash=bbd99dbf5
  Containers:
   alpine:
    Image:      <registry>.azurecr.io/alpine-test:latest
    Port:       <none>
    Host Port:  <none>
    Environment:
      MY_SECRET:  admin-password@azurekeyvault
    Mounts:       <none>
  Volumes:        <none>
Conditions:
  Type             Status  Reason
  ----             ------  ------
  ReplicaFailure   True    FailedCreate
Events:
  Type     Reason        Age               From                   Message
  ----     ------        ----              ----                   -------
  Warning  FailedCreate  16s               replicaset-controller  Error creating: Internal error occurred: failed calling webhook "pods.env-injector.admission.spv.no": failed to call webhook: an error on the server ("{\"response\":{\"uid\":\"e0ac9a89-737b-4a68-b348-4d4017c09858\",\"allowed\":false,\"status\":{\"metadata\":{},\"status\":\"Failure\",\"message\":\"failed to get auto cmd, error: cannot fetch image descriptor: GET https://<registry>.azurecr.io/oauth2/token?scope=repository%3Aalpine-test%3Apull\\u0026service=<registry>.azurecr.io: UNAUTHORIZED: authentication required, visit https://aka.ms/acr/authorization for more information.\"}}}") has prevented the request from succeeding
# <snip> (the same event is repeated about 20 times)

env-injector logs at time of failure:

2024/01/09 06:10:51 [DEBUG] reviewing request 67fe5091-2a8f-4239-a599-d2236f22a837, named: default/
I0109 06:10:51.012438       1 main.go:139] "found pod to mutate" pod="default/"
I0109 06:10:51.012454       1 pod.go:295] "creating client certificate to use with auth service" default/="(MISSING)"
I0109 06:10:51.012462       1 clientCert.go:25] "creating x509 key pair for ca cert and key"
I0109 06:10:51.012752       1 clientCert.go:32] "parse certificate"
I0109 06:10:51.012772       1 clientCert.go:38] "generating client key"
I0109 06:10:51.029883       1 clientCert.go:44] "generating serial number"
I0109 06:10:51.029909       1 clientCert.go:66] "crating x509 certificate"
I0109 06:10:51.032023       1 pod.go:303] "create authentication service secret" default/="(MISSING)"
I0109 06:10:51.045707       1 round_trippers.go:553] POST https://10.0.0.1:443/api/v1/namespaces/default/secrets 409 Conflict in 13 milliseconds
I0109 06:10:51.051601       1 round_trippers.go:553] PUT https://10.0.0.1:443/api/v1/namespaces/default/secrets/akv2k8s-alpine-deployment 200 OK in 5 milliseconds
I0109 06:10:51.052088       1 pod.go:317] "mutate init-containers" default/="(MISSING)"
I0109 06:10:51.052106       1 pod.go:323] "mutate containers" default/="(MISSING)"
I0109 06:10:51.052113       1 pod.go:138] "found container to mutate" container="default/alpine"
I0109 06:10:51.052119       1 pod.go:141] "checking for env vars to inject" container="default/alpine"
I0109 06:10:51.052145       1 pod.go:144] "found env var to inject" env="admin-password@azurekeyvault" container="default/alpine"
I0109 06:10:51.052170       1 registry.go:30] "getting container command for container" container="default/alpine"
I0109 06:10:51.052232       1 registry.go:36] "no cmd override in kubernetes for container, checking docker image configuration for entrypoint and cmd" image="<registryName>.azurecr.io/alpine-test:latest" container="default/alpine"
I0109 06:10:51.052256       1 registry.go:130] "using registry" imageRegistry="<registryName>.azurecr.io"
I0109 06:10:51.052279       1 registry.go:135] "using cloudConfig for registry authentication" config.authType="azureCloudConfig"
I0109 06:10:51.052319       1 acr.go:78] using managed identity for acr credentials
I0109 06:10:51.052349       1 provider.go:285] "azure: using managed identity extension to retrieve access token" id="076f5645-0759-4528-b014-7cf8ac51f90f"
I0109 06:10:51.052359       1 provider.go:292] "azure: using managed identity extension to retrieve access token" id="076f5645-0759-4528-b014-7cf8ac51f90f"
I0109 06:10:51.063450       1 acr.go:184] "discovering auth redirects" url="<registryName>.azurecr.io"
I0109 06:10:51.096246       1 acr.go:190] exchanging an acr refresh_token
I0109 06:10:51.424071       1 round_trippers.go:553] GET https://10.0.0.1:443/api/v1/namespaces/default/serviceaccounts/default 200 OK in 3 milliseconds
E0109 06:10:51.657377       1 main.go:162] "failed to mutate" err="failed to get auto cmd, error: cannot fetch image descriptor: GET https://<registryName>.azurecr.io/oauth2/token?scope=repository%3Aalpine-test%3Apull&service=<registryName>.azurecr.io: UNAUTHORIZED: authentication required, visit https://aka.ms/acr/authorization for more information." pod="default/"
2024/01/09 06:10:51 [ERROR] admission webhook error: failed to get auto cmd, error: cannot fetch image descriptor: GET https://<registryName>.azurecr.io/oauth2/token?scope=repository%3Aalpine-test%3Apull&service=<registryName>.azurecr.io: UNAUTHORIZED: authentication required, visit https://aka.ms/acr/authorization for more information.

Note the two new (as of #631) INFO logs near the bottom, as it helps to track code execution.

  • using registry... shows execution is in getContainerRegistryRemoteOptions(), before the switch statement
  • using cloudConfig for registry authentication... shows execution is in the first branch of the switch statement

Additional context

In terms of potential direction for resolution, I'd probably be most in favor of the following:

  • Adding separate configuration options (probably in the form of a global and overrides just like the Key Vault option) that configures whether MSI should be used for the container registry.
  • Use the newly-added code from Use registry authentication for azure sp credentials, when authType i… #631 only in situations where:
    • the previously mentioned config option is appropriately configured (e.g. azureCloudConfig); and
    • An ACR is used / detected (implying that third party registries will fall back to using imagePullSecrets, regardless of the option's value)
  • The option should default to using imagePullSecrets, to avoid making this a breaking change. Otherwise, the breaking change should be communicated.
@Tabrnicus Tabrnicus added the bug Something isn't working label Jan 9, 2024
@Speeddymon
Copy link

@181192 please take a look - seems this one requires some additional attention.

CC: @abhilashjoseph as the author of the PR which broke this functionality.

@flo-kn
Copy link

flo-kn commented Aug 13, 2024

Hi everyone 👋
any news this in the meantime?

I just bumped into same issue trying pull container images from a third-party registry using imagePullSecrets.
Would be super happy for any pointers, directions, or workarround ideas.

@pwelp
Copy link

pwelp commented Nov 25, 2024

Hey,

I am also running into this. I am a little confused though. I've got 3 pods all from the same registry in the same namespace using the same pullsecret, but I am only having issues pulling 1 of them.

@Benly-walter
Copy link

Benly-walter commented Jan 15, 2025

Ran into the same error today while attempting helm version upgrade (chart version 2.5.1 to 2.6.0)
Any updates on this?

@Speeddymon
Copy link

Due to lack of timely updates and response from the Dev team both here and in their own Slack, I've moved our projects over to External Secrets Operator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants