Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcile error: network endpoint group in a specific zone not found #50

Open
glerchundi opened this issue Oct 22, 2021 · 4 comments
Open

Comments

@glerchundi
Copy link
Contributor

We deployed autoneg in one of our clusters running GKE Autopilot. When running a workload on it and if it's scheduled to just some specific zones but not all, the NEG are not created in all the zones.

That means that autoneg will fail and stop reconciling.

The actual failing log part:

2021-10-22T17:55:52.092Z	INFO	controllers.Service	Applying intended status	{"service": "envoy/envoy", "status": {"backend_services":{"8000":{"myproduct":{"name":"myproduct","max_connections_per_endpoint":1000}}},"network_endpoint_groups":{"8000":"k8s1-4fd3dc4c-envoy-envoy-8000-44f9746b","8001":"k8s1-4fd3dc4c-envoy-envoy-8001-7508741b"},"zones":["europe-west1-b","europe-west1-c","europe-west1-d"]}}
2021-10-22T17:55:52.762Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "service", "request": "envoy/envoy", "error": "googleapi: Error 404: The resource 'projects/myproduct-dev/zones/europe-west1-c/networkEndpointGroups/k8s1-4fd3dc4c-envoy-envoy-8000-44f9746b' was not found, notFound"}

The question is, should autoneg tolerate missing network endpoint groups in some but not in all available zones?

@rosmo
Copy link
Collaborator

rosmo commented Oct 25, 2021

Hey @glerchundi, how have you configured the workload? With anti-affinity for a specific zone? It's a bit interesting, generally I've seen NEGs created on all zones regardless of if a workload is running there, but I suppose this might be an optimization in GKE.

@soellman
Copy link
Contributor

I think I've seen this before - my hypothesis is that the GKE neg controller adds the annotation with the NEG names before they're actually created, and thus autoneg may fail when adding those not-yet-created NEGs to the backendservice.

Question - does this eventually get reconciled? Or does it stuck in a bad state?

@glerchundi
Copy link
Contributor Author

Thanks @rosmo & @soellman for your replies!

The workload is configured with a preferred anti affinity on zones but depending on the number of zones that GKE Autopilot has created or the scheduling decisions it took (placing all the pods in the same zone, for example) there could be a possibility to have less number of negs than available zones.

This is eventually fixed as the process of killing, upgrading or whatever reason that makes those pods to be scheduled in different zone will trigger the creation of those missing negs.

At the same time the use of those negs in a backend services prevents them from deletion. Although I don't know if this would ever happen if there aren't.

Hope helps understanding the reasoning behind!

@Xander-Polishchuk
Copy link

Hey everyone,

We're observing the same issue with out GKE deployment (VPC Native Autopilot cluster), however in our case deployment is simple and doesn't have any anti-affinity configuration. Only simple deployment with Standalone NEG and external global LB.

I was able to reproduce it with as simple deployment as this:

apiVersion: v1
kind: Namespace
metadata:
  name: echo
  labels:
    app.kubernetes.io/name: echo
---
apiVersion: v1
kind: Service
metadata:
  name: echo
  annotations:
    cloud.google.com/neg: '{"exposed_ports": {"80":{"name": "echo"}}}'
    controller.autoneg.dev/neg: '{"backend_services":{"80":[{"name":"echo","max_rate_per_endpoint":100}]}}'
  namespace: echo
spec:
  type: ClusterIP
  selector:
    app: echo
  ports:
    - port: 80
      targetPort: 8080
      name: http
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo
  namespace: echo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echo
  template:
    metadata:
      labels:
        app: echo
    spec:
      containers:
        - name: echo
          image: ealen/echo-server:0.7.0
          imagePullPolicy: IfNotPresent
          ports:
            - name: http
              containerPort: 8080
          env:
            - name: PORT
              value: '8080'

In that case gke annotates services as:

Annotations:       
  cloud.google.com/neg: {"exposed_ports": {"80":{"name": "echo"}}}
  cloud.google.com/neg-status: {"network_endpoint_groups":{"80":"echo"},"zones":["europe-west1-b","europe-west1-c","europe-west1-d"]}
  controller.autoneg.dev/neg: {"backend_services":{"80":[{"name":"echo","max_rate_per_endpoint":100}]}}
  controller.autoneg.dev/neg-status: {"backend_services":{"80":{"echo":{"name":"echo","max_rate_per_endpoint":100}}},"network_endpoint_groups":{"80":"echo"},"zones":["europe-w...

however in reality groups only created in two zones - europe-west1-b and europe-west1-c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants