-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconcile error: network endpoint group in a specific zone not found #50
Comments
Hey @glerchundi, how have you configured the workload? With anti-affinity for a specific zone? It's a bit interesting, generally I've seen NEGs created on all zones regardless of if a workload is running there, but I suppose this might be an optimization in GKE. |
I think I've seen this before - my hypothesis is that the GKE neg controller adds the annotation with the NEG names before they're actually created, and thus autoneg may fail when adding those not-yet-created NEGs to the backendservice. Question - does this eventually get reconciled? Or does it stuck in a bad state? |
Thanks @rosmo & @soellman for your replies! The workload is configured with a preferred anti affinity on zones but depending on the number of zones that GKE Autopilot has created or the scheduling decisions it took (placing all the pods in the same zone, for example) there could be a possibility to have less number of negs than available zones. This is eventually fixed as the process of killing, upgrading or whatever reason that makes those pods to be scheduled in different zone will trigger the creation of those missing negs. At the same time the use of those negs in a backend services prevents them from deletion. Although I don't know if this would ever happen if there aren't. Hope helps understanding the reasoning behind! |
Hey everyone, We're observing the same issue with out GKE deployment (VPC Native Autopilot cluster), however in our case deployment is simple and doesn't have any anti-affinity configuration. Only simple deployment with Standalone NEG and external global LB. I was able to reproduce it with as simple deployment as this: apiVersion: v1
kind: Namespace
metadata:
name: echo
labels:
app.kubernetes.io/name: echo
---
apiVersion: v1
kind: Service
metadata:
name: echo
annotations:
cloud.google.com/neg: '{"exposed_ports": {"80":{"name": "echo"}}}'
controller.autoneg.dev/neg: '{"backend_services":{"80":[{"name":"echo","max_rate_per_endpoint":100}]}}'
namespace: echo
spec:
type: ClusterIP
selector:
app: echo
ports:
- port: 80
targetPort: 8080
name: http
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: echo
namespace: echo
spec:
replicas: 1
selector:
matchLabels:
app: echo
template:
metadata:
labels:
app: echo
spec:
containers:
- name: echo
image: ealen/echo-server:0.7.0
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8080
env:
- name: PORT
value: '8080' In that case gke annotates services as: Annotations:
cloud.google.com/neg: {"exposed_ports": {"80":{"name": "echo"}}}
cloud.google.com/neg-status: {"network_endpoint_groups":{"80":"echo"},"zones":["europe-west1-b","europe-west1-c","europe-west1-d"]}
controller.autoneg.dev/neg: {"backend_services":{"80":[{"name":"echo","max_rate_per_endpoint":100}]}}
controller.autoneg.dev/neg-status: {"backend_services":{"80":{"echo":{"name":"echo","max_rate_per_endpoint":100}}},"network_endpoint_groups":{"80":"echo"},"zones":["europe-w... however in reality groups only created in two zones - |
We deployed autoneg in one of our clusters running GKE Autopilot. When running a workload on it and if it's scheduled to just some specific zones but not all, the NEG are not created in all the zones.
That means that
autoneg
will fail and stop reconciling.The actual failing log part:
The question is, should
autoneg
tolerate missing network endpoint groups in some but not in all available zones?The text was updated successfully, but these errors were encountered: