Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWX web/task pod not launching correctly - "Waiting for database migrations..." and cannot execute awx-manage commands "connection is bad: Name or service not known" #1636

Open
3 tasks done
containerckf opened this issue Nov 14, 2023 · 8 comments

Comments

@containerckf
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that the AWX Operator is open source software provided for free and that I might not receive a timely response.

Bug Summary

When installing the recent (2.7.1) version of AWX (and others) on a v1.28 EKS Cluster, AWX does not correctly initialize.

The Web pod gets stuck in a loop of trying to complete the database migration (even though the deployment is fresh). When trying to run 'awx-manage' commands from the pod there are "connection is bad: Name or service not known" errors received.

All the pods are in Running state, but the web pod target shows as Unhealthy and when trying to access the ALB endpoint, the AWX interface does not come up. This was verified by port forwarding the web pod and hitting the IP directly, confirming the ALB was routing correctly.

AWX Operator version

2.7.1

AWX version

23.3.1

Kubernetes platform

kubernetes

Kubernetes/Platform version

1.28

Modifications

no

Steps to reproduce

Installation is performed via "kustomization" and ingress YAML per outlined here.

kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

generatorOptions:
  disableNameSuffixHash: true

secretGenerator:
  - name: awx-postgres-configuration
    type: Opaque
    literals:
      - host=awx-postgres
      - port=5432
      - database=awx
      - username=awx
      - password=Ansible123!
      - type=managed

  - name: awx-admin-password
    type: Opaque
    literals:
      - password=Ansible123!

resources:
  # Find the latest tag here: https://github.com/ansible/awx-operator/releases
  - github.com/ansible/awx-operator/config/default?ref=2.7.1
  - awx-ingress.yaml

# Set the image tags to match the git version from above
images:
  - name: quay.io/ansible/awx-operator
    newTag: 2.7.1

# Specify a custom namespace in which to install AWX
namespace: awx

awx-ingress.yaml

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
spec:

  admin_user: admin
  admin_password_secret: awx-admin-password

  ingress_type: ingress
  ingress_path: "/"
  ingress_path_type: Prefix
  hostname: awx.dev.compucom.io
  ingress_annotations: |
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/actions.redirect: "{\"Type\": \"redirect\", \"RedirectConfig\": {\"Protocol\": \"HTTPS\", \"Port\": \"443\", \"StatusCode\": \"HTTP_301\"}}"
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789012:certificate/XXX
    alb.ingress.kubernetes.io/load-balancer-attributes: "idle_timeout.timeout_seconds=360"

  postgres_configuration_secret: awx-postgres-configuration

These files are deployed with command-

$ kubectl apply -k .

Expected results

Database to initialize / web service to become ready (connect to AWX service via ALB to target pod running in EKS)

Actual results

  1. AWX Web container never properly starts - the following is seen in logs...
kubectl -n awx logs awx-web-7b9777b649-5sw8g
[wait-for-migrations] Waiting for database migrations...
[wait-for-migrations] Attempt 1 of 30
[wait-for-migrations] Waiting 0.5 seconds before next attempt
[wait-for-migrations] Attempt 2 of 30
[wait-for-migrations] Waiting 1 seconds before next attempt
[wait-for-migrations] Attempt 3 of 30
...
"playbook task failed"
{"level":"error","ts":"2023-09-19T12:21:02Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx","namespace":"awx"},"namespace":"awx","name":"awx","reconcileID":"3fbdfdb3-552a-47bd-a28a-f1dc6c1d9d9e","error":"event runner on failed","stacktrace"
  1. Targets (AWX web pod) is Unhealthy on the AWS Console end. Pod is Running, but cannot "initialize" database, there is no actual database to migrate. These are fresh installs every time.

Additional information

  1. Similar issue here mentioned an awx-manage command solution. We are able to exec into the container - awx-manage migrate --noinput yields the output..
django.db.utils.OperationalError: connection is bad: Name or service not known
psycopg.OperationalError: connection is bad: Name or service not known
  1. Performed steps on related GitHub - (setting the psql awx user password) and did not resolve any issues..

Why can't AWX correctly initizalize? Also verified the named packages above were present. What could be inhibiting the connection?

Operator Logs

No response

@containerckf containerckf changed the title AWX web/task pod not launching correctly - "Waiting for database migrations..." | cannot execute awx-manage commands either "psycopg.OperationalError: connection is bad: Name or service not known" AWX web/task pod not launching correctly - "Waiting for database migrations..." and cannot execute awx-manage commands "connection is bad: Name or service not known" Nov 14, 2023
@fosterseth
Copy link
Member

django.db.utils.OperationalError: connection is bad: Name or service not known

seems you have connectivity issues to your DB. So you may need to take some debugging steps to see why the connections are failing

Looks like you are setting up an internal database (running as pod inside of the same cluster as awx-task/web). is that right? if so, you shouldn't need to set the postgres_configuration_secret. Does it work fine without setting the postgres configuration?

@sasvari-attila-bosch
Copy link

I experience a quite similar issue when installing AWX with the operator version 2.18.0, although my setup somewhat different.

I have my Postgres in Azure behind a VNet. I mount a custom /etc/resolv.conf through ee_extra_volume_mounts, task_extra_volume_mounts, init_container_extra_volume_mounts, etc. From there (e.g. awx-web, awx-task) my Azure Postgres Flexi server is reachable.

However, the AWX migration Job does not configured to use it, and its pods apparently can't resolve the address of my Postgres (django.db.utils.OperationalError: [Errno -2] Name or service not known).

Is there a way to configure the the migration Job to use Azure's DNS resolver?

@vpelagatti
Copy link

@sasvari-attila-bosch , have you solve this issue? I'm facing the same problem

@sasvari-attila-bosch
Copy link

@sasvari-attila-bosch , have you solve this issue? I'm facing the same problem

@vpelagatti, I wasn't able to resolve it using version 2.18.0, and I haven't tried with 2.19.*.

@kcjones91
Copy link

@sasvari-attila-bosch , have you solve this issue? I'm facing the same problem

@vpelagatti, I wasn't able to resolve it using version 2.18.0, and I haven't tried with 2.19.*.

Having the same issue in 2.19. In the past, I used to see the postgres configuration mounted to the task and web pods. This time I do not. I see redis secrets mounted instead?

@malovata
Copy link

@sasvari-attila-bosch , have you solve this issue? I'm facing the same problem

@vpelagatti, I wasn't able to resolve it using version 2.18.0, and I haven't tried with 2.19.*.

Having the same issue in 2.19. In the past, I used to see the postgres configuration mounted to the task and web pods. This time I do not. I see redis secrets mounted instead?

I have same problem, you couldn't decide?

@mihaipuha
Copy link

I am encountering the same issue.
#1966 - This may be the same problem

@mihaipuha
Copy link

I am encountering the same issue. #1966 - This may be the same problem

On my side, this was due to some strange mistakes in the secret generation('*-postgres-configuration' ). The host in it was awx-postgres-13 instead of awx-postgres-15, due to this, the initContianer init-database was trowing "Name or service not known".
You can log into it and check the /etc/tower/conf.d/credentials.py file, which is used by awx-check which is used by wait-for-migrations script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants