From e7b191a313fc927721f49d92dc5315b6043e8775 Mon Sep 17 00:00:00 2001 From: Robert Bailey Date: Tue, 30 Apr 2024 14:30:40 -0700 Subject: [PATCH] Cherry-pick #635 to release-1.1 branch (#637) Update RAG to use Autopilot by default (#635) Remove DNS troubleshooting information, as this has been patched. Co-authored-by: artemvmin --- applications/rag/README.md | 36 +++++++++++++------------------ applications/rag/metadata.yaml | 4 ++-- applications/rag/variables.tf | 2 +- applications/rag/workloads.tfvars | 2 +- 4 files changed, 19 insertions(+), 25 deletions(-) diff --git a/applications/rag/README.md b/applications/rag/README.md index 70ee33b97..d6e2612db 100644 --- a/applications/rag/README.md +++ b/applications/rag/README.md @@ -31,20 +31,17 @@ Install the following on your computer: ### Bring your own cluster (optional) -By default, this tutorial creates a Standard cluster on your behalf. We highly recommend following the default settings. +By default, this tutorial creates a cluster on your behalf. We highly recommend following the default settings. If you prefer to manage your own cluster, set `create_cluster = false` in the [Installation section](#installation). Creating a long-running cluster may be better for development, allowing you to iterate on Terraform components without recreating the cluster every time. -Use the provided infrastructue module to create a cluster: - -1. `cd ai-on-gke/infrastructure` - -2. Edit `platform.tfvars` to set your project ID, location and cluster name. The other fields are optional. Ensure you create an L4 nodepool as this tutorial requires it. - -3. Run `terraform init` - -4. Run `terraform apply --var-file workloads.tfvars` +Use gcloud to create a GKE Autopilot cluster. Note that RAG requires the latest Autopilot features, available on the latest versions of 1.28 and 1.29. +``` +gcloud container clusters create-auto rag-cluster \ + --location us-central1 \ + --cluster-version 1.28 +``` ### Bring your own VPC (optional) By default, this tutorial creates a new network on your behalf with [Private Service Connect](https://cloud.google.com/vpc/docs/private-service-connect) already enabled. We highly recommend following the default settings. @@ -64,10 +61,11 @@ This section sets up the RAG infrastructure in your GCP project using Terraform. 1. `cd ai-on-gke/applications/rag` 2. Edit `workloads.tfvars` to set your project ID, location, cluster name, and GCS bucket name. Ensure the `gcs_bucket` name is globally unique (add a random suffix). Optionally, make the following changes: - * (Optional) Set a custom `kubernetes_namespace` where all k8s resources will be created. * (Recommended) [Enable authenticated access](#configure-authenticated-access-via-iap) for JupyterHub, frontend chat and Ray dashboard services. - * (Not recommended) Set `create_cluster = false` if you bring your own cluster. If using a GKE Standard cluster, ensure it has an L4 nodepool with autoscaling and node autoprovisioning enabled. - * (Not recommended) Set `create_network = false` if you bring your own VPC. Ensure your VPC has Private Service Connect enabled as described above. + * (Optional) Set a custom `kubernetes_namespace` where all k8s resources will be created. + * (Optional) Set `autopilot_cluster = false` to deploy using GKE Standard. + * (Optional) Set `create_cluster = false` if you are bringing your own cluster. If using a GKE Standard cluster, ensure it has an L4 nodepool with autoscaling and node autoprovisioning enabled. You can simplify setup by following the Terraform instructions in [`infrastructure/README.md`](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/infrastructure/README.md). + * (Optional) Set `create_network = false` if you are bringing your own VPC. Ensure your VPC has Private Service Connect enabled as described above. 3. Run `terraform init` @@ -193,17 +191,13 @@ Connect to the GKE cluster: gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${CLUSTER_LOCATION} ``` -1. Troubleshoot JupyterHub job failures: - - If the JupyterHub job fails to start the proxy with error code 599, it is likely an known issue with Cloud DNS, which occurs when a cluster is quickly deleted and recreated with the same name. - - Recreate the cluster with a different name or wait several minutes after running `terraform destroy` before running `terraform apply`. - -2. Troubleshoot Ray job failures: +1. Troubleshoot Ray job failures: - If the Ray actors fail to be scheduled, it could be due to a stockout or quota issue. - Run `kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name=kuberay`. There should be a Ray head and Ray worker pod in `Running` state. If your ray pods aren't running, it's likely due to quota or stockout issues. Check that your project and selected `cluster_location` have L4 GPU capacity. - Often, retrying the Ray job submission (the last cell of the notebook) helps. - The Ray job may take 15-20 minutes to run the first time due to environment setup. -3. Troubleshoot IAP login issues: +2. Troubleshoot IAP login issues: - Verify the cert is Active: - For JupyterHub `kubectl get managedcertificates jupyter-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'` - For the frontend: `kubectl get managedcertificates frontend-managed-cert -n rag --output jsonpath='{.status.domainStatus[0].status}'` @@ -213,7 +207,7 @@ gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${CLUSTER_L - Org error: - The [OAuth Consent Screen](https://developers.google.com/workspace/guides/configure-oauth-consent#configure_oauth_consent) has `User type` set to `Internal` by default, which means principals external to the org your project is in cannot log in. To add external principals, change `User type` to `External`. -4. Troubleshoot `terraform apply` failures: +3. Troubleshoot `terraform apply` failures: - Inference server (`mistral`) fails to deploy: - This usually indicates a stockout/quota issue. Verify your project and chosen `cluster_location` have L4 capacity. - GCS bucket already exists: @@ -221,6 +215,6 @@ gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${CLUSTER_L - Cloud SQL instance already exists: - Ensure the `cloudsql_instance` name doesn't already exist in your project. -5. Troubleshoot `terraform destroy` failures: +4. Troubleshoot `terraform destroy` failures: - Network deletion issue: - `terraform destroy` fails to delete the network due to a known issue in the GCP provider. For now, the workaround is to manually delete it. diff --git a/applications/rag/metadata.yaml b/applications/rag/metadata.yaml index c79af0b46..5b240ad12 100644 --- a/applications/rag/metadata.yaml +++ b/applications/rag/metadata.yaml @@ -28,8 +28,8 @@ spec: varType: string defaultValue: "created-by=gke-ai-quick-start-solutions,ai.gke.io=rag" - name: autopilot_cluster - varType: string - defaultValue: false + varType: bool + defaultValue: true - name: iap_consent_info description: Configure the OAuth Consent Screen for your project. Ensure User type is set to Internal. Note that by default, only users within your organization can be allowlisted. To add external users, change the User type to External after the application is deployed. varType: bool diff --git a/applications/rag/variables.tf b/applications/rag/variables.tf index 1d57d27b7..b9bc5ba8e 100644 --- a/applications/rag/variables.tf +++ b/applications/rag/variables.tf @@ -319,7 +319,7 @@ variable "private_cluster" { variable "autopilot_cluster" { type = bool - default = false + default = true } variable "cloudsql_instance" { diff --git a/applications/rag/workloads.tfvars b/applications/rag/workloads.tfvars index a0218fb7c..243597e3e 100644 --- a/applications/rag/workloads.tfvars +++ b/applications/rag/workloads.tfvars @@ -20,7 +20,7 @@ subnetwork_cidr = "10.100.0.0/16" create_cluster = true # Creates a GKE cluster in the specified network. cluster_name = "" cluster_location = "us-central1" -autopilot_cluster = false +autopilot_cluster = true private_cluster = false ## GKE environment variables