MLCOMPUTE-2001 | Cleanup spark related logs from setup_tron_namespace #3979

chi-yelp · 2024-10-10T17:16:54Z

Includes the changes from Yelp/service_configuration_lib#151, add option to mute some logs when calling functions from setup_tron_namespace, and fix some issues.

chi-yelp · 2024-10-10T17:17:32Z

paasta_tools/cli/cmds/spark_run.py

        default_spark_pool = "batch"
-        valid_clusters = ["spark-pnw-prod", "pnw-devc"]
+        valid_clusters = ["pnw-devc-spark", "pnw-prod-spark"]


To align the default values with the recent puppet changes for pool validation rules

chi-yelp · 2024-10-10T17:18:11Z

paasta_tools/setup_tron_namespace.py

@@ -224,7 +224,7 @@ def main():
            # since we need to print out what failed in either case
            failed.append(service)

-    if args.bulk_config_fetch:
+    if args.dry_run and args.bulk_config_fetch:


I guess we should also skip Tron API calls here in the dry run mode?

good catch!

chi-yelp · 2024-10-10T17:20:48Z

paasta_tools/tron_tools.py

-            ] = spark_tools.SPARK_DNS_POD_TEMPLATE
+        spark_conf[
+            "spark.kubernetes.executor.podTemplateFile"
+        ] = spark_tools.SPARK_DNS_POD_TEMPLATE


Fix: we should set/overwrite the pod template file (for k8s dns mode, etc.) whether the pod template has already been set or not

nemacysts · 2024-10-11T19:31:32Z

paasta_tools/setup_tron_namespace.py

@@ -224,7 +224,7 @@ def main():
            # since we need to print out what failed in either case
            failed.append(service)

-    if args.bulk_config_fetch:
+    if args.dry_run and args.bulk_config_fetch:


good catch!

nemacysts · 2024-10-11T19:32:26Z

paasta_tools/spark_tools.py

+def auto_add_timeout_for_spark_job(
+    cmd: str, timeout_job_runtime: str, silent: bool = False
+) -> str:


fwiw: for spark-drivers-on-k8s we should use the tron max_runtime config directly and get rid of this code entirely :)

Thanks for bringing this up, I think we can move this to be part of our custom spark-submit wrapper later, so it can be more easily monitored from the spark side and managed by spark configuration service & auto tuner, and also ensures the consistency of different types of spark deployments (tron, adhoc, jupyter).
(The current implementation of this also respects to the max_runtime tronfig from the caller side).

chi-yelp added 2 commits October 10, 2024 10:12

Cleanup spark related logs from setup_tron_namespace

d490e96

Fix driver on k8s pod template overwriting

9926b4e

chi-yelp requested review from nemacysts and 88manpreet October 10, 2024 17:16

chi-yelp requested a review from a team as a code owner October 10, 2024 17:16

chi-yelp commented Oct 10, 2024

View reviewed changes

nemacysts approved these changes Oct 11, 2024

View reviewed changes

chi-yelp merged commit fd8188b into master Oct 14, 2024
10 checks passed

chi-yelp deleted the u/chi/reduce_setup_tron_namespace_logs branch October 14, 2024 11:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLCOMPUTE-2001 | Cleanup spark related logs from setup_tron_namespace #3979

MLCOMPUTE-2001 | Cleanup spark related logs from setup_tron_namespace #3979

chi-yelp commented Oct 10, 2024

chi-yelp Oct 10, 2024

chi-yelp Oct 10, 2024

nemacysts Oct 11, 2024

chi-yelp Oct 10, 2024

nemacysts Oct 11, 2024

nemacysts Oct 11, 2024

chi-yelp Oct 14, 2024

MLCOMPUTE-2001 | Cleanup spark related logs from setup_tron_namespace #3979

MLCOMPUTE-2001 | Cleanup spark related logs from setup_tron_namespace #3979

Conversation

chi-yelp commented Oct 10, 2024

chi-yelp Oct 10, 2024

Choose a reason for hiding this comment

chi-yelp Oct 10, 2024

Choose a reason for hiding this comment

nemacysts Oct 11, 2024

Choose a reason for hiding this comment

chi-yelp Oct 10, 2024

Choose a reason for hiding this comment

nemacysts Oct 11, 2024

Choose a reason for hiding this comment

nemacysts Oct 11, 2024

Choose a reason for hiding this comment

chi-yelp Oct 14, 2024

Choose a reason for hiding this comment