-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MLCOMPUTE-2001 | Cleanup spark related logs from setup_tron_namespace #3979
Conversation
default_spark_pool = "batch" | ||
valid_clusters = ["spark-pnw-prod", "pnw-devc"] | ||
valid_clusters = ["pnw-devc-spark", "pnw-prod-spark"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To align the default values with the recent puppet changes for pool validation rules
@@ -224,7 +224,7 @@ def main(): | |||
# since we need to print out what failed in either case | |||
failed.append(service) | |||
|
|||
if args.bulk_config_fetch: | |||
if args.dry_run and args.bulk_config_fetch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should also skip Tron API calls here in the dry run mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
] = spark_tools.SPARK_DNS_POD_TEMPLATE | ||
spark_conf[ | ||
"spark.kubernetes.executor.podTemplateFile" | ||
] = spark_tools.SPARK_DNS_POD_TEMPLATE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix: we should set/overwrite the pod template file (for k8s dns mode, etc.) whether the pod template has already been set or not
@@ -224,7 +224,7 @@ def main(): | |||
# since we need to print out what failed in either case | |||
failed.append(service) | |||
|
|||
if args.bulk_config_fetch: | |||
if args.dry_run and args.bulk_config_fetch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
def auto_add_timeout_for_spark_job( | ||
cmd: str, timeout_job_runtime: str, silent: bool = False | ||
) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw: for spark-drivers-on-k8s we should use the tron max_runtime config directly and get rid of this code entirely :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up, I think we can move this to be part of our custom spark-submit wrapper later, so it can be more easily monitored from the spark side and managed by spark configuration service & auto tuner, and also ensures the consistency of different types of spark deployments (tron, adhoc, jupyter).
(The current implementation of this also respects to the max_runtime
tronfig from the caller side).
Includes the changes from Yelp/service_configuration_lib#151, add option to mute some logs when calling functions from
setup_tron_namespace
, and fix some issues.