Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] Add kubernetes backend #589

Closed

Conversation

lilyminium
Copy link
Contributor

@lilyminium lilyminium commented Dec 4, 2024

Add trial Dask Kubernetes addition. Do not merge -- this is just a demo PR for easy diffing.

cc @mattwthompson -- thanks so much for having a look at this 🙏

Comment on lines +246 to +260
if len(args) >= 2:
# schema is the second argument
# awful temporary terribad hack
schema_json = args[1]
if (
'".allow_gpu_platforms": true' in schema_json
or "energy_minimisation" in schema_json
):
resources["GPU"] = 0.5
resources["notGPU"] = 0
else:
resources["GPU"] = 0
resources["notGPU"] = 1
kwargs["resources"] = resources
logger.info(f"Annotating resources: {resources}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fairly hacky hard-coding that I did to pick up OpenMM minimization and simulation jobs. It's probably not the best solution for a number of reasons.

An alternative solution for identifying which protocols to divert to gpu/cpu workers include modifying somewhere upstream with direct access to the protocol (I think specifically https://github.com/openforcefield/openff-evaluator/blob/main/openff/evaluator/workflow/protocols.py#L1036-L1045) to either: a) pick up the Protocol by type, which could be hardcoded in a list, or b) add the allow_gpu_platforms attribute to the EnergyMinimisation protocol and look for that.

An alternative solution for choosing which workers to divert the protocol to is keeping the actual worker IDs in a dict somewhere and specifying them using the workers kwarg of Client.submit (https://distributed.dask.org/en/stable/api.html#distributed.Client.submit). This avoids hardcoding the very quickly-named custom resources of GPU and notGPU.

However a pro of this custom resources solution is it's fairly isolated to just the DaskKubernetesBackend and shouldn't modify the existing behaviour with HPC workers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the resources being used per worker -- they can be arbitrarily named but I think need to be numerical. I was setting GPU=0.5 in the protocols and GPU=1 on the worker to try to get two protocols to execute at once on the same GPU worker, but not sure it was more efficient.

Comment on lines +445 to +447
self._cluster = KubeCluster(
namespace=self._namespace, custom_cluster_spec=spec, **self._cluster_kwargs
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do: a PVC would need to be started already for the connected Dask cluster to work. This can be done via the Kubernetes API, although I currently do it separately to the DaskKubernetesBackend. Should it be migrated into this class to autostart the pvc?

self._namespace = namespace
self._annotate_resources = annotate_resources
self._image = image
self._other_resources = other_resources
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is currently only GPU/CPU we could also just skip this and set self._cpu_resources_per_worker=... etc.

Comment on lines +65 to +70
secret = KubernetesSecret(
name="openeye-license",
secret_name="oe-license-feb-2024",
mount_path="/secrets/oe_license.txt",
sub_path="oe_license.txt",
)
Copy link
Contributor Author

@lilyminium lilyminium Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very narrow view of a secret, which can be done in a few ways but for an OpenEye license as a file is easiest... this takes a previously-configured k8s secret called oe-license-feb-2024 and mounts it at the /secrets/oe_license.txt path to make it available

Copy link

codecov bot commented Dec 4, 2024

Codecov Report

Attention: Patch coverage is 71.48438% with 73 lines in your changes missing coverage. Please review.

Project coverage is 87.28%. Comparing base (e406ec9) to head (b83bc2f).
Report is 11 commits behind head on main.

Additional details and impacted files

Copy link
Member

@mattwthompson mattwthompson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to have a couple of paragraphs of prose (don't care where, could be in docs or just a comment here) going through the basics of the design and how to use it on a K8 cluster. Otherwise LGTM & please let me know when you consider it complete enough to merge!

@lilyminium
Copy link
Contributor Author

lilyminium commented Jan 15, 2025

Added documentation (previewable here: https://openff-docs--589.org.readthedocs.build/projects/evaluator/en/589/backends/daskkubernetesbackend.html) and an example run in a directory, which may (?) be worth linking to in the docs itself. I'm not sure why the links to the new classes in the rendered docs aren't working, unless it's because they're new in a branch.

Also added documentation in docs/examples/kubernetes-run.

@lilyminium lilyminium marked this pull request as ready for review January 15, 2025 13:34
@mattwthompson mattwthompson mentioned this pull request Jan 15, 2025
1 task
@mattwthompson
Copy link
Member

Merged as #597 just to get keep things green. Awesome work @lilyminium!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants