-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DNM] Add kubernetes backend #589
[DNM] Add kubernetes backend #589
Conversation
for more information, see https://pre-commit.ci
if len(args) >= 2: | ||
# schema is the second argument | ||
# awful temporary terribad hack | ||
schema_json = args[1] | ||
if ( | ||
'".allow_gpu_platforms": true' in schema_json | ||
or "energy_minimisation" in schema_json | ||
): | ||
resources["GPU"] = 0.5 | ||
resources["notGPU"] = 0 | ||
else: | ||
resources["GPU"] = 0 | ||
resources["notGPU"] = 1 | ||
kwargs["resources"] = resources | ||
logger.info(f"Annotating resources: {resources}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fairly hacky hard-coding that I did to pick up OpenMM minimization and simulation jobs. It's probably not the best solution for a number of reasons.
An alternative solution for identifying which protocols to divert to gpu/cpu workers include modifying somewhere upstream with direct access to the protocol (I think specifically https://github.com/openforcefield/openff-evaluator/blob/main/openff/evaluator/workflow/protocols.py#L1036-L1045) to either: a) pick up the Protocol by type, which could be hardcoded in a list, or b) add the allow_gpu_platforms
attribute to the EnergyMinimisation protocol and look for that.
An alternative solution for choosing which workers to divert the protocol to is keeping the actual worker IDs in a dict somewhere and specifying them using the workers
kwarg of Client.submit
(https://distributed.dask.org/en/stable/api.html#distributed.Client.submit). This avoids hardcoding the very quickly-named custom resources of GPU
and notGPU
.
However a pro of this custom resources solution is it's fairly isolated to just the DaskKubernetesBackend and shouldn't modify the existing behaviour with HPC workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the resources being used per worker -- they can be arbitrarily named but I think need to be numerical. I was setting GPU=0.5
in the protocols and GPU=1
on the worker to try to get two protocols to execute at once on the same GPU worker, but not sure it was more efficient.
self._cluster = KubeCluster( | ||
namespace=self._namespace, custom_cluster_spec=spec, **self._cluster_kwargs | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do: a PVC would need to be started already for the connected Dask cluster to work. This can be done via the Kubernetes API, although I currently do it separately to the DaskKubernetesBackend. Should it be migrated into this class to autostart the pvc?
self._namespace = namespace | ||
self._annotate_resources = annotate_resources | ||
self._image = image | ||
self._other_resources = other_resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is currently only GPU/CPU we could also just skip this and set self._cpu_resources_per_worker=...
etc.
secret = KubernetesSecret( | ||
name="openeye-license", | ||
secret_name="oe-license-feb-2024", | ||
mount_path="/secrets/oe_license.txt", | ||
sub_path="oe_license.txt", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very narrow view of a secret, which can be done in a few ways but for an OpenEye license as a file is easiest... this takes a previously-configured k8s secret called oe-license-feb-2024
and mounts it at the /secrets/oe_license.txt
path to make it available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to have a couple of paragraphs of prose (don't care where, could be in docs or just a comment here) going through the basics of the design and how to use it on a K8 cluster. Otherwise LGTM & please let me know when you consider it complete enough to merge!
Added documentation (previewable here: https://openff-docs--589.org.readthedocs.build/projects/evaluator/en/589/backends/daskkubernetesbackend.html) and an example run in a directory, which may (?) be worth linking to in the docs itself. I'm not sure why the links to the new classes in the rendered docs aren't working, unless it's because they're new in a branch. Also added documentation in |
Merged as #597 just to get keep things green. Awesome work @lilyminium! |
Add trial Dask Kubernetes addition. Do not merge -- this is just a demo PR for easy diffing.
cc @mattwthompson -- thanks so much for having a look at this 🙏