Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement task restart policies #280

Open
wants to merge 55 commits into
base: main
Choose a base branch
from

Conversation

ianmkenney
Copy link
Member

closes #277

* Test: test_add_task_restart_policy_patterns
* Test: test_get_task_restart_policy_patterns
* Test: test_remove_task_restart_policy_patterns
* Test: test_clear_task_restart_policy_patterns
* Test: test_task_resolve_restarts
* TaskRestartPattern
* TaskRestartPolicy
* TaskHistory
* Removed TaskRestartPolicy and TaskHistory
* Added Traceback
* TaskReturnPattern: Confirm that the input pattern is a string type and that it is not empty.
* Traceback: Confirm that the input is a list of strings and that none of them are empty.
@ianmkenney ianmkenney force-pushed the feature/iss-277-restart-policy branch from 7e82f54 to 6a167f1 Compare July 18, 2024 20:46
ianmkenney and others added 24 commits July 22, 2024 14:24
Similar to `TaskHub`s, the `TaskRestartPattern` needs additonal hashed
data to uniquely identify it as a Neo4j node (via the gufe key). The
unit tests have been updated to reflect this change.
`statestore` methods have been added to modify the database state:

* add_task_restart_patterns
* remove_task_restart_patterns
* get_task_restart_patterns

Tests were added for each method in the integration tests for the
statestore.
The `add_task_restart_patterns` method now establishes the APPLIES
relationship between the each new pattern and all Tasks ACTIONED
on the corresponding TaskHub.

Added testing for creation of the APPLIES relationship, asserting
the number of created connections over multiple TaskHubs and Tasks.

Further subdivided the test classes.

Additionally added a `set_task_restart_patterns_max_retries` method
for updating the max_retries of a TaskRestartPattern.
"actioning" a Task on a TaskHub with preexisting TaskRestartPatterns
created the APPLIES relationship between them with a num_retries value
of 0. This behavior is tested in the test_action_task function in
the statestore.
When an actioned Task is canceled and also has an APPLIES relationship
with a TaskRestartPattern, APPLIES is removed between the two
nodes.

Removed org, project, and campaign fields since they are not
necessary for the APPLIES relationship.
Setting an actioned Task status to the following statuses now removes
the APPLIES relationship from attached TaskRestartPatterns:

* complete
* invalid
* deleted

NOTE: tests have not been added for this yet
Confirming that changing the status of an actioned Task to any of
the following removes the APPLIES relationship:

* complete
* invalid
* deleted
New statestore method placeholders:
  - add_task_traceback
  - resolve_task_restarts

The compute api will add a Task Traceback and resolve restarts for
returned failed Tasks.

When a list of restart patterns are added, restarts are resolved.
* Renamed add_task_traceback to add_protocol_dag_result_ref_traceback
* Added tests for add_protocol_dag_result_ref_traceback
Implemented half of the resolve_task_restarts test
With this decorator, if a transaction isn't passed as a keyword arg, one
is automatically created (and closed). This allows a chaining behavior
where many method calls share a single transaction object.
* Removed custom tokenization
* Implemented _defaults to allow default tokenization to work
cancel_map has been changed from a defaultdict to a base dict and
instead using the dict.get method to return None. Additionally added a
set of all task/taskhub pairs that is later used to determine what
should be canceled.

I've also added grouping on taskhubs so the number of calls to
cancel_tasks is minimized.
…olicy_resolve_restarts

Restart policy: resolve restarts
@pep8speaks
Copy link

pep8speaks commented Sep 19, 2024

Hello @ianmkenney! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 962:80: E501 line too long (81 > 79 characters)
Line 963:80: E501 line too long (83 > 79 characters)
Line 975:80: E501 line too long (81 > 79 characters)
Line 1004:80: E501 line too long (81 > 79 characters)

Line 153:80: E501 line too long (92 > 79 characters)
Line 157:80: E501 line too long (94 > 79 characters)
Line 165:80: E501 line too long (87 > 79 characters)
Line 174:80: E501 line too long (81 > 79 characters)
Line 209:80: E501 line too long (84 > 79 characters)
Line 217:80: E501 line too long (90 > 79 characters)
Line 218:80: E501 line too long (85 > 79 characters)

Line 1261:80: E501 line too long (87 > 79 characters)
Line 1266:80: E501 line too long (92 > 79 characters)
Line 1280:80: E501 line too long (82 > 79 characters)
Line 1472:80: E501 line too long (97 > 79 characters)
Line 1475:80: E501 line too long (117 > 79 characters)
Line 1490:80: E501 line too long (94 > 79 characters)
Line 1667:80: E501 line too long (92 > 79 characters)
Line 1668:80: E501 line too long (123 > 79 characters)
Line 1674:80: E501 line too long (87 > 79 characters)
Line 1681:80: E501 line too long (81 > 79 characters)
Line 2488:80: E501 line too long (98 > 79 characters)
Line 2504:80: E501 line too long (82 > 79 characters)
Line 2856:5: E266 too many leading '#' for block comment
Line 2862:80: E501 line too long (105 > 79 characters)
Line 2878:9: E266 too many leading '#' for block comment
Line 2889:80: E501 line too long (98 > 79 characters)
Line 2915:80: E501 line too long (82 > 79 characters)
Line 2917:80: E501 line too long (88 > 79 characters)
Line 2942:80: E501 line too long (83 > 79 characters)
Line 2948:80: E501 line too long (84 > 79 characters)
Line 2952:80: E501 line too long (99 > 79 characters)
Line 2957:80: E501 line too long (81 > 79 characters)
Line 2968:80: E501 line too long (99 > 79 characters)
Line 2980:80: E501 line too long (84 > 79 characters)
Line 2987:80: E501 line too long (105 > 79 characters)
Line 3009:80: E501 line too long (87 > 79 characters)
Line 3012:80: E501 line too long (84 > 79 characters)
Line 3016:80: E501 line too long (145 > 79 characters)
Line 3019:80: E501 line too long (117 > 79 characters)
Line 3037:80: E501 line too long (82 > 79 characters)
Line 3043:80: E501 line too long (87 > 79 characters)
Line 3059:80: E501 line too long (125 > 79 characters)
Line 3075:80: E501 line too long (81 > 79 characters)
Line 3080:80: E501 line too long (82 > 79 characters)
Line 3086:80: E501 line too long (85 > 79 characters)
Line 3087:80: E501 line too long (140 > 79 characters)
Line 3099:80: E501 line too long (80 > 79 characters)
Line 3107:80: E501 line too long (145 > 79 characters)

Line 192:80: E501 line too long (83 > 79 characters)
Line 208:80: E501 line too long (85 > 79 characters)
Line 209:80: E501 line too long (83 > 79 characters)

Line 1110:80: E501 line too long (89 > 79 characters)
Line 1112:9: E266 too many leading '#' for block comment
Line 1112:80: E501 line too long (91 > 79 characters)
Line 1113:9: E266 too many leading '#' for block comment
Line 1117:80: E501 line too long (121 > 79 characters)
Line 1118:80: E501 line too long (81 > 79 characters)
Line 1123:9: E266 too many leading '#' for block comment
Line 1138:80: E501 line too long (122 > 79 characters)
Line 1286:80: E501 line too long (82 > 79 characters)
Line 1309:80: E501 line too long (123 > 79 characters)
Line 1314:80: E501 line too long (85 > 79 characters)
Line 1324:80: E501 line too long (85 > 79 characters)
Line 1982:80: E501 line too long (87 > 79 characters)
Line 1998:80: E501 line too long (88 > 79 characters)
Line 2007:80: E501 line too long (106 > 79 characters)
Line 2011:80: E501 line too long (81 > 79 characters)
Line 2015:80: E501 line too long (87 > 79 characters)
Line 2017:5: E266 too many leading '#' for block comment
Line 2022:80: E501 line too long (82 > 79 characters)
Line 2028:80: E501 line too long (87 > 79 characters)
Line 2032:80: E501 line too long (84 > 79 characters)
Line 2035:80: E501 line too long (157 > 79 characters)
Line 2071:80: E501 line too long (81 > 79 characters)
Line 2079:80: E501 line too long (80 > 79 characters)
Line 2084:80: E501 line too long (87 > 79 characters)
Line 2103:80: E501 line too long (113 > 79 characters)
Line 2115:80: E501 line too long (84 > 79 characters)
Line 2121:80: E501 line too long (84 > 79 characters)
Line 2128:13: E266 too many leading '#' for block comment
Line 2128:80: E501 line too long (82 > 79 characters)
Line 2129:13: E266 too many leading '#' for block comment
Line 2131:80: E501 line too long (114 > 79 characters)
Line 2137:13: E266 too many leading '#' for block comment
Line 2142:80: E501 line too long (97 > 79 characters)
Line 2146:80: E501 line too long (80 > 79 characters)
Line 2148:80: E501 line too long (84 > 79 characters)
Line 2160:80: E501 line too long (80 > 79 characters)
Line 2180:80: E501 line too long (82 > 79 characters)
Line 2183:80: E501 line too long (93 > 79 characters)
Line 2193:80: E501 line too long (82 > 79 characters)
Line 2195:80: E501 line too long (91 > 79 characters)
Line 2197:80: E501 line too long (83 > 79 characters)
Line 2202:80: E501 line too long (82 > 79 characters)
Line 2206:80: E501 line too long (82 > 79 characters)
Line 2212:80: E501 line too long (81 > 79 characters)
Line 2217:80: E501 line too long (81 > 79 characters)
Line 2238:80: E501 line too long (82 > 79 characters)
Line 2256:80: E501 line too long (87 > 79 characters)
Line 2263:80: E501 line too long (81 > 79 characters)
Line 2271:80: E501 line too long (80 > 79 characters)
Line 2291:80: E501 line too long (82 > 79 characters)
Line 2294:80: E501 line too long (82 > 79 characters)
Line 2316:80: E501 line too long (88 > 79 characters)
Line 2317:80: E501 line too long (86 > 79 characters)
Line 2324:80: E501 line too long (83 > 79 characters)
Line 2342:80: E501 line too long (86 > 79 characters)
Line 2347:80: E501 line too long (84 > 79 characters)
Line 2350:80: E501 line too long (92 > 79 characters)
Line 2353:80: E501 line too long (88 > 79 characters)
Line 2354:80: E501 line too long (106 > 79 characters)
Line 2355:80: E501 line too long (125 > 79 characters)
Line 2358:80: E501 line too long (120 > 79 characters)
Line 2359:80: E501 line too long (120 > 79 characters)
Line 2362:80: E501 line too long (89 > 79 characters)
Line 2378:80: E501 line too long (103 > 79 characters)
Line 2405:80: E501 line too long (148 > 79 characters)

Line 24:80: E501 line too long (83 > 79 characters)
Line 40:80: E501 line too long (83 > 79 characters)
Line 102:80: E501 line too long (88 > 79 characters)

Line 51:80: E501 line too long (81 > 79 characters)
Line 62:80: E501 line too long (81 > 79 characters)
Line 133:80: E501 line too long (86 > 79 characters)
Line 137:80: E501 line too long (82 > 79 characters)
Line 143:80: E501 line too long (87 > 79 characters)
Line 149:80: E501 line too long (85 > 79 characters)
Line 153:80: E501 line too long (84 > 79 characters)
Line 171:80: E501 line too long (86 > 79 characters)
Line 201:80: E501 line too long (83 > 79 characters)
Line 203:80: E501 line too long (84 > 79 characters)

Comment last updated at 2024-09-25 15:23:32 UTC

The addition of source_keys and failure_keys was not included in the
unit tests so all initializations of Tracebacks failed. I've added
default values for the test class.
* add_task_restart_patterns
* remove_task_restart_patterns
* get_task_restart_patterns
* set_task_restart_patterns_max_retries

Additionally, I added the get_taskhubs method to Neo4jStore since
get_taskhub will only get the taskhub for a single network at a
time. It might make sense to replace the old method with this new one.
@ianmkenney ianmkenney force-pushed the feature/iss-277-restart-policy branch from 1071369 to f03417c Compare October 8, 2024 17:26
Copy link

codecov bot commented Oct 8, 2024

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

@dotsdl dotsdl marked this pull request as ready for review October 9, 2024 21:32
@dotsdl dotsdl self-requested a review October 25, 2024 23:01
Copy link
Member

@dotsdl dotsdl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this impressive feature @ianmkenney! I have a few notes, and I've made some modifications where it was obvious to me what to do.

Can you address the notes and fix any broken tests? After that, I think we should be good to merge!


// only proceed for cases where task is not already actioned on hub
// and where the task is either in 'waiting', 'running', or 'error' status
WITH th, an, task
WHERE NOT (th)-[:ACTIONS]->(task)
AND task.status IN ['{TaskStatusEnum.waiting.value}', '{TaskStatusEnum.running.value}', '{TaskStatusEnum.error.value}']
AND task.status IN [$waiting, $running, $error]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much nicer!

@@ -1411,30 +1461,51 @@ def action_tasks(
# so we can properly return `None` if needed
task_map = {str(task): None for task in tasks}

q = f"""
query_safe_task_list = [str(task) for task in tasks if task]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the trailing if task? Unclear under what conditions this would apply.

return ScopedKey.from_str(node["_scoped_key"])

transform_function = _node_to_gufe if return_gufe else _node_to_scoped_key
transform_results = defaultdict(None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is a defaultdict instead of just a dict? Where it gets used below, you can use transform_results.get(...) to safely ask for the value for a given key, and None will be given if there is no matching key. Using a defaultdict doesn't give any obvious behavior advantage here.

Comment on lines +197 to +201
# TODO: should this also compare taskhub scoped keys?
def __eq__(self, other):
if not isinstance(other, self.__class__):
return False
return self.pattern == other.pattern
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm...where do we use equality for this model? That would inform to me what the most reasonable approach is here.

max_retries=max_retries,
)

# TODO: validation of taskhubs variable, will fail in weird ways if not enforced
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this if it's a likely failure mode?

Comment on lines +112 to +115
try:
dict_trp.pop(":version:")
except KeyError:
raise AssertionError("expected to find :version:")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Comment on lines +183 to +190
assert tb_dict.pop("__qualname__") == "Tracebacks"
assert tb_dict.pop("__module__") == "alchemiscale.storage.models"

# light test of the version key
try:
tb_dict.pop(":version:")
except KeyError:
raise AssertionError("expected to find :version:")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Comment on lines +1196 to +1198
applies_count = n4js.execute_query(
query, taskhub_scoped_key=str(taskhub_sk)
).records[0]["applies_count"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like applies_count gets used after this?


@pytest.mark.xfail(raises=NotImplementedError)
def test_task_actioning_applies_relationship(self):
raise NotImplementedError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we intending to fill this in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see we check this behavior in test_action_task.


@pytest.mark.xfail(raises=NotImplementedError)
def test_task_deaction_applies_relationship(self):
raise NotImplementedError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we intending to fill this in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see we check this behavior in test_cancel_task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add user-settable server-side Task restart policy, per-AlchemicalNetwork
3 participants