Implement task restart policies #280

ianmkenney · 2024-07-16T15:25:33Z

closes #277

* Test: test_add_task_restart_policy_patterns * Test: test_get_task_restart_policy_patterns * Test: test_remove_task_restart_policy_patterns * Test: test_clear_task_restart_policy_patterns * Test: test_task_resolve_restarts

* TaskRestartPattern * TaskRestartPolicy * TaskHistory

* Removed TaskRestartPolicy and TaskHistory * Added Traceback

* TaskReturnPattern: Confirm that the input pattern is a string type and that it is not empty. * Traceback: Confirm that the input is a list of strings and that none of them are empty.

Similar to `TaskHub`s, the `TaskRestartPattern` needs additonal hashed data to uniquely identify it as a Neo4j node (via the gufe key). The unit tests have been updated to reflect this change.

`statestore` methods have been added to modify the database state: * add_task_restart_patterns * remove_task_restart_patterns * get_task_restart_patterns Tests were added for each method in the integration tests for the statestore.

The `add_task_restart_patterns` method now establishes the APPLIES relationship between the each new pattern and all Tasks ACTIONED on the corresponding TaskHub. Added testing for creation of the APPLIES relationship, asserting the number of created connections over multiple TaskHubs and Tasks. Further subdivided the test classes. Additionally added a `set_task_restart_patterns_max_retries` method for updating the max_retries of a TaskRestartPattern.

"actioning" a Task on a TaskHub with preexisting TaskRestartPatterns created the APPLIES relationship between them with a num_retries value of 0. This behavior is tested in the test_action_task function in the statestore.

When an actioned Task is canceled and also has an APPLIES relationship with a TaskRestartPattern, APPLIES is removed between the two nodes. Removed org, project, and campaign fields since they are not necessary for the APPLIES relationship.

Setting an actioned Task status to the following statuses now removes the APPLIES relationship from attached TaskRestartPatterns: * complete * invalid * deleted NOTE: tests have not been added for this yet

Confirming that changing the status of an actioned Task to any of the following removes the APPLIES relationship: * complete * invalid * deleted

New statestore method placeholders: - add_task_traceback - resolve_task_restarts The compute api will add a Task Traceback and resolve restarts for returned failed Tasks. When a list of restart patterns are added, restarts are resolved.

* Renamed add_task_traceback to add_protocol_dag_result_ref_traceback * Added tests for add_protocol_dag_result_ref_traceback

Implemented half of the resolve_task_restarts test

With this decorator, if a transaction isn't passed as a keyword arg, one is automatically created (and closed). This allows a chaining behavior where many method calls share a single transaction object.

* Removed custom tokenization * Implemented _defaults to allow default tokenization to work

cancel_map has been changed from a defaultdict to a base dict and instead using the dict.get method to return None. Additionally added a set of all task/taskhub pairs that is later used to determine what should be canceled. I've also added grouping on taskhubs so the number of calls to cancel_tasks is minimized.

…olicy_resolve_restarts Restart policy: resolve restarts

pep8speaks · 2024-09-19T22:04:41Z

Hello @ianmkenney! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file alchemiscale/interface/api.py:

Line 962:80: E501 line too long (81 > 79 characters)
Line 963:80: E501 line too long (83 > 79 characters)
Line 975:80: E501 line too long (81 > 79 characters)
Line 1004:80: E501 line too long (81 > 79 characters)

In the file alchemiscale/storage/models.py:

Line 153:80: E501 line too long (92 > 79 characters)
Line 157:80: E501 line too long (94 > 79 characters)
Line 165:80: E501 line too long (87 > 79 characters)
Line 174:80: E501 line too long (81 > 79 characters)
Line 209:80: E501 line too long (84 > 79 characters)
Line 217:80: E501 line too long (90 > 79 characters)
Line 218:80: E501 line too long (85 > 79 characters)

In the file alchemiscale/storage/statestore.py:

Line 1261:80: E501 line too long (87 > 79 characters)
Line 1266:80: E501 line too long (92 > 79 characters)
Line 1280:80: E501 line too long (82 > 79 characters)
Line 1472:80: E501 line too long (97 > 79 characters)
Line 1475:80: E501 line too long (117 > 79 characters)
Line 1490:80: E501 line too long (94 > 79 characters)
Line 1667:80: E501 line too long (92 > 79 characters)
Line 1668:80: E501 line too long (123 > 79 characters)
Line 1674:80: E501 line too long (87 > 79 characters)
Line 1681:80: E501 line too long (81 > 79 characters)
Line 2488:80: E501 line too long (98 > 79 characters)
Line 2504:80: E501 line too long (82 > 79 characters)
Line 2856:5: E266 too many leading '#' for block comment
Line 2862:80: E501 line too long (105 > 79 characters)
Line 2878:9: E266 too many leading '#' for block comment
Line 2889:80: E501 line too long (98 > 79 characters)
Line 2915:80: E501 line too long (82 > 79 characters)
Line 2917:80: E501 line too long (88 > 79 characters)
Line 2942:80: E501 line too long (83 > 79 characters)
Line 2948:80: E501 line too long (84 > 79 characters)
Line 2952:80: E501 line too long (99 > 79 characters)
Line 2957:80: E501 line too long (81 > 79 characters)
Line 2968:80: E501 line too long (99 > 79 characters)
Line 2980:80: E501 line too long (84 > 79 characters)
Line 2987:80: E501 line too long (105 > 79 characters)
Line 3009:80: E501 line too long (87 > 79 characters)
Line 3012:80: E501 line too long (84 > 79 characters)
Line 3016:80: E501 line too long (145 > 79 characters)
Line 3019:80: E501 line too long (117 > 79 characters)
Line 3037:80: E501 line too long (82 > 79 characters)
Line 3043:80: E501 line too long (87 > 79 characters)
Line 3059:80: E501 line too long (125 > 79 characters)
Line 3075:80: E501 line too long (81 > 79 characters)
Line 3080:80: E501 line too long (82 > 79 characters)
Line 3086:80: E501 line too long (85 > 79 characters)
Line 3087:80: E501 line too long (140 > 79 characters)
Line 3099:80: E501 line too long (80 > 79 characters)
Line 3107:80: E501 line too long (145 > 79 characters)

In the file alchemiscale/tests/integration/conftest.py:

Line 192:80: E501 line too long (83 > 79 characters)
Line 208:80: E501 line too long (85 > 79 characters)
Line 209:80: E501 line too long (83 > 79 characters)

In the file alchemiscale/tests/integration/storage/test_statestore.py:

Line 1110:80: E501 line too long (89 > 79 characters)
Line 1112:9: E266 too many leading '#' for block comment
Line 1112:80: E501 line too long (91 > 79 characters)
Line 1113:9: E266 too many leading '#' for block comment
Line 1117:80: E501 line too long (121 > 79 characters)
Line 1118:80: E501 line too long (81 > 79 characters)
Line 1123:9: E266 too many leading '#' for block comment
Line 1138:80: E501 line too long (122 > 79 characters)
Line 1286:80: E501 line too long (82 > 79 characters)
Line 1309:80: E501 line too long (123 > 79 characters)
Line 1314:80: E501 line too long (85 > 79 characters)
Line 1324:80: E501 line too long (85 > 79 characters)
Line 1982:80: E501 line too long (87 > 79 characters)
Line 1998:80: E501 line too long (88 > 79 characters)
Line 2007:80: E501 line too long (106 > 79 characters)
Line 2011:80: E501 line too long (81 > 79 characters)
Line 2015:80: E501 line too long (87 > 79 characters)
Line 2017:5: E266 too many leading '#' for block comment
Line 2022:80: E501 line too long (82 > 79 characters)
Line 2028:80: E501 line too long (87 > 79 characters)
Line 2032:80: E501 line too long (84 > 79 characters)
Line 2035:80: E501 line too long (157 > 79 characters)
Line 2071:80: E501 line too long (81 > 79 characters)
Line 2079:80: E501 line too long (80 > 79 characters)
Line 2084:80: E501 line too long (87 > 79 characters)
Line 2103:80: E501 line too long (113 > 79 characters)
Line 2115:80: E501 line too long (84 > 79 characters)
Line 2121:80: E501 line too long (84 > 79 characters)
Line 2128:13: E266 too many leading '#' for block comment
Line 2128:80: E501 line too long (82 > 79 characters)
Line 2129:13: E266 too many leading '#' for block comment
Line 2131:80: E501 line too long (114 > 79 characters)
Line 2137:13: E266 too many leading '#' for block comment
Line 2142:80: E501 line too long (97 > 79 characters)
Line 2146:80: E501 line too long (80 > 79 characters)
Line 2148:80: E501 line too long (84 > 79 characters)
Line 2160:80: E501 line too long (80 > 79 characters)
Line 2180:80: E501 line too long (82 > 79 characters)
Line 2183:80: E501 line too long (93 > 79 characters)
Line 2193:80: E501 line too long (82 > 79 characters)
Line 2195:80: E501 line too long (91 > 79 characters)
Line 2197:80: E501 line too long (83 > 79 characters)
Line 2202:80: E501 line too long (82 > 79 characters)
Line 2206:80: E501 line too long (82 > 79 characters)
Line 2212:80: E501 line too long (81 > 79 characters)
Line 2217:80: E501 line too long (81 > 79 characters)
Line 2238:80: E501 line too long (82 > 79 characters)
Line 2256:80: E501 line too long (87 > 79 characters)
Line 2263:80: E501 line too long (81 > 79 characters)
Line 2271:80: E501 line too long (80 > 79 characters)
Line 2291:80: E501 line too long (82 > 79 characters)
Line 2294:80: E501 line too long (82 > 79 characters)
Line 2316:80: E501 line too long (88 > 79 characters)
Line 2317:80: E501 line too long (86 > 79 characters)
Line 2324:80: E501 line too long (83 > 79 characters)
Line 2342:80: E501 line too long (86 > 79 characters)
Line 2347:80: E501 line too long (84 > 79 characters)
Line 2350:80: E501 line too long (92 > 79 characters)
Line 2353:80: E501 line too long (88 > 79 characters)
Line 2354:80: E501 line too long (106 > 79 characters)
Line 2355:80: E501 line too long (125 > 79 characters)
Line 2358:80: E501 line too long (120 > 79 characters)
Line 2359:80: E501 line too long (120 > 79 characters)
Line 2362:80: E501 line too long (89 > 79 characters)
Line 2378:80: E501 line too long (103 > 79 characters)
Line 2405:80: E501 line too long (148 > 79 characters)

In the file alchemiscale/tests/integration/storage/utils.py:

Line 24:80: E501 line too long (83 > 79 characters)
Line 40:80: E501 line too long (83 > 79 characters)
Line 102:80: E501 line too long (88 > 79 characters)

In the file alchemiscale/tests/unit/test_storage_models.py:

Line 51:80: E501 line too long (81 > 79 characters)
Line 62:80: E501 line too long (81 > 79 characters)
Line 133:80: E501 line too long (86 > 79 characters)
Line 137:80: E501 line too long (82 > 79 characters)
Line 143:80: E501 line too long (87 > 79 characters)
Line 149:80: E501 line too long (85 > 79 characters)
Line 153:80: E501 line too long (84 > 79 characters)
Line 171:80: E501 line too long (86 > 79 characters)
Line 201:80: E501 line too long (83 > 79 characters)
Line 203:80: E501 line too long (84 > 79 characters)

Comment last updated at 2024-09-25 15:23:32 UTC

The addition of source_keys and failure_keys was not included in the unit tests so all initializations of Tracebacks failed. I've added default values for the test class.

* add_task_restart_patterns * remove_task_restart_patterns * get_task_restart_patterns * set_task_restart_patterns_max_retries Additionally, I added the get_taskhubs method to Neo4jStore since get_taskhub will only get the taskhub for a single network at a time. It might make sense to replace the old method with this new one.

codecov · 2024-10-08T17:44:49Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

…t pattern endpoints

dotsdl

Thanks for this impressive feature @ianmkenney! I have a few notes, and I've made some modifications where it was obvious to me what to do.

Can you address the notes and fix any broken tests? After that, I think we should be good to merge!

dotsdl · 2024-10-26T01:41:56Z

alchemiscale/storage/statestore.py


        // only proceed for cases where task is not already actioned on hub
        // and where the task is either in 'waiting', 'running', or 'error' status
        WITH th, an, task
        WHERE NOT (th)-[:ACTIONS]->(task)
-          AND task.status IN ['{TaskStatusEnum.waiting.value}', '{TaskStatusEnum.running.value}', '{TaskStatusEnum.error.value}']
+          AND task.status IN [$waiting, $running, $error]


Much nicer!

dotsdl · 2024-10-26T01:53:07Z

alchemiscale/storage/statestore.py

@@ -1411,30 +1461,51 @@ def action_tasks(
        # so we can properly return `None` if needed
        task_map = {str(task): None for task in tasks}

-        q = f"""
+        query_safe_task_list = [str(task) for task in tasks if task]


Why the trailing if task? Unclear under what conditions this would apply.

dotsdl · 2024-12-03T06:12:13Z

alchemiscale/storage/statestore.py

+            return ScopedKey.from_str(node["_scoped_key"])
+
+        transform_function = _node_to_gufe if return_gufe else _node_to_scoped_key
+        transform_results = defaultdict(None)


Is there a reason this is a defaultdict instead of just a dict? Where it gets used below, you can use transform_results.get(...) to safely ask for the value for a given key, and None will be given if there is no matching key. Using a defaultdict doesn't give any obvious behavior advantage here.

dotsdl · 2025-01-03T04:25:39Z

alchemiscale/storage/models.py

+    # TODO: should this also compare taskhub scoped keys?
+    def __eq__(self, other):
+        if not isinstance(other, self.__class__):
+            return False
+        return self.pattern == other.pattern


Hmmm...where do we use equality for this model? That would inform to me what the most reasonable approach is here.

dotsdl · 2025-01-03T04:37:07Z

alchemiscale/storage/statestore.py

+            max_retries=max_retries,
+        )
+
+    # TODO: validation of taskhubs variable, will fail in weird ways if not enforced


Can you add this if it's a likely failure mode?

dotsdl · 2025-01-03T19:19:41Z

alchemiscale/tests/unit/test_storage_models.py

+        try:
+            dict_trp.pop(":version:")
+        except KeyError:
+            raise AssertionError("expected to find :version:")


dotsdl · 2025-01-03T19:23:56Z

alchemiscale/tests/unit/test_storage_models.py

+        assert tb_dict.pop("__qualname__") == "Tracebacks"
+        assert tb_dict.pop("__module__") == "alchemiscale.storage.models"
+
+        # light test of the version key
+        try:
+            tb_dict.pop(":version:")
+        except KeyError:
+            raise AssertionError("expected to find :version:")


dotsdl · 2025-01-16T05:48:29Z

alchemiscale/tests/integration/storage/test_statestore.py

+        applies_count = n4js.execute_query(
+            query, taskhub_scoped_key=str(taskhub_sk)
+        ).records[0]["applies_count"]


Doesn't look like applies_count gets used after this?

dotsdl · 2025-01-17T01:17:01Z

alchemiscale/tests/integration/storage/test_statestore.py

+
+        @pytest.mark.xfail(raises=NotImplementedError)
+        def test_task_actioning_applies_relationship(self):
+            raise NotImplementedError


Are we intending to fill this in?

Ah, I see we check this behavior in test_action_task.

dotsdl · 2025-01-17T01:17:10Z

alchemiscale/tests/integration/storage/test_statestore.py

+
+        @pytest.mark.xfail(raises=NotImplementedError)
+        def test_task_deaction_applies_relationship(self):
+            raise NotImplementedError


Are we intending to fill this in?

Ah, I see we check this behavior in test_cancel_task.

ianmkenney added 5 commits July 16, 2024 08:15

Added placeholder tests for proposed methods

7f752b3

* Test: test_add_task_restart_policy_patterns * Test: test_get_task_restart_policy_patterns * Test: test_remove_task_restart_policy_patterns * Test: test_clear_task_restart_policy_patterns * Test: test_task_resolve_restarts

Added models for new node types

dd8f0e9

* TaskRestartPattern * TaskRestartPolicy * TaskHistory

Updated new GufeTokenizable models in statestore

da17e45

* Removed TaskRestartPolicy and TaskHistory * Added Traceback

Added placeholder unit tests for new models

b7f63d4

Added validation and unit tests for storage models

6a167f1

* TaskReturnPattern: Confirm that the input pattern is a string type and that it is not empty. * Traceback: Confirm that the input is a list of strings and that none of them are empty.

ianmkenney force-pushed the feature/iss-277-restart-policy branch from 7e82f54 to 6a167f1 Compare July 18, 2024 20:46

ianmkenney and others added 24 commits July 22, 2024 14:24

Added taskhub_sk to TaskRestartPattern

a10e235

Similar to `TaskHub`s, the `TaskRestartPattern` needs additonal hashed data to uniquely identify it as a Neo4j node (via the gufe key). The unit tests have been updated to reflect this change.

Added statestore methods for restart patterns

b99d8ef

`statestore` methods have been added to modify the database state: * add_task_restart_patterns * remove_task_restart_patterns * get_task_restart_patterns Tests were added for each method in the integration tests for the statestore.

Establish APPLIES when actioning a Task

988155f

"actioning" a Task on a TaskHub with preexisting TaskRestartPatterns created the APPLIES relationship between them with a num_retries value of 0. This behavior is tested in the test_action_task function in the statestore.

Task status changes affect APPLIES relationship

510ae66

Setting an actioned Task status to the following statuses now removes the APPLIES relationship from attached TaskRestartPatterns: * complete * invalid * deleted NOTE: tests have not been added for this yet

Tests for Task status change on APPLIES

2310fd5

Confirming that changing the status of an actioned Task to any of the following removes the APPLIES relationship: * complete * invalid * deleted

Added method (unimplemented) calls for restarts

ea2851f

New statestore method placeholders: - add_task_traceback - resolve_task_restarts The compute api will add a Task Traceback and resolve restarts for returned failed Tasks. When a list of restart patterns are added, restarts are resolved.

Implemented add_protocol_dag_result_ref_traceback

8e011be

* Renamed add_task_traceback to add_protocol_dag_result_ref_traceback * Added tests for add_protocol_dag_result_ref_traceback

Started implementation of restart resolution

4f07dde

Tracebacks now include key data from its source units

78c4551

Built out custom fixture for testing restart policies

7acc003

Implemented half of the resolve_task_restarts test

Added the chainable decorator to Neo4jStore

03d9fa1

With this decorator, if a transaction isn't passed as a keyword arg, one is automatically created (and closed). This allows a chaining behavior where many method calls share a single transaction object.

Resolve task restarts now sets all remaining tasks to waiting

aad97e3

Corrected resolution logic

a655dc7

Extracted complexity out of test_resolve_task_restarts

5bb6700

resolve restart of tasks with no tracebacks

fe4b87b

Replaced many maps with a for loop

8a6f980

Small changes from review

93eb5f5

Chainable now uses the update_wrapper function

0900f39

Updated Traceback class

c8ddafc

* Removed custom tokenization * Implemented _defaults to allow default tokenization to work

Renamed Traceback to Tracebacks

2a59499

Fixed query for deleting the APPLIES relationship

645b2e4

dotsdl mentioned this pull request Sep 13, 2024

Restart policy: resolve restarts #286

Merged

Merge pull request #286 from OpenFreeEnergy/feature/iss-277-restart-p…

cf0e961

…olicy_resolve_restarts Restart policy: resolve restarts

ianmkenney added 7 commits September 24, 2024 15:54

Fix for Tracebacks unit tests

6066796

The addition of source_keys and failure_keys was not included in the unit tests so all initializations of Tracebacks failed. I've added default values for the test class.

Added untested client method for task restart policies

cea16bc

Added testing for client methods dealing with restart policies

a4da776

get_taskhub calls get_taskhubs

fdc25a7

Updated docstrings

51194ff

Merge branch 'main' into feature/iss-277-restart-policy

f03417c

ianmkenney force-pushed the feature/iss-277-restart-policy branch from 1071369 to f03417c Compare October 8, 2024 17:26

dotsdl marked this pull request as ready for review October 9, 2024 21:32

ianmkenney and others added 4 commits October 21, 2024 16:09

Added docstrings to client methods

977c896

Added Task restart patterns to user guide

2d2d8f6

Link to python classes and methods in restart pattern section

d7dcd5c

Merge branch 'main' into feature/iss-277-restart-policy

006e689

dotsdl self-requested a review October 25, 2024 23:01

dotsdl added 10 commits December 2, 2024 22:00

Merge branch 'main' into feature/iss-277-restart-policy

d331cc4

statestore edits from review

c468b43

Tracebacks model doc fix

bb5dbcd

Consistency fix to TaskRestartPattern._defaults

3776c7a

Docstring updates to client; token validation to interface api restar…

b4865fd

…t pattern endpoints

Merge branch 'main' into feature/iss-277-restart-policy

555ba62

Review edits

893a790

Edits from review

2787527

Black!

7ba1b4f

User guide fixes, consistency edits

0220e00

dotsdl requested changes Jan 20, 2025

View reviewed changes

Cypher fix

ae584eb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement task restart policies #280

Implement task restart policies #280

ianmkenney commented Jul 16, 2024

pep8speaks commented Sep 19, 2024 •

edited

Loading

codecov bot commented Oct 8, 2024

dotsdl left a comment

dotsdl Oct 26, 2024

dotsdl Oct 26, 2024

dotsdl Dec 3, 2024

dotsdl Jan 3, 2025

dotsdl Jan 3, 2025

dotsdl Jan 3, 2025

dotsdl Jan 3, 2025

dotsdl Jan 16, 2025

dotsdl Jan 17, 2025

dotsdl Jan 20, 2025

dotsdl Jan 17, 2025

dotsdl Jan 20, 2025

Implement task restart policies #280

Are you sure you want to change the base?

Implement task restart policies #280

Conversation

ianmkenney commented Jul 16, 2024

pep8speaks commented Sep 19, 2024 • edited Loading

Comment last updated at 2024-09-25 15:23:32 UTC

codecov bot commented Oct 8, 2024

Welcome to Codecov 🎉

dotsdl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Sep 19, 2024 •

edited

Loading