-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] Refactor Engine and Runner #235
Conversation
Just back in town. Thank you for working on this PR. |
why all the tests are keeping failing... |
What should we do to verify the branch before merging it into the main branch? @tylergu |
@Spedoske This PR has merge conflicts due to the Monkey patch. Can you fix it? |
It's discussed in the Slack channel. The current plan is, Do speak up if you have better ideas. |
@Spedoske , since this AE thing is going on and we are running on a tight deadline, we want to build the AE branch on top of the current main head for stability. I wish to hold this PR until the AE deadline (7/28), because we may have fix patches to apply on top of the current main head for the AE. |
Yes, I agree to the schedule. In addition, as the pull request is no longer a draft, I am now running the kafka operator #230, which should be finished in 36 hours. |
Sounds good. It's a big task if we consider inspection. But I'm fine if the goal is only to port.
Take your time! |
039832c
to
ca4d04c
Compare
@tianyin We have met and discussed possible ways to merge this PR, but did not reach a concrete plan. The challenge here is that the commit history is not friendly to split the PR. The different features and refactors are intermingled together and depends on each other. So it would be a lot of effort for just splitting the PR. I came up with two potential plans:
|
I can't agree more! I really think we need many small PRs rather than a humongous PR which nobody knows how to merge. In industry, those PRs will never be merged due to the risk. #1 is really the key but I don't know how we can ensure the working condition -- that's one problem we will be facing again and again. My suggestion is to abandon this PR and cherrypick some code to open new small PRs. Let me know your thoughts @Spedoske |
Migrated to issue #309 |
What is not fully implemented in the pull request
checker.py
as the entrypointWhat is included in the pull request.
Run Result
In the main branch, the
CheckerSet
(a set of checkers) will return aRunResult
which contains the check result. TheRunResult
will result in different engine's behaviors, for example:acto/acto/common.py
Lines 101 to 116 in e4cd473
I find we must write some code for each result (e.g.
PassResult
,InvalidInputResult
) and each result from a oracle (e.g.crash_result
,health_result
), which is a pain when implementing a new checker.I also find it's hard to attach runtime information on the result. For example, we cannot attach any fields besides
msg
.acto/acto/common.py
Lines 282 to 291 in e4cd473
Currently, I the
CheckerSet
(a set of checkers) will return a list ofOracleResult
. It has methodsmeans_ok(self)->bool
,means_revert(self)->bool
,means_flush(self)->bool
andmeans_terminate(self)->bool
to decide what state it is in.acto/acto/checker/checker.py
Lines 33 to 59 in 2aa1977
OracleControlFlow
is howOracleResult
will result in the next step of the trial.acto/acto/checker/checker.py
Lines 12 to 16 in 2aa1977
After the refactoring, the logic of runner regarding to
OracleResult
become very clear.acto/acto/runner/trial.py
Lines 138 to 171 in 2aa1977
Recovery test
Move the code in
recovery->check_state_equality
into a checker.acto/acto/engine.py
Lines 562 to 572 in e4cd473
Snapshot
Add more fields in snapshot.
Deploy
Use health checker to check the cluster is converged.
Engine
Adapting engine.py to the RayRunner
Decouple task assignment over the workers in
TestPlan
.acto/acto/engine_new.py
Lines 210 to 235 in 2aa1977
Kubernetes cluster information collector
Move all code collecting Kubenetes cluster information to one class
RayRunner
redesign a cleaner runner interface
use ray to implement runner, allowing run trial on multiple machines.
The core run loop is very clear.
acto/acto/runner/ray_runner.py
Lines 27 to 41 in 2aa1977
Snapshot collector
Snapshot collector is a interface used in the ray runner. The definition is
Callable[['Runner', Trial, dict], Snapshot]
, which means it receives aRay Runner
, aTrial
and a system input, and it will produce aSnapshot
.acto/acto/runner/snapshot_collector.py
Lines 42 to 62 in 2aa1977
Reimplement wait for system converge using
asyncio
, to make the function standalone(not in a class)acto/acto/runner/snapshot_collector.py
Lines 74 to 101 in 2aa1977
Trial
TrialInputIterator
is a class that handles thetest case revert
andtest case setup
.Trial
is a class that handles the internal state of a trial and interact with a runnerTrial
to get system inputSnapshot
fromsnapshot_collector
to the Trial.