[DO NOT MERGE] Refactor Engine and Runner #235

Spedoske · 2023-07-11T21:10:15Z

What is not fully implemented in the pull request

implement image cache
test multiple runners
migrate post diff test
migrate reproduce
migrate checker.py as the entrypoint
perform a full run, including post diff test and reproduce
Add documentation about how to create a custom oracle

What is included in the pull request.

Run Result

In the main branch, the CheckerSet (a set of checkers) will return a RunResult which contains the check result. The RunResult will result in different engine's behaviors, for example:

acto/acto/common.py

Lines 101 to 116 in e4cd473

    
           def is_pass(self) -> bool: 
        
               if not isinstance(self.crash_result, PassResult) and self.crash_result is not None: 
        
                   return False 
        
               elif not isinstance(self.health_result, PassResult) and self.health_result is not None: 
        
                   return False 
        
               elif not isinstance(self.custom_result, PassResult) and self.custom_result is not None: 
        
                   return False 
        
               if isinstance(self.state_result, PassResult): 
        
                   return True 
        
               else: 
        
                   if actoConfig.alarms.invalid_input and isinstance( 
        
                           self.log_result, InvalidInputResult): 
        
                       return True 
        
                   else: 
        
                       return False

I find we must write some code for each result (e.g. PassResult, InvalidInputResult) and each result from a oracle (e.g. crash_result, health_result), which is a pain when implementing a new checker.
I also find it's hard to attach runtime information on the result. For example, we cannot attach any fields besides msg.

acto/acto/common.py

Lines 282 to 291 in e4cd473

    
           class UnhealthyResult(ErrorResult): 
        
               def __init__(self, oracle: Oracle, msg: str) -> None: 
        
                   super().__init__(oracle, msg) 
        
               def to_dict(self): 
        
                   return {'oracle': self.oracle, 'message': self.message} 
        
               def from_dict(d: dict): 
        
                   return UnhealthyResult(d['oracle'], d['message'])

Currently, I the CheckerSet (a set of checkers) will return a list of OracleResult. It has methods means_ok(self)->bool, means_revert(self)->bool, means_flush(self)->bool and means_terminate(self)->bool to decide what state it is in.

acto/acto/checker/checker.py

Lines 33 to 59 in 2aa1977

    
           @dataclass 
        
           class OracleResult(Exception): 
        
               message: str = OracleControlFlow.ok 
        
               exception: Optional[Exception] = None 
        
               emit_by: str = '<None>' 
        
               def means(self, control_flow: OracleControlFlow): 
        
                   method_name = f'means_{control_flow.name}' 
        
                   return getattr(self, method_name)() 
        
               def set_emitter(self, oracle: 'Checker'): 
        
                   self.emit_by = oracle.name 
        
               @means_first(lambda self: not self.means_terminate()) 
        
               def means_ok(self): 
        
                   return self.message == OracleControlFlow.ok 
        
               @staticmethod 
        
               def means_flush(): 
        
                   return False 
        
               @means_first(lambda self: not self.means_terminate()) 
        
               def means_revert(self): 
        
                   return self.message != OracleControlFlow.ok 
        
               def means_terminate(self): 
        
                   return self.exception is not None

OracleControlFlow is how OracleResult will result in the next step of the trial.

acto/acto/checker/checker.py

Lines 12 to 16 in 2aa1977

    
           class OracleControlFlow(StrEnum): 
        
               ok = auto() 
        
               flush = auto() 
        
               revert = auto() 
        
               terminate = auto()

After the refactoring, the logic of runner regarding to OracleResult become very clear.

acto/acto/runner/trial.py

Lines 138 to 171 in 2aa1977

    
           # If all the checkers return ok, we can move on to the next generation 
        
           if all(map(lambda x: x.means(OracleControlFlow.ok), result)): 
        
               self.state = 'normal' 
        
               return 
        
           # If any of the checkers return terminate, we terminate the trial 
        
           if any(map(lambda x: x.means(OracleControlFlow.terminate), result)): 
        
               self.state = 'terminated' 
        
               return 
        
           if any(map(lambda x: x.means(OracleControlFlow.revert), result)): 
        
               # If the result indicates our input is invalid, we need to first run revert to 
        
               # go back to previous system state, then construct a new input without the 
        
               # responsible testcase and re-apply 
        
               if self.state == 'recovering': 
        
                   # if we enter a recovering state twice, we abort the trial 
        
                   # because it means we cannot recover from the failure 
        
                   self.state = 'terminated' 
        
               else: 
        
                   # if current state is normal, we revert the last applied test case 
        
                   self.state = 'recovering' 
        
                   snapshot.set_snapshot_before_applied_input(prev_snapshot) 
        
                   self.next_input.revert() 
        
               return 
        
           if any(map(lambda x: x.means(OracleControlFlow.flush), result)): 
        
               if self.state == 'recovering': 
        
                   # TODO: what if we encounter a flush when we are recovering? 
        
                   self.state = 'terminated' 
        
               else: 
        
                   self.next_input.flush() 
        
               return 
        
           assert False, "unreachable"

Recovery test

Move the code in recovery->check_state_equality into a checker.

acto/acto/engine.py

Lines 562 to 572 in e4cd473

    
           def run_recovery(self, runner: Runner, checker: CheckerSet, generation: int) -> OracleResult: 
        
               '''Runs the recovery test case after an error is reported''' 
        
               logger = get_thread_logger(with_prefix=True) 
        
               RECOVERY_SNAPSHOT = -2  # the immediate snapshot before the error 
        
               logger.debug('Running recovery') 
        
               recovery_input = self.snapshots[RECOVERY_SNAPSHOT].input 
        
               snapshot, err = runner.run(recovery_input, generation=-1) 
        
               result = check_state_equality(snapshot, self.snapshots[RECOVERY_SNAPSHOT]) 
        
               return result

Snapshot

Add more fields in snapshot.

Deploy

Use health checker to check the cluster is converged.

Engine

Adapting engine.py to the RayRunner
Decouple task assignment over the workers in TestPlan.

acto/acto/engine_new.py

Lines 210 to 235 in 2aa1977

    
           def run_test_plan(test_case_list: List[Tuple[List[str], TestCase]]): 
        
               active_runner_count = 0 
        
               while active_runner_count != 0 and len(test_case_list) != 0: 
        
                   def task(runner: Runner, test_cases: List[Tuple[List[str], TestCase]]) -> Trial: 
        
                       collector = with_context(CollectorContext( 
        
                           namespace=self.context['namespace'], 
        
                           crd_meta_info=self.context['crd'], 
        
                       ), snapshot_collector) 
        
                       collector = self.deploy.chain_with(drop_first_parameter(collector)) 
        
                       assert isinstance(self.input_model.get_root_schema(), ObjectSchema) 
        
                       iterator = TrialInputIterator(iter(test_cases), self.input_model.get_root_schema(), self.input_model.get_seed_input()) 
        
                       trial = Trial(iterator, self.checkers, num_mutation=self.num_of_mutations) 
        
                       runner.run(trial, collector) 
        
                       return trial 
        
                   while self.runners.has_free() and len(test_case_list) != 0: 
        
                       test_cases, test_case_list = test_case_list[:self.num_of_mutations], test_case_list[self.num_of_mutations:] 
        
                       self.runners.submit(task, test_cases) 
        
                       active_runner_count += 1 
        
                   trial: Trial = self.runners.get_next_unordered() 
        
                   active_runner_count -= 1 
        
                   for test_case in trial.next_input.next_testcase: 
        
                       test_case_list.append(test_case)

Kubernetes cluster information collector

Move all code collecting Kubenetes cluster information to one class

RayRunner

redesign a cleaner runner interface
use ray to implement runner, allowing run trial on multiple machines.

The core run loop is very clear.

acto/acto/runner/ray_runner.py

Lines 27 to 41 in 2aa1977

    
           def run(self, trial: Trial, snapshot_collector: Callable[['Runner', Trial, dict], Snapshot]) -> Trial: 
        
               self.cluster_ok_event.wait() 
        
               for system_input in trial: 
        
                   snapshot = None 
        
                   error = None 
        
                   try: 
        
                       snapshot = snapshot_collector(self, trial, system_input) 
        
                   except Exception as e: 
        
                       error = e 
        
                       # TODO: do not use print 
        
                       print(traceback.format_exc()) 
        
                   trial.send_snapshot(snapshot, error) 
        
               self.cluster_ok_event.clear() 
        
               threading.Thread(target=self.__reset_cluster_and_set_available).start() 
        
               return trial

Snapshot collector

Snapshot collector is a interface used in the ray runner. The definition is Callable[['Runner', Trial, dict], Snapshot], which means it receives a Ray Runner, a Trial and a system input, and it will produce a Snapshot.

acto/acto/runner/snapshot_collector.py

Lines 42 to 62 in 2aa1977

    
           def snapshot_collector(ctx: CollectorContext, runner: Runner, trial: Trial, system_input: dict, ignore_cli_error=False) -> Snapshot: 
        
               cli_result = runner.kubectl_client.apply(system_input, namespace=ctx.namespace) 
        
               if cli_result.returncode != 0 and not ignore_cli_error: 
        
                   logging.error(f'Failed to apply system input to namespace {ctx.namespace}.\n{system_input}') 
        
                   raise RuntimeError(f'Failed to apply system input to namespace {ctx.namespace}.\n{system_input}') 
        
               cli_result = { 
        
                   "stdout": "" if cli_result.stdout is None else cli_result.stdout.strip(), 
        
                   "stderr": "" if cli_result.stderr is None else cli_result.stderr.strip(), 
        
               } 
        
               asyncio.run(wait_for_system_converge(ctx.kubectl_collector, ctx.timeout)) 
        
               system_state = ctx.kubectl_collector.collect_system_state() 
        
               operator_log = ctx.kubectl_collector.collect_operator_log() 
        
               events = ctx.kubectl_collector.collect_events() 
        
               not_ready_pods_logs = ctx.kubectl_collector.collect_not_ready_pods_logs() 
        
               return Snapshot(input=system_input, system_state=system_state, 
        
                               operator_log=operator_log, cli_result=cli_result, 
        
                               events=events, not_ready_pods_logs=not_ready_pods_logs, 
        
                               generation=trial.generation, trial_state=trial.state)

Reimplement wait for system converge using asyncio, to make the function standalone(not in a class)

acto/acto/runner/snapshot_collector.py

Lines 74 to 101 in 2aa1977

    
           async def wait_until_no_future_events(core_api: kubernetes.client.CoreV1Api, timeout: int): 
        
               """ 
        
               Wait until no events are generated for the given namespace for the given timeout. 
        
               @param core_api: kubernetes api client, CoreV1Api 
        
               @param timeout: timeout in seconds 
        
               @return: 
        
               """ 
        
               while True: 
        
                   events: CoreV1EventList = core_api.list_event_for_all_namespaces() 
        
                   events_last_time: List[datetime] = [extract_event_time(event) for event in events.items] 
        
                   events_last_time = list(filter(None, events_last_time)) 
        
                   if not events_last_time: 
        
                       return True 
        
                   max_time: datetime = max(events_last_time) 
        
                   # check how much time has passed since the last event 
        
                   time_since_last_event = datetime.now(tz=max_time.tzinfo) - max_time 
        
                   if time_since_last_event.total_seconds() > timeout: 
        
                       return True 
        
                   await asyncio.sleep((timedelta(seconds=timeout) - time_since_last_event).total_seconds()) 
        
           async def wait_for_system_converge(collector: Collector, no_events_threshold: int, hard_timeout=480): 
        
               futures = [ 
        
                   asyncio.create_task(wait_until_no_future_events(collector.coreV1Api, no_events_threshold)), 
        
                   asyncio.create_task(asyncio.sleep(hard_timeout, result=False)) 
        
               ] 
        
               await asyncio.wait(futures, return_when=asyncio.FIRST_COMPLETED)

Trial

TrialInputIterator is a class that handles the test case revert and test case setup.
Trial is a class that handles the internal state of a trial and interact with a runner

A runner iterates a Trial to get system input
A runner then send the Snapshot from snapshot_collector to the Trial.

github-actions · 2023-07-11T21:11:24Z

Coverage Report

File	Stmts	Miss	Cover	Missing
acto
__main__.py	86	86	0%	1–173
common.py	354	96	73%	97–98, 102–116, 120, 122, 124, 129, 132, 136, 138, 140, 142, 144, 148, 153–162, 203, 236, 249, 291, 315–316, 325, 327, 330, 334–343, 363–366, 415, 418, 432, 460–461, 498–503, 505–507, 514–517, 521, 537–538, 544–555, 614–623, 627, 634–639
deploy.py	152	121	20%	36–39, 42–58, 62, 65–75, 83–113, 116–123, 129–151, 154–167, 179–202, 209–218, 226–244
engine.py	526	526	0%	1–906
oracle_handle.py	24	13	46%	15–18, 26–27, 39–46, 54
reproduce.py	110	110	0%	1–210
serialization.py	30	16	47%	19–27, 33–39
snapshot.py	25	2	92%	17, 34
acto/checker
checker.py	13	2	85%	12, 19
checker_set.py	57	12	79%	41–42, 67–78
test_checker.py	162	152	6%	13–246
acto/checker/impl
crash.py	30	2	93%	14, 16
health.py	49	3	94%	36, 65, 81
kubectl_cli.py	29	4	86%	32–33, 44–46
operator_log.py	24	2	92%	24, 32
state.py	227	52	77%	55, 63–64, 69–93, 104–105, 133, 140, 147–149, 166, 265, 288, 317, 319–323, 330–338
state_compare.py	76	3	96%	62, 73, 80
state_condition.py	39	13	67%	18, 24, 29–37, 47, 53
acto/input
get_matched_schemas.py	54	22	59%	12, 47–51, 55–74
input.py	588	469	20%	29–36, 73, 91, 111, 122–125, 128, 158–170, 173–184, 188, 192, 199, 205, 209–388, 409–423, 431–436, 440–453, 462, 478, 481–488, 500, 502, 504–506, 509–522, 527, 535–546, 549–555, 563–590, 594–874, 897–910, 914–961
testcase.py	55	20	64%	40–50, 53, 56, 59, 62–66, 95–96, 100, 103, 109, 116, 119
testplan.py	183	137	25%	14–24, 27–29, 32, 35–42, 45, 48–67, 70–74, 78, 82, 85–91, 99–107, 110–121, 124, 127, 130, 133, 136–138, 141–151, 154–164, 167, 174–185, 188–194, 200, 203–219, 222, 225, 231, 239–244, 247, 250, 253, 259–260, 263–269, 272, 275, 278
value_with_schema.py	337	218	35%	16, 20, 24, 28, 32, 36, 40, 54, 58, 61–67, 70–76, 86–114, 117–124, 128–137, 141–149, 152–156, 159, 162, 166, 180, 183, 186–192, 195–201, 211–230, 233–240, 243, 247–256, 260–272, 275–279, 282, 285, 289, 301, 307, 312, 314, 316, 320, 324, 329–333, 337–340, 349–359, 362–369, 373–376, 380–385, 388–391, 405, 408–412, 416, 421–428, 431–434, 437–439, 442–445, 448–451, 462, 465, 468, 485–498
valuegenerator.py	620	389	37%	20, 24, 28, 32, 43, 47, 51, 55, 76–86, 90–104, 107, 110, 114, 117, 121–130, 133–139, 142, 145, 148, 151, 154, 157–167, 194–199, 202–213, 216, 219, 223, 226–231, 234–237, 240, 243–248, 251–254, 257, 260, 263, 266, 269, 273–282, 285, 288, 291, 294–304, 331–343, 346–347, 350, 353, 357, 360–365, 368–371, 374, 377–382, 385–388, 391, 394, 397, 400, 403, 406, 409–419, 450–482, 485–496, 499–502, 505–508, 512–520, 523, 526, 529, 532, 535, 538–548, 573–589, 592–609, 612, 615, 619–621, 624–628, 631–632, 635–644, 647–653, 656–657, 660–670, 673, 676, 679, 682, 685, 688–698, 710–711, 714–724, 727–730, 733–736, 740, 748, 751–752, 755–765, 768–771, 774–777, 781, 794–797, 800–814, 817, 820, 824, 827–832, 835, 838, 841–846, 849, 852, 855, 858, 861–871, 881, 884, 887, 890, 894, 913, 919, 931, 933, 936–940, 943–946, 951, 955, 957, 961–962
acto/input/known_schemas
base.py	53	13	75%	17–18, 28, 37, 46–47, 56–57, 66, 75, 84–85, 93
cronjob_schemas.py	76	33	57%	13, 16–19, 22, 25, 36–39, 42–47, 50, 53, 59, 62–65, 68, 71, 82, 85–90, 93, 96, 113, 117–119, 131, 137, 140
deployment_schemas.py	59	25	58%	16, 22–27, 30–32, 35, 38, 54–57, 65–67, 70, 78–81, 91, 94
known_schema.py	75	39	48%	28, 31–34, 37, 43, 46–48, 51, 54, 81–84, 102–113, 117–135
pod_disruption_budget_schemas.py	56	22	61%	14–17, 21, 25–27, 30, 41–44, 48, 54, 57, 68–71, 81, 84
pod_schemas.py	797	271	66%	16–19, 23, 28, 32, 40–43, 47, 51–53, 61–64, 68, 73, 83, 92, 151, 156, 160, 167, 171, 178, 182, 238, 242, 247, 251, 294, 298, 303, 307, 335, 338, 341, 344, 347, 350, 353, 356, 359, 362, 365, 368, 393, 397, 400–405, 408, 414, 417, 420, 428, 431, 434, 437, 445, 453, 481, 488–494, 503, 507, 539, 543, 572, 575, 578, 581, 584, 587, 590, 619, 623, 626–631, 634, 642, 645, 648, 664–667, 670–672, 675, 684, 690, 693–696, 699, 702, 713–714, 717–719, 722, 725, 736, 739, 742, 745–748, 752, 756–758, 761–765, 768, 774, 777, 780, 783, 786, 789, 792, 795, 798, 801, 804, 807, 810, 813, 816, 840, 845, 849–857, 860, 866, 869, 872, 875, 878, 881, 884, 887, 890, 893, 896, 899, 902, 905, 908, 929, 933–935, 938–943, 946, 959, 964, 974, 980, 986, 989, 992, 1032, 1036–1038, 1041, 1054, 1059, 1065, 1068–1071, 1074, 1077, 1088–1091, 1094–1096, 1102, 1108, 1111–1114, 1117, 1120, 1133–1136, 1139–1141, 1147, 1153, 1156–1159, 1162, 1165, 1176–1179, 1182–1184, 1190, 1196, 1199–1202, 1205, 1212, 1215–1217, 1223, 1238, 1243, 1247, 1274, 1278, 1291, 1296, 1302, 1305, 1308, 1314, 1317, 1320–1322, 1325, 1342, 1345, 1348, 1374, 1378–1381, 1384, 1402, 1408, 1411, 1414, 1421, 1427, 1469, 1473
resource_schemas.py	149	50	66%	15–19, 22, 25, 28–32, 35, 38, 45, 49, 54, 57–59, 62, 76, 78, 82, 96, 102–105, 108, 111, 121, 134, 147, 154, 159–162, 165, 173–176, 179, 187, 194, 199, 213, 231, 236
service_schemas.py	178	67	62%	13, 16, 19, 25–30, 33, 36, 42, 45–48, 51, 54, 64–67, 70–73, 79, 85, 88–91, 94, 101–102, 105–108, 111, 114, 127, 142, 180, 184, 208, 214, 217, 220, 228–231, 235, 240, 244–246, 249, 257–258, 261, 264, 277–280, 284, 288–290, 293
statefulset_schemas.py	186	61	67%	15–18, 21, 31–36, 42, 49–53, 56–58, 61–63, 66–70, 73–75, 78–80, 83, 90–93, 99–102, 105, 132, 149, 154, 158, 164, 167–170, 173, 176, 190–193, 196–201, 207, 225, 230, 234, 262, 266, 290
storage_schemas.py	179	79	56%	13, 16, 25–30, 33, 36, 42, 48–53, 59, 67–70, 74, 79–80, 83, 89, 92–95, 98, 104–105, 108–111, 114–116, 122, 130–131, 135, 140, 145–148, 154–155, 158, 164, 181–184, 193, 197, 203–205, 211, 214, 228–231, 244, 250–252, 255, 258, 266–269, 273, 277–279, 282
acto/kubectl_client
kubectl.py	23	18	22%	8–14, 23–29, 37–44
acto/kubernetes_engine
base.py	48	30	38%	12, 16, 20, 24, 28, 31–49, 56–70
k3d.py	85	85	0%	1–139
kind.py	88	71	19%	18, 22–51, 57, 66–99, 102–114, 117–130, 137–151
minikube.py	3	3	0%	1–5
acto/monkey_patch
monkey_patch.py	79	24	70%	9, 18, 36, 39, 41, 45–56, 76, 90–95, 106
acto/parse_log
parse_log.py	78	19	76%	71–74, 84–86, 89–91, 96–98, 116–124
acto/post_process
post_diff_test.py	380	223	41%	39, 53–54, 159, 169, 173, 178–187, 222, 224, 226–230, 236–253, 261–269, 272–294, 301–310, 313–360, 371–375, 395, 403–433, 436–453, 457–505, 518, 522, 528–529, 531–552, 562–598
post_process.py	104	22	79%	51, 55, 67, 77–88, 100, 103, 138–142, 146, 150, 158, 162
test_post_process.py	28	1	96%	53
acto/runner
runner.py	286	258	10%	24–44, 72–117, 120–129, 132–163, 170–197, 204–227, 230–236, 239–258, 261–266, 277–282, 296–301, 314–390, 395–399, 405–415, 426–449, 454–487
acto/schema
anyof.py	44	20	55%	26, 30–38, 41, 44, 47–49, 52, 55–60
array.py	70	29	59%	31, 43–52, 57, 59–62, 71–78, 81–83, 86–88, 91, 94, 97, 103
base.py	102	57	44%	13–15, 18–20, 23, 26–37, 40, 43, 46–48, 51–61, 64–74, 77, 80–86, 95, 100, 105, 110, 114, 118, 142, 145–149
boolean.py	27	10	63%	16, 20, 23, 26, 29–32, 35, 38
integer.py	26	9	65%	17, 21–23, 26, 29, 32, 35, 38
number.py	30	11	63%	30–32, 35–37, 40, 43, 46, 49, 52
object.py	117	48	59%	44, 46, 51, 66–75, 80, 83, 94–109, 112–120, 123–126, 129, 132, 148, 151–158, 168
oneof.py	44	30	32%	13–19, 22, 25–27, 30–38, 41, 44, 47–49, 52, 55–60
opaque.py	17	6	65%	13, 16, 19, 22, 25, 28
schema.py	42	8	81%	21, 25, 31–34, 49–50
string.py	27	9	67%	25, 29–31, 34, 37, 40, 43, 46
acto/utils
__init__.py	14	1	93%	12
acto_timer.py	31	22	29%	10–15, 19, 22–33, 38–40, 44–47
config.py	34	1	97%	63
error_handler.py	43	33	23%	13–30, 38–52, 56–74
k8s_helper.py	64	51	20%	21–27, 39–45, 57–61, 73–79, 83–91, 95–102, 106–127
preprocess.py	69	59	14%	16–62, 74–123, 129–174
process_with_except.py	9	9	0%	1–13
thread_logger.py	15	3	80%	9, 18, 28
TOTAL	8010	4300	46%

Tests	Skipped	Failures	Errors	Time
124	0 💤	0 ❌	0 🔥	58.187s ⏱️

tianyin · 2023-07-14T01:04:27Z

Just back in town. Thank you for working on this PR.

Signed-off-by: Tyler Gu <[email protected]>

tianyin · 2023-07-19T23:36:05Z

why all the tests are keeping failing...

Spedoske · 2023-07-22T11:02:23Z

What should we do to verify the branch before merging it into the main branch? @tylergu

tianyin · 2023-07-22T16:35:30Z

@Spedoske This PR has merge conflicts due to the Monkey patch. Can you fix it?

tianyin · 2023-07-22T16:37:35Z

What should we do to verify the branch

It's discussed in the Slack channel. The current plan is,
https://uiuc-sysnet.slack.com/archives/C034V55LUN8/p1689824741722279

Do speak up if you have better ideas.

tylergu · 2023-07-22T18:33:46Z

@Spedoske , since this AE thing is going on and we are running on a tight deadline, we want to build the AE branch on top of the current main head for stability. I wish to hold this PR until the AE deadline (7/28), because we may have fix patches to apply on top of the current main head for the AE.
After the AE, I will review all the changes in this PR and merge it if the regression tests we get from the AE all pass.

Spedoske · 2023-07-23T06:33:42Z

@Spedoske , since this AE thing is going on and we are running on a tight deadline, we want to build the AE branch on top of the current main head for stability. I wish to hold this PR until the AE deadline (7/28), because we may have fix patches to apply on top of the current main head for the AE. After the AE, I will review all the changes in this PR and merge it if the regression tests we get from the AE all pass.

Yes, I agree to the schedule.

In addition, as the pull request is no longer a draft, I am now running the kafka operator #230, which should be finished in 36 hours.

tianyin · 2023-07-23T17:33:23Z

I am now running the kafka operator #230,

Sounds good. It's a big task if we consider inspection. But I'm fine if the goal is only to port.

which should be finished in 36 hours.

Take your time!

tianyin · 2023-08-13T16:02:16Z

@Spedoske @tylergu have you guys met and discussed the merge plan? What is the conclusion?

tylergu · 2023-08-13T16:26:08Z

@tianyin We have met and discussed possible ways to merge this PR, but did not reach a concrete plan.
It is clear to me that we should not just directly merge this PR. This PR contains a lot of different changes for refactoring and many different new features. Some of the features are clearly needed and can be merged safely (the extensible checker interface). Some of the changes need further discussions (we should avoid refactor every once a month, if we do refactor, we should do it right). The ideal solution is to split all the changes into smaller PRs and merge them incrementally. We can merge the changes that are clearly needed first.

The challenge here is that the commit history is not friendly to split the PR. The different features and refactors are intermingled together and depends on each other. So it would be a lot of effort for just splitting the PR.

I came up with two potential plans:

We rewrite the PR history, and split the PR into smaller PRs. We need to make sure after each small PR, Acto should be in a working condition.
We abandon this PR's history. We summarize the features and refactors implemented in this PR and discuss what to merge and what not to merge, and manually merge the parts in this PR that we want.

tianyin · 2023-08-13T16:47:05Z

I can't agree more!

I really think we need many small PRs rather than a humongous PR which nobody knows how to merge. In industry, those PRs will never be merged due to the risk.

#1 is really the key but I don't know how we can ensure the working condition -- that's one problem we will be facing again and again.

My suggestion is to abandon this PR and cherrypick some code to open new small PRs.

Let me know your thoughts @Spedoske

tylergu · 2024-01-17T21:24:06Z

Migrated to issue #309

Spedoske and others added 2 commits July 10, 2023 05:29

refactor engine and runner

c86554d

make it possible to generate context.json

2aa1977

Spedoske marked this pull request as draft July 11, 2023 21:10

Spedoske mentioned this pull request Jul 11, 2023

Action Plan for Spedoske before Jul 10 #232

Closed

3 tasks

KashunCheng added 2 commits July 11, 2023 23:23

fix run trials on one ray runner

61a477d

fix run trails on multiple ray runners

780d993

tianyin mentioned this pull request Jul 14, 2023

The Runner Design #236

Closed

add a thread implementation for runner

4bea2fe

Spedoske mentioned this pull request Jul 15, 2023

Action Plan for Spedoske before Jul 21 #237

Closed

3 tasks

KashunCheng and others added 5 commits July 16, 2023 00:44

add ansible deployment scripts

759163b

fix wrong image hash

80a0081

Add comments

41dc8da

Signed-off-by: Tyler Gu <[email protected]>

fix post diff test

5affdea

fix monkey patch

d7c974c

KashunCheng added 4 commits July 19, 2023 21:28

fix unittests

777a77d

Fix for checker, reproduce and post_diff as entrypoints

a850ccd

Fix unittests

0422e7f

Change cr to start 3 kafka instances by default

17abef2

Spedoske self-assigned this Jul 22, 2023

Spedoske marked this pull request as ready for review July 22, 2023 11:01

Fix logic error about iterators

ca4d04c

KashunCheng added 7 commits July 23, 2023 19:24

Add checker implementation guide and redesign checker interface

6ff5921

fix compatibility for python 3.8

4fe1d27

fix bugs for checker

5a06ab9

implement coverage collection

760c343

Fix multiprocess in post process, ray dashboard and simplify runner

89c1168

Fix unittest

f8a82f2

Bug fixes for process runner

039832c

Spedoske force-pushed the refactor-prealpha branch from 039832c to ca4d04c Compare August 10, 2023 16:15

Spedoske changed the title ~~Refactor Engine and Runner~~ [DO NOT MERGE] Refactor Engine and Runner Aug 10, 2023

Fix ray trials

452b8ea

Spedoske mentioned this pull request Aug 13, 2023

PR feature table #260

Closed

KashunCheng added this to the Merge the big refactor milestone Aug 29, 2023

tylergu mentioned this pull request Jan 17, 2024

Test Runner and KubernetesEngine New Design #309

Open

tylergu closed this Jan 17, 2024

tylergu deleted the refactor-prealpha branch February 29, 2024 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Refactor Engine and Runner #235

[DO NOT MERGE] Refactor Engine and Runner #235

Spedoske commented Jul 11, 2023 •

edited

Loading

github-actions bot commented Jul 11, 2023 •

edited

Loading

tianyin commented Jul 14, 2023

tianyin commented Jul 19, 2023

Spedoske commented Jul 22, 2023

tianyin commented Jul 22, 2023 •

edited

Loading

tianyin commented Jul 22, 2023

tylergu commented Jul 22, 2023

Spedoske commented Jul 23, 2023

tianyin commented Jul 23, 2023

tianyin commented Aug 13, 2023

tylergu commented Aug 13, 2023

tianyin commented Aug 13, 2023

tylergu commented Jan 17, 2024

	def is_pass(self) -> bool:
	if not isinstance(self.crash_result, PassResult) and self.crash_result is not None:
	return False
	elif not isinstance(self.health_result, PassResult) and self.health_result is not None:
	return False
	elif not isinstance(self.custom_result, PassResult) and self.custom_result is not None:
	return False

	if isinstance(self.state_result, PassResult):
	return True
	else:
	if actoConfig.alarms.invalid_input and isinstance(
	self.log_result, InvalidInputResult):
	return True
	else:
	return False

	class UnhealthyResult(ErrorResult):

	def __init__(self, oracle: Oracle, msg: str) -> None:
	super().__init__(oracle, msg)

	def to_dict(self):
	return {'oracle': self.oracle, 'message': self.message}

	def from_dict(d: dict):
	return UnhealthyResult(d['oracle'], d['message'])

	@dataclass
	class OracleResult(Exception):
	message: str = OracleControlFlow.ok
	exception: Optional[Exception] = None
	emit_by: str = '<None>'

	def means(self, control_flow: OracleControlFlow):
	method_name = f'means_{control_flow.name}'
	return getattr(self, method_name)()

	def set_emitter(self, oracle: 'Checker'):
	self.emit_by = oracle.name

	@means_first(lambda self: not self.means_terminate())
	def means_ok(self):
	return self.message == OracleControlFlow.ok

	@staticmethod
	def means_flush():
	return False

	@means_first(lambda self: not self.means_terminate())
	def means_revert(self):
	return self.message != OracleControlFlow.ok

	def means_terminate(self):
	return self.exception is not None

	class OracleControlFlow(StrEnum):
	ok = auto()
	flush = auto()
	revert = auto()
	terminate = auto()

	# If all the checkers return ok, we can move on to the next generation
	if all(map(lambda x: x.means(OracleControlFlow.ok), result)):
	self.state = 'normal'
	return

	# If any of the checkers return terminate, we terminate the trial
	if any(map(lambda x: x.means(OracleControlFlow.terminate), result)):
	self.state = 'terminated'
	return

	if any(map(lambda x: x.means(OracleControlFlow.revert), result)):
	# If the result indicates our input is invalid, we need to first run revert to
	# go back to previous system state, then construct a new input without the
	# responsible testcase and re-apply
	if self.state == 'recovering':
	# if we enter a recovering state twice, we abort the trial
	# because it means we cannot recover from the failure
	self.state = 'terminated'
	else:
	# if current state is normal, we revert the last applied test case
	self.state = 'recovering'
	snapshot.set_snapshot_before_applied_input(prev_snapshot)
	self.next_input.revert()
	return

	if any(map(lambda x: x.means(OracleControlFlow.flush), result)):
	if self.state == 'recovering':
	# TODO: what if we encounter a flush when we are recovering?
	self.state = 'terminated'
	else:
	self.next_input.flush()
	return

	assert False, "unreachable"

	def run_recovery(self, runner: Runner, checker: CheckerSet, generation: int) -> OracleResult:
	'''Runs the recovery test case after an error is reported'''
	logger = get_thread_logger(with_prefix=True)
	RECOVERY_SNAPSHOT = -2 # the immediate snapshot before the error

	logger.debug('Running recovery')
	recovery_input = self.snapshots[RECOVERY_SNAPSHOT].input
	snapshot, err = runner.run(recovery_input, generation=-1)
	result = check_state_equality(snapshot, self.snapshots[RECOVERY_SNAPSHOT])

	return result

	def run_test_plan(test_case_list: List[Tuple[List[str], TestCase]]):
	active_runner_count = 0
	while active_runner_count != 0 and len(test_case_list) != 0:
	def task(runner: Runner, test_cases: List[Tuple[List[str], TestCase]]) -> Trial:
	collector = with_context(CollectorContext(
	namespace=self.context['namespace'],
	crd_meta_info=self.context['crd'],
	), snapshot_collector)

	collector = self.deploy.chain_with(drop_first_parameter(collector))
	assert isinstance(self.input_model.get_root_schema(), ObjectSchema)
	iterator = TrialInputIterator(iter(test_cases), self.input_model.get_root_schema(), self.input_model.get_seed_input())

	trial = Trial(iterator, self.checkers, num_mutation=self.num_of_mutations)
	runner.run(trial, collector)
	return trial

	while self.runners.has_free() and len(test_case_list) != 0:
	test_cases, test_case_list = test_case_list[:self.num_of_mutations], test_case_list[self.num_of_mutations:]
	self.runners.submit(task, test_cases)
	active_runner_count += 1

	trial: Trial = self.runners.get_next_unordered()
	active_runner_count -= 1
	for test_case in trial.next_input.next_testcase:
	test_case_list.append(test_case)

	def run(self, trial: Trial, snapshot_collector: Callable[['Runner', Trial, dict], Snapshot]) -> Trial:
	self.cluster_ok_event.wait()
	for system_input in trial:
	snapshot = None
	error = None
	try:
	snapshot = snapshot_collector(self, trial, system_input)
	except Exception as e:
	error = e
	# TODO: do not use print
	print(traceback.format_exc())
	trial.send_snapshot(snapshot, error)
	self.cluster_ok_event.clear()
	threading.Thread(target=self.__reset_cluster_and_set_available).start()
	return trial

	def snapshot_collector(ctx: CollectorContext, runner: Runner, trial: Trial, system_input: dict, ignore_cli_error=False) -> Snapshot:
	cli_result = runner.kubectl_client.apply(system_input, namespace=ctx.namespace)
	if cli_result.returncode != 0 and not ignore_cli_error:
	logging.error(f'Failed to apply system input to namespace {ctx.namespace}.\n{system_input}')
	raise RuntimeError(f'Failed to apply system input to namespace {ctx.namespace}.\n{system_input}')

	cli_result = {
	"stdout": "" if cli_result.stdout is None else cli_result.stdout.strip(),
	"stderr": "" if cli_result.stderr is None else cli_result.stderr.strip(),
	}
	asyncio.run(wait_for_system_converge(ctx.kubectl_collector, ctx.timeout))

	system_state = ctx.kubectl_collector.collect_system_state()
	operator_log = ctx.kubectl_collector.collect_operator_log()
	events = ctx.kubectl_collector.collect_events()
	not_ready_pods_logs = ctx.kubectl_collector.collect_not_ready_pods_logs()

	return Snapshot(input=system_input, system_state=system_state,
	operator_log=operator_log, cli_result=cli_result,
	events=events, not_ready_pods_logs=not_ready_pods_logs,
	generation=trial.generation, trial_state=trial.state)

	async def wait_until_no_future_events(core_api: kubernetes.client.CoreV1Api, timeout: int):
	"""
	Wait until no events are generated for the given namespace for the given timeout.
	@param core_api: kubernetes api client, CoreV1Api
	@param timeout: timeout in seconds
	@return:
	"""
	while True:
	events: CoreV1EventList = core_api.list_event_for_all_namespaces()
	events_last_time: List[datetime] = [extract_event_time(event) for event in events.items]
	events_last_time = list(filter(None, events_last_time))
	if not events_last_time:
	return True
	max_time: datetime = max(events_last_time)
	# check how much time has passed since the last event
	time_since_last_event = datetime.now(tz=max_time.tzinfo) - max_time
	if time_since_last_event.total_seconds() > timeout:
	return True
	await asyncio.sleep((timedelta(seconds=timeout) - time_since_last_event).total_seconds())


	async def wait_for_system_converge(collector: Collector, no_events_threshold: int, hard_timeout=480):
	futures = [
	asyncio.create_task(wait_until_no_future_events(collector.coreV1Api, no_events_threshold)),
	asyncio.create_task(asyncio.sleep(hard_timeout, result=False))
	]

	await asyncio.wait(futures, return_when=asyncio.FIRST_COMPLETED)

[DO NOT MERGE] Refactor Engine and Runner #235

[DO NOT MERGE] Refactor Engine and Runner #235

Conversation

Spedoske commented Jul 11, 2023 • edited Loading

What is not fully implemented in the pull request

What is included in the pull request.

Run Result

Recovery test

Snapshot

Deploy

Engine

Kubernetes cluster information collector

RayRunner

Snapshot collector

Trial

github-actions bot commented Jul 11, 2023 • edited Loading

tianyin commented Jul 14, 2023

tianyin commented Jul 19, 2023

Spedoske commented Jul 22, 2023

tianyin commented Jul 22, 2023 • edited Loading

tianyin commented Jul 22, 2023

tylergu commented Jul 22, 2023

Spedoske commented Jul 23, 2023

tianyin commented Jul 23, 2023

tianyin commented Aug 13, 2023

tylergu commented Aug 13, 2023

tianyin commented Aug 13, 2023

tylergu commented Jan 17, 2024

Spedoske commented Jul 11, 2023 •

edited

Loading

github-actions bot commented Jul 11, 2023 •

edited

Loading

tianyin commented Jul 22, 2023 •

edited

Loading