-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NodeKiller] Change rounding of a number of selected nodes #1107
Conversation
/assign @mm4tt |
clusterloader2/pkg/chaos/nodes.go
Outdated
nodesToFail := nodes[:0] | ||
klog.Infof("%s: %d nodes available, expecting to fail %f nodes", k, len(nodes), expectedNodesToFail) | ||
for _, node := range nodes { | ||
if rand.Float64() < k.config.FailureRate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already randomized, we call rand.Shuffle in line 99
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are randomized, but the size is constant.
And if we are expecting that less than a single node will fail (as for load test, with failure rate 0.01, and less than 100 eligible nodes, as some of them are running Prometheus), no node is scheduled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, good debugging!
But, wouldn't be better to simply set numNodes to 1 if it's 0?
What I don't like about your solution is that, while it'll kill the FailureRate
of nodes on average, we'll have test-runs where no nodes were killed and we'll have runs where 1+ nodes were killed. It's quite likely that it will make the test flaky - most likely it won't be failing but it can result in big variance of many metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, changed that to get more stable tests. Also, increased minimal interval in load test to prevent overkilling.
clusterloader2/pkg/chaos/nodes.go
Outdated
@@ -124,7 +129,7 @@ func (k *NodeKiller) kill(nodes []v1.Node) { | |||
time.Sleep(time.Duration(k.config.SimulatedDowntime)) | |||
|
|||
klog.Infof("%s: Rebooting %q to repair the node", k, node.Name) | |||
err = util.SSH("sudo reboot", &node, nil) | |||
err = util.SSH("nohup sudo reboot +1s > /dev/null 2> /dev/null < /dev/null &", &node, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be a different PR (or at least commit).
Also, could you explain how it works? Ideally, in a comment. Personally, I don't understand the +1s part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, moved to PR #1109 with extensive comment.
clusterloader2/pkg/chaos/nodes.go
Outdated
return nodes[:numNodes], nil | ||
expectedNodesToFail := k.config.FailureRate * float64(len(nodes)) | ||
nodesToFail := nodes[:0] | ||
klog.Infof("%s: %d nodes available, expecting to fail %f nodes", k, len(nodes), expectedNodesToFail) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for adding more logs (I'd even add more, e.g. listing exactly which nodes were returned here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This information is available from 'kill' function in line 117, but I can duplicate that.
28e99a1
to
9f57ad5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, your diagnosis makes sense, but what worries me is that we still have no logs from nodes.go file in the presubmit run. Any idea?
clusterloader2/pkg/chaos/nodes.go
Outdated
nodesToFail := nodes[:0] | ||
klog.Infof("%s: %d nodes available, expecting to fail %f nodes", k, len(nodes), expectedNodesToFail) | ||
for _, node := range nodes { | ||
if rand.Float64() < k.config.FailureRate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, good debugging!
But, wouldn't be better to simply set numNodes to 1 if it's 0?
What I don't like about your solution is that, while it'll kill the FailureRate
of nodes on average, we'll have test-runs where no nodes were killed and we'll have runs where 1+ nodes were killed. It's quite likely that it will make the test flaky - most likely it won't be failing but it can result in big variance of many metrics.
Persubmit job does not have chaos monkey override enabled , so this code is never invoked at this job. And actually, only a handful of jobs have this enabled - release jobs, few experimental jobs and ci-kubernetes-e2e-gci-gce-scalability |
9f57ad5
to
ceca314
Compare
ceca314
to
3c22abd
Compare
/lgtm |
3c22abd
to
04fcca9
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jprzychodzen, mm4tt The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Fixes #1005 |
Ref. #1005