Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NodeKiller] Change rounding of a number of selected nodes #1107

Merged
merged 1 commit into from
Mar 5, 2020

Conversation

jprzychodzen
Copy link
Contributor

Ref. #1005

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 4, 2020
@jprzychodzen
Copy link
Contributor Author

/assign @mm4tt

nodesToFail := nodes[:0]
klog.Infof("%s: %d nodes available, expecting to fail %f nodes", k, len(nodes), expectedNodesToFail)
for _, node := range nodes {
if rand.Float64() < k.config.FailureRate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already randomized, we call rand.Shuffle in line 99

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are randomized, but the size is constant.

And if we are expecting that less than a single node will fail (as for load test, with failure rate 0.01, and less than 100 eligible nodes, as some of them are running Prometheus), no node is scheduled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, good debugging!
But, wouldn't be better to simply set numNodes to 1 if it's 0?
What I don't like about your solution is that, while it'll kill the FailureRate of nodes on average, we'll have test-runs where no nodes were killed and we'll have runs where 1+ nodes were killed. It's quite likely that it will make the test flaky - most likely it won't be failing but it can result in big variance of many metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, changed that to get more stable tests. Also, increased minimal interval in load test to prevent overkilling.

@@ -124,7 +129,7 @@ func (k *NodeKiller) kill(nodes []v1.Node) {
time.Sleep(time.Duration(k.config.SimulatedDowntime))

klog.Infof("%s: Rebooting %q to repair the node", k, node.Name)
err = util.SSH("sudo reboot", &node, nil)
err = util.SSH("nohup sudo reboot +1s > /dev/null 2> /dev/null < /dev/null &", &node, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be a different PR (or at least commit).
Also, could you explain how it works? Ideally, in a comment. Personally, I don't understand the +1s part

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, moved to PR #1109 with extensive comment.

return nodes[:numNodes], nil
expectedNodesToFail := k.config.FailureRate * float64(len(nodes))
nodesToFail := nodes[:0]
klog.Infof("%s: %d nodes available, expecting to fail %f nodes", k, len(nodes), expectedNodesToFail)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for adding more logs (I'd even add more, e.g. listing exactly which nodes were returned here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information is available from 'kill' function in line 117, but I can duplicate that.

Copy link
Contributor

@mm4tt mm4tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, your diagnosis makes sense, but what worries me is that we still have no logs from nodes.go file in the presubmit run. Any idea?

nodesToFail := nodes[:0]
klog.Infof("%s: %d nodes available, expecting to fail %f nodes", k, len(nodes), expectedNodesToFail)
for _, node := range nodes {
if rand.Float64() < k.config.FailureRate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, good debugging!
But, wouldn't be better to simply set numNodes to 1 if it's 0?
What I don't like about your solution is that, while it'll kill the FailureRate of nodes on average, we'll have test-runs where no nodes were killed and we'll have runs where 1+ nodes were killed. It's quite likely that it will make the test flaky - most likely it won't be failing but it can result in big variance of many metrics.

@jprzychodzen
Copy link
Contributor Author

jprzychodzen commented Mar 5, 2020

Persubmit job does not have chaos monkey override enabled , so this code is never invoked at this job.

And actually, only a handful of jobs have this enabled - release jobs, few experimental jobs and ci-kubernetes-e2e-gci-gce-scalability

@jprzychodzen jprzychodzen changed the title [NodeKiller] Add randomization on node killing [NodeKiller] Change rounding of a number of selected nodes Mar 5, 2020
@mm4tt
Copy link
Contributor

mm4tt commented Mar 5, 2020

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 5, 2020
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 5, 2020
@mm4tt
Copy link
Contributor

mm4tt commented Mar 5, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 5, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jprzychodzen, mm4tt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jprzychodzen
Copy link
Contributor Author

Fixes #1005

@k8s-ci-robot k8s-ci-robot merged commit d818473 into kubernetes:master Mar 5, 2020
@jprzychodzen jprzychodzen deleted the node-killer branch March 11, 2020 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants