Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add the ability to remove nodes older than a certain age even when the node group has reached the min #240

Merged
merged 3 commits into from
Jun 6, 2024

Conversation

awprice
Copy link
Member

@awprice awprice commented May 27, 2024

  • adds max_node_age field to node group configuration
  • max_node_age by default is disabled, and needs to be a positive, greater than zero duration to enable
  • when the node group has reached the minimum and there are nodes older than the max_node_age, the scale delta will be set to +1
  • this will then force escalator to replace the oldest node

fixes #239

}

// We don't want to attempt to rotate nodes that have reached the max age if we haven't reached the minimum node
// count, as the scaling down of the node group will remove the oldest first anyway.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still an edge case here which is a node group that somehow has very consistent load such that it always maintains X number of nodes (e.g. always 2) and thus never gets cycled. I think this is very unlikely with the kinds of workloads that run on Escalator, but worth calling out in case others think it's worth considering here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought of this edge case whilst writing this feature - I guess it's a been an edge case in Escalator since it's existence, and won't be fixed or made worse by this feature being added.

Ideally if there is a consistent base load, the mininum nodes in the ASG should really be configured to a value that meets the base load, then this feature can be used to ensure the nodes are cycled out.

I can check if this edge case is documented in Escalator's documentation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can check if this edge case is documented in Escalator's documentation

This isn't documented anywhere - I've raised #241


// We don't want to attempt to rotate nodes that have reached the max age if we haven't reached the minimum node
// count, as the scaling down of the node group will remove the oldest first anyway.
if len(untaintedNodes) != nodeGroup.Opts.MinNodes || len(untaintedNodes) == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be <= to catch weird edge cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I think we want to keep len(untaintedNodes) != nodeGroup.Opts.MinNodes, as we're only wanting this feature to come into play when we're at the minimum node level.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was using <= is more defensive in case we (or the cloud provider) have a bug elsewhere in the code that would allow the number of nodes present to be less than the number of MinNodes. In that case, we would actually start checking their age and potentially mess with the scale delta here, which complicates getting back to the MinNodes regardless.

All of this said, it's an unlikely edge case and I'm fine with this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was using <= is more defensive in case we (or the cloud provider) have a bug elsewhere in the code that would allow the number of nodes present to be less than the number of MinNodes.

Exceeding the max and being under the min is already caught by Escalator here -

if len(allNodes) < nodeGroup.Opts.MinNodes {
err = errors.New("node count less than the minimum")
log.WithField("nodegroup", nodegroup).Warningf(
"Node count of %v less than minimum of %v",
len(allNodes),
nodeGroup.Opts.MinNodes,
)
return 0, err
}
if len(allNodes) > nodeGroup.Opts.MaxNodes {
err = errors.New("node count larger than the maximum")
log.WithField("nodegroup", nodegroup).Warningf(
"Node count of %v larger than maximum of %v",
len(allNodes),
nodeGroup.Opts.MaxNodes,
)
return 0, err
}

},
listerOptions: ListerOptions{},
},
1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this case, the behaviour is that we scale up one node, it would be replaced and then we would do another scale up to replace the other older node once we made it back to our minimum node count again. Is that correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - that's the desired behaviour

nil,
},
{
"max_node_age enabled, some nodes are tainted",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a test case where we're at the MinNodes but some are tainted? I don't think that should occur in regular operation though, just throwing this out there in case it makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look into adding this test case - it doesn't make sense to add to this feature's tests, as Escalator should get its self out of this scenario prior to the maxNodeAge code running (it will untaint the tainted nodes in an attempt to get the untainted nodes equal to the min nodes).

Again, this should be tested elsewhere already.

@@ -471,6 +476,31 @@ func TestValidateNodeGroup(t *testing.T) {
HardDeleteGracePeriod: "1h10m",
ScaleUpCoolDownPeriod: "55m",
TaintEffect: "",
MaxNodeAge: "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Piggy backing this particular edge case onto the case where there's an empty TaintEffect means that us also checking that we're parsing an empty MaxNodeAge is kinda lost. Either make a new table test case here or add that we're checking this field in the name of the table test case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added a test case for this

args args
expectedNodeDelta int
err error
}{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a test for num nodes == maxNodes, and all expired, with scale up enabled

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you worried about here? Are we expecting delta to be 0 because we're at max, or to be positive but unable to go further due to max? Either way I think that particular case isn't related to this feature

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah just the very unlikely case of nodes at max for a long time, and this min of +1 node delta to cycle them trying to push above maxASG.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually perfectly fine for node delta to be +1 when at the max ASG size - we have logic later on when trying to increase the ASG in the cloud provider above the max that will reject the scale up.

I don't think this test case is really relevant for this specific feature and the case of node delta exceeding the ASG max should be tested elsewhere.

Ideally we're really testing if nodeDelta is either +1 or 0 given the criteria for the feature and assuming whatever Escalator does with that nodeDelta is appropriately tested later on

if c.scaleOnMaxNodeAge(nodeGroup, untaintedNodes) {
log.WithField("nodegroup", nodegroup).
Info("Setting scale to minimum of 1 to rotate out a node older than the max node age")
nodesDelta = int(math.Max(float64(nodesDelta), 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this also protect against nodesDelta exceeding ASGMax? Or is it protected against in normal scale functions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat related to the other conversation, but I don't think this is relevant for the tests of this feature - this should already be covered by the general scaling tests that check that if nodeDelta would breach the ASG max, that Escalator prevents it.

@mwhittington21 mwhittington21 self-requested a review May 28, 2024 23:14
mwhittington21
mwhittington21 previously approved these changes May 28, 2024
Copy link
Collaborator

@mwhittington21 mwhittington21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Jacobious52
Jacobious52 previously approved these changes May 29, 2024
…e node group has reached the min

- adds max_node_age field to node group configuration
- max_node_age by default is disabled, and needs to be a positive, greater than zero duration to enable
- when the node group has reached the minimum and there are nodes older than the max_node_age, the scale delta will be set to +1
- this will then force escalator to replace the oldest node

fixes #239

Signed-off-by: Alex Price <[email protected]>
@awprice awprice dismissed stale reviews from Jacobious52 and mwhittington21 via f83a85b May 30, 2024 01:57
@awprice awprice force-pushed the aprice/issue-239 branch from 045ee4e to f83a85b Compare May 30, 2024 01:57
Jacobious52
Jacobious52 previously approved these changes May 30, 2024
@mwhittington21 mwhittington21 self-requested a review May 30, 2024 04:49
mwhittington21
mwhittington21 previously approved these changes May 30, 2024
…o rotate nodes when at the ASG min, but there is a tainted node

Signed-off-by: Alex Price <[email protected]>
@awprice awprice dismissed stale reviews from mwhittington21 and Jacobious52 via 6e227c5 May 30, 2024 06:07
mwhittington21
mwhittington21 previously approved these changes May 30, 2024
Signed-off-by: Alex Price <[email protected]>
@awprice awprice merged commit 679092c into master Jun 6, 2024
3 checks passed
@awprice awprice deleted the aprice/issue-239 branch June 6, 2024 05:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Add ability to remove nodes older than a certain amount of time
3 participants