Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated the EKS CloudFormation node pool template to the latest official available state #3300

Merged
merged 19 commits into from
Nov 18, 2020

Conversation

pregnor
Copy link
Member

@pregnor pregnor commented Nov 9, 2020

Q A
Bug fix? no
New feature? no
API breaks? no
Deprecations? no
Related tickets resolves #3257
License Apache 2.0

What's in this PR?

Updated the EKS CloudFormation node pool template to the latest available version.
Due to the number of changes, see the change list at the additional notes.

Why?

To keep the corresponding CloudFormation template up to date with the latest official recommended one.

Additional context

The changes are well separated in their dedicated commits so we can easily revert any of those if we wish to do so for any reason. The PR is advised to be reviewed by commit because the changes had been introduced iteratively and it's much easier to follow them that way then comparing the original state to the result.

Detailed change list:

  1. Synced the formatting of the template to the source one to minimize the number of changes to merge starting from the next update. (quotation marks, line breaks, high level attribute ordering, low level attribute ordering, indentation)
  2. Synced parameter descriptions to the source descriptions where one was available.
  3. Corrected the description of NodeAutoScalingGroupMaxBatchSize.
  4. Synced parameters and resources available in the source, commented out unused resources (no actual changes to existing resources, no new resources, some new, so far not used parameters).
  5. Synced parameter types to the source (typedefed string type).
  6. Synced default values to the source (the ones changed are not used, because all values are explicitly provided on creation in the current implementation).
  7. According to the source, add the unused NodeAutoScalingGroupDesiredCapacity parameter. Previously such attribute was not present in the official template describing the size of the autoscaling group at the moment (neither current/actual, nor desired).
  8. According to the source, added a new bool parameter DisableIMDSv1 to be able to disable the version 1 instance metadata service. Currently it is not in use, but the corresponding NodeLaunchTemplate is ready to use it in case we would like to turn it on.
  9. Synced the node instance types to the ones available in the source (some new instance types are available).
  10. According to the source, added a new output to the stack template outputting the template's NodeGroup (other outputs are commented out, but this seemed to be a relevant result as the node group is created by the stack initialized from this template).
  11. According to the source added the MinInstancesInService attribute to the NodeGroup's UpdatePolicy and introduced a corresponding optional parameter NodeAutoScalingGroupMinInstancesInService (defaults to 0) which controls the minimum number of serving (functional) nodes at any moment during a node autoscaling group update. The implementation has been updated to set this to updateOptions.MaxSurge. Also created Revise EKS node pool surge node logic now that NASGMinInstancesInService is available in CF templates #3299 to revise the custom surge logic implementation we have in relation to this change. I deliberately wanted to separate the template update from the potential implementation changes branching off of the possibilities added by the update. I think this way it is easier to keep the integrity and functionality intact.

Checklist

  • Implementation tested (with at least one cloud provider)
  • Code meets the Developer Guide
  • OpenAPI and Postman files updated (if needed)
  • User guide and development docs updated (if needed)
  • Related Helm chart(s) updated (if needed)

To Do

  • Manual test.
  • Enterprise PR updating the UpdateNodeGroupActivity input parameters and usage.
  • Making sure the OSS version is released together with the enterprise version to avoid any potential activity input type mismatch (should be able to handle non-existing key and parse it into zero value, but it's best to make sure). (Is there any alternative than releasing a version from both after both PRs are merged?)

@pregnor pregnor added cloud/aws distribution/eks Amazon EKS kind/tech-debt area/node-pool Kubernetes cluster node pools labels Nov 9, 2020
@pregnor pregnor self-assigned this Nov 9, 2020
@pregnor pregnor changed the title Updated the EKS CloudFormaiton node pool template to the latest official available state Updated the EKS CloudFormation node pool template to the latest official available state Nov 10, 2020
@pregnor pregnor force-pushed the 3257/update-node-group-cf-template branch 10 times, most recently from fbefb25 to a43d1e8 Compare November 12, 2020 08:57
@pregnor
Copy link
Member Author

pregnor commented Nov 12, 2020

I decided to use InitSize in place of DesiredCapacity, because the node pool update process would potentially try to use the previous value of the DesiredCapacity parameter even in case of updating an old stack to a new stack and fail with no previous value available because the previous template did not contain such parameter (even with default value).

Also thought about trying to define the DesiredCapacity referencing the InitSize, but it is not allowed.

We still have possible workarounds for the future, but all of those go beyond the original intent of this PR (updating the template to the latest official state), so instead I decided avoiding to introduce the DesiredCapacity altogether and instead commenting the template accordingly.

I'm more than happy to resolve this InitSize/DesiredCapacity issue in another template with any of the discussed workarounds or migration solution, but I think it is a priority to integrate these updates ASAP.

@pregnor pregnor force-pushed the 3257/update-node-group-cf-template branch 2 times, most recently from 295ebe9 to 08d1dfb Compare November 13, 2020 04:26
@pregnor
Copy link
Member Author

pregnor commented Nov 13, 2020

Fixed MinInstancesInService usage for spot instances (must be 0) and value deriving (updateOptions.MaxSurge).

@pregnor pregnor force-pushed the 3257/update-node-group-cf-template branch from 08d1dfb to 85c027c Compare November 13, 2020 04:30
@pregnor
Copy link
Member Author

pregnor commented Nov 13, 2020

Created #3309 for further efforts in this direction.

@pregnor pregnor force-pushed the 3257/update-node-group-cf-template branch from 85c027c to 1ab6cdd Compare November 13, 2020 05:12
@pregnor
Copy link
Member Author

pregnor commented Nov 13, 2020

Fixed MinInstanceInServices during EKS.NPAPI.UpdateNP to always specify an explicit value instead of using the previous value when the UpdateNodeGroupActivity received a 0 MinInstanceInServices value.

This high level means defaulting to no surge / no ensured service instances behavior in case no surge is specified in the request instead of the previous "use the last used surge / ensured service instance value" behavior.

Context: the original behavior would have been considerably good in case we wanted to support a set once, use always afterwards use-case, but it would have meant that the surge / ensured instances could not have been reduced to 0 once it was not zero before. I think the explicit "we are using what you specified to this update request" is a much cleaner approach, even if requires a bit more complexity (per request surge setting) from the user.

@pregnor
Copy link
Member Author

pregnor commented Nov 13, 2020

After checking the current enterprise implementation I have determined it requires no change in input parameter which is a delight, because currently on OSS Pipeline side we are using 2 different input types for the EKS and PKE/AWS node pool update activities, but currently these are using a common input type on the enterprise Pipeline side which would be problematic due to the differing fields in the two if we had to update the enterprise side as well thus a smaller refactor around the type would be required on the enterprise side.

Regarding this case we have the luxury of Cadence passing input types around in a weak representation (marshaling with default values/allowed non-existing keys).

(And the enterprise workflow doing the maxSurge based node pool update locally, only updating the desired capacity at the node group.)

@pregnor pregnor force-pushed the 3257/update-node-group-cf-template branch from 5215c2c to 8d6fd80 Compare November 17, 2020 12:38
@pregnor pregnor force-pushed the 3257/update-node-group-cf-template branch 5 times, most recently from af73331 to 8cb378c Compare November 17, 2020 17:24
@pregnor pregnor force-pushed the 3257/update-node-group-cf-template branch from 8cb378c to e535a4c Compare November 18, 2020 08:21
Supposing we are disallowing fallback
deliberately.
We are using preinitialized NodeInstanceRoleID
which can also be specified by the end user.
We are using preinitialized node security group,
possibly specified by the end user.
Added unused NodeAutoScalingGroupDesiredCapacity
according to source. It cannot replace InitSize,
because we are postponing migration implementation
and backward compatibility issues.
Added an option for disabling instance metadata
service v1. It is turned off by default,
everything is supposed to work as it has before.
@pregnor pregnor force-pushed the 3257/update-node-group-cf-template branch from e535a4c to 0fdbe23 Compare November 18, 2020 08:27
@pregnor pregnor merged commit 7485e39 into master Nov 18, 2020
@pregnor pregnor deleted the 3257/update-node-group-cf-template branch November 18, 2020 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update EKS node pool CF template to the latest version
3 participants