Updated the EKS CloudFormation node pool template to the latest official available state #3300

pregnor · 2020-11-09T15:01:09Z

Q	A
Bug fix?	no
New feature?	no
API breaks?	no
Deprecations?	no
Related tickets	resolves #3257
License	Apache 2.0

What's in this PR?

Updated the EKS CloudFormation node pool template to the latest available version.
Due to the number of changes, see the change list at the additional notes.

Why?

To keep the corresponding CloudFormation template up to date with the latest official recommended one.

Additional context

The changes are well separated in their dedicated commits so we can easily revert any of those if we wish to do so for any reason. The PR is advised to be reviewed by commit because the changes had been introduced iteratively and it's much easier to follow them that way then comparing the original state to the result.

Detailed change list:

Synced the formatting of the template to the source one to minimize the number of changes to merge starting from the next update. (quotation marks, line breaks, high level attribute ordering, low level attribute ordering, indentation)
Synced parameter descriptions to the source descriptions where one was available.
Corrected the description of NodeAutoScalingGroupMaxBatchSize.
Synced parameters and resources available in the source, commented out unused resources (no actual changes to existing resources, no new resources, some new, so far not used parameters).
Synced parameter types to the source (typedefed string type).
Synced default values to the source (the ones changed are not used, because all values are explicitly provided on creation in the current implementation).
According to the source, add the unused NodeAutoScalingGroupDesiredCapacity parameter. Previously such attribute was not present in the official template describing the size of the autoscaling group at the moment (neither current/actual, nor desired).
According to the source, added a new bool parameter DisableIMDSv1 to be able to disable the version 1 instance metadata service. Currently it is not in use, but the corresponding NodeLaunchTemplate is ready to use it in case we would like to turn it on.
Synced the node instance types to the ones available in the source (some new instance types are available).
According to the source, added a new output to the stack template outputting the template's NodeGroup (other outputs are commented out, but this seemed to be a relevant result as the node group is created by the stack initialized from this template).
According to the source added the MinInstancesInService attribute to the NodeGroup's UpdatePolicy and introduced a corresponding optional parameter NodeAutoScalingGroupMinInstancesInService (defaults to 0) which controls the minimum number of serving (functional) nodes at any moment during a node autoscaling group update. The implementation has been updated to set this to updateOptions.MaxSurge. Also created Revise EKS node pool surge node logic now that NASGMinInstancesInService is available in CF templates #3299 to revise the custom surge logic implementation we have in relation to this change. I deliberately wanted to separate the template update from the potential implementation changes branching off of the possibilities added by the update. I think this way it is easier to keep the integrity and functionality intact.

Checklist

Implementation tested (with at least one cloud provider)
Code meets the Developer Guide
~~OpenAPI and Postman files updated (if needed)~~
~~User guide and development docs updated (if needed)~~
~~Related Helm chart(s) updated (if needed)~~

To Do

Manual test.
~~Enterprise PR updating the UpdateNodeGroupActivity input parameters and usage.~~
Making sure the OSS version is released together with the enterprise version to avoid any potential activity input type mismatch (should be able to handle non-existing key and parse it into zero value, but it's best to make sure). (Is there any alternative than releasing a version from both after both PRs are merged?)

pregnor · 2020-11-12T09:44:27Z

I decided to use InitSize in place of DesiredCapacity, because the node pool update process would potentially try to use the previous value of the DesiredCapacity parameter even in case of updating an old stack to a new stack and fail with no previous value available because the previous template did not contain such parameter (even with default value).

Also thought about trying to define the DesiredCapacity referencing the InitSize, but it is not allowed.

We still have possible workarounds for the future, but all of those go beyond the original intent of this PR (updating the template to the latest official state), so instead I decided avoiding to introduce the DesiredCapacity altogether and instead commenting the template accordingly.

I'm more than happy to resolve this InitSize/DesiredCapacity issue in another template with any of the discussed workarounds or migration solution, but I think it is a priority to integrate these updates ASAP.

pregnor · 2020-11-13T04:28:20Z

Fixed MinInstancesInService usage for spot instances (must be 0) and value deriving (updateOptions.MaxSurge).

pregnor · 2020-11-13T04:53:47Z

Created #3309 for further efforts in this direction.

pregnor · 2020-11-13T05:21:31Z

Fixed MinInstanceInServices during EKS.NPAPI.UpdateNP to always specify an explicit value instead of using the previous value when the UpdateNodeGroupActivity received a 0 MinInstanceInServices value.

This high level means defaulting to no surge / no ensured service instances behavior in case no surge is specified in the request instead of the previous "use the last used surge / ensured service instance value" behavior.

Context: the original behavior would have been considerably good in case we wanted to support a set once, use always afterwards use-case, but it would have meant that the surge / ensured instances could not have been reduced to 0 once it was not zero before. I think the explicit "we are using what you specified to this update request" is a much cleaner approach, even if requires a bit more complexity (per request surge setting) from the user.

pregnor · 2020-11-13T06:47:49Z

After checking the current enterprise implementation I have determined it requires no change in input parameter which is a delight, because currently on OSS Pipeline side we are using 2 different input types for the EKS and PKE/AWS node pool update activities, but currently these are using a common input type on the enterprise Pipeline side which would be problematic due to the differing fields in the two if we had to update the enterprise side as well thus a smaller refactor around the type would be required on the enterprise side.

Regarding this case we have the luxury of Cadence passing input types around in a weak representation (marshaling with default values/allowed non-existing keys).

(And the enterprise workflow doing the maxSurge based node pool update locally, only updating the desired capacity at the node group.)

Supposing we are disallowing fallback deliberately.

We are using preinitialized NodeInstanceRoleID which can also be specified by the end user.

We are using preinitialized node security group, possibly specified by the end user.

Added unused NodeAutoScalingGroupDesiredCapacity according to source. It cannot replace InitSize, because we are postponing migration implementation and backward compatibility issues.

Added an option for disabling instance metadata service v1. It is turned off by default, everything is supposed to work as it has before.

pregnor added cloud/aws distribution/eks Amazon EKS kind/tech-debt area/node-pool Kubernetes cluster node pools labels Nov 9, 2020

pregnor requested review from bonifaido, orymate and sagikazarmark November 9, 2020 15:01

pregnor self-assigned this Nov 9, 2020

orymate approved these changes Nov 10, 2020

View reviewed changes

pregnor changed the title ~~Updated the EKS CloudFormaiton node pool template to the latest official available state~~ Updated the EKS CloudFormation node pool template to the latest official available state Nov 10, 2020

pregnor force-pushed the 3257/update-node-group-cf-template branch 10 times, most recently from fbefb25 to a43d1e8 Compare November 12, 2020 08:57

bonifaido approved these changes Nov 12, 2020

View reviewed changes

pregnor force-pushed the 3257/update-node-group-cf-template branch 2 times, most recently from 295ebe9 to 08d1dfb Compare November 13, 2020 04:26

pregnor force-pushed the 3257/update-node-group-cf-template branch from 08d1dfb to 85c027c Compare November 13, 2020 04:30

pregnor force-pushed the 3257/update-node-group-cf-template branch from 85c027c to 1ab6cdd Compare November 13, 2020 05:12

pregnor force-pushed the 3257/update-node-group-cf-template branch from 5215c2c to 8d6fd80 Compare November 17, 2020 12:38

pregnor mentioned this pull request Nov 17, 2020

fix(template/EKS/NP): fixed parameter metadata #3321

Merged

pregnor force-pushed the 3257/update-node-group-cf-template branch 5 times, most recently from af73331 to 8cb378c Compare November 17, 2020 17:24

pregnor added 2 commits November 18, 2020 09:21

style(tmpl/EKS/np): synced quotes to source

fe2fba4

style(tmpl/EKS/np): synced line breaks to source

480609c

pregnor force-pushed the 3257/update-node-group-cf-template branch from 8cb378c to e535a4c Compare November 18, 2020 08:21

pregnor added 17 commits November 18, 2020 09:23

style(tmpl/EKS/np): synced group order to source

41b1b3a

style(tmpl/EKS/np): synced field order to source

dd372ff

style(tmpl/EKS/np): synced indentation to source

ba81abe

docs(tmpl/EKS/np): synced descriptions to source

24909af

docs(tmpl/EKS/np): updated descriptions

edbe70d

chore(tmpl/EKS/np): synced unused imageID fallback

a7b6b19

Supposing we are disallowing fallback deliberately.

chore(tmpl/EKS/np): synced unused NodeInstanceRole

54c20f2

We are using preinitialized NodeInstanceRoleID which can also be specified by the end user.

chore(tmpl/EKS/np): synced unused NSecurityGroup

9bfb928

We are using preinitialized node security group, possibly specified by the end user.

chore(tmpl/EKS/np): synced unused NSG dependencies

53f5d3c

chore(tmpl/EKS/np): synced unused NSG in-/egresses

349af62

chore(tmpl/EKS/np): synced unused NASGDesiredCap.

f700d8b

Added unused NodeAutoScalingGroupDesiredCapacity according to source. It cannot replace InitSize, because we are postponing migration implementation and backward compatibility issues.

fix(tmpl/EKS/np): synced types to source

339e5fd

feat(tmpl/EKS/np): added template version 2.0.0

0059993

feat(tmpl/EKS/np): added DisableIMDSv1 option

293908f

Added an option for disabling instance metadata service v1. It is turned off by default, everything is supposed to work as it has before.

feat(tmpl/EKS/np): added NASGMinInstancesInService

d4d40b0

feat(tmpl/EKS/np): synced NodeInstanceTypes to src

b1e2a58

feat(tmpl/EKS/np): synced outputs to source

0fdbe23

pregnor force-pushed the 3257/update-node-group-cf-template branch from e535a4c to 0fdbe23 Compare November 18, 2020 08:27

pregnor merged commit 7485e39 into master Nov 18, 2020

pregnor deleted the 3257/update-node-group-cf-template branch November 18, 2020 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated the EKS CloudFormation node pool template to the latest official available state #3300

Updated the EKS CloudFormation node pool template to the latest official available state #3300

pregnor commented Nov 9, 2020 •

edited

Loading

pregnor commented Nov 12, 2020

pregnor commented Nov 13, 2020

pregnor commented Nov 13, 2020

pregnor commented Nov 13, 2020

pregnor commented Nov 13, 2020 •

edited

Loading

Updated the EKS CloudFormation node pool template to the latest official available state #3300

Updated the EKS CloudFormation node pool template to the latest official available state #3300

Conversation

pregnor commented Nov 9, 2020 • edited Loading

What's in this PR?

Why?

Additional context

Checklist

To Do

pregnor commented Nov 12, 2020

pregnor commented Nov 13, 2020

pregnor commented Nov 13, 2020

pregnor commented Nov 13, 2020

pregnor commented Nov 13, 2020 • edited Loading

pregnor commented Nov 9, 2020 •

edited

Loading

pregnor commented Nov 13, 2020 •

edited

Loading