Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-46380: StaticPodOperatorStatus validation should reject downgrades and concurrent node rollouts #2123

Conversation

benluddy
Copy link
Contributor

No description provided.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 11, 2024
Copy link
Contributor

openshift-ci bot commented Dec 11, 2024

Hello @benluddy! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

Copy link
Contributor

openshift-ci bot commented Dec 11, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 11, 2024
@deads2k
Copy link
Contributor

deads2k commented Dec 11, 2024

/test all

@@ -252,6 +252,7 @@ type StaticPodOperatorStatus struct {
// +listType=map
// +listMapKey=nodeName
// +optional
// +kubebuilder:validation:XValidation:rule="size(self.filter(status, status.?targetRevision.orValue(0) != 0)) <= 1",message="no more than 1 node status may have a nonzero targetRevision"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing integration test

@benluddy
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn

Copy link
Contributor

openshift-ci bot commented Dec 11, 2024

@benluddy: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/92294ac0-b80d-11ef-8f69-be8d4af6e02e-0

@benluddy benluddy force-pushed the validate-static-pod-operator-nodestatus-max-1-node-rollout branch 2 times, most recently from 0473ef3 to 8b6f854 Compare December 12, 2024 15:17
@openshift-ci openshift-ci bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 12, 2024
@benluddy
Copy link
Contributor Author

/payload-aggregated periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn 3

Copy link
Contributor

openshift-ci bot commented Dec 12, 2024

@benluddy: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@benluddy
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn 3

Copy link
Contributor

openshift-ci bot commented Dec 12, 2024

@benluddy: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7f63f180-b89c-11ef-9416-0d3097789c7a-0

@benluddy
Copy link
Contributor Author

I caught the static pod installer controller trying to decrease currentRevision and failing validation here in https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-api-2123-ci-4.19-e2e-gcp-ovn/1867227820942430208.

@benluddy benluddy changed the title WIP: Validate static pod operator nodestatus max 1 node rollout OCPBUGS-46380: StaticPodOperatorStatus validation should reject downgrades and concurrent node rollouts Dec 12, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 12, 2024
@openshift-ci-robot
Copy link

@benluddy: This pull request references Jira Issue OCPBUGS-46380, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Dec 12, 2024
These node status states and transitions always indicate a bug in one of the static pod controllers.
@benluddy benluddy force-pushed the validate-static-pod-operator-nodestatus-max-1-node-rollout branch from 8b6f854 to f4a5275 Compare December 12, 2024 21:52
@benluddy benluddy marked this pull request as ready for review December 12, 2024 21:52
@benluddy
Copy link
Contributor Author

/jira refresh

@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 12, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Dec 12, 2024
@openshift-ci-robot
Copy link

@benluddy: This pull request references Jira Issue OCPBUGS-46380, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Dec 12, 2024
@openshift-ci openshift-ci bot requested a review from wangke19 December 12, 2024 21:52
@benluddy
Copy link
Contributor Author

/assign @deads2k

type NodeStatus struct {
// nodeName is the name of the node
// +required
NodeName string `json:"nodeName"`

// currentRevision is the generation of the most recently successful deployment
// +kubebuilder:validation:XValidation:rule="self >= oldSelf",message="must only increase"
CurrentRevision int32 `json:"currentRevision"`
// targetRevision is the generation of the deployment we're trying to apply
TargetRevision int32 `json:"targetRevision,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we'll make this only increase too

@deads2k
Copy link
Contributor

deads2k commented Dec 13, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 13, 2024
Copy link
Contributor

openshift-ci bot commented Dec 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benluddy, deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 13, 2024
Copy link
Contributor

openshift-ci bot commented Dec 13, 2024

@benluddy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn f4a5275 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-azure f4a5275 link false /test e2e-azure

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit f5726c3 into openshift:master Dec 13, 2024
19 of 21 checks passed
@openshift-ci-robot
Copy link

@benluddy: Jira Issue OCPBUGS-46380: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-46380 has been moved to the MODIFIED state.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-config-api
This PR has been included in build ose-cluster-config-api-container-v4.19.0-202412130637.p0.gf5726c3.assembly.stream.el9.
All builds following this will include this PR.

type NodeStatus struct {
// nodeName is the name of the node
// +required
NodeName string `json:"nodeName"`

// currentRevision is the generation of the most recently successful deployment
// +kubebuilder:validation:XValidation:rule="self >= oldSelf",message="must only increase"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is not true. There is a fallback logic in SNO that might revert the CurrentRevision if the new revision fails to install. I think we should revert it; otherwise, it might break an SNO cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this reminds me of an earlier bug https://bugzilla.redhat.com/show_bug.cgi?id=1985997.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really here, but would it be compatible to instead validate that self.currentRevision == oldSelf.targetRevision?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@p0lyn0mial A clear logic for fallback we can see https://github.com/openshift/enhancements/blob/master/enhancements/kube-apiserver/startup-monitor.md, CurrentRevision won't decrease, when detecting problems with the new revision, the startup-monitor will copy the pod-manifest of the /etc/kubernetes/static-pods/last-known-good link (or the previous revision if the link does not exist, or don't do anything if there is no previous revision as in bootstrapping) into /etc/kubernetes.

@benluddy
Copy link
Contributor Author

/cherry-pick release-4.18

@openshift-cherrypick-robot

@benluddy: new pull request created: #2152

In response to this:

/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants