Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SURE-7342] Re-installation loop of managed services #1703

Closed
1 task done
dvarrazzo opened this issue Aug 9, 2023 · 8 comments
Closed
1 task done

[SURE-7342] Re-installation loop of managed services #1703

dvarrazzo opened this issue Aug 9, 2023 · 8 comments

Comments

@dvarrazzo
Copy link

dvarrazzo commented Aug 9, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

We have experimenting with a Rancher installation, currently controlling 6 gitrepo and 6 clusters

We have 6 gitrepo currently controlling 6 clusters (not 1:1). The gitrepos watch 3 repos, one of which on several branches.

GITREPO            REPO/BRANCH    CLUSTERS
on-prem            on-prem        1
retailer-test      retailer       1
shared-deploying   shared/test1   0
shared-test        shared/alpha   2
shared-staging     shared/beta    1
shared-prod        shared/master  1

Since we added the cluster on shared-prod, the three clusters on shared-staging and shared-test entered a loop where all the helm charts managed are re-installed. As a consequence, services get restarted and the system is unstable.

I can't find any even in the system showing the reason why the re-installation happens. The GitRepo resources don't change. Because restart happens on the 3 clusters at the same time, I tend to think it is something on the fleet manager to trigger the event.

How can we diagnose what is causing the problem and stop the issue?

Expected Behavior

No response

Steps To Reproduce

No response

Environment

- Architecture: Linux
- Rancher Version: 2.7.2
- Fleet Version: 0.6.0
- Provider: AKS

Logs

No response

Anything else?

No response

@dvarrazzo
Copy link
Author

Further details:

  • Pausing a cluster doesn't have effect: when its time arrives, Helm charts are redeployed nonetheless.
  • Checking the logs of the deployments in the cattle-system and cattle-fleet-system namespaces, I see no message about the re-installation event - only the ones related to the restarts once they are in progress.
  • I have managed to stabilise the cluster I care the most (staging) by scaling the fleet-agent deployment to 0 and can keep on investigating what's going on in the test clusters.

Continuing the investigation, to focus on the manager cluster. If you have any process or resource to suggest monitoring, it would be welcome.

@dvarrazzo
Copy link
Author

We have upgraded to rancher v2.7.6, which we installed on a new manager;

  • we created gitrepos shared-staging and shared-prod
  • we attached a cluster ks-shared-staging configured to deploy shared-staging on it
  • we attached a cluster ks-shared configured to deploy shared-prod on it

As soon as we attached the ks-shared cluster, the bundles on ks-shared-staging cluster got immediately redeployed, like if the cluster was installed the first time (not just upgraded: helm list shows revision 1 and a fresh version). This happens every 15 minutes.

Any feedback, before we abandon the project of moving to Rancher altogether?

@weyfonk
Copy link
Contributor

weyfonk commented Oct 18, 2023

Hi @dvarrazzo, this looks like another instance of #1245, with redeployments happening every 15 minutes on clusters named with a common prefix.
We fixed this in Fleet v0.8.0, which should be available in Rancher v2.7.6.

Any chance you could update your Fleet install to discard this being the issue?

@dvarrazzo
Copy link
Author

I looked for similar issues, but didn't find any. Thank you for the reference.

What can I say... Wow. This huge bug has been open since January and across several releases. And no, not confirmed: rancher 2.7.6, installed on 2023-10-02, installed fleet 0.7.1, affected by the bug.

I will discuss with my team, but at the moment we are on course to abandon our attempt to use fleet altogether. Our level of trust in the project is pretty low in this moment, you may understand.

@weyfonk
Copy link
Contributor

weyfonk commented Oct 19, 2023

We know there is currently unpredictability on Fleet versions installed with a given Rancher version, and are taking steps to fix that in the next Rancher release.
Hence why I asked if you could update your Fleet install to 0.8.0 to resolve your issue (more info here on how to do it).

@nepomucen
Copy link

nepomucen commented Nov 2, 2023

We hit the same issue, and Fleet in version 0.8.0 doesn't appear to solve it. In our case the upgrade to Rancher 2.7.6 took place on 2023-10-30. With the upgrade came automatically the Fleet in version fleet-102.2.0+up0.8.0 (I understand it's not pinned anyhow with Rancher 2.7 release up to now).

As a result the fleet-agent keep upgrading every 1 minute on downstream rke2 clusters:

helm ls -n cattle-fleet-system -o json | jq '.[] | "\(.name)   \(.revision)    \(.updated)"'
"fleet-agent-staging-xyz-001   1771    2023-11-02 11:57:54.070257445 +0000 UTC"

@kkaempf kkaempf added this to the 2024-Q1-2.8x milestone Dec 6, 2023
@kkaempf kkaempf changed the title Re-installation loop of managed services [SURE-7342] Re-installation loop of managed services Dec 8, 2023
@kkaempf kkaempf added the JIRA Must shout label Dec 8, 2023
@manno manno moved this from 🆕 New to 📋 Backlog in Fleet Jan 10, 2024
@weyfonk weyfonk moved this from 📋 Backlog to 🏗 In progress in Fleet Jan 12, 2024
@weyfonk weyfonk self-assigned this Jan 12, 2024
@kkaempf
Copy link
Collaborator

kkaempf commented Jan 15, 2024

@nepomucen, @dvarrazzo - we cannot reproduce this with Rancher 2.8.0 (Fleet 0.9.0). If you can, please help us with a reproducer.

@weyfonk weyfonk moved this from 🏗 In progress to Blocked in Fleet Jan 15, 2024
@weyfonk weyfonk removed their assignment Jan 15, 2024
@kkaempf
Copy link
Collaborator

kkaempf commented Feb 26, 2024

closing for no response.

We believe this is fixed in Rancher 2.8.

@kkaempf kkaempf closed this as not planned Won't fix, can't repro, duplicate, stale Feb 26, 2024
@github-project-automation github-project-automation bot moved this from Blocked to ✅ Done in Fleet Feb 26, 2024
@zube zube bot closed this as completed Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

4 participants