[SURE-7342] Re-installation loop of managed services #1703

dvarrazzo · 2023-08-09T08:24:32Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

We have experimenting with a Rancher installation, currently controlling 6 gitrepo and 6 clusters

We have 6 gitrepo currently controlling 6 clusters (not 1:1). The gitrepos watch 3 repos, one of which on several branches.

GITREPO            REPO/BRANCH    CLUSTERS
on-prem            on-prem        1
retailer-test      retailer       1
shared-deploying   shared/test1   0
shared-test        shared/alpha   2
shared-staging     shared/beta    1
shared-prod        shared/master  1

Since we added the cluster on shared-prod, the three clusters on shared-staging and shared-test entered a loop where all the helm charts managed are re-installed. As a consequence, services get restarted and the system is unstable.

I can't find any even in the system showing the reason why the re-installation happens. The GitRepo resources don't change. Because restart happens on the 3 clusters at the same time, I tend to think it is something on the fleet manager to trigger the event.

How can we diagnose what is causing the problem and stop the issue?

Expected Behavior

No response

Steps To Reproduce

No response

Environment

- Architecture: Linux
- Rancher Version: 2.7.2
- Fleet Version: 0.6.0
- Provider: AKS

Logs

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

dvarrazzo · 2023-08-09T13:36:53Z

Further details:

Pausing a cluster doesn't have effect: when its time arrives, Helm charts are redeployed nonetheless.
Checking the logs of the deployments in the cattle-system and cattle-fleet-system namespaces, I see no message about the re-installation event - only the ones related to the restarts once they are in progress.
I have managed to stabilise the cluster I care the most (staging) by scaling the fleet-agent deployment to 0 and can keep on investigating what's going on in the test clusters.

Continuing the investigation, to focus on the manager cluster. If you have any process or resource to suggest monitoring, it would be welcome.

dvarrazzo · 2023-10-18T07:57:08Z

We have upgraded to rancher v2.7.6, which we installed on a new manager;

we created gitrepos shared-staging and shared-prod
we attached a cluster ks-shared-staging configured to deploy shared-staging on it
we attached a cluster ks-shared configured to deploy shared-prod on it

As soon as we attached the ks-shared cluster, the bundles on ks-shared-staging cluster got immediately redeployed, like if the cluster was installed the first time (not just upgraded: helm list shows revision 1 and a fresh version). This happens every 15 minutes.

Any feedback, before we abandon the project of moving to Rancher altogether?

weyfonk · 2023-10-18T11:16:36Z

Hi @dvarrazzo, this looks like another instance of #1245, with redeployments happening every 15 minutes on clusters named with a common prefix.
We fixed this in Fleet v0.8.0, which should be available in Rancher v2.7.6.

Any chance you could update your Fleet install to discard this being the issue?

dvarrazzo · 2023-10-18T19:12:48Z

I looked for similar issues, but didn't find any. Thank you for the reference.

What can I say... Wow. This huge bug has been open since January and across several releases. And no, not confirmed: rancher 2.7.6, installed on 2023-10-02, installed fleet 0.7.1, affected by the bug.

I will discuss with my team, but at the moment we are on course to abandon our attempt to use fleet altogether. Our level of trust in the project is pretty low in this moment, you may understand.

weyfonk · 2023-10-19T07:21:37Z

We know there is currently unpredictability on Fleet versions installed with a given Rancher version, and are taking steps to fix that in the next Rancher release.
Hence why I asked if you could update your Fleet install to 0.8.0 to resolve your issue (more info here on how to do it).

nepomucen · 2023-11-02T12:18:11Z

We hit the same issue, and Fleet in version 0.8.0 doesn't appear to solve it. In our case the upgrade to Rancher 2.7.6 took place on 2023-10-30. With the upgrade came automatically the Fleet in version fleet-102.2.0+up0.8.0 (I understand it's not pinned anyhow with Rancher 2.7 release up to now).

As a result the fleet-agent keep upgrading every 1 minute on downstream rke2 clusters:

helm ls -n cattle-fleet-system -o json | jq '.[] | "\(.name)   \(.revision)    \(.updated)"'
"fleet-agent-staging-xyz-001   1771    2023-11-02 11:57:54.070257445 +0000 UTC"

kkaempf · 2024-01-15T16:07:27Z

@nepomucen, @dvarrazzo - we cannot reproduce this with Rancher 2.8.0 (Fleet 0.9.0). If you can, please help us with a reproducer.

kkaempf · 2024-02-26T16:20:51Z

closing for no response.

We believe this is fixed in Rancher 2.8.

dvarrazzo added [zube]: To Triage kind/bug labels Aug 9, 2023

rancherbot added this to Fleet Aug 9, 2023

github-actions bot added the team/fleet label Aug 9, 2023

github-project-automation bot moved this to 🆕 New in Fleet Aug 9, 2023

kkaempf added this to the 2024-Q1-2.8x milestone Dec 6, 2023

kkaempf changed the title ~~Re-installation loop of managed services~~ [SURE-7342] Re-installation loop of managed services Dec 8, 2023

kkaempf added the JIRA Must shout label Dec 8, 2023

manno moved this from 🆕 New to 📋 Backlog in Fleet Jan 10, 2024

weyfonk moved this from 📋 Backlog to 🏗 In progress in Fleet Jan 12, 2024

weyfonk self-assigned this Jan 12, 2024

weyfonk moved this from 🏗 In progress to Blocked in Fleet Jan 15, 2024

weyfonk removed their assignment Jan 15, 2024

kkaempf added the status/cannot-reproduce label Jan 24, 2024

kkaempf closed this as not planned Won't fix, can't repro, duplicate, stale Feb 26, 2024

github-project-automation bot moved this from Blocked to ✅ Done in Fleet Feb 26, 2024

zube bot closed this as completed Feb 26, 2024

zube bot added [zube]: Done and removed [zube]: To Triage labels Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SURE-7342] Re-installation loop of managed services #1703

[SURE-7342] Re-installation loop of managed services #1703

dvarrazzo commented Aug 9, 2023 •

edited

Loading

dvarrazzo commented Aug 9, 2023

dvarrazzo commented Oct 18, 2023

weyfonk commented Oct 18, 2023

dvarrazzo commented Oct 18, 2023

weyfonk commented Oct 19, 2023

nepomucen commented Nov 2, 2023 •

edited

Loading

kkaempf commented Jan 15, 2024

kkaempf commented Feb 26, 2024

[SURE-7342] Re-installation loop of managed services #1703

[SURE-7342] Re-installation loop of managed services #1703

Comments

dvarrazzo commented Aug 9, 2023 • edited Loading

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Logs

Anything else?

dvarrazzo commented Aug 9, 2023

dvarrazzo commented Oct 18, 2023

weyfonk commented Oct 18, 2023

dvarrazzo commented Oct 18, 2023

weyfonk commented Oct 19, 2023

nepomucen commented Nov 2, 2023 • edited Loading

kkaempf commented Jan 15, 2024

kkaempf commented Feb 26, 2024

dvarrazzo commented Aug 9, 2023 •

edited

Loading

nepomucen commented Nov 2, 2023 •

edited

Loading