MEN-7900: Fix Mender client getting stuck after failure in sync state #1724

lluiscampos · 2025-01-10T12:45:42Z

From the code point, the issue was that the re-scheduling of new polls for updates or inventory were done from the states after the sync state, so unless the state machine reached that point the new polls would not be scheduled.

Fix by creating two new states, that just do the re-scheduling, between idle and sync. Note that the timer(s) have now been moved to the context object so that it can be accessed from multiple states (namely, update polling and submit inventory states which would need to manipulate the timer for exponential back-off retries.

Changelog: Fix issue where any error in Sync state (triggered for example with an error in Sync_Enter or Sync_Leave state scripts) leaves the client stuck in idle state forever and no new polls for update nor submit of inventory would be attempted again.

TODO:

Modify the integration tests so that this bug would have been caught (check that i fails with master and passes with this PR)
Consider duplicating the classes (discussion below)
Consider refactor the backoff (discussion below)

From the code point, the issue was that the re-scheduling of new polls for updates or inventory were done from the states _after_ the sync state, so unless the state machine reached that point the new polls would not be scheduled. Fix by creating two new states, that just do the re-scheduling, between idle and sync. Note that the timer(s) have now been moved to the context object so that it can be accessed from multiple states (namely, update polling and submit inventory states which would need to manipulate the timer for exponential back-off retries. Ticket: MEN-7900 Changelog: Fix issue where any error in Sync state (triggered for example with an error in Sync_Enter or Sync_Leave state scripts) leaves the client stuck in idle state forever and no new polls for update nor submit of inventory would be attempted again. Signed-off-by: Lluis Campos <[email protected]>

mender-test-bot · 2025-01-10T12:45:53Z

Merging these commits will result in the following changelog entries:

Changelogs

mender (MEN-7900-client-stuck-after-sync-error)

New changes in mender since master:

Bug Fixes

Fix issue where any error in Sync state (triggered for
example with an error in Sync_Enter or Sync_Leave state scripts) leaves
the client stuck in idle state forever and no new polls for update nor
submit of inventory would be attempted again.
(MEN-7900)

mender-test-bot · 2025-01-10T12:46:06Z

@lluiscampos, Let me know if you want to start the integration pipeline by mentioning me and the command "start pipeline".

my commands and options

You can trigger a pipeline on multiple prs with:

mentioning me and start pipeline --pr mender/127 --pr mender-connect/255

You can start a fast pipeline, disabling full integration tests with:

mentioning me and start pipeline --fast

You can trigger GitHub->GitLab branch sync with:

mentioning me and sync

You can cherry pick to a given branch or branches with:

mentioning me and:

 cherry-pick to:
 * 1.0.x
 * 2.0.x

lluiscampos · 2025-01-10T12:52:13Z

src/mender-update/daemon/states.cpp

+}
+
+void ScheduleNextPollState::OnEnter(Context &ctx, sm::EventPoster<StateEvent> &poster) {
+	log::Debug("Scheduling the next check in: " + to_string(interval_) + " seconds");


As you can see, I reused the same class ScheduleNextPollState for both poll for updates and submit inventory, by passing by reference which of the timers and interval to use.

The only disadvantage that I see is the log messages here being more generic: "next check" instead of "next inventory submission" for example.

You could add a const string &action argument to the constructor and use the action description here.

Yes, thank you!

I actually thought about this just as I logged of on Friday... and forgot to comment to myself today 😅

This also partially responds to the other question, although there we might want to keep the retry contained in the class(es) than can actually use it instead of in the shared context.

Yeah, the action argument would be nice

lluiscampos · 2025-01-10T12:55:38Z

src/mender-update/daemon/states.cpp

-		http::ExponentialBackoff(chrono::seconds(retry_interval_seconds), retry_count),
-		event_loop} {
+PollForDeploymentState::PollForDeploymentState(int retry_interval_seconds, int retry_count) :
+	backoff_ {chrono::seconds(retry_interval_seconds), retry_count} {
 }

 void PollForDeploymentState::HandlePollingError(Context &ctx, sm::EventPoster<StateEvent> &poster) {


This HandlePollingError is still in two places, in here and in SubmitInventoryState.

We could move it by having the backoff objects also in the context. What do you think? ~~Again the main disadvantage would be the logging, which in this case is info and error level so it is more relevant to get the more context for the (human) user reading the logs.~~

This HandlePollingError is still in two places, in here and in SubmitInventoryState.

I think that's fine. These two cases can potentially differ even more in the future.

vpodzime · 2025-01-13T10:45:07Z

src/mender-update/daemon/states.cpp

-			}
-		});
-
-	DoSubmitInventory(ctx, poster);


Where is this happening now?

in OnEnter. The previous OnEnter only scheduled the next interval and called DoSubmitInventory

I see. I was looking for this function call somewhere. 👍

danielskinstad

As was pointed out, having the action description as an arg and using that when logging would be nice; other than that it looks good to me 🚀

lluiscampos commented Jan 10, 2025

View reviewed changes

lluiscampos requested review from vpodzime, jo-lund and danielskinstad January 10, 2025 12:56

vpodzime reviewed Jan 13, 2025

View reviewed changes

danielskinstad approved these changes Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MEN-7900: Fix Mender client getting stuck after failure in sync state #1724

MEN-7900: Fix Mender client getting stuck after failure in sync state #1724

lluiscampos commented Jan 10, 2025 •

edited

Loading

mender-test-bot commented Jan 10, 2025 •

edited by jira bot

Loading

mender-test-bot commented Jan 10, 2025

lluiscampos Jan 10, 2025

vpodzime Jan 13, 2025

lluiscampos Jan 13, 2025

danielskinstad Jan 13, 2025

lluiscampos Jan 10, 2025 •

edited

Loading

vpodzime Jan 13, 2025

vpodzime Jan 13, 2025

lluiscampos Jan 13, 2025

vpodzime Jan 13, 2025

danielskinstad left a comment

MEN-7900: Fix Mender client getting stuck after failure in sync state #1724

Are you sure you want to change the base?

MEN-7900: Fix Mender client getting stuck after failure in sync state #1724

Conversation

lluiscampos commented Jan 10, 2025 • edited Loading

mender-test-bot commented Jan 10, 2025 • edited by jira bot Loading

Changelogs

mender (MEN-7900-client-stuck-after-sync-error)

Bug Fixes

mender-test-bot commented Jan 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lluiscampos Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielskinstad left a comment

Choose a reason for hiding this comment

lluiscampos commented Jan 10, 2025 •

edited

Loading

mender-test-bot commented Jan 10, 2025 •

edited by jira bot

Loading

lluiscampos Jan 10, 2025 •

edited

Loading