Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MEN-7900: Fix Mender client getting stuck after failure in sync state #1724

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lluiscampos
Copy link
Contributor

@lluiscampos lluiscampos commented Jan 10, 2025

From the code point, the issue was that the re-scheduling of new polls for updates or inventory were done from the states after the sync state, so unless the state machine reached that point the new polls would not be scheduled.

Fix by creating two new states, that just do the re-scheduling, between idle and sync. Note that the timer(s) have now been moved to the context object so that it can be accessed from multiple states (namely, update polling and submit inventory states which would need to manipulate the timer for exponential back-off retries.

Changelog: Fix issue where any error in Sync state (triggered for example with an error in Sync_Enter or Sync_Leave state scripts) leaves the client stuck in idle state forever and no new polls for update nor submit of inventory would be attempted again.

TODO:

  • Modify the integration tests so that this bug would have been caught (check that i fails with master and passes with this PR)
  • Consider duplicating the classes (discussion below)
  • Consider refactor the backoff (discussion below)

From the code point, the issue was that the re-scheduling of new polls
for updates or inventory were done from the states _after_ the sync
state, so unless the state machine reached that point the new polls
would not be scheduled.

Fix by creating two new states, that just do the re-scheduling, between
idle and sync. Note that the timer(s) have now been moved to the context
object so that it can be accessed from multiple states (namely, update
polling and submit inventory states which would need to manipulate the
timer for exponential back-off retries.

Ticket: MEN-7900

Changelog: Fix issue where any error in Sync state (triggered for
example with an error in Sync_Enter or Sync_Leave state scripts) leaves
the client stuck in idle state forever and no new polls for update nor
submit of inventory would be attempted again.

Signed-off-by: Lluis Campos <[email protected]>
@mender-test-bot
Copy link

mender-test-bot commented Jan 10, 2025

Merging these commits will result in the following changelog entries:

Changelogs

mender (MEN-7900-client-stuck-after-sync-error)

New changes in mender since master:

Bug Fixes
  • Fix issue where any error in Sync state (triggered for
    example with an error in Sync_Enter or Sync_Leave state scripts) leaves
    the client stuck in idle state forever and no new polls for update nor
    submit of inventory would be attempted again.
    (MEN-7900)

@mender-test-bot
Copy link

@lluiscampos, Let me know if you want to start the integration pipeline by mentioning me and the command "start pipeline".


my commands and options

You can trigger a pipeline on multiple prs with:

  • mentioning me and start pipeline --pr mender/127 --pr mender-connect/255

You can start a fast pipeline, disabling full integration tests with:

  • mentioning me and start pipeline --fast

You can trigger GitHub->GitLab branch sync with:

  • mentioning me and sync

You can cherry pick to a given branch or branches with:

  • mentioning me and:
 cherry-pick to:
 * 1.0.x
 * 2.0.x

}

void ScheduleNextPollState::OnEnter(Context &ctx, sm::EventPoster<StateEvent> &poster) {
log::Debug("Scheduling the next check in: " + to_string(interval_) + " seconds");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see, I reused the same class ScheduleNextPollState for both poll for updates and submit inventory, by passing by reference which of the timers and interval to use.

The only disadvantage that I see is the log messages here being more generic: "next check" instead of "next inventory submission" for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add a const string &action argument to the constructor and use the action description here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you!

I actually thought about this just as I logged of on Friday... and forgot to comment to myself today 😅

This also partially responds to the other question, although there we might want to keep the retry contained in the class(es) than can actually use it instead of in the shared context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the action argument would be nice

http::ExponentialBackoff(chrono::seconds(retry_interval_seconds), retry_count),
event_loop} {
PollForDeploymentState::PollForDeploymentState(int retry_interval_seconds, int retry_count) :
backoff_ {chrono::seconds(retry_interval_seconds), retry_count} {
}

void PollForDeploymentState::HandlePollingError(Context &ctx, sm::EventPoster<StateEvent> &poster) {
Copy link
Contributor Author

@lluiscampos lluiscampos Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This HandlePollingError is still in two places, in here and in SubmitInventoryState.

We could move it by having the backoff objects also in the context. What do you think? Again the main disadvantage would be the logging, which in this case is info and error level so it is more relevant to get the more context for the (human) user reading the logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This HandlePollingError is still in two places, in here and in SubmitInventoryState.

I think that's fine. These two cases can potentially differ even more in the future.

}
});

DoSubmitInventory(ctx, poster);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this happening now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in OnEnter. The previous OnEnter only scheduled the next interval and called DoSubmitInventory

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I was looking for this function call somewhere. 👍

Copy link
Contributor

@danielskinstad danielskinstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As was pointed out, having the action description as an arg and using that when logging would be nice; other than that it looks good to me 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants