-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MEN-7900: Fix Mender client getting stuck after failure in sync state #1724
base: master
Are you sure you want to change the base?
MEN-7900: Fix Mender client getting stuck after failure in sync state #1724
Conversation
From the code point, the issue was that the re-scheduling of new polls for updates or inventory were done from the states _after_ the sync state, so unless the state machine reached that point the new polls would not be scheduled. Fix by creating two new states, that just do the re-scheduling, between idle and sync. Note that the timer(s) have now been moved to the context object so that it can be accessed from multiple states (namely, update polling and submit inventory states which would need to manipulate the timer for exponential back-off retries. Ticket: MEN-7900 Changelog: Fix issue where any error in Sync state (triggered for example with an error in Sync_Enter or Sync_Leave state scripts) leaves the client stuck in idle state forever and no new polls for update nor submit of inventory would be attempted again. Signed-off-by: Lluis Campos <[email protected]>
Merging these commits will result in the following changelog entries: Changelogsmender (MEN-7900-client-stuck-after-sync-error)New changes in mender since master: Bug Fixes
|
@lluiscampos, Let me know if you want to start the integration pipeline by mentioning me and the command "start pipeline". my commands and optionsYou can trigger a pipeline on multiple prs with:
You can start a fast pipeline, disabling full integration tests with:
You can trigger GitHub->GitLab branch sync with:
You can cherry pick to a given branch or branches with:
|
} | ||
|
||
void ScheduleNextPollState::OnEnter(Context &ctx, sm::EventPoster<StateEvent> &poster) { | ||
log::Debug("Scheduling the next check in: " + to_string(interval_) + " seconds"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see, I reused the same class ScheduleNextPollState
for both poll for updates and submit inventory, by passing by reference which of the timers and interval to use.
The only disadvantage that I see is the log messages here being more generic: "next check" instead of "next inventory submission" for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add a const string &action
argument to the constructor and use the action description here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you!
I actually thought about this just as I logged of on Friday... and forgot to comment to myself today 😅
This also partially responds to the other question, although there we might want to keep the retry contained in the class(es) than can actually use it instead of in the shared context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the action argument would be nice
http::ExponentialBackoff(chrono::seconds(retry_interval_seconds), retry_count), | ||
event_loop} { | ||
PollForDeploymentState::PollForDeploymentState(int retry_interval_seconds, int retry_count) : | ||
backoff_ {chrono::seconds(retry_interval_seconds), retry_count} { | ||
} | ||
|
||
void PollForDeploymentState::HandlePollingError(Context &ctx, sm::EventPoster<StateEvent> &poster) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This HandlePollingError
is still in two places, in here and in SubmitInventoryState
.
We could move it by having the backoff
objects also in the context. What do you think? Again the main disadvantage would be the logging, which in this case is info and error level so it is more relevant to get the more context for the (human) user reading the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This
HandlePollingError
is still in two places, in here and inSubmitInventoryState
.
I think that's fine. These two cases can potentially differ even more in the future.
} | ||
}); | ||
|
||
DoSubmitInventory(ctx, poster); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this happening now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in OnEnter
. The previous OnEnter
only scheduled the next interval and called DoSubmitInventory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I was looking for this function call somewhere. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As was pointed out, having the action description as an arg and using that when logging would be nice; other than that it looks good to me 🚀
From the code point, the issue was that the re-scheduling of new polls for updates or inventory were done from the states after the sync state, so unless the state machine reached that point the new polls would not be scheduled.
Fix by creating two new states, that just do the re-scheduling, between idle and sync. Note that the timer(s) have now been moved to the context object so that it can be accessed from multiple states (namely, update polling and submit inventory states which would need to manipulate the timer for exponential back-off retries.
Changelog: Fix issue where any error in Sync state (triggered for example with an error in Sync_Enter or Sync_Leave state scripts) leaves the client stuck in idle state forever and no new polls for update nor submit of inventory would be attempted again.
TODO: