-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync only the target filesystem at the end of mover tasks #1466
Conversation
Hi @evandeaubl. Thanks for your PR. I'm waiting for a backube member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@evandeaubl thanks for this great analysis and your PR! I do feel like we've heard about the occasional problem with running sync, hope this will address most of them. I agree it makes sense to only run the sync against the mounted volume. I've just pushed a small change with updates to some of the scripts. |
@tesshuflower Thanks for the script cleanup, and catching that one Also, I've been testing this change over the past few days on my homelab cluster doing some pretty large transfers under the same conditions as described. The change has been working great, and hasn't caused any regressions that I've observed. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1466 +/- ##
=====================================
Coverage 66.2% 66.2%
=====================================
Files 57 57
Lines 7474 7474
=====================================
Hits 4950 4950
Misses 2236 2236
Partials 288 288 |
@evandeaubl Sorry for the delay on this one - there were some issues with automated tests that were breaking all the e2e tests from github actions. Unfortunately this means we'll have to rebase this PR to pickup the latest updates in main. I can do this, but would require force-pushing and that will re-write your history in your local fork/main. You can either rebase yourself, or let me know if you don't mind and I can do it. Alternatively I can also pull in your changes into a new PR if you prefer - just let me know! |
@tesshuflower No problem with you rebasing it. It doesn't disrupt anything on my end, and I don't think I'd have spare cycles to look at this until towards the end of the week. Thank you! |
Signed-off-by: Evan Deaubl <[email protected]>
Signed-off-by: Tesshu Flower <[email protected]>
Thanks - done! |
Quality Gate passedIssues Measures |
/ok-to-test |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: evandeaubl, tesshuflower The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Describe what this PR does
This PR fixes a bug I observed in a cluster where I had many rsync-tls privileged volsync jobs running against Longhorn volumes, and at least one was doing a large transfer. After a short time, the load average on the node with the large transfer destination mover would shoot really high (around 15), and the node started having problems at the kubelet level (pods failing to start with CreateContainerError was the easiest to observe condition)
Digging deeper, I saw that the other destination mover pods on that node (not related to the large transfer) were hanging at the ends of their transfers, when they should have been exiting and recycled by the operator.
Comparing the logs to the script showed that the only thing that could have been hanging was the
sync
command at the end of the script. Execing into the container and running a manual sync hanged in the same way, confirming that was the issue. If I instead limited thesync
to just the target filesystem usingsync -f /data
, the hang did not occur. Building a custom rsync-tls mover with thesync
command limited to the one filesystem resolved this issue.My understanding of how sync works in containers makes me think the general
sync
was trying to sync much more than just what was in the container itself, and combined with an active large transfer was causing things to bind up.Is there anything that requires special attention?
I am only running rsync-tls at the moment, but there were
sync
commands in the rclone and rsync-ssh movers that could exhibit the same problem that I fixed as well, but did not test.Related issues:
None