Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync only the target filesystem at the end of mover tasks #1466

Merged
merged 2 commits into from
Dec 11, 2024

Conversation

evandeaubl
Copy link
Contributor

Describe what this PR does
This PR fixes a bug I observed in a cluster where I had many rsync-tls privileged volsync jobs running against Longhorn volumes, and at least one was doing a large transfer. After a short time, the load average on the node with the large transfer destination mover would shoot really high (around 15), and the node started having problems at the kubelet level (pods failing to start with CreateContainerError was the easiest to observe condition)

Digging deeper, I saw that the other destination mover pods on that node (not related to the large transfer) were hanging at the ends of their transfers, when they should have been exiting and recycled by the operator.

Comparing the logs to the script showed that the only thing that could have been hanging was the sync command at the end of the script. Execing into the container and running a manual sync hanged in the same way, confirming that was the issue. If I instead limited the sync to just the target filesystem using sync -f /data, the hang did not occur. Building a custom rsync-tls mover with the sync command limited to the one filesystem resolved this issue.

My understanding of how sync works in containers makes me think the general sync was trying to sync much more than just what was in the container itself, and combined with an active large transfer was causing things to bind up.

Is there anything that requires special attention?
I am only running rsync-tls at the moment, but there were sync commands in the rclone and rsync-ssh movers that could exhibit the same problem that I fixed as well, but did not test.

Related issues:
None

Copy link
Contributor

openshift-ci bot commented Nov 23, 2024

Hi @evandeaubl. Thanks for your PR.

I'm waiting for a backube member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tesshuflower
Copy link
Contributor

@evandeaubl thanks for this great analysis and your PR! I do feel like we've heard about the occasional problem with running sync, hope this will address most of them. I agree it makes sense to only run the sync against the mounted volume.

I've just pushed a small change with updates to some of the scripts.

@evandeaubl
Copy link
Contributor Author

@tesshuflower Thanks for the script cleanup, and catching that one sync I missed. Easy to miss finding a sync in a project named volsync. 😄

Also, I've been testing this change over the past few days on my homelab cluster doing some pretty large transfers under the same conditions as described. The change has been working great, and hasn't caused any regressions that I've observed.

Copy link

codecov bot commented Nov 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 66.2%. Comparing base (0f8c557) to head (58fec6f).
Report is 3 commits behind head on main.

Additional details and impacted files
@@          Coverage Diff          @@
##            main   #1466   +/-   ##
=====================================
  Coverage   66.2%   66.2%           
=====================================
  Files         57      57           
  Lines       7474    7474           
=====================================
  Hits        4950    4950           
  Misses      2236    2236           
  Partials     288     288           

@tesshuflower
Copy link
Contributor

@evandeaubl Sorry for the delay on this one - there were some issues with automated tests that were breaking all the e2e tests from github actions. Unfortunately this means we'll have to rebase this PR to pickup the latest updates in main.

I can do this, but would require force-pushing and that will re-write your history in your local fork/main. You can either rebase yourself, or let me know if you don't mind and I can do it. Alternatively I can also pull in your changes into a new PR if you prefer - just let me know!

@evandeaubl
Copy link
Contributor Author

@tesshuflower No problem with you rebasing it. It doesn't disrupt anything on my end, and I don't think I'd have spare cycles to look at this until towards the end of the week. Thank you!

@tesshuflower
Copy link
Contributor

@tesshuflower No problem with you rebasing it. It doesn't disrupt anything on my end, and I don't think I'd have spare cycles to look at this until towards the end of the week. Thank you!

Thanks - done!

Copy link

sonarqubecloud bot commented Dec 9, 2024

@tesshuflower
Copy link
Contributor

/ok-to-test

@tesshuflower
Copy link
Contributor

/approve
/lgtm

Copy link
Contributor

openshift-ci bot commented Dec 11, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: evandeaubl, tesshuflower

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit e3e8da7 into backube:main Dec 11, 2024
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants