Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Problematic new solution] Fix race condition (also a regression of the PR 19139) #19256

Closed
wants to merge 1 commit into from

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Jan 22, 2025

Fix #19172

Note this solution has a potential race condition, because the grandson routine (created in serve ) call wg.Add(1) async, so it may happen after wg.Wait().

Original solution: #19221

cc @serathius @fuweid

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

… the errc

Also added commment to clarify the etcd shutting down workflow

Signed-off-by: Benjamin Wang <[email protected]>
@ahrtr ahrtr force-pushed the race-20250117-new branch from 0517c82 to d12c2cc Compare January 22, 2025 13:11
Copy link

codecov bot commented Jan 22, 2025

Codecov Report

Attention: Patch coverage is 74.07407% with 14 lines in your changes missing coverage. Please review.

Project coverage is 68.80%. Comparing base (c9045d6) to head (d12c2cc).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
server/embed/etcd.go 62.06% 11 Missing ⚠️
server/embed/serve.go 88.00% 1 Missing and 2 partials ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
server/embed/serve.go 59.52% <88.00%> (+1.72%) ⬆️
server/embed/etcd.go 75.51% <62.06%> (-0.36%) ⬇️

... and 12 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19256      +/-   ##
==========================================
- Coverage   68.85%   68.80%   -0.05%     
==========================================
  Files         420      420              
  Lines       35655    35675      +20     
==========================================
- Hits        24551    24547       -4     
- Misses       9683     9705      +22     
- Partials     1421     1423       +2     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c9045d6...d12c2cc. Read the comment docs.

@k8s-ci-robot
Copy link

@ahrtr: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-integration-1-cpu-amd64 d12c2cc link true /test pull-etcd-integration-1-cpu-amd64

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@serathius
Copy link
Member

Note this solution has a potential race condition, because the grandson routine (created in serve ) call wg.Add(1) async, so it may happen after wg.Wait().

But serve is also called in errHandler, so Wait will not exit until the serve exits. I think this pattern is composable.

@ahrtr ahrtr changed the title [New solution] Fix race condition (also a regression of the PR 19139) [Problematic new solution] Fix race condition (also a regression of the PR 19139) Jan 22, 2025
@ahrtr
Copy link
Member Author

ahrtr commented Jan 22, 2025

But serve is also called in errHandler, so Wait will not exit until the serve exits.

It should be not correct. sync.WaitGroup.Add(1) is only safe iff it's guaranteed to be called before sync.WaitGroup.Wait(). Once it's async, then it isn't guaranteed any more.

We have two correct solutions:

@serathius
Copy link
Member

Not sure I understand, are you saying that the following code is not concurrency safe?

https://go.dev/play/p/-y3KuHTE2WU

@ahrtr
Copy link
Member Author

ahrtr commented Jan 22, 2025

Not sure I understand, are you saying that the following code is not concurrency safe?

https://go.dev/play/p/-y3KuHTE2WU

You are right. It's safe both in your example and this PR. Calling WaitGroup.Add(1) is safe as long as the counter is greater than 0 when it being called.

https://github.com/golang/go/blob/194de8fbfaf4c3ed54e1a3c1b14fc67a830b8d95/src/sync/waitgroup.go#L36-L39

But I still prefer #19257 to this one.

@k8s-ci-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Race condition when closing the embedded etcd
3 participants