Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition (also a regression of the PR 19139) #19221

Merged
merged 3 commits into from
Jan 22, 2025

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Jan 17, 2025

Fix #19172

Please review this PR commit by commit.

Two high level thoughts,

  • There are multiple levels of goroutines. The grandparent ( StartEtcd ) creates multiple child goroutines ( client listeners, peer listeners and metrics listeners). The client listeners creates some grandson goroutines (see the first commit). Each one should only manage their immediate children.
  • For sync.WaitGroup, we should always call wg.Add and wg.Wait in the same goroutine.

cc @serathius @fuweid @ivanvc @jmhbnz @joshuazh-x

Copy link

codecov bot commented Jan 17, 2025

Codecov Report

Attention: Patch coverage is 77.27273% with 5 lines in your changes missing coverage. Please review.

Project coverage is 68.81%. Comparing base (0dcd015) to head (201568a).
Report is 17 commits behind head on main.

Files with missing lines Patch % Lines
server/embed/serve.go 81.25% 1 Missing and 2 partials ⚠️
server/embed/etcd.go 66.66% 2 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
server/embed/etcd.go 76.19% <66.66%> (+0.33%) ⬆️
server/embed/serve.go 59.38% <81.25%> (+1.58%) ⬆️

... and 21 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19221      +/-   ##
==========================================
- Coverage   68.82%   68.81%   -0.02%     
==========================================
  Files         420      420              
  Lines       35649    35664      +15     
==========================================
+ Hits        24536    24541       +5     
- Misses       9692     9697       +5     
- Partials     1421     1426       +5     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0dcd015...201568a. Read the comment docs.

@ahrtr ahrtr force-pushed the race-20250117 branch 2 times, most recently from c76bbeb to b1e5ebc Compare January 17, 2025 19:29
@ahrtr
Copy link
Member Author

ahrtr commented Jan 17, 2025

@fuweid @ivanvc @jmhbnz @serathius

This PR fixed a regression caused by #19139. So let's get this merged and backport to 3.5 and probably 3.4. We need to get it included in 3.5.18

@ahrtr
Copy link
Member Author

ahrtr commented Jan 17, 2025

/test pull-etcd-integration-1-cpu-arm64

server/embed/etcd.go Outdated Show resolved Hide resolved
@serathius
Copy link
Member

Hard to review without loading a lot of context, it's not the first time we are having problems with shutdown. I think the problem is lack of high level vision on shutdown protocol for server, and what sub routines should do to follow it, and why everything works together.

@ahrtr could you add a comment describing the shutdown protocol you have in mind? It should make it easier to review and be useful for the future.

@ahrtr
Copy link
Member Author

ahrtr commented Jan 20, 2025

@ahrtr could you add a comment describing the shutdown protocol you have in mind? It should make it easier to review and be useful for the future.

Done. Please see the last commit. cc @fuweid @ivanvc @jmhbnz @serathius

server/embed/etcd.go Outdated Show resolved Hide resolved
Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ahrtr
Copy link
Member Author

ahrtr commented Jan 22, 2025

cc @serathius do you have any further comment?

There are three commits in this PR. Note the third commit only adds some comment. The first and second commits are straight forward, please see the description of this PR.

// after all these sub goroutines exit (checked via `wg`). Writers
// should avoid writing after `stopc` is closed by selecting on
// reading from `stopc`.
errc chan error
Copy link
Member

@serathius serathius Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using wg to drain errc is a new concept introduced in previous PR, looks like previously we dependent on us correctly predicting needed capacity

e.errc = make(chan error, len(e.Peers)+len(e.Clients)+2*len(e.sctxs))

Possibly issues stem from the fact that this logic got outdated, I think now we can have even 4 writers to errc per sctx.

If the wg setup is correct then we should safe with setting capacity of errc channel to 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit that the calculation of the errc capacity might not be accurate, but in practice it's already good enough. Also it's a separate topic.

@@ -774,7 +805,9 @@ func (e *Etcd) serveClients() {

// start client servers in each goroutine
for _, sctx := range e.sctxs {
e.wg.Add(1)
Copy link
Member

@serathius serathius Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend wrapping the logic into a function to schedule a subroutine:

func (e *Etcd) runSubroutine(subroutine func() err) {
  e.wg.Add(1)
  go func() {
    defer e.wg.Done()
    err := subroutine()
     if err != nil {
        e.GetLogger().Error("setting up serving from embedded etcd failed.", zap.Error(err))
     }
      select {
      case <-e.stopc:
      return
      default:
      }
      select {
      case <-e.stopc:
      case e.errc <- err:
      }
      }()
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you plan to backport the PR it should make it much safer to replace errHandler calls as it would prevent us from missing any caller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the contrary, I tend not to do the refactoring because we need to backport the PR to 3.5.

  • From the top level, it's clear that the StartEtcd only creates three kinds of subroutines, which need to send back error to errc. So I don't see we miss any call for now. If you think we might miss it in future when we (especially new contributors) make new changes, then let's discuss it separately and do whatever refactor on main only.

    etcd/server/embed/etcd.go

    Lines 274 to 278 in c9045d6

    e.servePeers()
    e.serveClients()
    if err = e.serveMetrics(); err != nil {

  • Also note that as mentioned in the description of this PR, the errHandler is also used to track the grandson goroutine. Making the refactoring as you suggested may make it even harder to backport the PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #19256

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, fuweid, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ahrtr
Copy link
Member Author

ahrtr commented Jan 22, 2025

@fuweid @serathius then let's merge this PR and backport it to 3.5. Afterwards, we can continue to do the refactoring in main only in #19257. We can have more discussion under that PR.

Please let me know your thoughts before I merge this PR. thx

@serathius
Copy link
Member

#19139 Was backported to v3.4 so we also need to backport there.

@ahrtr
Copy link
Member Author

ahrtr commented Jan 22, 2025

#19139 Was backported to v3.4 so we also need to backport there.

#19139 was only partially backported to 3.4, and already confirmed that 3.4 doesn't have this regression due to big code difference. So this PR only needs to be backported to 3.5.

@fuweid
Copy link
Member

fuweid commented Jan 22, 2025

let's merge this PR and backport it to 3.5. Afterwards, we can continue to do the refactoring in main only in #19257. We can have more discussion under that PR.

+1. Agree

@ahrtr ahrtr merged commit 43431bd into etcd-io:main Jan 22, 2025
35 checks passed
@ahrtr ahrtr deleted the race-20250117 branch January 22, 2025 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Race condition when closing the embedded etcd
4 participants