Fix race condition (also a regression of the PR 19139) #19221

ahrtr · 2025-01-17T15:17:26Z

Fix #19172

Please review this PR commit by commit.

Two high level thoughts,

There are multiple levels of goroutines. The grandparent ( StartEtcd ) creates multiple child goroutines ( client listeners, peer listeners and metrics listeners). The client listeners creates some grandson goroutines (see the first commit). Each one should only manage their immediate children.
For sync.WaitGroup, we should always call wg.Add and wg.Wait in the same goroutine.

cc @serathius @fuweid @ivanvc @jmhbnz @joshuazh-x

codecov · 2025-01-17T16:01:10Z

Codecov Report

Attention: Patch coverage is 77.27273% with 5 lines in your changes missing coverage. Please review.

Project coverage is 68.81%. Comparing base (0dcd015) to head (201568a).
Report is 17 commits behind head on main.

Files with missing lines	Patch %	Lines
server/embed/serve.go	81.25%	1 Missing and 2 partials ⚠️
server/embed/etcd.go	66.66%	2 Missing ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
server/embed/etcd.go	`76.19% <66.66%> (+0.33%)`	⬆️
server/embed/serve.go	`59.38% <81.25%> (+1.58%)`	⬆️

... and 21 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19221      +/-   ##
==========================================
- Coverage   68.82%   68.81%   -0.02%     
==========================================
  Files         420      420              
  Lines       35649    35664      +15     
==========================================
+ Hits        24536    24541       +5     
- Misses       9692     9697       +5     
- Partials     1421     1426       +5

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0dcd015...201568a. Read the comment docs.

ahrtr · 2025-01-17T20:01:06Z

@fuweid @ivanvc @jmhbnz @serathius

This PR fixed a regression caused by #19139. So let's get this merged and backport to 3.5 and probably 3.4. We need to get it included in 3.5.18

ahrtr · 2025-01-17T21:30:03Z

/test pull-etcd-integration-1-cpu-arm64

server/embed/etcd.go

serathius · 2025-01-20T17:54:49Z

Hard to review without loading a lot of context, it's not the first time we are having problems with shutdown. I think the problem is lack of high level vision on shutdown protocol for server, and what sub routines should do to follow it, and why everything works together.

@ahrtr could you add a comment describing the shutdown protocol you have in mind? It should make it easier to review and be useful for the future.

ahrtr · 2025-01-20T18:27:19Z

@ahrtr could you add a comment describing the shutdown protocol you have in mind? It should make it easier to review and be useful for the future.

Done. Please see the last commit. cc @fuweid @ivanvc @jmhbnz @serathius

server/embed/etcd.go

…te before it returns Signed-off-by: Benjamin Wang <[email protected]>

… the errc Signed-off-by: Benjamin Wang <[email protected]>

server/embed/etcd.go

Signed-off-by: Benjamin Wang <[email protected]>

fuweid

LGTM

ahrtr · 2025-01-22T10:08:59Z

cc @serathius do you have any further comment?

There are three commits in this PR. Note the third commit only adds some comment. The first and second commits are straight forward, please see the description of this PR.

serathius · 2025-01-22T10:21:23Z

server/embed/etcd.go

+	// after all these sub goroutines exit (checked via `wg`). Writers
+	// should avoid writing after `stopc` is closed by selecting on
+	// reading from `stopc`.
+	errc chan error


Using wg to drain errc is a new concept introduced in previous PR, looks like previously we dependent on us correctly predicting needed capacity

etcd/server/embed/etcd.go

Line 257 in c9045d6

e.errc = make(chan error, len(e.Peers)+len(e.Clients)+2*len(e.sctxs))

Possibly issues stem from the fact that this logic got outdated, I think now we can have even 4 writers to errc per sctx.

If the wg setup is correct then we should safe with setting capacity of errc channel to 1.

I admit that the calculation of the errc capacity might not be accurate, but in practice it's already good enough. Also it's a separate topic.

serathius · 2025-01-22T10:49:13Z

server/embed/etcd.go

@@ -774,7 +805,9 @@ func (e *Etcd) serveClients() {

 	// start client servers in each goroutine
 	for _, sctx := range e.sctxs {
+		e.wg.Add(1)


I would recommend wrapping the logic into a function to schedule a subroutine:

func (e *Etcd) runSubroutine(subroutine func() err) { e.wg.Add(1) go func() { defer e.wg.Done() err := subroutine() if err != nil { e.GetLogger().Error("setting up serving from embedded etcd failed.", zap.Error(err)) } select { case <-e.stopc: return default: } select { case <-e.stopc: case e.errc <- err: } }() }

If you plan to backport the PR it should make it much safer to replace errHandler calls as it would prevent us from missing any caller.

On the contrary, I tend not to do the refactoring because we need to backport the PR to 3.5.

From the top level, it's clear that the StartEtcd only creates three kinds of subroutines, which need to send back error to errc. So I don't see we miss any call for now. If you think we might miss it in future when we (especially new contributors) make new changes, then let's discuss it separately and do whatever refactor on main only.

etcd/server/embed/etcd.go

Lines 274 to 278 in c9045d6

e.servePeers()

e.serveClients()

if err = e.serveMetrics(); err != nil {

Also note that as mentioned in the description of this PR, the errHandler is also used to track the grandson goroutine. Making the refactoring as you suggested may make it even harder to backport the PR.

Please see #19256

k8s-ci-robot · 2025-01-22T13:38:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, fuweid, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahrtr,serathius]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahrtr · 2025-01-22T14:31:06Z

@fuweid @serathius then let's merge this PR and backport it to 3.5. Afterwards, we can continue to do the refactoring in main only in #19257. We can have more discussion under that PR.

Please let me know your thoughts before I merge this PR. thx

serathius · 2025-01-22T14:34:39Z

#19139 Was backported to v3.4 so we also need to backport there.

ahrtr · 2025-01-22T14:39:18Z

#19139 Was backported to v3.4 so we also need to backport there.

#19139 was only partially backported to 3.4, and already confirmed that 3.4 doesn't have this regression due to big code difference. So this PR only needs to be backported to 3.5.

fuweid · 2025-01-22T14:40:32Z

let's merge this PR and backport it to 3.5. Afterwards, we can continue to do the refactoring in main only in #19257. We can have more discussion under that PR.

+1. Agree

k8s-ci-robot added approved size/S labels Jan 17, 2025

ahrtr force-pushed the race-20250117 branch from 8c456ea to 0a5de14 Compare January 17, 2025 15:49

k8s-ci-robot added size/M and removed size/S labels Jan 17, 2025

ahrtr force-pushed the race-20250117 branch 2 times, most recently from c76bbeb to b1e5ebc Compare January 17, 2025 19:29

k8s-ci-robot added the area/testing label Jan 17, 2025

ahrtr mentioned this pull request Jan 18, 2025

Avoid racing in closing Etcd.errc channel #19205

Closed

ahrtr force-pushed the race-20250117 branch from 1dcdfc1 to 3fa96c8 Compare January 18, 2025 10:20

ahrtr marked this pull request as draft January 19, 2025 10:08

k8s-ci-robot added the do-not-merge/work-in-progress label Jan 19, 2025

ahrtr marked this pull request as ready for review January 19, 2025 10:27

k8s-ci-robot removed the do-not-merge/work-in-progress label Jan 19, 2025

ahrtr force-pushed the race-20250117 branch from 3fa96c8 to 86ce681 Compare January 19, 2025 10:27

ahrtr mentioned this pull request Jan 19, 2025

Fix race condition (also a regression of the PR 19139) #19231

Closed

fuweid reviewed Jan 19, 2025

View reviewed changes

server/embed/etcd.go Outdated Show resolved Hide resolved

ahrtr force-pushed the race-20250117 branch from 86ce681 to 32fb3ee Compare January 20, 2025 09:04

ahrtr added the backport/v3.5 label Jan 20, 2025

ahrtr requested review from ivanvc, serathius, fuweid and jmhbnz January 20, 2025 14:54

ahrtr mentioned this pull request Jan 20, 2025

Fix passing compaction-batch-limit to etcd v3.4 and v3.5 #19218

Merged

ahrtr force-pushed the race-20250117 branch from 32fb3ee to c238ce9 Compare January 20, 2025 18:25

ahrtr mentioned this pull request Jan 21, 2025

Elevate etcd team permissions for release window kubernetes/org#5360

Merged

serathius reviewed Jan 21, 2025

View reviewed changes

server/embed/etcd.go Show resolved Hide resolved

ahrtr added 2 commits January 21, 2025 10:52

Enhance method (*serveCtx) serve to wait for all goroutines to comple…

3527b3b

…te before it returns Signed-off-by: Benjamin Wang <[email protected]>

Ensure all goroutines created by StartEtcd to complete before closing…

86a3170

… the errc Signed-off-by: Benjamin Wang <[email protected]>

ahrtr force-pushed the race-20250117 branch from f112964 to f604dd8 Compare January 21, 2025 10:59

k8s-ci-robot added size/L and removed size/M labels Jan 21, 2025

ahrtr force-pushed the race-20250117 branch from f604dd8 to 7489977 Compare January 21, 2025 11:57

serathius reviewed Jan 21, 2025

View reviewed changes

server/embed/etcd.go Show resolved Hide resolved

ahrtr force-pushed the race-20250117 branch from 7489977 to e89357b Compare January 21, 2025 12:46

serathius reviewed Jan 21, 2025

View reviewed changes

server/embed/etcd.go Show resolved Hide resolved

serathius reviewed Jan 21, 2025

View reviewed changes

server/embed/etcd.go Outdated Show resolved Hide resolved

add commment to clarify the etcd shutting down workflow

201568a

Signed-off-by: Benjamin Wang <[email protected]>

ahrtr force-pushed the race-20250117 branch from e89357b to 201568a Compare January 21, 2025 15:46

k8s-ci-robot added size/M and removed size/L labels Jan 21, 2025

fuweid approved these changes Jan 21, 2025

View reviewed changes

serathius reviewed Jan 22, 2025

View reviewed changes

ahrtr mentioned this pull request Jan 22, 2025

[Problematic new solution] Fix race condition (also a regression of the PR 19139) #19256

Open

serathius approved these changes Jan 22, 2025

View reviewed changes

ahrtr mentioned this pull request Jan 22, 2025

[New solution] Fix race condition (also a regression of the PR 19139) #19257

Open

ahrtr merged commit 43431bd into etcd-io:main Jan 22, 2025
35 checks passed

ahrtr deleted the race-20250117 branch January 22, 2025 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition (also a regression of the PR 19139) #19221

Fix race condition (also a regression of the PR 19139) #19221

ahrtr commented Jan 17, 2025 •

edited

Loading

codecov bot commented Jan 17, 2025 •

edited

Loading

ahrtr commented Jan 17, 2025

ahrtr commented Jan 17, 2025

serathius commented Jan 20, 2025

ahrtr commented Jan 20, 2025

fuweid left a comment

ahrtr commented Jan 22, 2025

serathius Jan 22, 2025 •

edited

Loading

ahrtr Jan 22, 2025

serathius Jan 22, 2025 •

edited

Loading

serathius Jan 22, 2025

ahrtr Jan 22, 2025

ahrtr Jan 22, 2025

k8s-ci-robot commented Jan 22, 2025

ahrtr commented Jan 22, 2025

serathius commented Jan 22, 2025

ahrtr commented Jan 22, 2025

fuweid commented Jan 22, 2025

	e.servePeers()

	e.serveClients()

	if err = e.serveMetrics(); err != nil {

Fix race condition (also a regression of the PR 19139) #19221

Fix race condition (also a regression of the PR 19139) #19221

Conversation

ahrtr commented Jan 17, 2025 • edited Loading

codecov bot commented Jan 17, 2025 • edited Loading

Codecov Report

ahrtr commented Jan 17, 2025

ahrtr commented Jan 17, 2025

serathius commented Jan 20, 2025

ahrtr commented Jan 20, 2025

fuweid left a comment

Choose a reason for hiding this comment

ahrtr commented Jan 22, 2025

serathius Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

ahrtr Jan 22, 2025

Choose a reason for hiding this comment

serathius Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

serathius Jan 22, 2025

Choose a reason for hiding this comment

ahrtr Jan 22, 2025

Choose a reason for hiding this comment

ahrtr Jan 22, 2025

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 22, 2025

ahrtr commented Jan 22, 2025

serathius commented Jan 22, 2025

ahrtr commented Jan 22, 2025

fuweid commented Jan 22, 2025

ahrtr commented Jan 17, 2025 •

edited

Loading

codecov bot commented Jan 17, 2025 •

edited

Loading

serathius Jan 22, 2025 •

edited

Loading

serathius Jan 22, 2025 •

edited

Loading