Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
138233: roachtest: Deflake multitenant upgrade test r=rimadeodhar a=rimadeodhar

The multitenant upgrade test enforces different test scenarios while upgrading tenants in a mixed version state. The test enforces the following cases:
1. Start storage cluster with binary version: x, cluster version: x
2. Create some tenants with binary version: x and ensure they can connect to the cluster and run a workload.
3. Using the mixed version test framework, upgrade the storage cluster with binary version: x+1, cluster version: x. In this mixed version state, create remaining tenants with binary version: x and run a workload.
4. Finalize the storage cluster. At this point, the storage cluster has binary version: x+1 and cluster version: x+1
5. Upgrade tenants with binary version: x+1 and confirm tenants can connect to the storage cluster and run a workload.

In #131847, the test was rewritten using the new mixed version test framework. However, this change exposed this test to a scenario that can cause this test to fail at step 3 above. The MVT framework also runs the mixed version test (i.e. with the tenant at the older binary version) when the cluster is in the finalizing stage. This scenario is run with a [prefixed probability](https://github.com/cockroachdb/cockroach/blob/b29d6f670b94492730eec233051cbe38a57b1ae9/pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go#L633). However, if we attempt to start the tenants with the previous version (i.e. the version the cluster is being upgraded from) when the cluster is being finalized, the tenants rightfully fail to connect which the test incorrectly interprets as a failure. As a result, we would see this test fail occassionally since the test was updated to use the new MVT. This PR modifies the test to ensure that in the finalizing state, we start the tenants with the right version.

Epic: none
Fixes: #136447
Release note: None

138977: roachtest: fix WaitForReady bugs r=kyle-a-wong a=kyle-a-wong

WaitForReady was not properly handling when the deadline of a timeout context was reached while waiting for the cluster to be ready. Previously, hitting the timeout would result in the roachtest running for 10 minutes until the whole test timed out. Now, WaitForReady respects when the timeutil.RunWithTimeout context times out.

Updated WaitForReady's checkReady func to use RoachtestHTTPClient's http client when making the call to GET /health?ready=1 instead of RoachtestHTTPClient's wrapper method. This wrapper method attempts to authenticate the user if no session id exists, but this endpoint doesn't require authentication so it doesnt need to use the wrapper method.

Fixes: #138143
Resolves: #136128
Epic: None
Release note: None

Co-authored-by: rimadeodhar <[email protected]>
Co-authored-by: Kyle Wong <[email protected]>
  • Loading branch information
3 people committed Jan 13, 2025
3 parents 304ccc7 + 2de665c + 0275f1f commit 2e25969
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 5 deletions.
6 changes: 3 additions & 3 deletions pkg/cmd/roachtest/roachtestutil/utils.go
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,7 @@ func WaitForReady(
) {
client := DefaultHTTPClient(c, t.L())
checkReady := func(ctx context.Context, url string) error {
resp, err := client.Get(ctx, url)
resp, err := client.client.Get(ctx, url)
if err != nil {
return err
}
Expand All @@ -212,13 +212,13 @@ func WaitForReady(
for i, adminAddr := range adminAddrs {
url := fmt.Sprintf(`https://%s/health?ready=1`, adminAddr)

for err := checkReady(ctx, url); err != nil; err = checkReady(ctx, url) {
for err := checkReady(ctx, url); err != nil && ctx.Err() == nil; err = checkReady(ctx, url) {
t.L().Printf("n%d not ready, retrying: %s", nodes[i], err)
time.Sleep(time.Second)
}
t.L().Printf("n%d is ready", nodes[i])
}
return nil
return ctx.Err()
},
))
}
Expand Down
16 changes: 14 additions & 2 deletions pkg/cmd/roachtest/tests/multitenant_upgrade.go
Original file line number Diff line number Diff line change
Expand Up @@ -221,8 +221,20 @@ func runMultitenantUpgrade(ctx context.Context, t test.Test, c cluster.Cluster)
func(ctx context.Context, l *logger.Logger, rng *rand.Rand, h *mixedversion.Helper) error {
for _, tenant := range tenants {
if !tenant.running {
if err := startTenant(ctx, l, tenant, h.Context().FromVersion, true); err != nil {
return err
if h.IsFinalizing() {
// If the upgrading service is finalizing, we need to stage the tenants with the upgraded
// binary to allow the tenant to start successfully.
if err := startTenant(ctx, l, tenant, h.Context().ToVersion, true); err != nil {
return err
}
} else {
// For all other upgrade stages, we can stage the tenant with the previous binary version i.e.
// the version from which the system tenant is being upgraded. This tests the scenario that
// in a mixed version state, tenants on the previous version can continue to connect
// to the cluster.
if err := startTenant(ctx, l, tenant, h.Context().FromVersion, true); err != nil {
return err
}
}
}
}
Expand Down

0 comments on commit 2e25969

Please sign in to comment.