Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: add metric tracking tables watched #119391

Conversation

rharding6373
Copy link
Collaborator

@rharding6373 rharding6373 commented Feb 20, 2024

This PR adds a new metric, changefeed.watched_targets, which tracks the number of unique tables watched by all changefeeds. If a table is tracked in multiple feeds, it is only counted once.

Epic: none
Fixes: #118379

Release note (enterprise change): Adds a new metric, changefeed.watched_targets, which tracks the number of unique tables watched by all changefeeds. Note that a table remains watched even if a changefeed is paused, and only becomes unwatched if the changefeed is altered to remove the table (the changefeed must be resumed), is canceled, or fails.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rharding6373 rharding6373 force-pushed the 20240205_metric_tables_watches_118379 branch 2 times, most recently from 576e86e to 90f9e7a Compare March 5, 2024 05:26
@rharding6373 rharding6373 force-pushed the 20240205_metric_tables_watches_118379 branch from 90f9e7a to fb01b11 Compare March 5, 2024 05:40
@rharding6373 rharding6373 marked this pull request as ready for review March 5, 2024 05:40
@rharding6373 rharding6373 requested a review from a team as a code owner March 5, 2024 05:40
@rharding6373 rharding6373 requested review from jayshrivastava and removed request for a team March 5, 2024 05:40
@rharding6373 rharding6373 force-pushed the 20240205_metric_tables_watches_118379 branch from fb01b11 to 23bd1be Compare March 5, 2024 21:18
Copy link
Collaborator

@andyyang890 andyyang890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @amruss, @jayshrivastava, @rharding6373, and @wenyihu6)


-- commits line 12 at r1:
nit: maybe we should say "number of unique tables"


pkg/ccl/changefeedccl/changefeed_stmt.go line 313 at r1 (raw file):

		logChangefeedCreateTelemetry(ctx, jr, changefeedStmt.Select != nil)
		err = func() error {

Maybe move this code into a named helper function? Especially given that it's more or less repeated verbatim below


pkg/ccl/changefeedccl/changefeed_stmt.go line 1497 at r1 (raw file):

	}

	return func() error {

nit: maybe put this in an if err := ...; err != nil { ... } and still have the return nil at the end of the function, would make it a bit easier to insert things after this function (if needed) in the future


pkg/ccl/changefeedccl/metrics.go line 165 at r1 (raw file):

		resolved    map[int64]hlc.Timestamp
		checkpoint  map[int64]hlc.Timestamp
		targets     map[int]int64

Could you add comments documenting what these fields store?


pkg/ccl/changefeedccl/metrics.go line 229 at r1 (raw file):

// WatchedTargets metrics if this is the first target with this id.
// Requires m.mu to be locked.
func (m *sliMetrics) incrTargetLocked(id int) {

nit: maybe rename to incTargetLocked to match the Gauge API


pkg/ccl/changefeedccl/metrics.go line 240 at r1 (raw file):

// WatchedTargets metric if this was the last target with this id.
// Requires m.mu to be locked.
func (m *sliMetrics) decrTargetLocked(id int) {

nit: maybe rename to decTargetLocked to match the Gauge API


pkg/ccl/changefeedccl/metrics.go line 886 at r1 (raw file):

	metaChangefeedWatchedTargets := metric.Metadata{
		Name:        "changefeed.watched_targets",
		Help:        "Targets watched by all feeds",

nit: maybe we should clarify that it's the number of "unique targets"


pkg/ccl/changefeedccl/helpers_test.go line 1379 at r1 (raw file):

}

func waitForChangefeedRunning(t *testing.T, sli *sliMetrics, targetCount int64) {

nit: maybe rename to something like waitForChangefeedRunningCount


pkg/ccl/changefeedccl/metrics_test.go line 23 at r1 (raw file):

)

func TestRunningAndTables(t *testing.T) {

nit: maybe rename to TestRunningCountAndWatchedTargets since those are the only two metrics being tested?


pkg/ccl/changefeedccl/metrics_test.go line 96 at r1 (raw file):

		require.EqualValues(t, 2, sliMetric.WatchedTargets.Value())

		// The running count should be decremented when the job is paused, but the

Should we maybe add an assertion for the running count being 0?

Copy link
Contributor

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for adding these metrics! They will be really useful for escalations.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @amruss, @jayshrivastava, and @rharding6373)


-- commits line 13 at r1:
nit: i think we only remove the watched targets if the job has failed or been canceled. Could you add a note here clarifying how they change and they won't be removed if they are paused?


pkg/ccl/changefeedccl/changefeed_stmt.go line 316 at r1 (raw file):

			metrics := p.ExecCfg().JobRegistry.MetricsStruct().Changefeed.(*Metrics)
			pl := sj.Payload()
			cf := pl.GetDetails().(*jobspb.Payload_Changefeed).Changefeed

It seems there's a potential for pl.GetDetails() to return nil, which could lead to a nil pointer dereference when accessing Changefeed here. But I notice that other parts of the code also directly access its fields, so maybe this is not possible.


pkg/ccl/changefeedccl/changefeed_stmt.go line 317 at r1 (raw file):

			pl := sj.Payload()
			cf := pl.GetDetails().(*jobspb.Payload_Changefeed).Changefeed
			scope := cf.Opts[changefeedbase.OptMetricsScope]

Do you know how this metrics would work if we add a metrics label for different changefeeds? Will sli.WatchedTargets still be correct?


pkg/ccl/changefeedccl/changefeed_stmt.go line 1156 at r1 (raw file):

		})
		sli.updateLastTargets(jobID, targetSet)
		return nil

nit: this part of code dealing with updating metrics for starting, resuming, and canceling jobs seems quite repetitive. Is there a straightforward way to modularize it?


pkg/ccl/changefeedccl/metrics.go line 197 at r1 (raw file):

}

func (m *sliMetrics) updateLastTargets(id jobspb.JobID, newTargets intsets.Fast) {

nit: could you add a comment for this function?


pkg/ccl/changefeedccl/metrics.go line 207 at r1 (raw file):

		}
		addedTargets = newTargets.Difference(oldTargets)
		removedTargets = oldTargets.Difference(newTargets)

For my education, what's the time efficiency to call Difference or Equals on intsets.Fast?


pkg/ccl/changefeedccl/metrics_test.go line 164 at r1 (raw file):

		sqlDB.Exec(t, `CREATE TABLE bar (a INT PRIMARY KEY, b STRING)`)
		require.NoError(t, testfeed.Resume())
		require.EqualValues(t, 2, sliMetric.WatchedTargets.Value())

Nice tests!

@rharding6373 rharding6373 force-pushed the 20240205_metric_tables_watches_118379 branch from 23bd1be to 9a3a1c3 Compare March 8, 2024 04:54
Copy link
Collaborator Author

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a major change to move this metric from sliMetrics into the main Metrics, since we want to capture the # of unique tables across all changefeeds regardless of the metric label the changefeed uses. The result is a tiny bit simpler. PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @amruss, @andyyang890, @jayshrivastava, and @wenyihu6)


-- commits line 12 at r1:

Previously, andyyang890 (Andy Yang) wrote…

nit: maybe we should say "number of unique tables"

Done


-- commits line 13 at r1:

Previously, wenyihu6 (Wenyi Hu) wrote…

nit: i think we only remove the watched targets if the job has failed or been canceled. Could you add a note here clarifying how they change and they won't be removed if they are paused?

Done


pkg/ccl/changefeedccl/changefeed_stmt.go line 313 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

Maybe move this code into a named helper function? Especially given that it's more or less repeated verbatim below

Good suggestion. Done.


pkg/ccl/changefeedccl/changefeed_stmt.go line 316 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

It seems there's a potential for pl.GetDetails() to return nil, which could lead to a nil pointer dereference when accessing Changefeed here. But I notice that other parts of the code also directly access its fields, so maybe this is not possible.

The payload is always set when the job is adopted. I don't think it can be nil.


pkg/ccl/changefeedccl/changefeed_stmt.go line 317 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Do you know how this metrics would work if we add a metrics label for different changefeeds? Will sli.WatchedTargets still be correct?

This is a good point, and one thing I realized since writing this is that due to the way metrics labels work, we would double-count tables watched by changefeeds under different labels 😬 I made a major change to fix this and added some test coverage. PTAL


pkg/ccl/changefeedccl/changefeed_stmt.go line 1156 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

nit: this part of code dealing with updating metrics for starting, resuming, and canceling jobs seems quite repetitive. Is there a straightforward way to modularize it?

Done


pkg/ccl/changefeedccl/changefeed_stmt.go line 1497 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

nit: maybe put this in an if err := ...; err != nil { ... } and still have the return nil at the end of the function, would make it a bit easier to insert things after this function (if needed) in the future

With the new changes, this is no longer an issue.


pkg/ccl/changefeedccl/metrics.go line 165 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

Could you add comments documenting what these fields store?

Done


pkg/ccl/changefeedccl/metrics.go line 197 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

nit: could you add a comment for this function?

Done


pkg/ccl/changefeedccl/metrics.go line 207 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

For my education, what's the time efficiency to call Difference or Equals on intsets.Fast?

Worst case implementation is a linked list of bitmaps, so Equals would be O(N). However, since table IDs are usually within a relatively small (100s) range, in practice I expect in most cases it would fit in a couple of bitmaps with only 1-3 linked list nodes.

Difference makes a copy of one of the sets first, but is also O(N).


pkg/ccl/changefeedccl/metrics.go line 229 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

nit: maybe rename to incTargetLocked to match the Gauge API

I made the name more similar (while still using locked naming convention). What do you think?


pkg/ccl/changefeedccl/metrics.go line 240 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

nit: maybe rename to decTargetLocked to match the Gauge API

see above


pkg/ccl/changefeedccl/metrics.go line 886 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

nit: maybe we should clarify that it's the number of "unique targets"

Done


pkg/ccl/changefeedccl/helpers_test.go line 1379 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

nit: maybe rename to something like waitForChangefeedRunningCount

I did some refactoring here and made this more generic. Wdyt?


pkg/ccl/changefeedccl/metrics_test.go line 23 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

nit: maybe rename to TestRunningCountAndWatchedTargets since those are the only two metrics being tested?

Done


pkg/ccl/changefeedccl/metrics_test.go line 96 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

Should we maybe add an assertion for the running count being 0?

Good idea. Done.


pkg/ccl/changefeedccl/metrics_test.go line 164 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Nice tests!

Thanks!

This PR adds a new metric, changefeed.watched_targets, which tracks the
number of unique tables watched by all changefeeds. If a table is
tracked in multiple feeds, it is only counted once.

Epic: none
Fixes: cockroachdb#118379

Release note (enterprise change): Adds a new metric,
changefeed.watched_targets, which tracks the number of unique tables
watched by all changefeeds. Note that a table remains watched even if a
changefeed is paused, and only becomes unwatched if the changefeed is
altered to remove the table (the changefeed must be resumed), is
canceled, or fails.
@rharding6373 rharding6373 force-pushed the 20240205_metric_tables_watches_118379 branch from 9a3a1c3 to 46c2422 Compare March 8, 2024 18:54
Copy link
Contributor

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! Thanks for working on this.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @amruss, @andyyang890, @rharding6373, and @wenyihu6)


pkg/ccl/changefeedccl/metrics.go line 1136 at r2 (raw file):

// WatchedTargets metrics if this is the first target with this id.
// Requires m.mu to be locked.
func (m *Metrics) incLocked(id int) {

It's probably better to name these with a specific name ex. incWatchedTargetsLocked.


pkg/ccl/changefeedccl/metrics.go line 1136 at r2 (raw file):

// WatchedTargets metrics if this is the first target with this id.
// Requires m.mu to be locked.
func (m *Metrics) incLocked(id int) {

nit: use targetID instead of id.

Copy link
Collaborator

@andyyang890 andyyang890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small nits but otherwise :lgtm:

Reviewed 6 of 6 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @amruss, @rharding6373, and @wenyihu6)


pkg/ccl/changefeedccl/metrics.go line 165 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Done

Thanks for adding comments!


pkg/ccl/changefeedccl/metrics.go line 1083 at r2 (raw file):

		// targets maps table IDs to the number of occurrences of the table ID in
		// changefeeds.
		targets map[int]int64

nit: do we maybe want to call this something more descriptive liketargetCounts? if so, we can rename the functions below like incTargetCountLocked and decTargetCountLocked


pkg/ccl/changefeedccl/metrics.go line 1133 at r2 (raw file):

}

// incrTarget increments the count with the target ID and increments the

nit: function name in comment needs to be updated


pkg/ccl/changefeedccl/metrics.go line 1144 at r2 (raw file):

}

// decrTarget decrements the count with the target ID and decrements the

nit: function name in comment needs to be updated


pkg/ccl/changefeedccl/helpers_test.go line 1379 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

I did some refactoring here and made this more generic. Wdyt?

Nice!

Copy link
Contributor

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: mod some nit comments others already made

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @amruss and @rharding6373)


pkg/ccl/changefeedccl/metrics.go line 207 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Worst case implementation is a linked list of bitmaps, so Equals would be O(N). However, since table IDs are usually within a relatively small (100s) range, in practice I expect in most cases it would fit in a couple of bitmaps with only 1-3 linked list nodes.

Difference makes a copy of one of the sets first, but is also O(N).

Thanks for the explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

changefeedccl: add metric tables_watched
5 participants