Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Prevent leader checker from generating excessive duplicate leader tasks (#39000) #39160

Open
wants to merge 1 commit into
base: 2.5
Choose a base branch
from

Conversation

weiliu1031
Copy link
Contributor

issue: #39001
pr: #39000
Background:
Segment Load Version: Each segment load request assigns a timestamp as its version. When multiple copies of a segment are loaded on different QueryNodes, the leader checker uses this version to identify the latest copy and updates the routing table in the leader view to point to it. Delegator Router Version: When a delegator builds a route to a QueryNode that has loaded a segment, it also records the segment's version.

Router Table Update Logic: If the leader checker detects that the version of a segment in the routing table does not match the version in the worker, it updates the routing table to point to the QueryNode with the latest version. Additionally, it updates the segment's load version in the QueryNode during this process.

Issue:
When a channel is undergoing load balancing, the leader checker may sync the routing table to a new delegator. This sync operation modifies the segment's load version, which invalidates the routing in the old delegator. Subsequently, the leader checker updates the routing table in the old delegator, breaking the routing in the new delegator. This cycle continues, causing repeated updates and inconsistencies.

Fix:
This PR introduces two changes to address the issue:

  1. Use NodeID to verify whether the delegator's routing table needs an update, avoiding unnecessary modifications.
  2. Ensure compatibility by using the latest segment's load version as the version recorded in the routing table.

These changes resolve the cyclic updates and prevent the leader checker from generating excessive duplicate tasks, ensuring routing stability across delegators during load balancing.

…r tasks (milvus-io#39000)

issue: milvus-io#39001
Background:
Segment Load Version: Each segment load request assigns a timestamp as
its version. When multiple copies of a segment are loaded on different
QueryNodes, the leader checker uses this version to identify the latest
copy and updates the routing table in the leader view to point to it.
Delegator Router Version: When a delegator builds a route to a QueryNode
that has loaded a segment, it also records the segment's version.

Router Table Update Logic: If the leader checker detects that the
version of a segment in the routing table does not match the version in
the worker, it updates the routing table to point to the QueryNode with
the latest version. Additionally, it updates the segment's load version
in the QueryNode during this process.

Issue:
When a channel is undergoing load balancing, the leader checker may sync
the routing table to a new delegator. This sync operation modifies the
segment's load version, which invalidates the routing in the old
delegator. Subsequently, the leader checker updates the routing table in
the old delegator, breaking the routing in the new delegator. This cycle
continues, causing repeated updates and inconsistencies.

Fix:
This PR introduces two changes to address the issue:
1. Use NodeID to verify whether the delegator's routing table needs an
update, avoiding unnecessary modifications.
2. Ensure compatibility by using the latest segment's load version as
the version recorded in the routing table.

These changes resolve the cyclic updates and prevent the leader checker
from generating excessive duplicate tasks, ensuring routing stability
across delegators during load balancing.

Signed-off-by: Wei Liu <[email protected]>
@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: weiliu1031
To complete the pull request process, please assign liliu-z after the PR has been reviewed.
You can assign the PR to them by writing /assign @liliu-z in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot added the size/S Denotes a PR that changes 10-29 lines. label Jan 10, 2025
@mergify mergify bot added dco-passed DCO check passed. kind/bug Issues or changes related a bug labels Jan 10, 2025
Copy link
Contributor

mergify bot commented Jan 10, 2025

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link

codecov bot commented Jan 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.03%. Comparing base (6b127d4) to head (8e5dc8a).
Report is 1 commits behind head on 2.5.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##              2.5   #39160       +/-   ##
===========================================
+ Coverage   69.54%   81.03%   +11.49%     
===========================================
  Files         294     1389     +1095     
  Lines       26462   196351   +169889     
===========================================
+ Hits        18403   159122   +140719     
- Misses       8059    31618    +23559     
- Partials        0     5611     +5611     
Components Coverage Δ
Client 78.26% <ø> (∅)
Core 69.55% <ø> (+<0.01%) ⬆️
Go 82.99% <100.00%> (∅)
Files with missing lines Coverage Δ
internal/querycoordv2/checkers/leader_checker.go 96.91% <100.00%> (ø)

... and 1095 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-passed DCO check passed. kind/bug Issues or changes related a bug size/S Denotes a PR that changes 10-29 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants