-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support query aggregtion(#36380) #39177
base: master
Are you sure you want to change the base?
feat: support query aggregtion(#36380) #39177
Conversation
Signed-off-by: MrPresent-Han <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: MrPresent-Han The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review comments round1
FieldID() int64 | ||
OriginalName() string | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OriginalName is users' output fields name, because all fields needed to be bucked may not be all needed to returned,
like 'select a, sum(c) from collection group by a, b',
in the sql above, the original outputfields are 'a, sum(c)', but the proxy must receive three bucked columns 'a, b, sum(c)' which must be in order, for correct reduction, so we keep the original originalName to finally project 'a, sum(c)' from the reduced result
return nil, fmt.Errorf("invalid Aggregation operator %d", pb.Op) | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the go-layer, the aggregation includes three components for aggregation reductiong:
- Bucket: includes all rows with identical hash values
- Row: includes all columns' values for one group-by line: like
a_val, b_val, sum(c_val)
- Entry: one column value in one row, like one value for 'a_val'
target.val = new.val | ||
return nil | ||
} | ||
// ensure the value type outside |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inside secore executing framework, sum and count are all int64 type, so there is no type risk here
} | ||
|
||
const NONE int = -1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if hash key collision, we have to iterate all rows inside one bucket to check match
hasher hash.Hash64 | ||
buffer []byte | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this buffer is for hash computation and is fixed size,
one buffer for one column
so no memory risk here
append(const ColumnVector& other) { | ||
values_->FillFieldData(other.GetRawData(), other.size()); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the iterative computing framework, result vectors are returned batch by batch, so we need to append one batch result into the final returned result, this method involves memory copy, may need to be optimized
@@ -78,11 +80,21 @@ DriverFactory::CreateDriver(std::unique_ptr<DriverContext> ctx, | |||
plannode)) { | |||
operators.push_back(std::make_unique<PhyVectorSearchNode>( | |||
id, ctx.get(), vectorsearchnode)); | |||
} else if (auto groupbynode = | |||
std::dynamic_pointer_cast<const plan::GroupByNode>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we differenct groupby to search_group_by and query_group_by operators
plannode)) { | ||
operators.push_back( | ||
std::make_unique<PhyGroupByNode>(id, ctx.get(), groupbynode)); | ||
std::make_unique<PhyProjectNode>(id, ctx.get(), projectNode)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for query_group_by with filter expr, the pipeline is:
agg_operator--->project_operator--->filterbits_operator-->mvcc_operator
no changes towards existing framework existing operator
@@ -135,6 +147,17 @@ Driver::Run(std::shared_ptr<Driver> self) { | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initialize operators before launching the pipeline, this is needed by agg_operator and the same as Velox
TargetBitmapView active_views(activeRows); | ||
populateLookupRows(active_views, lookup.rows_); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only kInsert is used as bucket operations in aggregation processes do not need delete existing entries
related: #36380
support query aggregation feature for milvus:
function feature:
select a, b from collection group by a, b
select a, b, sum(c), count(d) from collection group by a, b
code changes: