feat: support query aggregtion(#36380) #39177

MrPresent-Han · 2025-01-12T04:55:04Z

related: #36380
support query aggregation feature for milvus:
function feature:

support group by multiple scalar fields like select a, b from collection group by a, b
support aggregation along with group by like select a, b, sum(c), count(d) from collection group by a, b
replace original count(*) with aggregation with aggregation count

code changes:

add Project Operator to retrieve values of scalar fields in the executing framework
implemented PhyAggregation Operator like Velox, bucketing with simd operations

Signed-off-by: MrPresent-Han <[email protected]>

sre-ci-robot · 2025-01-12T04:55:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: MrPresent-Han
To complete the pull request process, please assign tedxu after the PR has been reviewed.
You can assign the PR to them by writing /assign @tedxu in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2025-01-12T06:39:47Z

Codecov Report

Attention: Patch coverage is 72.33070% with 666 lines in your changes missing coverage. Please review.

Project coverage is 81.03%. Comparing base (a8a6564) to head (bd3f412).

Files with missing lines	Patch %	Lines
internal/agg/aggregate.go	63.16%	172 Missing and 21 partials ⚠️
internal/core/src/query/PlanProto.cpp	4.81%	79 Missing ⚠️
internal/core/src/segcore/SegmentGrowingImpl.cpp	0.00%	44 Missing ⚠️
internal/core/src/query/ExecPlanNodeVisitor.cpp	58.06%	39 Missing ⚠️
...rnal/core/src/segcore/ChunkedSegmentSealedImpl.cpp	0.00%	34 Missing ⚠️
internal/proxy/task_query.go	64.78%	21 Missing and 4 partials ⚠️
internal/core/src/exec/HashTable.h	67.64%	22 Missing ⚠️
internal/proxy/util.go	42.42%	17 Missing and 2 partials ⚠️
internal/core/src/common/FieldData.cpp	68.00%	16 Missing ⚠️
internal/util/typeutil/hash.go	50.00%	15 Missing ⚠️
... and 38 more

Additional details and impacted files

@@             Coverage Diff             @@
##           master   #39177       +/-   ##
===========================================
+ Coverage   69.64%   81.03%   +11.39%     
===========================================
  Files         296     1426     +1130     
  Lines       26633   199546   +172913     
===========================================
+ Hits        18548   161703   +143155     
- Misses       8085    32196    +24111     
- Partials        0     5647     +5647

Components	Coverage Δ
Client	`79.53% <ø> (∅)`
Core	`69.90% <76.23%> (+0.25%)`	⬆️
Go	`83.00% <63.62%> (∅)`

Files with missing lines	Coverage Δ
internal/core/src/common/FieldData.h	`100.00% <ø> (ø)`
internal/core/src/common/Types.h	`33.80% <100.00%> (+4.72%)`	⬆️
internal/core/src/exec/Driver.cpp	`81.72% <100.00%> (+0.32%)`	⬆️
internal/core/src/exec/Driver.h	`50.00% <ø> (ø)`
internal/core/src/exec/VectorHasher.h	`100.00% <100.00%> (ø)`
internal/core/src/exec/expression/Utils.h	`96.82% <100.00%> (+0.33%)`	⬆️
internal/core/src/exec/operator/MvccNode.cpp	`100.00% <100.00%> (+6.25%)`	⬆️
internal/core/src/exec/operator/Operator.cpp	`100.00% <100.00%> (ø)`
internal/core/src/exec/operator/Operator.h	`71.79% <100.00%> (+6.08%)`	⬆️
internal/core/src/exec/operator/ProjectNode.cpp	`100.00% <100.00%> (ø)`
... and 76 more

... and 1084 files with indirect coverage changes

MrPresent-Han

review comments round1

MrPresent-Han · 2025-01-12T05:05:02Z

internal/agg/aggregate.go

+	FieldID() int64
+	OriginalName() string
+}
+


OriginalName is users' output fields name, because all fields needed to be bucked may not be all needed to returned,
like 'select a, sum(c) from collection group by a, b',
in the sql above, the original outputfields are 'a, sum(c)', but the proxy must receive three bucked columns 'a, b, sum(c)' which must be in order, for correct reduction, so we keep the original originalName to finally project 'a, sum(c)' from the reduced result

MrPresent-Han · 2025-01-12T05:09:18Z

internal/agg/aggregate.go

+		return nil, fmt.Errorf("invalid Aggregation operator %d", pb.Op)
+	}
+}
+


in the go-layer, the aggregation includes three components for aggregation reductiong:

Bucket: includes all rows with identical hash values

Row: includes all columns' values for one group-by line: like a_val, b_val, sum(c_val)

Entry: one column value in one row, like one value for 'a_val'

MrPresent-Han · 2025-01-12T05:16:41Z

internal/agg/aggregate.go

+		target.val = new.val
+		return nil
+	}
+	// ensure the value type outside


inside secore executing framework, sum and count are all int64 type, so there is no type risk here

MrPresent-Han · 2025-01-12T05:20:44Z

internal/agg/aggregate.go

+}
+
+const NONE int = -1
+


if hash key collision, we have to iterate all rows inside one bucket to check match

MrPresent-Han · 2025-01-12T05:27:06Z

internal/agg/aggregate.go

+	hasher hash.Hash64
+	buffer []byte
+}
+


this buffer is for hash computation and is fixed size,
one buffer for one column
so no memory risk here

MrPresent-Han · 2025-01-12T07:03:06Z

internal/core/src/common/Vector.h

+    append(const ColumnVector& other) {
+        values_->FillFieldData(other.GetRawData(), other.size());
+    }
+


in the iterative computing framework, result vectors are returned batch by batch, so we need to append one batch result into the final returned result, this method involves memory copy, may need to be optimized

MrPresent-Han · 2025-01-12T07:04:10Z

internal/core/src/exec/Driver.cpp

@@ -78,11 +80,21 @@ DriverFactory::CreateDriver(std::unique_ptr<DriverContext> ctx,
                           plannode)) {
            operators.push_back(std::make_unique<PhyVectorSearchNode>(
                id, ctx.get(), vectorsearchnode));
-        } else if (auto groupbynode =
-                       std::dynamic_pointer_cast<const plan::GroupByNode>(


we differenct groupby to search_group_by and query_group_by operators

MrPresent-Han · 2025-01-12T07:06:35Z

internal/core/src/exec/Driver.cpp

                           plannode)) {
            operators.push_back(
-                std::make_unique<PhyGroupByNode>(id, ctx.get(), groupbynode));
+                std::make_unique<PhyProjectNode>(id, ctx.get(), projectNode));
        }


for query_group_by with filter expr, the pipeline is:
agg_operator--->project_operator--->filterbits_operator-->mvcc_operator
no changes towards existing framework existing operator

MrPresent-Han · 2025-01-12T07:07:12Z

internal/core/src/exec/Driver.cpp

@@ -135,6 +147,17 @@ Driver::Run(std::shared_ptr<Driver> self) {
    }
 }



initialize operators before launching the pipeline, this is needed by agg_operator and the same as Velox

MrPresent-Han · 2025-01-12T07:10:29Z

internal/core/src/exec/HashTable.cpp

+    TargetBitmapView active_views(activeRows);
+    populateLookupRows(active_views, lookup.rows_);
+}
+


only kInsert is used as bucket operations in aggregation processes do not need delete existing entries

feat: support query aggregtion(milvus-io#36380)

bd3f412

Signed-off-by: MrPresent-Han <[email protected]>

sre-ci-robot requested review from aoiasd and bigsheeper January 12, 2025 04:55

sre-ci-robot added size/XXL Denotes a PR that changes 1000+ lines. area/compilation area/internal-api area/test sig/testing test/integration integration test labels Jan 12, 2025

mergify bot added dco-passed DCO check passed. kind/feature Issues related to feature request from users labels Jan 12, 2025

mergify bot added the ci-passed label Jan 12, 2025

MrPresent-Han commented Jan 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support query aggregtion(#36380) #39177

feat: support query aggregtion(#36380) #39177

MrPresent-Han commented Jan 12, 2025

sre-ci-robot commented Jan 12, 2025

codecov bot commented Jan 12, 2025 •

edited

Loading

MrPresent-Han left a comment

MrPresent-Han Jan 12, 2025

MrPresent-Han Jan 12, 2025

MrPresent-Han Jan 12, 2025

MrPresent-Han Jan 12, 2025

MrPresent-Han Jan 12, 2025

MrPresent-Han Jan 12, 2025

MrPresent-Han Jan 12, 2025

MrPresent-Han Jan 12, 2025

MrPresent-Han Jan 12, 2025

MrPresent-Han Jan 12, 2025

		@@ -135,6 +147,17 @@ Driver::Run(std::shared_ptr<Driver> self) {
		}
		}

feat: support query aggregtion(#36380) #39177

Are you sure you want to change the base?

feat: support query aggregtion(#36380) #39177

Conversation

MrPresent-Han commented Jan 12, 2025

sre-ci-robot commented Jan 12, 2025

codecov bot commented Jan 12, 2025 • edited Loading

Codecov Report

MrPresent-Han left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 12, 2025 •

edited

Loading