-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vecstore: implement GetFullVectors for the persistent store #139050
base: master
Are you sure you want to change the base?
Conversation
d310aa6
to
324aebb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, 2 of 2 files at r2, 1 of 8 files at r3, 1 of 10 files at r4, 9 of 9 files at r5, 8 of 8 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mw5h)
pkg/sql/vecindex/vecstore/persistent_store.go
line 30 at r6 (raw file):
codec keys.SQLCodec table catalog.TableDescriptor index catalog.Index
It doesn't have to happen here, but we'll need to be checking for staleness of the descriptor in each txn, similar to what we do for cached plans:
cockroach/pkg/sql/opt/metadata.go
Line 392 in 2729738
func (md *Metadata) CheckDependencies( |
Here's how I had the VectorManager
implementation fetch the descriptor in the draft:
cockroach/pkg/sql/vecindex/manager.go
Line 125 in e6b7dc6
tableDesc, err = txn.Descriptors().ByIDWithoutLeased(txn.KV()).Get().Table(ctx, tableID) |
Note that I have that TODO asking about whether we can use leased descriptors; I think we can and should for the sake of performance.
pkg/sql/vecindex/vecstore/persistent_store.go
line 69 at r6 (raw file):
func (ps *PersistentStore) Init() error { keycols := ps.index.CollectKeyColumnIDs()
[nit] Probably cleaner to do this:
// The indexed vector column is the last one in the index.
vectorColID := ps.index.GetKeyColumnID(ps.index.NumKeyColumns()-1)
ps.colIdxMap.Set(vectorColID, 0)
pkg/sql/vecindex/vecstore/persistent_store_test.go
line 51 at r6 (raw file):
tableDesc := desctestutils.TestingGetPublicTableDescriptor(kvDB, codec, "defaultdb", "t") keycols := tableDesc.GetPrimaryIndex().CollectPrimaryStoredColumnIDs()
nit: for clarity, keyCols
-> storedCols
(or something similar)
pkg/sql/vecindex/vecstore/persistent_store_test.go
line 54 at r6 (raw file):
var col catalog.Column var err error for colID, more := keycols.Next(0); more; colID, more = keycols.Next(colID) {
Instead of this loop, you could use catalog.MustFindColumnByName(tableDesc, "v")
pkg/sql/vecindex/vecstore/persistent_store_test.go
line 69 at r6 (raw file):
KeyColumnNames: []string{col.GetName()}, KeyColumnDirections: []catenumpb.IndexColumn_Direction{catenumpb.IndexColumn_ASC}, StoreColumnIDs: []descpb.ColumnID{tableDesc.GetPrimaryIndex().GetKeyColumnID(0)},
Might not really matter for a fake index, but the primary key columns should live in KeySuffixColumnIDs
, and StoreColumnIDs
should never be set for vector indexes.
pkg/sql/vecindex/vecstore/persistent_store_test.go
line 71 at r6 (raw file):
StoreColumnIDs: []descpb.ColumnID{tableDesc.GetPrimaryIndex().GetKeyColumnID(0)}, StoreColumnNames: []string{tableDesc.GetPrimaryIndex().GetKeyColumnName(0)}, Version: descpb.EmptyArraysInInvertedIndexesVersion,
LatestIndexDescriptorVersion
?
pkg/sql/vecindex/vecstore/persistent_store_test.go
line 78 at r6 (raw file):
quantizer := quantize.NewUnQuantizer(2) store := NewPersistentStore(kvDB, quantizer, codec, tableDesc, index) pk1 := encoding.EncodeVarintAscending(encoding.EncodeUvarintAscending([]byte{}, 11), 0)
I think we use EncodeVarintAscending
for integer key columns, so we should use that. On the other hand, the family ID (which is the 0
, right?) should be encoded using MakeFamilyKey
(though it won't make a difference in this particular case).
pkg/sql/vecindex/vecstore/store_test.go
line 37 at r6 (raw file):
defer commitTransaction(ctx, t, store, txn) // Include primary keys that are cannot be found.
nit: are cannot be found
-> cannot be found
pkg/sql/vecindex/vecstore/persistent_txn.go
line 419 at r6 (raw file):
// that specify a partition ID are ignored. The values are returned in-line in // the refs slice. If getFullVectorsFromPK discovers references to partition IDs // in the refs list, it returns true, false otherwise.
nit: looks like the last sentence is stale.
pkg/sql/vecindex/vecstore/persistent_txn.go
line 461 at r6 (raw file):
psTxn.tmpSpans, psTxn.tmpSpanIDs, rowinfra.GetDefaultBatchBytesLimit(false /* forceProductionValue */),
In the future, we should find a way to plumb *eval.Context
to the store.
pkg/sql/vecindex/vecstore/persistent_txn.go
line 477 at r6 (raw file):
break } refs[refIdx].Vector = data[0].(*tree.DPGVector).T
We should probably check for data[0] == nil
and data[0] == tree.DNull
just in case. There's also a tree.MustBeDPGVector
helper function, and tree.AsDPGVector
depending on whether you want to assert the type.
pkg/sql/vecindex/vecstore/persistent_txn.go
line 499 at r6 (raw file):
} if err := psTxn.kv.Run(ctx, b); err != nil {
Should we short-circuit here if numPKLookups == len(refs)
?
pkg/sql/vecindex/vecstore/in_memory_store_test.go
line 40 at r6 (raw file):
defer commitTransaction(ctx, t, store, txn) store.InsertVector([]byte{11}, testVectors[0])
nit: []byte{11}
-> testPKs[0]
?
similar for below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mw5h)
pkg/sql/vecindex/vecstore/persistent_store.go
line 68 at r6 (raw file):
} func (ps *PersistentStore) Init() error {
It's a bit strange to have both a New
constructor and an Init
method. Usually the only reason to have an Init
method is if you need to initialize an embedded struct or maybe b/c there are initialization steps that are not possible at the time New
is called. Unless that's the case here, the usual GoLang pattern is to instead change NewPersistentStore
to return an error, e.g. (*PersistentStore, error)
. That way, there's no way to forget that you need to call Init
after calling NewPersistentStore
.
pkg/sql/vecindex/vecstore/persistent_store.go
line 83 at r6 (raw file):
} return rowenc.InitIndexFetchSpec(&ps.fetchSpec, ps.codec, ps.table, ps.table.GetPrimaryIndex(), []descpb.ColumnID{vectorColumnID})
NIT: line length
pkg/sql/vecindex/vecstore/persistent_txn.go
line 436 at r6 (raw file):
} key := make(roachpb.Key, len(psTxn.store.pkPrefix)+len(ref.Key.PrimaryKey))
In theory, could we reuse space for these keys? For example, could we have a "key byte arena" or similar where we allocate keys sequentially? Does KV potentially hold onto the memory?
I'm not suggesting we do any of that now, I'm more wondering for the upcoming Go "regions" feature that supports a form of arena allocation. If we could reuse key byte memory, we could at least add a comment here to remind us that's possible.
pkg/sql/vecindex/vecstore/persistent_txn.go
line 450 at r6 (raw file):
Alloc: &alloc, Spec: &psTxn.store.fetchSpec, SpansCanOverlap: true,
I'm reading this comment from the code:
// SpansCanOverlap indicates whether the spans in a given batch can overlap
// with one another. If it is true, spans that correspond to the same row must
// share the same span ID, since the span IDs are used to determine when a new
// row is being processed. In practice, this means that span IDs must be
// passed in when SpansCanOverlap is true.
Are we OK here in the case where we have duplicate primary keys in the refs
slice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @DrewKimball)
pkg/sql/vecindex/vecstore/persistent_txn.go
line 436 at r6 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
In theory, could we reuse space for these keys? For example, could we have a "key byte arena" or similar where we allocate keys sequentially? Does KV potentially hold onto the memory?
I'm not suggesting we do any of that now, I'm more wondering for the upcoming Go "regions" feature that supports a form of arena allocation. If we could reuse key byte memory, we could at least add a comment here to remind us that's possible.
The fetcher takes ownership of the spans slice while it's processing, but the spans slice can be reused after the fetcher is closed. I'm less clear whether the individual keys are reusable, except that they do go into a KV request, so I'm guessing the answer is no.
pkg/sql/vecindex/vecstore/persistent_txn.go
line 450 at r6 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
I'm reading this comment from the code:
// SpansCanOverlap indicates whether the spans in a given batch can overlap // with one another. If it is true, spans that correspond to the same row must // share the same span ID, since the span IDs are used to determine when a new // row is being processed. In practice, this means that span IDs must be // passed in when SpansCanOverlap is true.
Are we OK here in the case where we have duplicate primary keys in the
refs
slice?
I had questions about that as well, which is why I put duplicate PKs in the common test. Making that work is why I had to plumb the spanIDs slice into DecodeInto.
pkg/sql/vecindex/vecstore/persistent_txn.go
line 461 at r6 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
In the future, we should find a way to plumb
*eval.Context
to the store.
It doesn't need to be the store...it could just be the persistentStoreTxn. Eventually I'll have to make a constructor for that to use with the execution engine, so that seems like a good place for that to live.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @mw5h)
pkg/sql/vecindex/vecstore/persistent_txn.go
line 436 at r6 (raw file):
Previously, mw5h (Matt White) wrote…
The fetcher takes ownership of the spans slice while it's processing, but the spans slice can be reused after the fetcher is closed. I'm less clear whether the individual keys are reusable, except that they do go into a KV request, so I'm guessing the answer is no.
KV can hold on to that memory IIRC from when I implemented the vectorized index join.
pkg/sql/vecindex/vecstore/persistent_txn.go
line 450 at r6 (raw file):
Previously, mw5h (Matt White) wrote…
I had questions about that as well, which is why I put duplicate PKs in the common test. Making that work is why I had to plumb the spanIDs slice into DecodeInto.
That's a confusing comment, which is unfortunate given that I wrote it... I think it's referring to the case when we split a single-row span into multiple spans each covering a different column family. I don't think this is actually necessary though, so I might produce a cleanup PR in a bit.
Anyway, I think what you're doing here works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @DrewKimball)
pkg/sql/vecindex/vecstore/persistent_store.go
line 30 at r6 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
It doesn't have to happen here, but we'll need to be checking for staleness of the descriptor in each txn, similar to what we do for cached plans:
cockroach/pkg/sql/opt/metadata.go
Line 392 in 2729738
func (md *Metadata) CheckDependencies( Here's how I had the
VectorManager
implementation fetch the descriptor in the draft:cockroach/pkg/sql/vecindex/manager.go
Line 125 in e6b7dc6
tableDesc, err = txn.Descriptors().ByIDWithoutLeased(txn.KV()).Get().Table(ctx, tableID)
Note that I have that TODO asking about whether we can use leased descriptors; I think we can and should for the sake of performance.
Maybe fetcher args don't belong in the store. I was just trying to reduce the cost of initializing that struct.
pkg/sql/vecindex/vecstore/persistent_store.go
line 68 at r6 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
It's a bit strange to have both a
New
constructor and anInit
method. Usually the only reason to have anInit
method is if you need to initialize an embedded struct or maybe b/c there are initialization steps that are not possible at the timeNew
is called. Unless that's the case here, the usual GoLang pattern is to instead changeNewPersistentStore
to return an error, e.g.(*PersistentStore, error)
. That way, there's no way to forget that you need to callInit
after callingNewPersistentStore
.
Done.
pkg/sql/vecindex/vecstore/persistent_store.go
line 69 at r6 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
[nit] Probably cleaner to do this:
// The indexed vector column is the last one in the index. vectorColID := ps.index.GetKeyColumnID(ps.index.NumKeyColumns()-1) ps.colIdxMap.Set(vectorColID, 0)
Done.
pkg/sql/vecindex/vecstore/persistent_txn.go
line 436 at r6 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
KV can hold on to that memory IIRC from when I implemented the vectorized index join.
Done.
pkg/sql/vecindex/vecstore/persistent_txn.go
line 450 at r6 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
That's a confusing comment, which is unfortunate given that I wrote it... I think it's referring to the case when we split a single-row span into multiple spans each covering a different column family. I don't think this is actually necessary though, so I might produce a cleanup PR in a bit.
Anyway, I think what you're doing here works.
Done.
pkg/sql/vecindex/vecstore/persistent_txn.go
line 477 at r6 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
We should probably check for
data[0] == nil
anddata[0] == tree.DNull
just in case. There's also atree.MustBeDPGVector
helper function, andtree.AsDPGVector
depending on whether you want to assert the type.
Done.
pkg/sql/vecindex/vecstore/persistent_txn.go
line 499 at r6 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Should we short-circuit here if
numPKLookups == len(refs)
?
Done.
pkg/sql/vecindex/vecstore/persistent_store_test.go
line 54 at r6 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Instead of this loop, you could use
catalog.MustFindColumnByName(tableDesc, "v")
Done.
pkg/sql/vecindex/vecstore/persistent_store_test.go
line 69 at r6 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Might not really matter for a fake index, but the primary key columns should live in
KeySuffixColumnIDs
, andStoreColumnIDs
should never be set for vector indexes.
Done.
pkg/sql/vecindex/vecstore/persistent_store_test.go
line 71 at r6 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
LatestIndexDescriptorVersion
?
Done.
pkg/sql/vecindex/vecstore/persistent_store_test.go
line 78 at r6 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
I think we use
EncodeVarintAscending
for integer key columns, so we should use that. On the other hand, the family ID (which is the0
, right?) should be encoded usingMakeFamilyKey
(though it won't make a difference in this particular case).
Done.
Previously, the row fetcher only supported a user provided spanIDs slice for the NextRow method. This patch expands that support to NextRowDecodedInto, which is needed by the persistent vector store because the GetFullVectors interface does not guarantee the requested PKs will be in order. Epic: CRDB-42943
Previously, we had a core offering that did not include the vector data type. We no longer have that offering, so this patch removes the requirement for an enterprise license to create a table with vector data. Epic: CRDB-42943
This patch introduces a KV implementation of GetFullVectors. This necessitated a rework of the PersistentStore to be more aware of the overall shape of the table, so the store now takes a codec, table and index and builds its own prefixes. To support testing, a new factory function is added to catalog.Index to wrap an index descriptor for use in testing. The unit test for the persistent store now creates a real table and a fake vector index to test the store. GetFullVectors tests have been moved from the in memory store test to the common test. Epic: CRDB-42943
324aebb
to
534540c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 17 files at r7, 9 of 9 files at r8, 8 of 8 files at r9, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball)
vecstore: implement GetFullVectors in the persistent store
This patch introduces a KV implementation of GetFullVectors. This
necessitated a rework of the PersistentStore to be more aware of the
overall shape of the table, so the store now takes a codec, table and
index and builds its own prefixes.
To support testing, a new factory function is added to catalog.Index to
wrap an index descriptor for use in testing. The unit test for the
persistent store now creates a real table and a fake vector index to
est the store. GetFullVectors tests have been moved from the in memory
store test to the common test.
colinfo: remove enterprise licensing requirement from vector data type
Previously, we had a core offering that did not include the vector data
ype. We no longer have that offering, so this patch removes the
requirement for an enterprise license to create a table with vector
data.
row: support explicit spanIDs in fetcher.NextRowDecodedInto
Previously, the row fetcher only supported a user provided spanIDs slice
for the NextRow method. This patch expands that support to
NextRowDecodedInto, which is needed by the persistent vector store
because the GetFullVectors interface does not guarantee the requested
PKs will be in order.
Epic: CRDB-42943