Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vecstore: implement GetFullVectors for the persistent store #139050

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

mw5h
Copy link
Contributor

@mw5h mw5h commented Jan 14, 2025

vecstore: implement GetFullVectors in the persistent store

This patch introduces a KV implementation of GetFullVectors. This
necessitated a rework of the PersistentStore to be more aware of the
overall shape of the table, so the store now takes a codec, table and
index and builds its own prefixes.

To support testing, a new factory function is added to catalog.Index to
wrap an index descriptor for use in testing. The unit test for the
persistent store now creates a real table and a fake vector index to
est the store. GetFullVectors tests have been moved from the in memory
store test to the common test.

colinfo: remove enterprise licensing requirement from vector data type

Previously, we had a core offering that did not include the vector data
ype. We no longer have that offering, so this patch removes the
requirement for an enterprise license to create a table with vector
data.

row: support explicit spanIDs in fetcher.NextRowDecodedInto

Previously, the row fetcher only supported a user provided spanIDs slice
for the NextRow method. This patch expands that support to
NextRowDecodedInto, which is needed by the persistent vector store
because the GetFullVectors interface does not guarantee the requested
PKs will be in order.

Epic: CRDB-42943

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@mw5h mw5h marked this pull request as ready for review January 14, 2025 17:48
@mw5h mw5h requested review from a team as code owners January 14, 2025 17:48
@mw5h mw5h requested review from DrewKimball and removed request for a team January 14, 2025 17:48
@mw5h mw5h force-pushed the vecstore-getfullvectors branch 2 times, most recently from d310aa6 to 324aebb Compare January 14, 2025 19:48
Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, 2 of 2 files at r2, 1 of 8 files at r3, 1 of 10 files at r4, 9 of 9 files at r5, 8 of 8 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mw5h)


pkg/sql/vecindex/vecstore/persistent_store.go line 30 at r6 (raw file):

	codec keys.SQLCodec
	table catalog.TableDescriptor
	index catalog.Index

It doesn't have to happen here, but we'll need to be checking for staleness of the descriptor in each txn, similar to what we do for cached plans:

func (md *Metadata) CheckDependencies(

Here's how I had the VectorManager implementation fetch the descriptor in the draft:

tableDesc, err = txn.Descriptors().ByIDWithoutLeased(txn.KV()).Get().Table(ctx, tableID)

Note that I have that TODO asking about whether we can use leased descriptors; I think we can and should for the sake of performance.


pkg/sql/vecindex/vecstore/persistent_store.go line 69 at r6 (raw file):

func (ps *PersistentStore) Init() error {
	keycols := ps.index.CollectKeyColumnIDs()

[nit] Probably cleaner to do this:

// The indexed vector column is the last one in the index.
vectorColID := ps.index.GetKeyColumnID(ps.index.NumKeyColumns()-1)
ps.colIdxMap.Set(vectorColID, 0)

pkg/sql/vecindex/vecstore/persistent_store_test.go line 51 at r6 (raw file):

	tableDesc := desctestutils.TestingGetPublicTableDescriptor(kvDB, codec, "defaultdb", "t")

	keycols := tableDesc.GetPrimaryIndex().CollectPrimaryStoredColumnIDs()

nit: for clarity, keyCols -> storedCols (or something similar)


pkg/sql/vecindex/vecstore/persistent_store_test.go line 54 at r6 (raw file):

	var col catalog.Column
	var err error
	for colID, more := keycols.Next(0); more; colID, more = keycols.Next(colID) {

Instead of this loop, you could use catalog.MustFindColumnByName(tableDesc, "v")


pkg/sql/vecindex/vecstore/persistent_store_test.go line 69 at r6 (raw file):

		KeyColumnNames:      []string{col.GetName()},
		KeyColumnDirections: []catenumpb.IndexColumn_Direction{catenumpb.IndexColumn_ASC},
		StoreColumnIDs:      []descpb.ColumnID{tableDesc.GetPrimaryIndex().GetKeyColumnID(0)},

Might not really matter for a fake index, but the primary key columns should live in KeySuffixColumnIDs, and StoreColumnIDs should never be set for vector indexes.


pkg/sql/vecindex/vecstore/persistent_store_test.go line 71 at r6 (raw file):

		StoreColumnIDs:      []descpb.ColumnID{tableDesc.GetPrimaryIndex().GetKeyColumnID(0)},
		StoreColumnNames:    []string{tableDesc.GetPrimaryIndex().GetKeyColumnName(0)},
		Version:             descpb.EmptyArraysInInvertedIndexesVersion,

LatestIndexDescriptorVersion?


pkg/sql/vecindex/vecstore/persistent_store_test.go line 78 at r6 (raw file):

	quantizer := quantize.NewUnQuantizer(2)
	store := NewPersistentStore(kvDB, quantizer, codec, tableDesc, index)
	pk1 := encoding.EncodeVarintAscending(encoding.EncodeUvarintAscending([]byte{}, 11), 0)

I think we use EncodeVarintAscending for integer key columns, so we should use that. On the other hand, the family ID (which is the 0, right?) should be encoded using MakeFamilyKey (though it won't make a difference in this particular case).


pkg/sql/vecindex/vecstore/store_test.go line 37 at r6 (raw file):

		defer commitTransaction(ctx, t, store, txn)

		// Include primary keys that are cannot be found.

nit: are cannot be found -> cannot be found


pkg/sql/vecindex/vecstore/persistent_txn.go line 419 at r6 (raw file):

// that specify a partition ID are ignored. The values are returned in-line in
// the refs slice. If getFullVectorsFromPK discovers references to partition IDs
// in the refs list, it returns true, false otherwise.

nit: looks like the last sentence is stale.


pkg/sql/vecindex/vecstore/persistent_txn.go line 461 at r6 (raw file):

			psTxn.tmpSpans,
			psTxn.tmpSpanIDs,
			rowinfra.GetDefaultBatchBytesLimit(false /* forceProductionValue */),

In the future, we should find a way to plumb *eval.Context to the store.


pkg/sql/vecindex/vecstore/persistent_txn.go line 477 at r6 (raw file):

				break
			}
			refs[refIdx].Vector = data[0].(*tree.DPGVector).T

We should probably check for data[0] == nil and data[0] == tree.DNull just in case. There's also a tree.MustBeDPGVector helper function, and tree.AsDPGVector depending on whether you want to assert the type.


pkg/sql/vecindex/vecstore/persistent_txn.go line 499 at r6 (raw file):

	}

	if err := psTxn.kv.Run(ctx, b); err != nil {

Should we short-circuit here if numPKLookups == len(refs)?


pkg/sql/vecindex/vecstore/in_memory_store_test.go line 40 at r6 (raw file):

		defer commitTransaction(ctx, t, store, txn)

		store.InsertVector([]byte{11}, testVectors[0])

nit: []byte{11} -> testPKs[0]?

similar for below.

Copy link
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mw5h)


pkg/sql/vecindex/vecstore/persistent_store.go line 68 at r6 (raw file):

}

func (ps *PersistentStore) Init() error {

It's a bit strange to have both a New constructor and an Init method. Usually the only reason to have an Init method is if you need to initialize an embedded struct or maybe b/c there are initialization steps that are not possible at the time New is called. Unless that's the case here, the usual GoLang pattern is to instead change NewPersistentStore to return an error, e.g. (*PersistentStore, error). That way, there's no way to forget that you need to call Init after calling NewPersistentStore.


pkg/sql/vecindex/vecstore/persistent_store.go line 83 at r6 (raw file):

	}

	return rowenc.InitIndexFetchSpec(&ps.fetchSpec, ps.codec, ps.table, ps.table.GetPrimaryIndex(), []descpb.ColumnID{vectorColumnID})

NIT: line length


pkg/sql/vecindex/vecstore/persistent_txn.go line 436 at r6 (raw file):

		}

		key := make(roachpb.Key, len(psTxn.store.pkPrefix)+len(ref.Key.PrimaryKey))

In theory, could we reuse space for these keys? For example, could we have a "key byte arena" or similar where we allocate keys sequentially? Does KV potentially hold onto the memory?

I'm not suggesting we do any of that now, I'm more wondering for the upcoming Go "regions" feature that supports a form of arena allocation. If we could reuse key byte memory, we could at least add a comment here to remind us that's possible.


pkg/sql/vecindex/vecstore/persistent_txn.go line 450 at r6 (raw file):

			Alloc:           &alloc,
			Spec:            &psTxn.store.fetchSpec,
			SpansCanOverlap: true,

I'm reading this comment from the code:

	// SpansCanOverlap indicates whether the spans in a given batch can overlap
	// with one another. If it is true, spans that correspond to the same row must
	// share the same span ID, since the span IDs are used to determine when a new
	// row is being processed. In practice, this means that span IDs must be
	// passed in when SpansCanOverlap is true.

Are we OK here in the case where we have duplicate primary keys in the refs slice?

Copy link
Contributor Author

@mw5h mw5h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @DrewKimball)


pkg/sql/vecindex/vecstore/persistent_txn.go line 436 at r6 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

In theory, could we reuse space for these keys? For example, could we have a "key byte arena" or similar where we allocate keys sequentially? Does KV potentially hold onto the memory?

I'm not suggesting we do any of that now, I'm more wondering for the upcoming Go "regions" feature that supports a form of arena allocation. If we could reuse key byte memory, we could at least add a comment here to remind us that's possible.

The fetcher takes ownership of the spans slice while it's processing, but the spans slice can be reused after the fetcher is closed. I'm less clear whether the individual keys are reusable, except that they do go into a KV request, so I'm guessing the answer is no.


pkg/sql/vecindex/vecstore/persistent_txn.go line 450 at r6 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

I'm reading this comment from the code:

	// SpansCanOverlap indicates whether the spans in a given batch can overlap
	// with one another. If it is true, spans that correspond to the same row must
	// share the same span ID, since the span IDs are used to determine when a new
	// row is being processed. In practice, this means that span IDs must be
	// passed in when SpansCanOverlap is true.

Are we OK here in the case where we have duplicate primary keys in the refs slice?

I had questions about that as well, which is why I put duplicate PKs in the common test. Making that work is why I had to plumb the spanIDs slice into DecodeInto.


pkg/sql/vecindex/vecstore/persistent_txn.go line 461 at r6 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

In the future, we should find a way to plumb *eval.Context to the store.

It doesn't need to be the store...it could just be the persistentStoreTxn. Eventually I'll have to make a constructor for that to use with the execution engine, so that seems like a good place for that to live.

Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @mw5h)


pkg/sql/vecindex/vecstore/persistent_txn.go line 436 at r6 (raw file):

Previously, mw5h (Matt White) wrote…

The fetcher takes ownership of the spans slice while it's processing, but the spans slice can be reused after the fetcher is closed. I'm less clear whether the individual keys are reusable, except that they do go into a KV request, so I'm guessing the answer is no.

KV can hold on to that memory IIRC from when I implemented the vectorized index join.


pkg/sql/vecindex/vecstore/persistent_txn.go line 450 at r6 (raw file):

Previously, mw5h (Matt White) wrote…

I had questions about that as well, which is why I put duplicate PKs in the common test. Making that work is why I had to plumb the spanIDs slice into DecodeInto.

That's a confusing comment, which is unfortunate given that I wrote it... I think it's referring to the case when we split a single-row span into multiple spans each covering a different column family. I don't think this is actually necessary though, so I might produce a cleanup PR in a bit.

Anyway, I think what you're doing here works.

Copy link
Contributor Author

@mw5h mw5h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @DrewKimball)


pkg/sql/vecindex/vecstore/persistent_store.go line 30 at r6 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

It doesn't have to happen here, but we'll need to be checking for staleness of the descriptor in each txn, similar to what we do for cached plans:

func (md *Metadata) CheckDependencies(

Here's how I had the VectorManager implementation fetch the descriptor in the draft:

tableDesc, err = txn.Descriptors().ByIDWithoutLeased(txn.KV()).Get().Table(ctx, tableID)

Note that I have that TODO asking about whether we can use leased descriptors; I think we can and should for the sake of performance.

Maybe fetcher args don't belong in the store. I was just trying to reduce the cost of initializing that struct.


pkg/sql/vecindex/vecstore/persistent_store.go line 68 at r6 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

It's a bit strange to have both a New constructor and an Init method. Usually the only reason to have an Init method is if you need to initialize an embedded struct or maybe b/c there are initialization steps that are not possible at the time New is called. Unless that's the case here, the usual GoLang pattern is to instead change NewPersistentStore to return an error, e.g. (*PersistentStore, error). That way, there's no way to forget that you need to call Init after calling NewPersistentStore.

Done.


pkg/sql/vecindex/vecstore/persistent_store.go line 69 at r6 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

[nit] Probably cleaner to do this:

// The indexed vector column is the last one in the index.
vectorColID := ps.index.GetKeyColumnID(ps.index.NumKeyColumns()-1)
ps.colIdxMap.Set(vectorColID, 0)

Done.


pkg/sql/vecindex/vecstore/persistent_txn.go line 436 at r6 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

KV can hold on to that memory IIRC from when I implemented the vectorized index join.

Done.


pkg/sql/vecindex/vecstore/persistent_txn.go line 450 at r6 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

That's a confusing comment, which is unfortunate given that I wrote it... I think it's referring to the case when we split a single-row span into multiple spans each covering a different column family. I don't think this is actually necessary though, so I might produce a cleanup PR in a bit.

Anyway, I think what you're doing here works.

Done.


pkg/sql/vecindex/vecstore/persistent_txn.go line 477 at r6 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

We should probably check for data[0] == nil and data[0] == tree.DNull just in case. There's also a tree.MustBeDPGVector helper function, and tree.AsDPGVector depending on whether you want to assert the type.

Done.


pkg/sql/vecindex/vecstore/persistent_txn.go line 499 at r6 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Should we short-circuit here if numPKLookups == len(refs)?

Done.


pkg/sql/vecindex/vecstore/persistent_store_test.go line 54 at r6 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Instead of this loop, you could use catalog.MustFindColumnByName(tableDesc, "v")

Done.


pkg/sql/vecindex/vecstore/persistent_store_test.go line 69 at r6 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Might not really matter for a fake index, but the primary key columns should live in KeySuffixColumnIDs, and StoreColumnIDs should never be set for vector indexes.

Done.


pkg/sql/vecindex/vecstore/persistent_store_test.go line 71 at r6 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

LatestIndexDescriptorVersion?

Done.


pkg/sql/vecindex/vecstore/persistent_store_test.go line 78 at r6 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

I think we use EncodeVarintAscending for integer key columns, so we should use that. On the other hand, the family ID (which is the 0, right?) should be encoded using MakeFamilyKey (though it won't make a difference in this particular case).

Done.

mw5h added 3 commits January 14, 2025 17:47
Previously, the row fetcher only supported a user provided spanIDs slice
for the NextRow method. This patch expands that support to
NextRowDecodedInto, which is needed by the persistent vector store
because the GetFullVectors interface does not guarantee the requested
PKs will be in order.

Epic: CRDB-42943
Previously, we had a core offering that did not include the vector data
type. We no longer have that offering, so this patch removes the
requirement for an enterprise license to create a table with vector
data.

Epic: CRDB-42943
This patch introduces a KV implementation of GetFullVectors. This
necessitated a rework of the PersistentStore to be more aware of the
overall shape of the table, so the store now takes a codec, table and
index and builds its own prefixes.

To support testing, a new factory function is added to catalog.Index to
wrap an index descriptor for use in testing. The unit test for the
persistent store now creates a real table and a fake vector index to
test the store. GetFullVectors tests have been moved from the in memory
store test to the common test.

Epic: CRDB-42943
@mw5h mw5h force-pushed the vecstore-getfullvectors branch from 324aebb to 534540c Compare January 15, 2025 01:47
Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Nicely done!

Reviewed 1 of 17 files at r7, 9 of 9 files at r8, 8 of 8 files at r9, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants