chore(dataobj): data object encoding and decoding #15676

rfratto · 2025-01-09T21:23:15Z

This commit introduces the encoding package with utilities for writing and reading a data object.

This initial commit includes a single section called "streams". The streams section holds a list of streams for which logs are available in the data object file. This does not hold the logs themselves, but rather just the stream labels themselves with an ID.

Encoding

Encoding presents a hierarchical API to match the file structure:

Callers open an encoder
Callers open a streams section from the encoder
Callers open a column from the streams section
Callers append a page into the column

Child elements of the hierarchy have a Commit method to flush their written data and metadata to their parent.

Each element of the hierarchy exposes its current MetadataSize. Callers should use MetadataSize to control the size of an element. For example, if Encoder.MetadataSize goes past a limit, callers should stop appending new sections to the file and flush the file to disk.

To support discarding data after reaching a size limit, each child element of the hierarchy also has a Discard method.

Decoding

Decoding separates each section into a different Decoder interface to more cleanly separate the APIs.

The initial Decoder is for ReadSeekers, but later implementations will include object storage and caching.

The Decoder interfaces are designed for batch reading, so that callers can retrieve multiple columns or pages at once. Implementations can then use this to reduce the number of roundtrips (such as retrieving mulitple cache keys in a single cache request).

encoding.StreamsDataset converts an instance of a StreamDecoder into a dataset.Dataset, allowing to use the existing dataset utility functions without downloading an entire dataset.

This commit introduces the encoding package with utilities for writing and reading a data object. This initial commit includes a single section called "streams". The streams section holds a list of streams for which logs are available in the data object file. This does not hold the logs themselves, but rather just the stream labels themselves with an ID. Encoding -------- Encoding presents a hierarchical API to match the file structure: 1. Callers open an encoder 2. Callers open a streams section from the encoder 3. Callers open a column from the streams section 4. Callers append a page into the column Child elements of the hierarchy have a Commit method to flush their written data and metadata to their parent. Each element of the hierarchy exposes its current MetadataSize. Callers should use MetadataSize to control the size of an element. For example, if Encoder.MetadataSize goes past a limit, callers should stop appending new sections to the file and flush the file to disk. To support discarding data after reaching a size limit, each child element of the hierarchy also has a Discard method. Decoding -------- Decoding separates each section into a different Decoder interface to more cleanly separate the APIs. The initial Decoder is for ReadSeekers, but later implementations will include object storage and caching. The Decoder interfaces are designed for batch reading, so that callers can retrieve multiple columns or pages at once. Implementations can then use this to reduce the number of roundtrips (such as retrieving mulitple cache keys in a single cache request). encoding.StreamsDataset converts an instance of a StreamDecoder into a dataset.Dataset, allowing to use the existing dataset utility functions without downloading an entire dataset.

rfratto · 2025-01-09T21:25:02Z

pkg/dataobj/internal/encoding/pools.go

+
+var protoBufferPool = sync.Pool{
+	New: func() any {
+		return new(proto.Buffer)


@cyriltovena I initially set proto.Buffer.SetDeterministic here to have deterministic encoding of protobufs but I think there's a bug in gogo protobuf that prevents it from working.

Either way, I think our encoding is already deterministic as long as we never include map types in our protobuf. I'll have some tests for that once we have the final pieces that tie everything together.

rfratto requested a review from a team as a code owner January 9, 2025 21:23

pull-request-size bot added the size/XXL label Jan 9, 2025

rfratto force-pushed the dataobj-encoding branch from eb7c9cc to 2c48dc1 Compare January 9, 2025 21:30

rfratto commented Jan 9, 2025

View reviewed changes

rfratto requested review from cyriltovena and benclive January 9, 2025 21:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(dataobj): data object encoding and decoding #15676

chore(dataobj): data object encoding and decoding #15676

rfratto commented Jan 9, 2025

rfratto Jan 9, 2025

chore(dataobj): data object encoding and decoding #15676

Are you sure you want to change the base?

chore(dataobj): data object encoding and decoding #15676

Conversation

rfratto commented Jan 9, 2025

Encoding

Decoding

rfratto Jan 9, 2025

Choose a reason for hiding this comment