Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(dataobj): data object encoding and decoding #15676

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rfratto
Copy link
Member

@rfratto rfratto commented Jan 9, 2025

This commit introduces the encoding package with utilities for writing and reading a data object.

This initial commit includes a single section called "streams". The streams section holds a list of streams for which logs are available in the data object file. This does not hold the logs themselves, but rather just the stream labels themselves with an ID.

Encoding

Encoding presents a hierarchical API to match the file structure:

  1. Callers open an encoder
  2. Callers open a streams section from the encoder
  3. Callers open a column from the streams section
  4. Callers append a page into the column

Child elements of the hierarchy have a Commit method to flush their written data and metadata to their parent.

Each element of the hierarchy exposes its current MetadataSize. Callers should use MetadataSize to control the size of an element. For example, if Encoder.MetadataSize goes past a limit, callers should stop appending new sections to the file and flush the file to disk.

To support discarding data after reaching a size limit, each child element of the hierarchy also has a Discard method.

Decoding

Decoding separates each section into a different Decoder interface to more cleanly separate the APIs.

The initial Decoder is for ReadSeekers, but later implementations will include object storage and caching.

The Decoder interfaces are designed for batch reading, so that callers can retrieve multiple columns or pages at once. Implementations can then use this to reduce the number of roundtrips (such as retrieving mulitple cache keys in a single cache request).

encoding.StreamsDataset converts an instance of a StreamDecoder into a dataset.Dataset, allowing to use the existing dataset utility functions without downloading an entire dataset.

@rfratto rfratto requested a review from a team as a code owner January 9, 2025 21:23
This commit introduces the encoding package with utilities for writing
and reading a data object.

This initial commit includes a single section called "streams". The
streams section holds a list of streams for which logs are available in
the data object file. This does not hold the logs themselves, but rather
just the stream labels themselves with an ID.

Encoding
--------

Encoding presents a hierarchical API to match the file structure:

1. Callers open an encoder
2. Callers open a streams section from the encoder
3. Callers open a column from the streams section
4. Callers append a page into the column

Child elements of the hierarchy have a Commit method to flush their
written data and metadata to their parent.

Each element of the hierarchy exposes its current MetadataSize. Callers
should use MetadataSize to control the size of an element. For example,
if Encoder.MetadataSize goes past a limit, callers should stop appending
new sections to the file and flush the file to disk.

To support discarding data after reaching a size limit, each child
element of the hierarchy also has a Discard method.

Decoding
--------

Decoding separates each section into a different Decoder interface to
more cleanly separate the APIs.

The initial Decoder is for ReadSeekers, but later implementations will
include object storage and caching.

The Decoder interfaces are designed for batch reading, so that callers
can retrieve multiple columns or pages at once. Implementations can then
use this to reduce the number of roundtrips (such as retrieving mulitple
cache keys in a single cache request).

encoding.StreamsDataset converts an instance of a StreamDecoder into a
dataset.Dataset, allowing to use the existing dataset utility functions
without downloading an entire dataset.

var protoBufferPool = sync.Pool{
New: func() any {
return new(proto.Buffer)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyriltovena I initially set proto.Buffer.SetDeterministic here to have deterministic encoding of protobufs but I think there's a bug in gogo protobuf that prevents it from working.

Either way, I think our encoding is already deterministic as long as we never include map types in our protobuf. I'll have some tests for that once we have the final pieces that tie everything together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant