chore(dataobj): initial commit of value encoding #15606
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit introduces the dataobj package with initial utilities for encoding and decoding individual values within a "dataset." A dataset is the generic representation of columnar data, with a dataset.Value being one value in one page in one column. A dataset is one type of structure that will exist within a "data object."
This initial implementation includes two encodings:
A follow-up commit will introduce bitmap encoding for efficiently bitpacking unsigned integers.
My initial prototype of dataobj used generics rather than the dataset.Value wrapper. However, usage of generics made it difficult to write utilities that operates on multiple columns. While dataset.Value is slightly less type safe, it is significantly easier to work within the scope of a dataset.
The encoding and decoding of values is implemented to support streaming as much as possible: individual values can be encoded and passed immediately to a compression block. Streaming values minimizes the number of rows that needed to be stored in memory at once on both the write path and the read path. This constrats with the design of parquet-go, which primarily intends for an entire page of values to be buffered in memory prior to encoding and compression. The streaming approach trades off slightly slower performance for memory efficiency.
This PR is based off of the prototyping work done in my dataobj and dataobj-combined branches.