-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Should byte_array_view in parquet reader/writer change #11408
Comments
I got some ideas, but they depend on a few points I'm not sure about yet:
|
|
Aiming to avoid code duplication:
|
I'd rather keep |
Didn't know about libcu++ potential involvement. |
This issue has been labeled |
What is your question?
Should
byte_array_view
change to a different implementation method or even go away completely.Motivation
When reviewing the
byte_array_view
PR it was brought up in review comments that things could be done differently and possibly better. This issue is an attempt to bring this design out in the light and get some discourse going so we can build it the best way possible. Jake was, rightfully, concerned about the cognitive overload of having another object type that has to be understood, no matter how minimal the type turns out to be.Backstory and origin
The original thought was that it would be nice to leverage the existing templates in the statistics code to get elements and compute max/min just like everything else. This meant that
.element
on a column would be able to return a type that represents alist<uint8>
. This is almost identical to a string column, so the thought was to have something analogous tostring_view
that could be used. This was quickly dismissed due to the issue of not having all list columns comprised of this thing and it felt like we were forcing something. All string columns are lists of chars, but not all list columns are lists of bytes.Requirements
The requirements in the statistics code are the ability to get an element from a table, compare elements, and compose an element from a pointer and a length. The statistics code goes to great length to type-erase the statistics blobs so they can be easily consumed at a large scale on the GPU and the reconstructs them later. It also uses
thrust::min
andcub::reduceBlock
to process them, so comparison operators are needed.Slippery issues to understand
We can't use the same statistics types as strings because
string_view::max()
is actually not the same as a max byte or a maxbyte_array_view
. The distinction is subtle, but important between all of them.string_view
header. No UTF8 string can have a higher value, so comparisons work even though it isn't an infinitely-long character string as one would initially think.0xff, 0x05
is less than0xff, 0x15
and0xff
is less than0x00, 0x00
.byte_array_view
is defined conceptually as an infinite array of 0xff. This isn't possible to statically define for comparison like thestring_view
class, so some magic values were used of a nullptr and max length. These then have to be explicitly compared later in the comparison function to achieve the proper results.Lots of places required special handling for
byte_array_view
and potentially get worse with the different possible solutions. The goal of course is to make these areas as clean as possible, so I thought it would be good to point some of them out here.list_view
, which can be returned from.element
calls on a list column. This didn't end up being a great solution, but I can't remember the details.Possible solutions
device_span
directly. This requires passing comparison functions to cub and thrust for the calculations, but is completely doable. This was attempted, potentially poorly, with not great looking results.device_span
inside, vs inheriting fromdevice_span
either publicly or privately. There isn't a great answer here to argue against inheritance. I originally thought that this would be a very small subset ofdevice_span
and I didn't want to muddy the waters with all the accessors and iterators, but after further inspection, I don't see anything that I would want to remove fromdevice_span
, so this would be a viable path. It does still hold the issue of cognitive overload of yet another type someone encounters.The text was updated successfully, but these errors were encountered: