-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add variable bit-width keys and improved key order for Parquet dict pages #13995
Comments
Looking at the files, the BIT_PACKED is a bit of a red herring.
The file metadata claims the repetition level data is BIT_PACKED, but the max repetition level is 0, so there is no data to encode. I'm surprised to see BIT_PACKED listed since it's been deprecated for at least a decade now. 😉 Digging deeper, the CPU pages are 7509 bytes for the first 26 pages, and then 10009 bytes for the remaining pages of the column chunk. It seems that the encoder is using a variable bit width for the dictionary encoding. 3 bits for 20000 values would be 7500 bytes, 4 bits for 20000 values is 10000 bytes. (Actually, there are runs of 8 values in the data, so it's really where the dictionary switches from <= 16 bits to >= 17 bits that the bump in page size occurs). libcudf uses a fixed bit width for the entire column chunk based on the number of dictionary keys present. I think it would be a lot of work to use variable bit widths in cudf. You can try limiting the rowgroup size to 400k rows. That might keep the dictionaries in the |
@etseidl thank you for studying this file. Would you please help me understand your observations a bit better? I thought that each column chunk has one dictionary page, and this dictionary is used for all of the pages in the column chunk. How could the pages in a column chunk switch between 16-bit and 17-bit dictionaries?
Would you please share a bit more about this feature idea? How would the encoder choose the bit width for each page? |
@GregoryKimball, it has to do with the size of the dictionary when the page is encoded. Because parquet-mr is more stream based, it will keep the current dictionary in RAM, and add keys as it goes. So say that the first page has just under 64k distinct entries; in this case the maximum key size will be 16 bits, and the page will be RLE encoded with that bit width. Now while encoding the second page, the number of distinct entries exceeds 64k; the RLE encoder will now use 17 bits. cuDF, on the other hand, computes the dictionary up front, and then uses the total number of entries to determine the bit width to use for all pages. cuDF could modify the dictionary page-encoder (not dictionary-page encoder 😄) to first find the largest dictionary key for the given page, and use that value to determine how many bits to use when doing the RLE encoding. I'm not sure what that would do to pre-computed page sizes and how the encoded values are stuffed into the column buffer. This could get expensive to compute, but some users might prefer the smallest file possible and be willing to make that trade. |
Thank you @etseidl, I appreciate the explanation. If we added this dynamic bit width,
Good points. It seems like it would take more work upfront to re-compute the page sizes depending on the max key.
Certainly some users put a huge premium on file size. This also comes up a lot with the nvCOMP team where there are often runtime/filesize tradeoffs. |
I did a quick test and found that each page winds up with a wide range of keys due to the parallel nature of the dictionary construction. This will need more thought. |
Thank you @etseidl for testing this. Do you think we could do better by sorting the keys descending based on the number of occurrences in the column chunk? (and then using dynamic bit width for the pages) If sorting on number of occurrences doesn't work well, then I suppose we would be stuck with a more difficult optimization such as filling the first 2^16 keys to cover the most number of pages. Does this sound right to you? |
Looks like this issue has continued interest from the community. It seems like the next steps to consider would be:
|
Thanks @abellina for sharing the example files with a +7% file size increase between CPU and GPU writers. Are there any details you can share about the data that led to the reported +30% file size increase? |
Original title:
"[FEA] research enabling BIT_PACKED encoding for columns in parquet writer"
We have a user report of larger sizes for parquet encoded files via the GPU as opposed to Spark CPU. With their sample data, I can get a 30% increase in the GPU file size vs the CPU. I have been able to produce a single row group and the same number of files, so I am down to column encodings. The types of columns are all INT64 nullable columns.
It looks like one of the differences between the two files is that in cuDF columns are not using the BIT_PACKED encoding:
CPU (
ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY
):GPU (
ENC:PLAIN_DICTIONARY,RLE
):Discussing with @nvdbaranec, he suggested that BIT_PACKED could be enough of a reason for the difference. I have generated two files with my own mock data (just sequences of longs) and encoded it with the CPU and the GPU. I have placed two of the generated file in this zip file:
bit_packed_example.zip
I would appreciate any comments. If you want me to try a small change in cuDF and rebuild/retest, I am happy to do so.
The text was updated successfully, but these errors were encountered: