-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] String columns written with fastparquet
seem to be read incorrectly via CUDF's Parquet reader
#14258
Comments
Looking at a hex dump of the file, it seems the page data is padded with an extra 8 bytes of zero valued bytes. The string length calculation is likely just using the data length minus TBH I'd consider this a bug on the write side...I'll have to check the spec to see if these padding bytes are forbidden or this is just a gray area. Looking at fastparquet, it seems the padding was added to get some tests to pass. See this commit (writer.py, lines 452 and 484). I wonder if it's worth bringing up with the fastparquet devs, adding garbage padding to byte array columns should not be necessary. The relevant line in the current fastparquet is here |
Just FYI, here's a profile showing the impact of having to do the string size calculation the hard way. The profile shows reading 50M lines from a large parquet file containing plain encoded strings. The top profile is traversing the encoded string data, summing string lengths as it goes. Due to the structure of the data, this cannot be parallelized so a single thread per page is doing this operation. The bottom profile uses the page data size from the header to calculate string sizes. The call to |
Thank you for the analysis, @etseidl. This has me curious. The main reason I considered this might be something we should address is that the Spark Parquet reader, and the parquet tools seem to read the file correctly. |
Yep, because they're reading a page at a time in batches, so they don't need to worry about exact total sizes, they just read the length of each string as they consume it. Even libcudf as of last year would have read that file ok because the string reads were done in two passes. Now that we have the single pass read, we need to rely on accurate metadata to get the string data copied into the correct places in the column buffer. As usual, the Parquet spec is silent on this. The closest I could find to an answer is in the section on data pages, where it is stated that the Edit: Actually, I missed this "For data pages, the 3 pieces of information are encoded back to back, after the page header. No padding is allowed in the data page." Anyway, so as to not kill performance for all, would it be acceptable to add an option to do the more expensive string size calculation when necessary? |
Thank you @etseidl and @mythrocks for studying this anomaly. If we did have a reader option to pre-compute sizes, would Spark-RAPIDS have to always set this option to make sure we are correctly avoiding this behavior in the fastparquet writer? Do you think there could be any sensible postprocessing options to trim the null characters? TBH I didn't even know null characters could exist. |
Yes...or we could turn it around and make the slow pre-compute the default, and enable the faster version on demand. But if fastparquet fixes their writer (I don't see why they wouldn't, they did not do the padding for V2 pages, for instance), then the cudf reader would be slower by default for no reason. A brittle option would be to check the
Well, you'd wind up with a column buffer with holes in it.
They're just byte arrays, with an annotation to interpret as UTF8 strings, so null chars are just fine. The trouble here is fastparquet is adding this padding and counting it in the data size, even though it's not strictly data. |
Description
This was uncovered in Spark tests that compare Parquet read/write compatibility with
fastparquet
.The last row of a String column written with
fastparquet
seems to be interpreted by CUDF as having more null characters at the end than expected.Repro
I'll spare the Scala/Spark details in this bug. Here is a zipped Parquet file that seems to be read differently in CUDF.
From NVIDIA/spark-rapids#9387:
It would be good to check with the CUDF native Parquet reader, and compare against the results from
parquet-mr
.The text was updated successfully, but these errors were encountered: