-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Malformed fixed length byte array Parquet file loads corrupted data instead of error #14104
Comments
Signed-off-by: Jason Lowe <[email protected]>
…#9235) Signed-off-by: Jason Lowe <[email protected]>
These are like puzzles 😅 So the issue with this file is that the schema says the |
Thank you @etseidl for looking into this. Of your proposals I prefer:
I don't mind doing more work if we are going to crash anyways. What do you think is the simplest check to implement? |
@GregoryKimball I think the simplest would be to walk through the schema in some fashion, find the max definition level for each column, and then check the ColumnIndex for for each column chunk for that column and see if the num_nulls field is consistent with the max definition level (i.e, if max_def == 0 and num_nulls > 0 then error). This would be doable on the host without digging into the page data. But this requires that column indexes are present (which they are for this file). The next option would be to do the same thing, but instead walk the page headers in the file to get the null counts, but that would require V2 data page headers. The only surefire way is to detect the buffer overun when decoding the values (which is what parquet-mr and arrow seem to do), but as I've said, erroring out of the kernel when that is detected and communicating the error to the host is an issue. |
The decode kernel does not detect the error, |
#14167 is taking the first step to solving this case. We will also need to update the decode kernel to detect this error. |
Describe the bug
Using libcudf to load a Parquet file that is malformed "succeeds" by producing a table with some corrupted rows rather than returning an error as expected. Spark 3.5, parquet-mr 1.13.1, and pyarrow 13 all produce unexpected EOF errors when trying to load the same file.
Steps/Code to reproduce bug
Load https://github.com/apache/parquet-testing/blob/master/data/fixed_length_byte_array.parquet using libcudf. Note that it will produce a table with 1000 rows with no nulls, and some of the rows have a list of bytes longer than 4 entries. According to the docs for the file, the data is supposed to be a single column with a fixed-length byte array of size 4, yet some rows load with more than four bytes, some with no bytes.
Expected behavior
libcudf should return an error when trying to load the file rather than producing corrupted rows.
The text was updated successfully, but these errors were encountered: