Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Malformed fixed length byte array Parquet file loads corrupted data instead of error #14104

Open
jlowe opened this issue Sep 13, 2023 · 5 comments
Labels
1 - On Deck To be worked on next bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Contributor

jlowe commented Sep 13, 2023

Describe the bug
Using libcudf to load a Parquet file that is malformed "succeeds" by producing a table with some corrupted rows rather than returning an error as expected. Spark 3.5, parquet-mr 1.13.1, and pyarrow 13 all produce unexpected EOF errors when trying to load the same file.

Steps/Code to reproduce bug
Load https://github.com/apache/parquet-testing/blob/master/data/fixed_length_byte_array.parquet using libcudf. Note that it will produce a table with 1000 rows with no nulls, and some of the rows have a list of bytes longer than 4 entries. According to the docs for the file, the data is supposed to be a single column with a fixed-length byte array of size 4, yet some rows load with more than four bytes, some with no bytes.

Expected behavior
libcudf should return an error when trying to load the file rather than producing corrupted rows.

@jlowe jlowe added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS cuIO cuIO issue labels Sep 13, 2023
jlowe added a commit to jlowe/spark-rapids that referenced this issue Sep 13, 2023
jlowe added a commit to NVIDIA/spark-rapids that referenced this issue Sep 13, 2023
@etseidl
Copy link
Contributor

etseidl commented Sep 14, 2023

These are like puzzles 😅 So the issue with this file is that the schema says the flba_field is required, but the column index indicates there are nulls. Because the schema says the field is required, there is no definition level data either. I suppose this one could be detected by doing a sanity check during page header parsing (make sure the uncompressed size makes sense for how many values should be present), or even after reading the file metadata (schema says required but metadata says nulls are present). The page reader should also exit (but does not...it trusts the value counts and doesn't currently detect buffer overruns) when it reaches the end of the page data, but as with other errors that occur on the device, it's hard to communicate back to the host that some exception occurred.

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Sep 18, 2023

Thank you @etseidl for looking into this. Of your proposals I prefer:

after reading the file metadata (schema says required but metadata says nulls are present)

I don't mind doing more work if we are going to crash anyways. What do you think is the simplest check to implement?
(FYI @PointKernel)

@etseidl
Copy link
Contributor

etseidl commented Sep 18, 2023

@GregoryKimball I think the simplest would be to walk through the schema in some fashion, find the max definition level for each column, and then check the ColumnIndex for for each column chunk for that column and see if the num_nulls field is consistent with the max definition level (i.e, if max_def == 0 and num_nulls > 0 then error). This would be doable on the host without digging into the page data. But this requires that column indexes are present (which they are for this file). The next option would be to do the same thing, but instead walk the page headers in the file to get the null counts, but that would require V2 data page headers.

The only surefire way is to detect the buffer overun when decoding the values (which is what parquet-mr and arrow seem to do), but as I've said, erroring out of the kernel when that is detected and communicating the error to the host is an issue.

@vuule
Copy link
Contributor

vuule commented Sep 20, 2023

The decode kernel does not detect the error, page_state_s.error flag stays at zero when reading the linked file.
If this flag was raised, we could use it to communicate the error to the host. I think the overhead of such solution would be acceptable (4 byte D2H copy, w/o errors; plus atomicOr when an error occurs).

@GregoryKimball
Copy link
Contributor

#14167 is taking the first step to solving this case. We will also need to update the decode kernel to detect this error.

@GregoryKimball GregoryKimball added 1 - On Deck To be worked on next and removed Needs Triage Need team to review and classify labels Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - On Deck To be worked on next bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

4 participants