-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Temporary fix Parquet metadata with empty value string being ignored from writing #14026
Conversation
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks okay to me. Just not sure if we want to do the work around in the Spark plugin instead of doing it here?
I think we should fix in cudf instead, because there may be more metadata in the form of A permanent fix for the problem should be tracked by #14024. After that is closed, we can revert this. |
/merge |
When writing to Parquet files, Spark needs to write pairs of key-value strings into files' metadata. Sometimes the value strings are just an empty string. Such empty string is ignored from writing into the file, causing other applications (such as Spark) to read the value and interpret it as a
null
instead of an empty string as in the original input, as described in #14024. This is wrong and led to data corruption as I tested.This PR intentionally modifies the empty value string into a space character to workaround the bug. This is a temporary fix while waiting for a better fix to be worked on.