Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporary fix Parquet metadata with empty value string being ignored from writing #14026

Merged
merged 5 commits into from
Sep 6, 2023

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Aug 31, 2023

When writing to Parquet files, Spark needs to write pairs of key-value strings into files' metadata. Sometimes the value strings are just an empty string. Such empty string is ignored from writing into the file, causing other applications (such as Spark) to read the value and interpret it as a null instead of an empty string as in the original input, as described in #14024. This is wrong and led to data corruption as I tested.

This PR intentionally modifies the empty value string into a space character to workaround the bug. This is a temporary fix while waiting for a better fix to be worked on.

@ttnghia ttnghia added bug Something isn't working 3 - Ready for Review Ready for review by team cuIO cuIO issue Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change Reliability labels Aug 31, 2023
@ttnghia ttnghia requested a review from revans2 August 31, 2023 20:34
@ttnghia ttnghia self-assigned this Aug 31, 2023
@ttnghia ttnghia requested a review from a team as a code owner August 31, 2023 20:34
@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 31, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ttnghia ttnghia changed the title Fix Parquet metadata with empty value string is ignored from writing Fix Parquet metadata with empty value string being ignored from writing Aug 31, 2023
@ttnghia
Copy link
Contributor Author

ttnghia commented Aug 31, 2023

/ok to test

@ttnghia
Copy link
Contributor Author

ttnghia commented Sep 5, 2023

/ok to test

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks okay to me. Just not sure if we want to do the work around in the Spark plugin instead of doing it here?

@ttnghia
Copy link
Contributor Author

ttnghia commented Sep 6, 2023

I think we should fix in cudf instead, because there may be more metadata in the form of {"key_str", ""} pairs being written in the future. Currently we are having problem with a few known keys such as legacyDateTime and legacyINT96 thus if we only fix for these keys in the plugin then I'm not sure if we still miss some other and the problem may show up again.

A permanent fix for the problem should be tracked by #14024. After that is closed, we can revert this.

@ttnghia ttnghia changed the title Fix Parquet metadata with empty value string being ignored from writing Temporary fix Parquet metadata with empty value string being ignored from writing Sep 6, 2023
@ttnghia
Copy link
Contributor Author

ttnghia commented Sep 6, 2023

/merge

@rapids-bot rapids-bot bot merged commit 609f894 into rapidsai:branch-23.10 Sep 6, 2023
@ttnghia ttnghia deleted the fix_parquet_meta branch September 6, 2023 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue Java Affects Java cuDF API. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants