-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] CompactProtocolFieldWriter
does not write empty value string in key-value pair
#14024
Comments
The |
@etseidl I assume that the key-value pair is optional, but is it okay to include the key without the value? |
According to the thrift idl, yes. struct KeyValue {
1: required string key
2: optional string value
} Also optional in struct FileMetaData {
...
/** Optional key/value metadata **/
5: optional list<KeyValue> key_value_metadata
} |
That's interesting. Then such "optional" should mean the user can specify it as an optional, instead of the current behavior (the library ignores writing empty string without the user's knowledge). So #14000 should be the way to go instead. |
Yes, I think making more use of
|
Yeah, using |
…from writing (#14026) When writing to Parquet files, Spark needs to write pairs of key-value strings into files' metadata. Sometimes the value strings are just an empty string. Such empty string is ignored from writing into the file, causing other applications (such as Spark) to read the value and interpret it as a `null` instead of an empty string as in the original input, as described in #14024. This is wrong and led to data corruption as I tested. This PR intentionally modifies the empty value string into a space character to workaround the bug. This is a temporary fix while waiting for a better fix to be worked on. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) URL: #14026
Refactors the current `CompactProtocolReader` used to parse parquet file metadata. The main goal of the refactor is to allow easier use of `std::optional` fields in the thrift structs to prevent situations as in #14024 where an optional field is an empty string. The writer cannot distinguish between present-but-empty and not-present, so chooses the latter when writing the field. This PR adds a `ParquetFieldOptional` functor that can wrap the other field functors, obviating the need to write a new optional functor for each type. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) URL: #14097
This issue has been at least partially addressed. We need to make another pass over |
During debugging a Parquet issue, I found that the Parquet metadata is not correctly written by cudf. In particular, in a round trip test, a pair of metadata
{"key_str", ""}
is written then read as{"key_str", null}
in Spark. This is due to theCompactProtocolFieldWriter
does not write empty value string:We should write all value strings even for empty value.
However, I'm not sure if this is the expected behavior of compact protocol?
The text was updated successfully, but these errors were encountered: