You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If the repeated field is a group with one field and is named either array or uses the LIST-annotated group's name with _tuple appended then the repeated type is the element type and elements are required.
I implemented some tests based off of this and saw the CUDF is able to parse the data, but it is not returning the same types as Spark does, nor does it return what I would expect the examples to show.
SPECIAL_ARRAY_LIST_TEST.parquet has a footer schema of
message spark {
required group my_list (LIST) {
repeated group array {
required int32 item;
}
}
}
When I parse the data with CUDF I get back a table with types like Table<LIST<INT32>>, but Spark and expects the data to look like Table<LIST<STRUCT<INT32>>>.
Pandas appears to do the same thing, but I am not an expert on pandas to be 100% sure that it is the same thing.
The other file is essentially the same, but it is using the _tuple special case instead of array.
Steps/Code to reproduce bug
Try to read the attached files in CUDF and see if they match the desired types/schema.
Expected behavior
They should match, but it looks like they do not.
Additional context
This is probably not super critical because it is an odd corner case that is not likely to be very common, but technically it is returning the wrong data.
The text was updated successfully, but these errors were encountered:
Describe the bug
The parquet specification at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md when talking about backwards compatibility in lists says that
The examples given for these are.
I implemented some tests based off of this and saw the CUDF is able to parse the data, but it is not returning the same types as Spark does, nor does it return what I would expect the examples to show.
In files.zip there are two parquet files.
SPECIAL_ARRAY_LIST_TEST.parquet
has a footer schema ofWhen I parse the data with CUDF I get back a table with types like
Table<LIST<INT32>>
, but Spark and expects the data to look likeTable<LIST<STRUCT<INT32>>>
.Pandas appears to do the same thing, but I am not an expert on pandas to be 100% sure that it is the same thing.
The other file is essentially the same, but it is using the
_tuple
special case instead ofarray
.Steps/Code to reproduce bug
Try to read the attached files in CUDF and see if they match the desired types/schema.
Expected behavior
They should match, but it looks like they do not.
Additional context
This is probably not super critical because it is an odd corner case that is not likely to be very common, but technically it is returning the wrong data.
The text was updated successfully, but these errors were encountered: