[BUG] Special case Parquet LIST names appear to be ignored #12043

revans2 · 2022-11-01T21:11:13Z

Describe the bug
The parquet specification at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md when talking about backwards compatibility in lists says that

If the repeated field is a group with one field and is named either array or uses the LIST-annotated group's name with _tuple appended then the repeated type is the element type and elements are required.

The examples given for these are.

// List<OneTuple<String>> (nullable list, non-null elements)
optional group my_list (LIST) {
  repeated group array {
    required binary str (UTF8);
  };
}

// List<OneTuple<String>> (nullable list, non-null elements)
optional group my_list (LIST) {
  repeated group my_list_tuple {
    required binary str (UTF8);
  };
}

I implemented some tests based off of this and saw the CUDF is able to parse the data, but it is not returning the same types as Spark does, nor does it return what I would expect the examples to show.

In files.zip there are two parquet files.

SPECIAL_ARRAY_LIST_TEST.parquet has a footer schema of

message spark {
  required group my_list (LIST) {
    repeated group array {
      required int32 item;
    }
  }
}

When I parse the data with CUDF I get back a table with types like Table<LIST<INT32>>, but Spark and expects the data to look like Table<LIST<STRUCT<INT32>>>.

Pandas appears to do the same thing, but I am not an expert on pandas to be 100% sure that it is the same thing.

>>> pd.read_parquet("SPECIAL_ARRAY_LIST_TEST.parquet")
                      my_list
0  [{'item': 0}, {'item': 1}]
>>> pd.read_parquet("SPECIAL_ARRAY_LIST_TEST.parquet").info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   my_list  1 non-null      object
dtypes: object(1)
memory usage: 136.0+ bytes

The other file is essentially the same, but it is using the _tuple special case instead of array.

Steps/Code to reproduce bug
Try to read the attached files in CUDF and see if they match the desired types/schema.

Expected behavior
They should match, but it looks like they do not.

Additional context
This is probably not super critical because it is an odd corner case that is not likely to be very common, but technically it is returning the wrong data.

The text was updated successfully, but these errors were encountered:

revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Nov 1, 2022

revans2 mentioned this issue Nov 1, 2022

[BUG] the special "array" name and "_tuple" suffix is not supported for parquet reads NVIDIA/spark-rapids#6968

Open

revans2 changed the title ~~[BUG] Special case LIST names appear to be ignored~~ [BUG] Special case Parquet LIST names appear to be ignored Nov 1, 2022

vuule added cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Nov 8, 2022

GregoryKimball added this to the Parquet continuous improvement milestone Nov 19, 2022

GregoryKimball added this to libcudf Jan 6, 2023

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023

GregoryKimball removed this from libcudf Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Special case Parquet LIST names appear to be ignored #12043

[BUG] Special case Parquet LIST names appear to be ignored #12043

revans2 commented Nov 1, 2022

[BUG] Special case Parquet LIST names appear to be ignored #12043

[BUG] Special case Parquet LIST names appear to be ignored #12043

Comments

revans2 commented Nov 1, 2022