Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Special case Parquet LIST names appear to be ignored #12043

Open
revans2 opened this issue Nov 1, 2022 · 0 comments
Open

[BUG] Special case Parquet LIST names appear to be ignored #12043

revans2 opened this issue Nov 1, 2022 · 0 comments
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Nov 1, 2022

Describe the bug
The parquet specification at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md when talking about backwards compatibility in lists says that

If the repeated field is a group with one field and is named either array or uses the LIST-annotated group's name with _tuple appended then the repeated type is the element type and elements are required.

The examples given for these are.

// List<OneTuple<String>> (nullable list, non-null elements)
optional group my_list (LIST) {
  repeated group array {
    required binary str (UTF8);
  };
}

// List<OneTuple<String>> (nullable list, non-null elements)
optional group my_list (LIST) {
  repeated group my_list_tuple {
    required binary str (UTF8);
  };
}

I implemented some tests based off of this and saw the CUDF is able to parse the data, but it is not returning the same types as Spark does, nor does it return what I would expect the examples to show.

In files.zip there are two parquet files.

SPECIAL_ARRAY_LIST_TEST.parquet has a footer schema of

message spark {
  required group my_list (LIST) {
    repeated group array {
      required int32 item;
    }
  }
}

When I parse the data with CUDF I get back a table with types like Table<LIST<INT32>>, but Spark and expects the data to look like Table<LIST<STRUCT<INT32>>>.

Pandas appears to do the same thing, but I am not an expert on pandas to be 100% sure that it is the same thing.

>>> pd.read_parquet("SPECIAL_ARRAY_LIST_TEST.parquet")
                      my_list
0  [{'item': 0}, {'item': 1}]
>>> pd.read_parquet("SPECIAL_ARRAY_LIST_TEST.parquet").info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   my_list  1 non-null      object
dtypes: object(1)
memory usage: 136.0+ bytes

The other file is essentially the same, but it is using the _tuple special case instead of array.

Steps/Code to reproduce bug
Try to read the attached files in CUDF and see if they match the desired types/schema.

Expected behavior
They should match, but it looks like they do not.

Additional context
This is probably not super critical because it is an odd corner case that is not likely to be very common, but technically it is returning the wrong data.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Nov 1, 2022
@revans2 revans2 changed the title [BUG] Special case LIST names appear to be ignored [BUG] Special case Parquet LIST names appear to be ignored Nov 1, 2022
@vuule vuule added cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Nov 8, 2022
@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023
@GregoryKimball GregoryKimball removed this from libcudf Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

3 participants