Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Parquet reader code cleanup, re: nested columns vs columns with lists. #11793

Open
nvdbaranec opened this issue Sep 27, 2022 · 0 comments
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. proposal Change current process or code

Comments

@nvdbaranec
Copy link
Contributor

nvdbaranec commented Sep 27, 2022

In the parquet reader there are two similar-sounding but distinct pieces of terminology:

  • Nested columns. This is the same as in the cudf sense. Anything involving structs or lists at any level.
  • Nested hierarchies. This only involves columns (or parts of columns) that contain lists (represented via repetition levels).

This causes confusion and bugs for a couple of reasons. A given (cudf) output column can contain both nested and non-nested hierarchies. For example:

         A (struct)
       /   \
      B     C (list)
            |
            D (int)

This single output column contains two separate input column hierarchies. A->B and A->C->D. A->B does not contain repetition data and therefore is not a nested hierarchy. A->C->D does contain repetition data and does constitute a nested hierarchy. However they are both nested in the cudf sense (more than 1 level deep).

We handle these two fundamental situations differently during the decoding process. So if the two concepts get confused it can easily cause bugs.

It would be great to do a pass that cleans this up in a comprehensive way.

@nvdbaranec nvdbaranec added feature request New feature or request Needs Triage Need team to review and classify cuIO cuIO issue labels Sep 27, 2022
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment proposal Change current process or code code quality and removed feature request New feature or request Needs Triage Need team to review and classify labels Oct 21, 2022
@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. proposal Change current process or code
Projects
None yet
Development

No branches or pull requests

3 participants