[FEA] Parquet reader code cleanup, re: nested columns vs columns with lists. #11793
Labels
0 - Backlog
In queue waiting for assignment
cuIO
cuIO issue
libcudf
Affects libcudf (C++/CUDA) code.
proposal
Change current process or code
Milestone
In the parquet reader there are two similar-sounding but distinct pieces of terminology:
This causes confusion and bugs for a couple of reasons. A given (cudf) output column can contain both nested and non-nested hierarchies. For example:
This single output column contains two separate input column hierarchies. A->B and A->C->D. A->B does not contain repetition data and therefore is not a nested hierarchy. A->C->D does contain repetition data and does constitute a nested hierarchy. However they are both nested in the cudf sense (more than 1 level deep).
We handle these two fundamental situations differently during the decoding process. So if the two concepts get confused it can easily cause bugs.
It would be great to do a pass that cleans this up in a comprehensive way.
The text was updated successfully, but these errors were encountered: