Compute and use the initial string offset when building `nested` large string cols with chunked parquet reader #17702

mhaseeb123 · 2025-01-09T05:07:56Z

Description

This PR enables computing the str_offset required to correctly compute the offsets columns for nested large strings columns with chunked Parquet reader when chunk_read_limit is small resulting in multiple output table chunks per subpass.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-01-09T05:08:01Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

mhaseeb123 · 2025-01-14T00:36:40Z

CC: @etseidl would love your review here as well if possible!

cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh

mhaseeb123 · 2025-01-14T00:43:49Z

cpp/tests/large_strings/merge_tests.cpp

@@ -51,6 +51,9 @@ TEST_F(MergeTest, MergeLargeStrings)
    CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(c, input);
  }

+  // Unset the LIBCUDF_LARGE_STRINGS_THRESHOLD if already set.


Setting env var to a smaller number may result in this test to fail (and rightfully so) so locally unsetting it here

cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh

cpp/src/io/parquet/page_delta_decode.cu

mhaseeb123 · 2025-01-15T01:31:54Z

cpp/src/io/parquet/page_string_utils.cuh

+ * atomically update the initial string offset to be used during large string column construction
+ */
+template <int block_size>
+__device__ void compute_string_offsets(page_state_s* const state,


Comments about the name of this utility function are welcome! 🙂

Compute and use str_offset for large nested string cols.

1ce570c

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 9, 2025

github-actions bot assigned mhaseeb123 Jan 9, 2025

mhaseeb123 added 5 - DO NOT MERGE Hold off on merging; see PR for details 2 - In Progress Currently a work in progress bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Jan 9, 2025

mhaseeb123 added 4 commits January 10, 2025 01:12

Clean up, add docstrings

e73432d

Fix copyright year

a146cd6

Fix comment

9f4ede3

Revert comment

edaff09

mhaseeb123 requested a review from nvdbaranec January 10, 2025 03:07

Merge branch 'branch-25.02' into fix/str_offset-nested-large-str-cols

2a279de

mhaseeb123 removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jan 10, 2025

Minor optimization. Sync stream only for str_offsets vector

0d2317a

mhaseeb123 changed the title ~~🚧 Compute and use the str_offset when reading nested large string cols with chunked Parquet reader.~~ Compute and use the str_offset when reading nested large string cols with chunked Parquet reader. Jan 14, 2025

mhaseeb123 requested a review from vuule January 14, 2025 00:35

mhaseeb123 marked this pull request as ready for review January 14, 2025 00:36

mhaseeb123 requested a review from a team as a code owner January 14, 2025 00:36

Remove leftover cout

9622620

mhaseeb123 requested a review from davidwendt January 14, 2025 00:38

mhaseeb123 commented Jan 14, 2025

View reviewed changes

cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh Outdated Show resolved Hide resolved

mhaseeb123 requested a review from PointKernel January 14, 2025 00:41

mhaseeb123 commented Jan 14, 2025

View reviewed changes

mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jan 14, 2025

Remove code duplication with utility function

38652c0

mhaseeb123 added 2 commits January 14, 2025 02:06

fix copyright header

a57ccab

Remove explicit inline and simplify branch

3cffe1c

mhaseeb123 changed the title ~~Compute and use the str_offset when reading nested large string cols with chunked Parquet reader.~~ Compute and use the initial string offset when building nested large string cols with chunked parquet reader Jan 14, 2025

Refactor offset computing to avoid ambiguous use of util function.

29c1754

PointKernel reviewed Jan 14, 2025

View reviewed changes

cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh Outdated Show resolved Hide resolved

cpp/src/io/parquet/page_delta_decode.cu Outdated Show resolved Hide resolved

mhaseeb123 added 3 commits January 15, 2025 00:22

Change initial_offset type to int64 and subtract from last_elem

28835c3

Reuse code with a util function

46ba4ab

Merge branch 'branch-25.02' into fix/str_offset-nested-large-str-cols

ebea0cd

mhaseeb123 commented Jan 15, 2025

View reviewed changes

mhaseeb123 added 2 commits January 15, 2025 01:34

Minor optimization. Make const ptr to const page_state

73ced83

Merge branch 'branch-25.02' into fix/str_offset-nested-large-str-cols

b0acb4c

PointKernel approved these changes Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute and use the initial string offset when building `nested` large string cols with chunked parquet reader #17702

Compute and use the initial string offset when building `nested` large string cols with chunked parquet reader #17702

mhaseeb123 commented Jan 9, 2025 •

edited

Loading

copy-pr-bot bot commented Jan 9, 2025

mhaseeb123 commented Jan 14, 2025

mhaseeb123 Jan 14, 2025

mhaseeb123 Jan 15, 2025 •

edited

Loading

Compute and use the initial string offset when building nested large string cols with chunked parquet reader #17702

Are you sure you want to change the base?

Compute and use the initial string offset when building nested large string cols with chunked parquet reader #17702

Conversation

mhaseeb123 commented Jan 9, 2025 • edited Loading

Description

Checklist

copy-pr-bot bot commented Jan 9, 2025

mhaseeb123 commented Jan 14, 2025

mhaseeb123 Jan 14, 2025

Choose a reason for hiding this comment

mhaseeb123 Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Compute and use the initial string offset when building `nested` large string cols with chunked parquet reader #17702

Compute and use the initial string offset when building `nested` large string cols with chunked parquet reader #17702

mhaseeb123 commented Jan 9, 2025 •

edited

Loading

mhaseeb123 Jan 15, 2025 •

edited

Loading