-
Notifications
You must be signed in to change notification settings - Fork 918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute and use the initial string offset when building nested
large string cols with chunked parquet reader
#17702
base: branch-25.02
Are you sure you want to change the base?
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
str_offset
when reading nested large string cols with chunked Parquet reader.str_offset
when reading nested large string cols with chunked Parquet reader.
CC: @etseidl would love your review here as well if possible! |
@@ -51,6 +51,9 @@ TEST_F(MergeTest, MergeLargeStrings) | |||
CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(c, input); | |||
} | |||
|
|||
// Unset the LIBCUDF_LARGE_STRINGS_THRESHOLD if already set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting env var to a smaller number may result in this test to fail (and rightfully so) so locally unsetting it here
str_offset
when reading nested large string cols with chunked Parquet reader.nested
large string cols with chunked parquet reader
* atomically update the initial string offset to be used during large string column construction | ||
*/ | ||
template <int block_size> | ||
__device__ void compute_string_offsets(page_state_s* const state, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments about the name of this utility function are welcome! 🙂
Description
Closes #17692.
This PR enables computing the
str_offset
required to correctly compute the offsets columns for nested large strings columns with chunked Parquet reader whenchunk_read_limit
is small resulting in multiple output table chunks per subpass.Checklist