Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Optimization of repetition and definition level decoding in the parquet reader kernel. #12633

Closed
nvdbaranec opened this issue Jan 27, 2023 · 2 comments
Assignees
Labels
cuIO cuIO issue feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@nvdbaranec
Copy link
Contributor

The parquet reader kernels (gpuDecodePages and gpuComputePageSizes) are heavily bottlenecked by the repetition/definition level decoding steps.

These data streams produce the information needed to decode output value indices, nulls and nesting depth information. We currently use a single warp to decode batches of the above data in a double buffered way, handing these buffers off to 3 further warps which then decode the actual output data.

The idea is that the level decoding warp can provide batches of information faster than the value decoding warps can consume them. In practice, this doesn't happen: the value decoding warps sit idle while the level decoding warp slogs through the data.

A proposed optimization here is:

Change the fundamental way the kernel is structured to a more direct producer-consumer model. The primary bottleneck is the single warp which walks through the level data. Eg:

[4 byte header + level data chunk] [4 byte header + level data chunk] ....

A better structure would be to have a single warp which produces a list of each one of the chunks and puts them in a work queue. Each of the individual chunks would then get it's own warp to do the decoding. So instead of having 1 warp decoding say 64 chunks, we could have 64 warps each decoding one chunk. There would have to be some additional coordination to determine sizes and output positions, but that would mostly be in the form of a scan/prefix-sum between the warps which is fast. In addition, if we do this, it seems possible that we could eliminate the double buffered intermediate storage of output indices - each warp decoding the level stream could also simply copy/decode the output values/validity themselves directly.

@nvdbaranec nvdbaranec added feature request New feature or request Needs Triage Need team to review and classify cuIO cuIO issue improvement Improvement / enhancement to an existing function labels Jan 27, 2023
@nvdbaranec nvdbaranec self-assigned this Jan 27, 2023
@mattahrens mattahrens added the Performance Performance related issue label Jan 27, 2023
@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023
@mattahrens mattahrens removed the Needs Triage Need team to review and classify label Apr 17, 2023
@nvdbaranec
Copy link
Contributor Author

Update on this. I did quite a bit of work related to this. Notably:

  • Coming up with a mechanism for decoding repetition/definition levels (as well as dictionaries) using a block-wide mechanic
    instead of just a single warp. It is considerably faster (up to 80% in the preprocess kernel and 50% in the decode kernel).
  • Conversion of the remainder of the preprocess and decode kernels to also operate block-wide. The preprocess kernel optimization has been merged. Optimization to decoding of parquet level streams #13203

For the decode kernel, overall optimization works, but growing the kernel from being warp-based to being block based resulted in a large growth in register and shared memory usage that undid the performance wins. Full heavyweight kernel can be found here: https://github.com/nvdbaranec/cudf/blob/3b13588f01bfb5ba55970c660641478de5fbcb38/cpp/src/io/parquet/page_data.cu#L2460

The conclusion reached was that we need to start exploring using multiple, smaller kernels that target specific types of page data (fixed width vs nested vs. dictionary, etc) and run them in parallel on mulitple streams. The idea is that the individual kernels remain slimmer and retain the optimization wins specific to their cases. A proof of concept for fixed-width, non-dictionary data can be found here: #13622

What remains now is to determine the exact mix of kernels to use. There are many possible ways to break them down, but starting with the ones that use the most similar profile of shared memory (eg, dictionary vs. non-dictionary) is probably a good start.

@GregoryKimball
Copy link
Contributor

Closing in favor of #14953

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
None yet
Development

No branches or pull requests

4 participants