Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Increase reader throughput by pipelining IO and compute #13828

Open
GregoryKimball opened this issue Aug 7, 2023 · 0 comments
Open

[FEA] Increase reader throughput by pipelining IO and compute #13828

GregoryKimball opened this issue Aug 7, 2023 · 0 comments
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Aug 7, 2023

-- this is a draft, please do not comment yet --

The end-to-end throughput of a file reader is limited by the sequential read speed of the underlying data source. We can use "pipelining" to overlap processing data on the device with reading data from the data source. Pipelining works by processing the data in batches, so that the previous chunk can be processed as the next chunk is reading. Pipelined readers show higher end-to-end throughput if the overlap between reading and processing is greater than the overhead from processing smaller batches.

In cuIO, multibyte_split used a pipelined design that reads text data in ~33 MB chunks (2^25 bytes) into a pinned host buffer, copies the data to device, and then generates offsets data. Here's a profile reading "Common Crawl" document data with cudf.read_text from a 410 MB file:
image

Note how the get_next_chunk function includes the OS read and Memcopy HtoD, and how the Memcpy HtoD overlaps with the next OS read. Stream-ordered kernel launches also overlap with the next OS read. For each 10 ms OS read, there is 1.5 ms of overlapping copy/compute work and 0.2 ms of overhead between each OS read.

We can applying pipelining to the Parquet reader as well. Parquet reading includes several major steps: raw IO, header decoding, decompression, and data decoding. The runtime of each step varies based on the properties of the data in the file, including the data types, encoding efficiency, and compression efficiency. Furthermore Parquet files have internal row group and page structure that restricts how the file can be split. Here is an example profile reading the same "Common Crawl" data as above, but from a 240 MB Snappy-compressed Parquet file:
image

Note how 90 ms is spent in OS read on the file and ~20 ms is spent processing, with decompression taking most (11.5 ms) of the processing time. Also note the GPU utilization data during the read_parquet function, with zero GPU utilization during the copy followed by good good SM utilization and moderate warp utilization during the compute.

We've completed prototyping work in #12358, experimenting with several approaches for pipelining the Parquet reader. Here are some performance analysis ideas for the next time we tackle this feature:

  • Curate a library of real world (not generated) data files and use that to evaluate the performance of pipelining approach
  • Analyze the copying, decompression, decoding times in the curated library and track which files show the biggest benefit from pipelining
  • Consider setting a floor (such as 200 MB of compressed data) before pipelining kicks in, to make sure we aren't accruing too much overhead
  • Evaluate network-attached storage in addition to local NVMe data sources

As far as pipelining approaches, here are some areas to consider:

Stream usage Chunking pattern Notes
entire read per stream row group tbd
decompression stream and decoding stream row group tbd

-- this is a draft, please do not comment yet --

@GregoryKimball GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS helps: Python labels Aug 7, 2023
@GregoryKimball GregoryKimball moved this to Story Issue in libcudf Aug 7, 2023
@GregoryKimball GregoryKimball moved this from Story Issue to To be revisited in libcudf Mar 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Status: To be revisited
Development

No branches or pull requests

2 participants