[FEA] Increase reader throughput by pipelining IO and compute #13828

GregoryKimball · 2023-08-07T19:43:53Z

-- this is a draft, please do not comment yet --

The end-to-end throughput of a file reader is limited by the sequential read speed of the underlying data source. We can use "pipelining" to overlap processing data on the device with reading data from the data source. Pipelining works by processing the data in batches, so that the previous chunk can be processed as the next chunk is reading. Pipelined readers show higher end-to-end throughput if the overlap between reading and processing is greater than the overhead from processing smaller batches.

In cuIO, multibyte_split used a pipelined design that reads text data in ~33 MB chunks (2^25 bytes) into a pinned host buffer, copies the data to device, and then generates offsets data. Here's a profile reading "Common Crawl" document data with cudf.read_text from a 410 MB file:

Note how the get_next_chunk function includes the OS read and Memcopy HtoD, and how the Memcpy HtoD overlaps with the next OS read. Stream-ordered kernel launches also overlap with the next OS read. For each 10 ms OS read, there is 1.5 ms of overlapping copy/compute work and 0.2 ms of overhead between each OS read.

We can applying pipelining to the Parquet reader as well. Parquet reading includes several major steps: raw IO, header decoding, decompression, and data decoding. The runtime of each step varies based on the properties of the data in the file, including the data types, encoding efficiency, and compression efficiency. Furthermore Parquet files have internal row group and page structure that restricts how the file can be split. Here is an example profile reading the same "Common Crawl" data as above, but from a 240 MB Snappy-compressed Parquet file:

Note how 90 ms is spent in OS read on the file and ~20 ms is spent processing, with decompression taking most (11.5 ms) of the processing time. Also note the GPU utilization data during the read_parquet function, with zero GPU utilization during the copy followed by good good SM utilization and moderate warp utilization during the compute.

We've completed prototyping work in #12358, experimenting with several approaches for pipelining the Parquet reader. Here are some performance analysis ideas for the next time we tackle this feature:

Curate a library of real world (not generated) data files and use that to evaluate the performance of pipelining approach
Analyze the copying, decompression, decoding times in the curated library and track which files show the biggest benefit from pipelining
Consider setting a floor (such as 200 MB of compressed data) before pipelining kicks in, to make sure we aren't accruing too much overhead
Evaluate network-attached storage in addition to local NVMe data sources

As far as pipelining approaches, here are some areas to consider:

Stream usage	Chunking pattern	Notes
entire read per stream	row group	tbd
decompression stream and decoding stream	row group	tbd

-- this is a draft, please do not comment yet --

The text was updated successfully, but these errors were encountered:

GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS helps: Python labels Aug 7, 2023

GregoryKimball added this to libcudf Aug 7, 2023

GregoryKimball moved this to Story Issue in libcudf Aug 7, 2023

GregoryKimball mentioned this issue Aug 7, 2023

[FEA] Pipeline IO, decompression and decoding in Parquet reader for performance improvements #11951

Closed

GregoryKimball added this to the Parquet continuous improvement milestone Aug 10, 2023

GregoryKimball mentioned this issue Sep 10, 2023

[FEA] Improve ORC reader filtering and performance #13882

Open

vyasr removed the helps: Python label Feb 23, 2024

GregoryKimball moved this from Story Issue to To be revisited in libcudf Mar 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Increase reader throughput by pipelining IO and compute #13828

[FEA] Increase reader throughput by pipelining IO and compute #13828

GregoryKimball commented Aug 7, 2023 •

edited

Loading

[FEA] Increase reader throughput by pipelining IO and compute #13828

[FEA] Increase reader throughput by pipelining IO and compute #13828

Comments

GregoryKimball commented Aug 7, 2023 • edited Loading

GregoryKimball commented Aug 7, 2023 •

edited

Loading