[FEA] Increase reader throughput by pipelining IO and compute #13828
Labels
0 - Backlog
In queue waiting for assignment
cuIO
cuIO issue
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Spark
Functionality that helps Spark RAPIDS
Milestone
-- this is a draft, please do not comment yet --
The end-to-end throughput of a file reader is limited by the sequential read speed of the underlying data source. We can use "pipelining" to overlap processing data on the device with reading data from the data source. Pipelining works by processing the data in batches, so that the previous chunk can be processed as the next chunk is reading. Pipelined readers show higher end-to-end throughput if the overlap between reading and processing is greater than the overhead from processing smaller batches.
In cuIO,
multibyte_split
used a pipelined design that reads text data in ~33 MB chunks (2^25 bytes) into a pinned host buffer, copies the data to device, and then generates offsets data. Here's a profile reading "Common Crawl" document data withcudf.read_text
from a 410 MB file:Note how the
get_next_chunk
function includes the OSread
andMemcopy HtoD
, and how theMemcpy HtoD
overlaps with the next OSread
. Stream-ordered kernel launches also overlap with the next OSread
. For each 10 ms OSread
, there is 1.5 ms of overlapping copy/compute work and 0.2 ms of overhead between each OSread
.We can applying pipelining to the Parquet reader as well. Parquet reading includes several major steps: raw IO, header decoding, decompression, and data decoding. The runtime of each step varies based on the properties of the data in the file, including the data types, encoding efficiency, and compression efficiency. Furthermore Parquet files have internal row group and page structure that restricts how the file can be split. Here is an example profile reading the same "Common Crawl" data as above, but from a 240 MB Snappy-compressed Parquet file:
Note how 90 ms is spent in OS read on the file and ~20 ms is spent processing, with decompression taking most (11.5 ms) of the processing time. Also note the GPU utilization data during the
read_parquet
function, with zero GPU utilization during the copy followed by good good SM utilization and moderate warp utilization during the compute.We've completed prototyping work in #12358, experimenting with several approaches for pipelining the Parquet reader. Here are some performance analysis ideas for the next time we tackle this feature:
As far as pipelining approaches, here are some areas to consider:
-- this is a draft, please do not comment yet --
The text was updated successfully, but these errors were encountered: