Thrust 1.8.0
Thrust 1.8.0 introduces support for algorithm invocation from CUDA __device__
code, support for CUDA streams, and algorithm performance improvements. Users may now invoke Thrust algorithms from CUDA __device__
code, providing a parallel algorithms library to CUDA programmers authoring custom kernels, as well as allowing Thrust programmers to nest their algorithm calls within functors. The thrust::seq
execution policy allows users to require sequential algorithm execution in the calling thread and makes a sequential algorithms library available to individual CUDA threads. The .on(stream)
syntax allows users to request a CUDA stream for kernels launched during algorithm execution. Finally, new CUDA algorithm implementations provide substantial performance improvements.
New Features
- Algorithms in CUDA
__device__
code-
Thrust algorithms may now be invoked from CUDA
__device__
and__host__ __device__
functions.Algorithms invoked in this manner must be invoked with an execution policy as the first parameter:
__device__ int my_device_sort(int *data, size_t n) { thrust::sort(thrust::device, data, data + n); }
The following execution policies are supported in CUDA
__device__
code:thrust::seq
thrust::cuda::par
thrust::device
, whenTHRUST_DEVICE_SYSTEM == THRUST_DEVICE_SYSTEM_CUDA
Parallel algorithm execution may not be accelerated unless CUDA Dynamic Parallelism is available.
-
- Execution Policies
-
CUDA Streams
Thethrust::cuda::par.on(stream)
syntax allows users to request that CUDA__global__
functions launched during algorithm execution should occur on a given stream:// execute for_each on stream s thrust::for_each(thrust::cuda::par.on(s), begin, end, my_functor);
Algorithms executed with a CUDA stream in this manner may still synchronize with other streams when allocating temporary storage or returning results to the CPU.
-
thrust::seq
Thethrust::seq
execution policy allows users to require that an algorithm execute sequentially in the calling thread:// execute for_each sequentially in this thread thrust::for_each(thrust::seq, begin, end, my_functor);
-
- Other
- The new thrust::complex template provides complex number support.
New Examples
simple_cuda_streams
demonstrates how to request a CUDA stream during algorithm execution.async_reduce
demonstrates ways to achieve algorithm invocations which are asynchronous with the calling thread.
Other Enhancements
- CUDA
sort
performance for user-defined types is 300% faster on Tesla K20c for large problem sizes. - CUDA
merge
performance is 200% faster on Tesla K20c for large problem sizes. - CUDA
sort
performance for primitive types is 50% faster on Tesla K20c for large problem sizes. - CUDA
reduce_by_key
performance is 25% faster on Tesla K20c for large problem sizes. - CUDA
scan
performance is 15% faster on Tesla K20c for large problem sizes. fallback_allocator
example is simpler.
Bug Fixes
- #364 iterators with unrelated system tags may be used with algorithms invoked with an execution policy
- #371 do not redefine
__CUDA_ARCH__
- #379 fix crash when dereferencing
transform_iterator
on the CPU - #391 avoid use of uppercase variable names
- #392 fix
thrust::copy
betweencusp::complex
&std::complex
- #396 program compiled with gcc < 4.3 hangs during comparison sort
- #406
fallback_allocator.cu
example checks device for unified addressing support - #417 avoid using
std::less<T>
in binary search algorithms - #418 avoid various warnings
- #443 including
version.h
no longer configures default systems - #578 nvcc produces warnings when sequential algorithms are used with cpu systems
Known Issues
- When invoked with primitive data types,
thrust::sort
,thrust::sort_by_key
,thrust::stable_sort
,thrust::stable_sort_by_key
may fail to link in some cases withnvcc -rdc=true
. - The CUDA implementation of
thrust::reduce_by_key
incorrectly outputs the last element in a segment of equivalent keys instead of the first.
Acknowledgments
- Thanks to Sean Baxter for contributing faster CUDA
reduce
,merge
, andscan
implementations. - Thanks to Duane Merrill for contributing a faster CUDA radix sort implementation.
- Thanks to Filipe Maia for contributing the implementation of
thrust::complex
.