Thrust 1.8.0 introduces support for algorithm invocation from CUDA __device__ code, support for CUDA streams, and algorithm performance improvements. Users may now invoke Thrust algorithms from CUDA __device__ code, providing a parallel algorithms library to CUDA programmers authoring custom kernels, as well as allowing Thrust programmers to nest their algorithm calls within functors. The thrust::seq execution policy allows users to require sequential algorithm execution in the calling thread and makes a sequential algorithms library available to individual CUDA threads. The .on(stream) syntax allows users to request a CUDA stream for kernels launched during algorithm execution. Finally, new CUDA algorithm implementations provide substantial performance improvements.

New Features

Algorithms in CUDA __device__ code
- Thrust algorithms may now be invoked from CUDA __device__ and __host__ __device__ functions.
  
  Algorithms invoked in this manner must be invoked with an execution policy as the first parameter:
```
 __device__ int my_device_sort(int *data, size_t n)
 {
   thrust::sort(thrust::device, data, data + n);
 }
```
  The following execution policies are supported in CUDA __device__ code:
  - thrust::seq
  - thrust::cuda::par
  - thrust::device, when THRUST_DEVICE_SYSTEM == THRUST_DEVICE_SYSTEM_CUDA
  Parallel algorithm execution may not be accelerated unless CUDA Dynamic Parallelism is available.
Execution Policies
- CUDA Streams
  The thrust::cuda::par.on(stream) syntax allows users to request that CUDA __global__ functions launched during algorithm execution should occur on a given stream:
```
// execute for_each on stream s
thrust::for_each(thrust::cuda::par.on(s), begin, end, my_functor);
```
  Algorithms executed with a CUDA stream in this manner may still synchronize with other streams when allocating temporary storage or returning results to the CPU.
- thrust::seq
  The thrust::seq execution policy allows users to require that an algorithm execute sequentially in the calling thread:
```
// execute for_each sequentially in this thread
thrust::for_each(thrust::seq, begin, end, my_functor);
```
Other
- The new thrust::complex template provides complex number support.

New Examples

simple_cuda_streams demonstrates how to request a CUDA stream during algorithm execution.
async_reduce demonstrates ways to achieve algorithm invocations which are asynchronous with the calling thread.

Other Enhancements

CUDA sort performance for user-defined types is 300% faster on Tesla K20c for large problem sizes.
CUDA merge performance is 200% faster on Tesla K20c for large problem sizes.
CUDA sort performance for primitive types is 50% faster on Tesla K20c for large problem sizes.
CUDA reduce_by_key performance is 25% faster on Tesla K20c for large problem sizes.
CUDA scan performance is 15% faster on Tesla K20c for large problem sizes.
fallback_allocator example is simpler.

Bug Fixes

#364 iterators with unrelated system tags may be used with algorithms invoked with an execution policy
#371 do not redefine __CUDA_ARCH__
#379 fix crash when dereferencing transform_iterator on the CPU
#391 avoid use of uppercase variable names
#392 fix thrust::copy between cusp::complex & std::complex
#396 program compiled with gcc < 4.3 hangs during comparison sort
#406 fallback_allocator.cu example checks device for unified addressing support
#417 avoid using std::less<T> in binary search algorithms
#418 avoid various warnings
#443 including version.h no longer configures default systems
#578 nvcc produces warnings when sequential algorithms are used with cpu systems

Known Issues

When invoked with primitive data types, thrust::sort, thrust::sort_by_key, thrust::stable_sort, thrust::stable_sort_by_key may fail to link in some cases with nvcc -rdc=true.
The CUDA implementation of thrust::reduce_by_key incorrectly outputs the last element in a segment of equivalent keys instead of the first.

Acknowledgments

Thanks to Sean Baxter for contributing faster CUDA reduce, merge, and scan implementations.
Thanks to Duane Merrill for contributing a faster CUDA radix sort implementation.
Thanks to Filipe Maia for contributing the implementation of thrust::complex.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thrust 1.8.0