Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Thrust 1.8.0

Compare
Choose a tag to compare
@jaredhoberock jaredhoberock released this 12 Jan 22:16

Thrust 1.8.0 introduces support for algorithm invocation from CUDA __device__ code, support for CUDA streams, and algorithm performance improvements. Users may now invoke Thrust algorithms from CUDA __device__ code, providing a parallel algorithms library to CUDA programmers authoring custom kernels, as well as allowing Thrust programmers to nest their algorithm calls within functors. The thrust::seq execution policy allows users to require sequential algorithm execution in the calling thread and makes a sequential algorithms library available to individual CUDA threads. The .on(stream) syntax allows users to request a CUDA stream for kernels launched during algorithm execution. Finally, new CUDA algorithm implementations provide substantial performance improvements.

New Features

  • Algorithms in CUDA __device__ code
    • Thrust algorithms may now be invoked from CUDA __device__ and __host__ __device__ functions.

      Algorithms invoked in this manner must be invoked with an execution policy as the first parameter:

       __device__ int my_device_sort(int *data, size_t n)
       {
         thrust::sort(thrust::device, data, data + n);
       }
      

      The following execution policies are supported in CUDA __device__ code:

      • thrust::seq
      • thrust::cuda::par
      • thrust::device, when THRUST_DEVICE_SYSTEM == THRUST_DEVICE_SYSTEM_CUDA

      Parallel algorithm execution may not be accelerated unless CUDA Dynamic Parallelism is available.

  • Execution Policies
    • CUDA Streams
      The thrust::cuda::par.on(stream) syntax allows users to request that CUDA __global__ functions launched during algorithm execution should occur on a given stream:

      // execute for_each on stream s
      thrust::for_each(thrust::cuda::par.on(s), begin, end, my_functor);
      

      Algorithms executed with a CUDA stream in this manner may still synchronize with other streams when allocating temporary storage or returning results to the CPU.

    • thrust::seq
      The thrust::seq execution policy allows users to require that an algorithm execute sequentially in the calling thread:

      // execute for_each sequentially in this thread
      thrust::for_each(thrust::seq, begin, end, my_functor);
      
  • Other
    • The new thrust::complex template provides complex number support.

New Examples

  • simple_cuda_streams demonstrates how to request a CUDA stream during algorithm execution.
  • async_reduce demonstrates ways to achieve algorithm invocations which are asynchronous with the calling thread.

Other Enhancements

  • CUDA sort performance for user-defined types is 300% faster on Tesla K20c for large problem sizes.
  • CUDA merge performance is 200% faster on Tesla K20c for large problem sizes.
  • CUDA sort performance for primitive types is 50% faster on Tesla K20c for large problem sizes.
  • CUDA reduce_by_key performance is 25% faster on Tesla K20c for large problem sizes.
  • CUDA scan performance is 15% faster on Tesla K20c for large problem sizes.
  • fallback_allocator example is simpler.

Bug Fixes

  • #364 iterators with unrelated system tags may be used with algorithms invoked with an execution policy
  • #371 do not redefine __CUDA_ARCH__
  • #379 fix crash when dereferencing transform_iterator on the CPU
  • #391 avoid use of uppercase variable names
  • #392 fix thrust::copy between cusp::complex & std::complex
  • #396 program compiled with gcc < 4.3 hangs during comparison sort
  • #406 fallback_allocator.cu example checks device for unified addressing support
  • #417 avoid using std::less<T> in binary search algorithms
  • #418 avoid various warnings
  • #443 including version.h no longer configures default systems
  • #578 nvcc produces warnings when sequential algorithms are used with cpu systems

Known Issues

  • When invoked with primitive data types, thrust::sort, thrust::sort_by_key, thrust::stable_sort, thrust::stable_sort_by_key may fail to link in some cases with nvcc -rdc=true.
  • The CUDA implementation of thrust::reduce_by_key incorrectly outputs the last element in a segment of equivalent keys instead of the first.

Acknowledgments

  • Thanks to Sean Baxter for contributing faster CUDA reduce, merge, and scan implementations.
  • Thanks to Duane Merrill for contributing a faster CUDA radix sort implementation.
  • Thanks to Filipe Maia for contributing the implementation of thrust::complex.