Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Releases: NVIDIA/thrust

Thrust 1.8.3 (CUDA Toolkit 8.0)

16 May 09:58
Compare
Choose a tag to compare

Thrust 1.8.3 is a small bug fix release.

New Examples

  • range_view demonstrates the use of a view (a non-owning wrapper for an iterator range with a container-like interface).

Bug Fixes

  • thrust::(min|max|minmax)_element can now accept raw device pointers when an explicit device execution policy is used.
  • thrust::clear operations on vector types no longer requires the element type to have a default constructor.

Thrust 1.8.2 (CUDA Toolkit 7.5)

28 Feb 20:44
Compare
Choose a tag to compare

Thrust 1.8.2 is a small bug fix release.

Bug Fixes

  • Avoid warnings and errors concerning user functions called from __host__ __device__ functions
  • #632 CUDA set_intersection_by_key error
  • #651 thrust::copy between host & device is not interoperable with thrust::cuda::par.on(stream)
  • #664 CUDA for_each ignores execution policy's stream

Known Issues

  • #628 CUDA's reduce_by_key fails on sm_50 devices

Thrust 1.8.1 (CUDA Toolkit 7.0)

18 Mar 23:39
Compare
Choose a tag to compare

Thrust 1.8.1 is a small bug fix release.

Bug Fixes

  • CUDA thrust::for_each accesses illegal memory locations when given a large range.

Thrust 1.8.0

12 Jan 22:16
Compare
Choose a tag to compare

Thrust 1.8.0 introduces support for algorithm invocation from CUDA __device__ code, support for CUDA streams, and algorithm performance improvements. Users may now invoke Thrust algorithms from CUDA __device__ code, providing a parallel algorithms library to CUDA programmers authoring custom kernels, as well as allowing Thrust programmers to nest their algorithm calls within functors. The thrust::seq execution policy allows users to require sequential algorithm execution in the calling thread and makes a sequential algorithms library available to individual CUDA threads. The .on(stream) syntax allows users to request a CUDA stream for kernels launched during algorithm execution. Finally, new CUDA algorithm implementations provide substantial performance improvements.

New Features

  • Algorithms in CUDA __device__ code
    • Thrust algorithms may now be invoked from CUDA __device__ and __host__ __device__ functions.

      Algorithms invoked in this manner must be invoked with an execution policy as the first parameter:

       __device__ int my_device_sort(int *data, size_t n)
       {
         thrust::sort(thrust::device, data, data + n);
       }
      

      The following execution policies are supported in CUDA __device__ code:

      • thrust::seq
      • thrust::cuda::par
      • thrust::device, when THRUST_DEVICE_SYSTEM == THRUST_DEVICE_SYSTEM_CUDA

      Parallel algorithm execution may not be accelerated unless CUDA Dynamic Parallelism is available.

  • Execution Policies
    • CUDA Streams
      The thrust::cuda::par.on(stream) syntax allows users to request that CUDA __global__ functions launched during algorithm execution should occur on a given stream:

      // execute for_each on stream s
      thrust::for_each(thrust::cuda::par.on(s), begin, end, my_functor);
      

      Algorithms executed with a CUDA stream in this manner may still synchronize with other streams when allocating temporary storage or returning results to the CPU.

    • thrust::seq
      The thrust::seq execution policy allows users to require that an algorithm execute sequentially in the calling thread:

      // execute for_each sequentially in this thread
      thrust::for_each(thrust::seq, begin, end, my_functor);
      
  • Other
    • The new thrust::complex template provides complex number support.

New Examples

  • simple_cuda_streams demonstrates how to request a CUDA stream during algorithm execution.
  • async_reduce demonstrates ways to achieve algorithm invocations which are asynchronous with the calling thread.

Other Enhancements

  • CUDA sort performance for user-defined types is 300% faster on Tesla K20c for large problem sizes.
  • CUDA merge performance is 200% faster on Tesla K20c for large problem sizes.
  • CUDA sort performance for primitive types is 50% faster on Tesla K20c for large problem sizes.
  • CUDA reduce_by_key performance is 25% faster on Tesla K20c for large problem sizes.
  • CUDA scan performance is 15% faster on Tesla K20c for large problem sizes.
  • fallback_allocator example is simpler.

Bug Fixes

  • #364 iterators with unrelated system tags may be used with algorithms invoked with an execution policy
  • #371 do not redefine __CUDA_ARCH__
  • #379 fix crash when dereferencing transform_iterator on the CPU
  • #391 avoid use of uppercase variable names
  • #392 fix thrust::copy between cusp::complex & std::complex
  • #396 program compiled with gcc < 4.3 hangs during comparison sort
  • #406 fallback_allocator.cu example checks device for unified addressing support
  • #417 avoid using std::less<T> in binary search algorithms
  • #418 avoid various warnings
  • #443 including version.h no longer configures default systems
  • #578 nvcc produces warnings when sequential algorithms are used with cpu systems

Known Issues

  • When invoked with primitive data types, thrust::sort, thrust::sort_by_key, thrust::stable_sort, thrust::stable_sort_by_key may fail to link in some cases with nvcc -rdc=true.
  • The CUDA implementation of thrust::reduce_by_key incorrectly outputs the last element in a segment of equivalent keys instead of the first.

Acknowledgments

  • Thanks to Sean Baxter for contributing faster CUDA reduce, merge, and scan implementations.
  • Thanks to Duane Merrill for contributing a faster CUDA radix sort implementation.
  • Thanks to Filipe Maia for contributing the implementation of thrust::complex.

Thrust 1.7.2 (CUDA Toolkit 6.5)

16 May 09:57
Compare
Choose a tag to compare

Thrust 1.7.2 is a minor bug fix release.

Bug Fixes

  • Avoid use of std::min in generic find implementation.

Thrust 1.7.1 (CUDA Toolkit 6.0)

19 Jun 21:33
Compare
Choose a tag to compare

Thrust 1.7.1 is a minor bug fix release.

Bug Fixes

  • Eliminate identifiers in set_operations.cu example with leading underscore
  • Eliminate unused variable warning in CUDA reduce_by_key implemention
  • Avoid deriving function objects from std::unary_function and std::binary_function

Thrust 1.7.0 (CUDA Toolkit 5.5)

03 Jul 05:09
Compare
Choose a tag to compare

Thrust 1.7.0 introduces a new interface for controlling algorithm execution as well as several new algorithms and performance improvements. With this new interface, users may directly control how algorithms execute as well as details such as the allocation of temporary storage. Key/value versions of thrust::merge and the set operation algorithms have been added, as well stencil versions of partitioning algorithms. thrust::tabulate has been introduced to tabulate the values of functions taking integers. For 32b types, new CUDA merge and set operations provide 2-15x faster performance while a new CUDA comparison sort provides 1.3-4x faster performance. Finally, a new TBB reduce_by_key implementation provides 80% faster performance.

Breaking API Changes

  • Dispatch
    • Custom user backend systems' tag types must now inherit from the corresponding system's execution_policy template (e.g. thrust::cuda::execution_policy) instead of the tag struct (e.g. thrust::cuda::tag). Otherwise, algorithm specializations will silently go unfound during dispatch.

      See examples/minimal_custom_backend.cu and examples/cuda/fallback_allocator.cu for usage examples.

    • thrust::advance and thrust::distance are no longer dispatched based on iterator system type and thus may no longer be customized.

  • Iterators
    • iterator_facade and iterator_adaptor's Pointer template parameters have been eliminated.
    • iterator_adaptor has been moved into the thrust namespace (previously thrust::experimental::iterator_adaptor).
    • iterator_facade has been moved into the thrust namespace (previously thrust::experimental::iterator_facade).
    • iterator_core_access has been moved into the thrust namespace (previously thrust::experimental::iterator_core_access).
      All iterators' nested pointer typedef (the type of the result of operator->) is now void instead of a pointer type to indicate that such expressions are currently impossible.
      Floating point counting_iterators' nested difference_type typedef is now a signed integral type instead of a floating point type.
  • Other
    • normal_distribution has been moved into the thrust::random namespace (previously thrust::random::experimental::normal_distribution).
    • Placeholder expressions may no longer include the comma operator.

New Features

  • Execution Policies
    • Users may directly control the dispatch of algorithm invocations with optional execution policy arguments.

      For example, instead of wrapping raw pointers allocated by cudaMalloc with thrust::device_ptr, the thrust::device execution_policy may be passed as an argument to an algorithm invocation to enable CUDA execution.

      The following execution policies are supported in this version:

      • thrust::host
      • thrust::device
      • thrust::cpp::par
      • thrust::cuda::par
      • thrust::omp::par
      • thrust::tbb::par
  • Algorithms
    • free
    • get_temporary_buffer
    • malloc
    • merge_by_key
    • partition with stencil
    • partition_copy with stencil
    • return_temporary_buffer
    • set_difference_by_key
    • set_intersection_by_key
    • set_symmetric_difference_by_key
    • set_union_by_key
    • stable_partition with stencil
    • stable_partition_copy with stencil
    • tabulate

New Examples

  • uninitialized_vector demonstrates how to use a custom allocator to avoid the automatic initialization of elements in thrust::device_vector.

Other Enhancements

  • Authors of custom backend systems may manipulate arbitrary state during algorithm dispatch by incorporating it into their execution_policy parameter.
  • Users may control the allocation of temporary storage during algorithm execution by passing standard allocators as parameters via execution policies such as thrust::device.
  • THRUST_DEVICE_SYSTEM_CPP has been added as a compile-time target for the device backend.
  • CUDA merge performance is 2-15x faster.
  • CUDA comparison sort performance is 1.3-4x faster.
  • CUDA set operation performance is 1.5-15x faster.
  • TBB reduce_by_key performance is 80% faster.
  • Several algorithms have been parallelized with TBB.
  • Support for user allocators in vectors has been improved.
  • The sparse_vector example is now implemented with merge_by_key instead of sort_by_key.
  • Warnings have been eliminated in various contexts.
  • Warnings about __host__ or __device__-only functions called from __host__ __device__ functions have been eliminated in various contexts.
  • Documentation about algorithm requirements have been improved.
  • Simplified the minimal_custom_backend example.
  • Simplified the cuda/custom_temporary_allocation example.
  • Simplified the cuda/fallback_allocator example.

Bug Fixes

  • #248 fix broken counting_iterator<float> behavior with OpenMP
  • #231, #209 fix set operation failures with CUDA
  • #187 fix incorrect occupancy calculation with CUDA
  • #153 fix broken multigpu behavior with CUDA
  • #142 eliminate warning produced by thrust::random::taus88 and MSVC 2010
  • #208 correctly initialize elements in temporary storage when necessary
  • #16 fix compilation error when sorting bool with CUDA
  • #10 fix ambiguous overloads of reinterpret_tag

Known Issues

  • g++ versions 4.3 and lower may fail to dispatch thrust::get_temporary_buffer correctly causing infinite recursion in examples such as cuda/custom_temporary_allocation.

Acknowledgments

  • Thanks to Sean Baxter, Bryan Catanzaro, and Manjunath Kudlur for contributing a faster merge implementation for CUDA.
  • Thanks to Sean Baxter for contributing a faster set operation implementation for CUDA.
  • Thanks to Cliff Woolley for contributing a correct occupancy calculation algorithm.

Thrust 1.5.3 (CUDA Toolkit 5.0)

16 May 09:50
Compare
Choose a tag to compare

Thrust 1.5.3 is a minor bug fix release.

Bug Fixes

  • Avoid warnings about potential race due to __shared__ non-POD variable

Thrust 1.6.0

16 May 09:53
Compare
Choose a tag to compare

Thrust 1.6.0 provides an interface for customization and extension and a new backend system based on the Threading Building Blocks library. With this new interface, programmers may customize the behavior of specific algorithms as well as control the allocation of temporary storage or invent entirely new backends. These enhancements also allow multiple different backend systems such as CUDA and OpenMP to coexist within a single program. Support for TBB allows Thrust programs to integrate more naturally into applications which may already employ the TBB task scheduler.

Breaking Changes

  • The header <thrust/experimental/cuda/pinned_allocator.h> has been moved to <thrust/system/cuda/experimental/pinned_allocator.h>
  • thrust::experimental::cuda::pinned_allocator has been moved to thrust::cuda::experimental::pinned_allocator
  • The macro THRUST_DEVICE_BACKEND has been renamed THRUST_DEVICE_SYSTEM
  • The macro THRUST_DEVICE_BACKEND_CUDA has been renamed THRUST_DEVICE_SYSTEM_CUDA
  • The macro THRUST_DEVICE_BACKEND_OMP has been renamed THRUST_DEVICE_SYSTEM_OMP
  • thrust::host_space_tag has been renamed thrust::host_system_tag
  • thrust::device_space_tag has been renamed thrust::device_system_tag
  • thrust::any_space_tag has been renamed thrust::any_system_tag
  • thrust::iterator_space has been renamed thrust::iterator_system

New Features

  • Backend Systems
    • Threading Building Blocks (TBB) is now supported
  • Algorithms
    • thrust::for_each_n
    • thrust::raw_reference_cast
  • Types
    • thrust::pointer
    • thrust::reference

New Examples

  • cuda/custom_temporary_allocation
  • cuda/fallback_allocator
  • device_ptr
  • expand
  • minimal_custom_backend
  • raw_reference_cast
  • set_operations

Other Enhancements

  • thrust::for_each now returns the end of the input range similar to most other algorithms
  • thrust::pair and thrust::tuple have swap functionality
  • All CUDA algorithms now support large data types
  • Iterators may be dereferenced in user device or global functions
  • The safe use of different backend systems is now possible within a single binary

Bug Fixes

  • #469 min_element and max_element algorithms no longer require a const comparison operator

Known Issues

  • NVCC may crash when parsing TBB headers on Windows.

Thrust 1.5.2 (CUDA Toolkit 4.2)

16 May 09:49
Compare
Choose a tag to compare

Thrust 1.5.2 is a minor bug fix release.

Bug Fixes

  • Fixed warning about C-style initialization of structures