Releases: NVIDIA/thrust
Thrust 1.8.3 (CUDA Toolkit 8.0)
Thrust 1.8.3 is a small bug fix release.
New Examples
range_view
demonstrates the use of a view (a non-owning wrapper for an iterator range with a container-like interface).
Bug Fixes
thrust::(min|max|minmax)_element
can now accept raw device pointers when an explicit device execution policy is used.thrust::clear
operations on vector types no longer requires the element type to have a default constructor.
Thrust 1.8.2 (CUDA Toolkit 7.5)
Thrust 1.8.2 is a small bug fix release.
Bug Fixes
- Avoid warnings and errors concerning user functions called from
__host__ __device__
functions - #632 CUDA
set_intersection_by_key
error - #651
thrust::copy
between host & device is not interoperable withthrust::cuda::par.on(stream)
- #664 CUDA
for_each
ignores execution policy's stream
Known Issues
- #628 CUDA's
reduce_by_key
fails onsm_50
devices
Thrust 1.8.1 (CUDA Toolkit 7.0)
Thrust 1.8.1 is a small bug fix release.
Bug Fixes
- CUDA
thrust::for_each
accesses illegal memory locations when given a large range.
Thrust 1.8.0
Thrust 1.8.0 introduces support for algorithm invocation from CUDA __device__
code, support for CUDA streams, and algorithm performance improvements. Users may now invoke Thrust algorithms from CUDA __device__
code, providing a parallel algorithms library to CUDA programmers authoring custom kernels, as well as allowing Thrust programmers to nest their algorithm calls within functors. The thrust::seq
execution policy allows users to require sequential algorithm execution in the calling thread and makes a sequential algorithms library available to individual CUDA threads. The .on(stream)
syntax allows users to request a CUDA stream for kernels launched during algorithm execution. Finally, new CUDA algorithm implementations provide substantial performance improvements.
New Features
- Algorithms in CUDA
__device__
code-
Thrust algorithms may now be invoked from CUDA
__device__
and__host__ __device__
functions.Algorithms invoked in this manner must be invoked with an execution policy as the first parameter:
__device__ int my_device_sort(int *data, size_t n) { thrust::sort(thrust::device, data, data + n); }
The following execution policies are supported in CUDA
__device__
code:thrust::seq
thrust::cuda::par
thrust::device
, whenTHRUST_DEVICE_SYSTEM == THRUST_DEVICE_SYSTEM_CUDA
Parallel algorithm execution may not be accelerated unless CUDA Dynamic Parallelism is available.
-
- Execution Policies
-
CUDA Streams
Thethrust::cuda::par.on(stream)
syntax allows users to request that CUDA__global__
functions launched during algorithm execution should occur on a given stream:// execute for_each on stream s thrust::for_each(thrust::cuda::par.on(s), begin, end, my_functor);
Algorithms executed with a CUDA stream in this manner may still synchronize with other streams when allocating temporary storage or returning results to the CPU.
-
thrust::seq
Thethrust::seq
execution policy allows users to require that an algorithm execute sequentially in the calling thread:// execute for_each sequentially in this thread thrust::for_each(thrust::seq, begin, end, my_functor);
-
- Other
- The new thrust::complex template provides complex number support.
New Examples
simple_cuda_streams
demonstrates how to request a CUDA stream during algorithm execution.async_reduce
demonstrates ways to achieve algorithm invocations which are asynchronous with the calling thread.
Other Enhancements
- CUDA
sort
performance for user-defined types is 300% faster on Tesla K20c for large problem sizes. - CUDA
merge
performance is 200% faster on Tesla K20c for large problem sizes. - CUDA
sort
performance for primitive types is 50% faster on Tesla K20c for large problem sizes. - CUDA
reduce_by_key
performance is 25% faster on Tesla K20c for large problem sizes. - CUDA
scan
performance is 15% faster on Tesla K20c for large problem sizes. fallback_allocator
example is simpler.
Bug Fixes
- #364 iterators with unrelated system tags may be used with algorithms invoked with an execution policy
- #371 do not redefine
__CUDA_ARCH__
- #379 fix crash when dereferencing
transform_iterator
on the CPU - #391 avoid use of uppercase variable names
- #392 fix
thrust::copy
betweencusp::complex
&std::complex
- #396 program compiled with gcc < 4.3 hangs during comparison sort
- #406
fallback_allocator.cu
example checks device for unified addressing support - #417 avoid using
std::less<T>
in binary search algorithms - #418 avoid various warnings
- #443 including
version.h
no longer configures default systems - #578 nvcc produces warnings when sequential algorithms are used with cpu systems
Known Issues
- When invoked with primitive data types,
thrust::sort
,thrust::sort_by_key
,thrust::stable_sort
,thrust::stable_sort_by_key
may fail to link in some cases withnvcc -rdc=true
. - The CUDA implementation of
thrust::reduce_by_key
incorrectly outputs the last element in a segment of equivalent keys instead of the first.
Acknowledgments
- Thanks to Sean Baxter for contributing faster CUDA
reduce
,merge
, andscan
implementations. - Thanks to Duane Merrill for contributing a faster CUDA radix sort implementation.
- Thanks to Filipe Maia for contributing the implementation of
thrust::complex
.
Thrust 1.7.2 (CUDA Toolkit 6.5)
Thrust 1.7.2 is a minor bug fix release.
Bug Fixes
- Avoid use of
std::min
in generic find implementation.
Thrust 1.7.1 (CUDA Toolkit 6.0)
Thrust 1.7.1 is a minor bug fix release.
Bug Fixes
- Eliminate identifiers in
set_operations.cu
example with leading underscore - Eliminate unused variable warning in CUDA reduce_by_key implemention
- Avoid deriving function objects from
std::unary_function
andstd::binary_function
Thrust 1.7.0 (CUDA Toolkit 5.5)
Thrust 1.7.0 introduces a new interface for controlling algorithm execution as well as several new algorithms and performance improvements. With this new interface, users may directly control how algorithms execute as well as details such as the allocation of temporary storage. Key/value versions of thrust::merge
and the set operation algorithms have been added, as well stencil versions of partitioning algorithms. thrust::tabulate
has been introduced to tabulate the values of functions taking integers. For 32b types, new CUDA merge and set operations provide 2-15x faster performance while a new CUDA comparison sort provides 1.3-4x faster performance. Finally, a new TBB reduce_by_key
implementation provides 80% faster performance.
Breaking API Changes
- Dispatch
-
Custom user backend systems' tag types must now inherit from the corresponding system's
execution_policy
template (e.g.thrust::cuda::execution_policy
) instead of the tagstruct
(e.g.thrust::cuda::tag
). Otherwise, algorithm specializations will silently go unfound during dispatch.See examples/minimal_custom_backend.cu and examples/cuda/fallback_allocator.cu for usage examples.
-
thrust::advance
andthrust::distance
are no longer dispatched based on iterator system type and thus may no longer be customized.
-
- Iterators
iterator_facade
anditerator_adaptor
'sPointer
template parameters have been eliminated.iterator_adaptor
has been moved into thethrust
namespace (previouslythrust::experimental::iterator_adaptor
).iterator_facade
has been moved into thethrust
namespace (previouslythrust::experimental::iterator_facade
).iterator_core_access
has been moved into thethrust
namespace (previouslythrust::experimental::iterator_core_access
).
All iterators' nested pointertypedef
(the type of the result ofoperator->
) is nowvoid
instead of a pointer type to indicate that such expressions are currently impossible.
Floating pointcounting_iterators
' nesteddifference_type
typedef
is now a signed integral type instead of a floating point type.
- Other
normal_distribution
has been moved into thethrust::random
namespace (previouslythrust::random::experimental::normal_distribution
).- Placeholder expressions may no longer include the comma operator.
New Features
- Execution Policies
-
Users may directly control the dispatch of algorithm invocations with optional execution policy arguments.
For example, instead of wrapping raw pointers allocated by
cudaMalloc
withthrust::device_ptr
, thethrust::device
execution_policy may be passed as an argument to an algorithm invocation to enable CUDA execution.The following execution policies are supported in this version:
thrust::host
thrust::device
thrust::cpp::par
thrust::cuda::par
thrust::omp::par
thrust::tbb::par
-
- Algorithms
free
get_temporary_buffer
malloc
merge_by_key
partition
with stencilpartition_copy
with stencilreturn_temporary_buffer
set_difference_by_key
set_intersection_by_key
set_symmetric_difference_by_key
set_union_by_key
stable_partition
with stencilstable_partition_copy
with stenciltabulate
New Examples
- uninitialized_vector demonstrates how to use a custom allocator to avoid the automatic initialization of elements in
thrust::device_vector
.
Other Enhancements
- Authors of custom backend systems may manipulate arbitrary state during algorithm dispatch by incorporating it into their
execution_policy
parameter. - Users may control the allocation of temporary storage during algorithm execution by passing standard allocators as parameters via execution policies such as
thrust::device
. THRUST_DEVICE_SYSTEM_CPP
has been added as a compile-time target for the device backend.- CUDA
merge
performance is 2-15x faster. - CUDA comparison sort performance is 1.3-4x faster.
- CUDA set operation performance is 1.5-15x faster.
- TBB
reduce_by_key
performance is 80% faster. - Several algorithms have been parallelized with TBB.
- Support for user allocators in vectors has been improved.
- The sparse_vector example is now implemented with merge_by_key instead of sort_by_key.
- Warnings have been eliminated in various contexts.
- Warnings about
__host__
or__device__
-only functions called from__host__ __device__
functions have been eliminated in various contexts. - Documentation about algorithm requirements have been improved.
- Simplified the minimal_custom_backend example.
- Simplified the cuda/custom_temporary_allocation example.
- Simplified the cuda/fallback_allocator example.
Bug Fixes
- #248 fix broken
counting_iterator<float>
behavior with OpenMP - #231, #209 fix set operation failures with CUDA
- #187 fix incorrect occupancy calculation with CUDA
- #153 fix broken multigpu behavior with CUDA
- #142 eliminate warning produced by
thrust::random::taus88
and MSVC 2010 - #208 correctly initialize elements in temporary storage when necessary
- #16 fix compilation error when sorting bool with CUDA
- #10 fix ambiguous overloads of
reinterpret_tag
Known Issues
g++
versions 4.3 and lower may fail to dispatchthrust::get_temporary_buffer
correctly causing infinite recursion in examples such as cuda/custom_temporary_allocation.
Acknowledgments
- Thanks to Sean Baxter, Bryan Catanzaro, and Manjunath Kudlur for contributing a faster merge implementation for CUDA.
- Thanks to Sean Baxter for contributing a faster set operation implementation for CUDA.
- Thanks to Cliff Woolley for contributing a correct occupancy calculation algorithm.
Thrust 1.5.3 (CUDA Toolkit 5.0)
Thrust 1.5.3 is a minor bug fix release.
Bug Fixes
- Avoid warnings about potential race due to
__shared__
non-POD variable
Thrust 1.6.0
Thrust 1.6.0 provides an interface for customization and extension and a new backend system based on the Threading Building Blocks library. With this new interface, programmers may customize the behavior of specific algorithms as well as control the allocation of temporary storage or invent entirely new backends. These enhancements also allow multiple different backend systems such as CUDA and OpenMP to coexist within a single program. Support for TBB allows Thrust programs to integrate more naturally into applications which may already employ the TBB task scheduler.
Breaking Changes
- The header <thrust/experimental/cuda/pinned_allocator.h> has been moved to <thrust/system/cuda/experimental/pinned_allocator.h>
- thrust::experimental::cuda::pinned_allocator has been moved to thrust::cuda::experimental::pinned_allocator
- The macro THRUST_DEVICE_BACKEND has been renamed THRUST_DEVICE_SYSTEM
- The macro THRUST_DEVICE_BACKEND_CUDA has been renamed THRUST_DEVICE_SYSTEM_CUDA
- The macro THRUST_DEVICE_BACKEND_OMP has been renamed THRUST_DEVICE_SYSTEM_OMP
- thrust::host_space_tag has been renamed thrust::host_system_tag
- thrust::device_space_tag has been renamed thrust::device_system_tag
- thrust::any_space_tag has been renamed thrust::any_system_tag
- thrust::iterator_space has been renamed thrust::iterator_system
New Features
- Backend Systems
- Threading Building Blocks (TBB) is now supported
- Algorithms
thrust::for_each_n
thrust::raw_reference_cast
- Types
thrust::pointer
thrust::reference
New Examples
cuda/custom_temporary_allocation
cuda/fallback_allocator
device_ptr
expand
minimal_custom_backend
raw_reference_cast
set_operations
Other Enhancements
- thrust::for_each now returns the end of the input range similar to most other algorithms
- thrust::pair and thrust::tuple have swap functionality
- All CUDA algorithms now support large data types
- Iterators may be dereferenced in user device or global functions
- The safe use of different backend systems is now possible within a single binary
Bug Fixes
- #469
min_element
andmax_element
algorithms no longer require a const comparison operator
Known Issues
- NVCC may crash when parsing TBB headers on Windows.
Thrust 1.5.2 (CUDA Toolkit 4.2)
Thrust 1.5.2 is a minor bug fix release.
Bug Fixes
- Fixed warning about C-style initialization of structures