16 May 09:58

brycelelbach

f472e78

Thrust 1.8.3 (CUDA Toolkit 8.0)

Thrust 1.8.3 is a small bug fix release.

New Examples

range_view demonstrates the use of a view (a non-owning wrapper for an iterator range with a container-like interface).

Bug Fixes

thrust::(min|max|minmax)_element can now accept raw device pointers when an explicit device execution policy is used.
thrust::clear operations on vector types no longer requires the element type to have a default constructor.

Assets 2

28 Feb 20:44

jaredhoberock

1.8.2

a1c6199

Thrust 1.8.2 (CUDA Toolkit 7.5)

Thrust 1.8.2 is a small bug fix release.

Bug Fixes

Avoid warnings and errors concerning user functions called from __host__ __device__ functions
#632 CUDA set_intersection_by_key error
#651 thrust::copy between host & device is not interoperable with thrust::cuda::par.on(stream)
#664 CUDA for_each ignores execution policy's stream

Known Issues

#628 CUDA's reduce_by_key fails on sm_50 devices

Assets 2

18 Mar 23:39

jaredhoberock

1.8.1

d19d3ec

Thrust 1.8.1 (CUDA Toolkit 7.0)

Thrust 1.8.1 is a small bug fix release.

Bug Fixes

CUDA thrust::for_each accesses illegal memory locations when given a large range.

Assets 2

12 Jan 22:16

jaredhoberock

1.8.0

b1098f0

Thrust 1.8.0

Thrust 1.8.0 introduces support for algorithm invocation from CUDA __device__ code, support for CUDA streams, and algorithm performance improvements. Users may now invoke Thrust algorithms from CUDA __device__ code, providing a parallel algorithms library to CUDA programmers authoring custom kernels, as well as allowing Thrust programmers to nest their algorithm calls within functors. The thrust::seq execution policy allows users to require sequential algorithm execution in the calling thread and makes a sequential algorithms library available to individual CUDA threads. The .on(stream) syntax allows users to request a CUDA stream for kernels launched during algorithm execution. Finally, new CUDA algorithm implementations provide substantial performance improvements.

New Features

Algorithms in CUDA __device__ code
- Thrust algorithms may now be invoked from CUDA __device__ and __host__ __device__ functions.
  
  Algorithms invoked in this manner must be invoked with an execution policy as the first parameter:
```
 __device__ int my_device_sort(int *data, size_t n)
 {
   thrust::sort(thrust::device, data, data + n);
 }
```
  The following execution policies are supported in CUDA __device__ code:
  - thrust::seq
  - thrust::cuda::par
  - thrust::device, when THRUST_DEVICE_SYSTEM == THRUST_DEVICE_SYSTEM_CUDA
  Parallel algorithm execution may not be accelerated unless CUDA Dynamic Parallelism is available.
Execution Policies
- CUDA Streams
  The thrust::cuda::par.on(stream) syntax allows users to request that CUDA __global__ functions launched during algorithm execution should occur on a given stream:
```
// execute for_each on stream s
thrust::for_each(thrust::cuda::par.on(s), begin, end, my_functor);
```
  Algorithms executed with a CUDA stream in this manner may still synchronize with other streams when allocating temporary storage or returning results to the CPU.
- thrust::seq
  The thrust::seq execution policy allows users to require that an algorithm execute sequentially in the calling thread:
```
// execute for_each sequentially in this thread
thrust::for_each(thrust::seq, begin, end, my_functor);
```
Other
- The new thrust::complex template provides complex number support.

New Examples

simple_cuda_streams demonstrates how to request a CUDA stream during algorithm execution.
async_reduce demonstrates ways to achieve algorithm invocations which are asynchronous with the calling thread.

Other Enhancements

CUDA sort performance for user-defined types is 300% faster on Tesla K20c for large problem sizes.
CUDA merge performance is 200% faster on Tesla K20c for large problem sizes.
CUDA sort performance for primitive types is 50% faster on Tesla K20c for large problem sizes.
CUDA reduce_by_key performance is 25% faster on Tesla K20c for large problem sizes.
CUDA scan performance is 15% faster on Tesla K20c for large problem sizes.
fallback_allocator example is simpler.

Bug Fixes

#364 iterators with unrelated system tags may be used with algorithms invoked with an execution policy
#371 do not redefine __CUDA_ARCH__
#379 fix crash when dereferencing transform_iterator on the CPU
#391 avoid use of uppercase variable names
#392 fix thrust::copy between cusp::complex & std::complex
#396 program compiled with gcc < 4.3 hangs during comparison sort
#406 fallback_allocator.cu example checks device for unified addressing support
#417 avoid using std::less<T> in binary search algorithms
#418 avoid various warnings
#443 including version.h no longer configures default systems
#578 nvcc produces warnings when sequential algorithms are used with cpu systems

Known Issues

When invoked with primitive data types, thrust::sort, thrust::sort_by_key, thrust::stable_sort, thrust::stable_sort_by_key may fail to link in some cases with nvcc -rdc=true.
The CUDA implementation of thrust::reduce_by_key incorrectly outputs the last element in a segment of equivalent keys instead of the first.

Acknowledgments

Thanks to Sean Baxter for contributing faster CUDA reduce, merge, and scan implementations.
Thanks to Duane Merrill for contributing a faster CUDA radix sort implementation.
Thanks to Filipe Maia for contributing the implementation of thrust::complex.

Assets 2

16 May 09:57

brycelelbach

1.7.2

eff3a97

Thrust 1.7.2 (CUDA Toolkit 6.5)

Thrust 1.7.2 is a minor bug fix release.

Bug Fixes

Avoid use of std::min in generic find implementation.

Assets 2

19 Jun 21:33

jaredhoberock

1.7.1

24d2e73

Thrust 1.7.1 (CUDA Toolkit 6.0)

Thrust 1.7.1 is a minor bug fix release.

Bug Fixes

Eliminate identifiers in set_operations.cu example with leading underscore
Eliminate unused variable warning in CUDA reduce_by_key implemention
Avoid deriving function objects from std::unary_function and std::binary_function

Assets 2

03 Jul 05:09

jaredhoberock

1.7.0

1e430ed

Thrust 1.7.0 (CUDA Toolkit 5.5)

Thrust 1.7.0 introduces a new interface for controlling algorithm execution as well as several new algorithms and performance improvements. With this new interface, users may directly control how algorithms execute as well as details such as the allocation of temporary storage. Key/value versions of thrust::merge and the set operation algorithms have been added, as well stencil versions of partitioning algorithms. thrust::tabulate has been introduced to tabulate the values of functions taking integers. For 32b types, new CUDA merge and set operations provide 2-15x faster performance while a new CUDA comparison sort provides 1.3-4x faster performance. Finally, a new TBB reduce_by_key implementation provides 80% faster performance.

Breaking API Changes

Dispatch
- Custom user backend systems' tag types must now inherit from the corresponding system's execution_policy template (e.g. thrust::cuda::execution_policy) instead of the tag struct (e.g. thrust::cuda::tag). Otherwise, algorithm specializations will silently go unfound during dispatch.
  
  See examples/minimal_custom_backend.cu and examples/cuda/fallback_allocator.cu for usage examples.
- thrust::advance and thrust::distance are no longer dispatched based on iterator system type and thus may no longer be customized.
Iterators
- iterator_facade and iterator_adaptor's Pointer template parameters have been eliminated.
- iterator_adaptor has been moved into the thrust namespace (previously thrust::experimental::iterator_adaptor).
- iterator_facade has been moved into the thrust namespace (previously thrust::experimental::iterator_facade).
- iterator_core_access has been moved into the thrust namespace (previously thrust::experimental::iterator_core_access).
  All iterators' nested pointer typedef (the type of the result of operator->) is now void instead of a pointer type to indicate that such expressions are currently impossible.
  Floating point counting_iterators' nested difference_type typedef is now a signed integral type instead of a floating point type.
Other
- normal_distribution has been moved into the thrust::random namespace (previously thrust::random::experimental::normal_distribution).
- Placeholder expressions may no longer include the comma operator.

New Features

Execution Policies
- Users may directly control the dispatch of algorithm invocations with optional execution policy arguments.
  
  For example, instead of wrapping raw pointers allocated by cudaMalloc with thrust::device_ptr, the thrust::device execution_policy may be passed as an argument to an algorithm invocation to enable CUDA execution.
  
  The following execution policies are supported in this version:
  - thrust::host
  - thrust::device
  - thrust::cpp::par
  - thrust::cuda::par
  - thrust::omp::par
  - thrust::tbb::par
Algorithms
- free
- get_temporary_buffer
- malloc
- merge_by_key
- partition with stencil
- partition_copy with stencil
- return_temporary_buffer
- set_difference_by_key
- set_intersection_by_key
- set_symmetric_difference_by_key
- set_union_by_key
- stable_partition with stencil
- stable_partition_copy with stencil
- tabulate

New Examples

uninitialized_vector demonstrates how to use a custom allocator to avoid the automatic initialization of elements in thrust::device_vector.

Other Enhancements

Authors of custom backend systems may manipulate arbitrary state during algorithm dispatch by incorporating it into their execution_policy parameter.
Users may control the allocation of temporary storage during algorithm execution by passing standard allocators as parameters via execution policies such as thrust::device.
THRUST_DEVICE_SYSTEM_CPP has been added as a compile-time target for the device backend.
CUDA merge performance is 2-15x faster.
CUDA comparison sort performance is 1.3-4x faster.
CUDA set operation performance is 1.5-15x faster.
TBB reduce_by_key performance is 80% faster.
Several algorithms have been parallelized with TBB.
Support for user allocators in vectors has been improved.
The sparse_vector example is now implemented with merge_by_key instead of sort_by_key.
Warnings have been eliminated in various contexts.
Warnings about __host__ or __device__-only functions called from __host__ __device__ functions have been eliminated in various contexts.
Documentation about algorithm requirements have been improved.
Simplified the minimal_custom_backend example.
Simplified the cuda/custom_temporary_allocation example.
Simplified the cuda/fallback_allocator example.

Bug Fixes

#248 fix broken counting_iterator<float> behavior with OpenMP
#231, #209 fix set operation failures with CUDA
#187 fix incorrect occupancy calculation with CUDA
#153 fix broken multigpu behavior with CUDA
#142 eliminate warning produced by thrust::random::taus88 and MSVC 2010
#208 correctly initialize elements in temporary storage when necessary
#16 fix compilation error when sorting bool with CUDA
#10 fix ambiguous overloads of reinterpret_tag

Known Issues

g++ versions 4.3 and lower may fail to dispatch thrust::get_temporary_buffer correctly causing infinite recursion in examples such as cuda/custom_temporary_allocation.

Acknowledgments

Thanks to Sean Baxter, Bryan Catanzaro, and Manjunath Kudlur for contributing a faster merge implementation for CUDA.
Thanks to Sean Baxter for contributing a faster set operation implementation for CUDA.
Thanks to Cliff Woolley for contributing a correct occupancy calculation algorithm.

Assets 2

16 May 09:50

brycelelbach

1.5.3

d59c906

Thrust 1.5.3 (CUDA Toolkit 5.0)

Thrust 1.5.3 is a minor bug fix release.

Bug Fixes

Avoid warnings about potential race due to __shared__ non-POD variable

Assets 2

16 May 09:53

brycelelbach

1.6.0

25da9a1

Thrust 1.6.0

Thrust 1.6.0 provides an interface for customization and extension and a new backend system based on the Threading Building Blocks library. With this new interface, programmers may customize the behavior of specific algorithms as well as control the allocation of temporary storage or invent entirely new backends. These enhancements also allow multiple different backend systems such as CUDA and OpenMP to coexist within a single program. Support for TBB allows Thrust programs to integrate more naturally into applications which may already employ the TBB task scheduler.

Breaking Changes

The header <thrust/experimental/cuda/pinned_allocator.h> has been moved to <thrust/system/cuda/experimental/pinned_allocator.h>
thrust::experimental::cuda::pinned_allocator has been moved to thrust::cuda::experimental::pinned_allocator
The macro THRUST_DEVICE_BACKEND has been renamed THRUST_DEVICE_SYSTEM
The macro THRUST_DEVICE_BACKEND_CUDA has been renamed THRUST_DEVICE_SYSTEM_CUDA
The macro THRUST_DEVICE_BACKEND_OMP has been renamed THRUST_DEVICE_SYSTEM_OMP
thrust::host_space_tag has been renamed thrust::host_system_tag
thrust::device_space_tag has been renamed thrust::device_system_tag
thrust::any_space_tag has been renamed thrust::any_system_tag
thrust::iterator_space has been renamed thrust::iterator_system

New Features

Backend Systems
- Threading Building Blocks (TBB) is now supported
Algorithms
- thrust::for_each_n
- thrust::raw_reference_cast
Types
- thrust::pointer
- thrust::reference

New Examples

cuda/custom_temporary_allocation
cuda/fallback_allocator
device_ptr
expand
minimal_custom_backend
raw_reference_cast
set_operations

Other Enhancements

thrust::for_each now returns the end of the input range similar to most other algorithms
thrust::pair and thrust::tuple have swap functionality
All CUDA algorithms now support large data types
Iterators may be dereferenced in user device or global functions
The safe use of different backend systems is now possible within a single binary

Bug Fixes

#469 min_element and max_element algorithms no longer require a const comparison operator

Known Issues

NVCC may crash when parsing TBB headers on Windows.

Assets 2

16 May 09:49

brycelelbach

1.5.2

7f9a05f

Thrust 1.5.2 (CUDA Toolkit 4.2)

Thrust 1.5.2 is a minor bug fix release.

Bug Fixes

Fixed warning about C-style initialization of structures

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Examples

Bug Fixes

Bug Fixes

Known Issues

Bug Fixes

New Features

New Examples

Other Enhancements

Bug Fixes

Known Issues

Acknowledgments

Bug Fixes

Bug Fixes

Breaking API Changes

New Features

New Examples

Other Enhancements

Bug Fixes

Known Issues

Acknowledgments

Bug Fixes

Breaking Changes

New Features

New Examples

Other Enhancements

Bug Fixes

Known Issues

Bug Fixes

Releases: NVIDIA/thrust

Thrust 1.8.3 (CUDA Toolkit 8.0)

New Examples

Bug Fixes

Thrust 1.8.2 (CUDA Toolkit 7.5)

Bug Fixes

Known Issues

Thrust 1.8.1 (CUDA Toolkit 7.0)

Bug Fixes

Thrust 1.8.0

New Features

New Examples

Other Enhancements

Bug Fixes

Known Issues

Acknowledgments

Thrust 1.7.2 (CUDA Toolkit 6.5)

Bug Fixes

Thrust 1.7.1 (CUDA Toolkit 6.0)

Bug Fixes

Thrust 1.7.0 (CUDA Toolkit 5.5)

Breaking API Changes

New Features

New Examples

Other Enhancements

Bug Fixes

Known Issues

Acknowledgments

Thrust 1.5.3 (CUDA Toolkit 5.0)

Bug Fixes

Thrust 1.6.0

Breaking Changes

New Features

New Examples

Other Enhancements

Bug Fixes

Known Issues

Thrust 1.5.2 (CUDA Toolkit 4.2)

Bug Fixes