Thrust is a C++ parallel programming library which resembles the C++ Standard Library.

Overview

Thrust: Code at the speed of light

Thrust is a C++ parallel programming library which resembles the C++ Standard Library. Thrust's high-level interface greatly enhances programmer productivity while enabling performance portability between GPUs and multicore CPUs. Interoperability with established technologies (such as CUDA, TBB, and OpenMP) facilitates integration with existing software. Develop high-performance applications rapidly with Thrust!

Thrust is included in the NVIDIA HPC SDK and the CUDA Toolkit.

Quick Start

Getting the Thrust Source Code

The CUDA Toolkit provides a recent release of the Thrust source code in include/thrust. This will be suitable for most users.

Users that wish to contribute to Thrust or try out newer features should recursively clone the Thrust Github repository:

git clone --recursive https://github.com/NVIDIA/thrust.git

Using Thrust From Your Project

Thrust is a header-only library; there is no need to build or install the project unless you want to run the Thrust unit tests.

For CMake-based projects, we provide a CMake package for use with find_package. See the CMake README for more information. Thrust can also be added via add_subdirectory or tools like the CMake Package Manager.

For non-CMake projects, compile with:

  • The Thrust include path (-I )
  • The CUB include path, if using the CUDA device system (-I /dependencies/cub/ )
  • By default, the CPP host system and CUDA device system are used. These can be changed using compiler definitions:
    • -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_XXX, where XXX is CPP (serial, default), OMP (OpenMP), or TBB (Intel TBB)
    • -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_XXX, where XXX is CPP, OMP, TBB, or CUDA (default).

Examples

Thrust is best explained through examples. The following source code generates random numbers serially and then transfers them to a parallel device where they are sorted.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <algorithm>
#include <cstdlib>

int main(void)
{
  // generate 32M random numbers serially
  thrust::host_vector<int> h_vec(32 << 20);
  std::generate(h_vec.begin(), h_vec.end(), rand);

  // transfer data to the device
  thrust::device_vector<int> d_vec = h_vec;

  // sort data on the device (846M keys per second on GeForce GTX 480)
  thrust::sort(d_vec.begin(), d_vec.end());

  // transfer data back to host
  thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());

  return 0;
}

This code sample computes the sum of 100 random numbers in parallel:

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <algorithm>
#include <cstdlib>

int main(void)
{
  // generate random data serially
  thrust::host_vector<int> h_vec(100);
  std::generate(h_vec.begin(), h_vec.end(), rand);

  // transfer to device and compute sum
  thrust::device_vector<int> d_vec = h_vec;
  int x = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus<int>());
  return 0;
}

Additional usage examples can be found in the examples/ and testing/ directories of the Github repo.

Documentation Resources

CI Status

Supported Compilers

Thrust is regularly tested using the specified versions of the following compilers. Unsupported versions may emit deprecation warnings, which can be silenced by defining THRUST_IGNORE_DEPRECATED_COMPILER during compilation.

  • NVCC 11.0+
  • NVC++ 20.9+
  • GCC 5+
  • Clang 7+
  • MSVC 2019+ (19.20/16.0/14.20)

Releases

Thrust is distributed with the NVIDIA HPC SDK and the CUDA Toolkit in addition to GitHub.

See the changelog for details about specific releases.

Thrust Release Included In
1.15.0 TBD
1.14.0 NVIDIA HPC SDK 21.9
1.13.1 CUDA Toolkit 11.5
1.13.0 NVIDIA HPC SDK 21.7
1.12.1 CUDA Toolkit 11.4
1.12.0 NVIDIA HPC SDK 21.3
1.11.0 CUDA Toolkit 11.3
1.10.0 NVIDIA HPC SDK 20.9 & CUDA Toolkit 11.2
1.9.10-1 NVIDIA HPC SDK 20.7 & CUDA Toolkit 11.1
1.9.10 NVIDIA HPC SDK 20.5
1.9.9 CUDA Toolkit 11.0
1.9.8-1 NVIDIA HPC SDK 20.3
1.9.8 CUDA Toolkit 11.0 Early Access
1.9.7-1 CUDA Toolkit 10.2 for Tegra
1.9.7 CUDA Toolkit 10.2
1.9.6-1 NVIDIA HPC SDK 20.3
1.9.6 CUDA Toolkit 10.1 Update 2
1.9.5 CUDA Toolkit 10.1 Update 1
1.9.4 CUDA Toolkit 10.1
1.9.3 CUDA Toolkit 10.0
1.9.2 CUDA Toolkit 9.2
1.9.1-2 CUDA Toolkit 9.1
1.9.0-5 CUDA Toolkit 9.0
1.8.3 CUDA Toolkit 8.0
1.8.2 CUDA Toolkit 7.5
1.8.1 CUDA Toolkit 7.0
1.8.0
1.7.2 CUDA Toolkit 6.5
1.7.1 CUDA Toolkit 6.0
1.7.0 CUDA Toolkit 5.5
1.6.0
1.5.3 CUDA Toolkit 5.0
1.5.2 CUDA Toolkit 4.2
1.5.1 CUDA Toolkit 4.1
1.5.0
1.4.0 CUDA Toolkit 4.0
1.3.0
1.2.1
1.2.0
1.1.1
1.1.0
1.0.0

Development Process

Thrust uses the CMake build system to build unit tests, examples, and header tests. To build Thrust as a developer, the following recipe should be followed:

# Clone Thrust and CUB repos recursively:
git clone --recursive https://github.com/NVIDIA/thrust.git
cd thrust

# Create build directory:
mkdir build
cd build

# Configure -- use one of the following:
cmake ..   # Command line interface.
ccmake ..  # ncurses GUI (Linux only)
cmake-gui  # Graphical UI, set source/build directories in the app

# Build:
cmake --build . -j 
   
       # invokes make (or ninja, etc)

# Run tests and examples:
ctest

   

By default, a serial CPP host system, CUDA accelerated device system, and C++14 standard are used. This can be changed in CMake. More information on configuring your Thrust build and creating a pull request can be found in CONTRIBUTING.md.

Issues
  • Thrust equivalent to std::complex

    Thrust equivalent to std::complex

    I've put all the code inside thrust/detail/complex. I've also create the unittests and documentation.

    I've ported FreeBSDs c99 complex implementation, as it seems to be the highest quality available. All the functions, except for pow, are accurate to within a few ULPs.

    Complex atan() and atanh() require C++11 due to the lack of real atanh() in previous versions.

    I've tested with g++ and clang++ but I didn't have the opportunity to try with msvc as I don't have access to it.

    opened by FilipeMaia 39
  • Intermittent compilation failures with thrust, cuda 10.2 and MSVC 2019

    Intermittent compilation failures with thrust, cuda 10.2 and MSVC 2019

    We experience intermittent compilation failure on our CI server. The CXX compiler identification is MSVC 19.25.28612.0. The CUDA compiler identification is NVIDIA 10.2.89.

    Retrying the compilation typically succeeds. Our CI server now retries compiling the project up to 5 times to avoid this issue. (The issue has never occurred 5 times in a row yet.)

    The error looks as follows

    [2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2993: 'T': illegal type for non-type template parameter '__formal'
    [2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): note: see reference to class template instantiation 'thrust::detail::allocator_traits_detail::has_value_type<T>' being compiled
    [2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2065: 'U1': undeclared identifier
    [2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2923: 'std::_Select<__formal>::_Apply': 'U1' is not a valid template type argument for parameter '<unnamed-symbol>'
    [2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C4430: missing type specifier - int assumed. Note: C++ does not support default-int
    [2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2144: syntax error: 'unknown-type' should be preceded by ')'
    [2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2144: syntax error: 'unknown-type' should be preceded by ';'
    [2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2238: unexpected token(s) preceding ';'
    [2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2059: syntax error: ')'
    [2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2988: unrecognizable template declaration/definition
    [2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2059: syntax error: '<end Parse>'
    

    This affects at least two projects

    • https://github.com/apache/incubator-mxnet/issues/17935
    • https://github.com/pytorch/pytorch/issues/25393
    opened by leezu 33
  • Add gpuCI GPU testing, automatic GPU detection, and support for more compilers

    Add gpuCI GPU testing, automatic GPU detection, and support for more compilers

    Add gpuCI GPU testing, automatic GPU detection, and support for more compilers:

    • CMake: Add support for detecting the compute archs of the GPUs in your system at configure time.
    • gpuCI: Add a GPU node configuration that builds and tests as little as possible.
    • gpuCI: Cleanup logic for different build and test configurations.
    • gpuCI: Fix an unfortunate typo in determine_build_parallelism.bash which led to the parallelism level not being set.
    • gpuCI: Add support for NVC++.
    • gpuCI: Update to CUDA 11.1 and Ubuntu 20.04.
    • gpuCI: Add NVC++ and ICC configurations to the CPU axis file.
    • gpuCI: Add a GPU axis file.
    • gpuCI: Increase the desired memory per build thread to 4GB.
    • gpuCI: Add a -j switch which controls build parallelism to ci/local/build.bash.
    opened by brycelelbach 32
  • OpenCL support

    OpenCL support

    While CUDA is good for Nvidia cards, the fact is that only half of the userbase can use CUDA. OpenCL support would allow for Thrust to be useful for people with AMD cards, as well as embedded Android platforms which are now getting OpenCL support.

    The other advantage from implementing OpenCL would be that OpenGL 4.x has interop with OpenCL, which would allow for more possibilities regarding cross platform support.

    Source on Android getting OpenCL: http://www.androidcentral.com/nexus-4-and-nexus-10-found-have-opencl-drivers

    type: enhancement 
    opened by peterclemenko 26
  • Add transform_input_output_iterator

    Add transform_input_output_iterator

    Adds a variant of transform iterator adapter that works as both an input iterator and an output iterator. The given input function is applied after reading from the wrapped iterator while the output function is applied before writing to the wrapped iterator. The implementation is largely based on transform_output_iterator, with additional operators added to the proxy reference.

    Also fixes some typos in transform_output_iterator.

    testing: gpuCI passed testing: internal ci in progress 
    opened by trevorsm7 24
  • Support adding Thrust to CMake projects with `add_subdirectory`

    Support adding Thrust to CMake projects with `add_subdirectory`

    I have been using the github thrust for a while before the recent merge with cuda one. I manage dependency using cmake fetchcontent. After merging with the recent update in this repo, which includes a CMakeLists.txt, I found that it breaks command like add_subdirectory() because of the following error:

     MSVC_COMPILER_FLAGS:                                                                                                                                           
     | WARN_ALL : '/Wall'                                                                                                                                          
     | WARNINGS_AS_ERRORS : '/Wx'                                                                                                                                  
     | RELEASE : '/Ox'                                                                                                                                             
     | DEBUG : '/Zi;-D_DEBUG;/MTd'                                                                                                                                 
     | EXCEPTION_HANDLING : '/EHsc'                                                                                                                                
     | CPP : ''                                                                                                                                                    
     | OMP : '/openmp'                                                                                                                                             
     | TBB : ''                                                                                                                                                    
     | CUDA : ''                                                                                                                                                   
     | CUDA_BULK : ''                                                                                                                                              
     | WORKAROUNDS : '/DNOMINMAX;/wd4503'                                                                                                                          
     | C++03 : ''
     | C++11 : '-std=c++11'
    -- Looking for C++ include pthread.h
    -- Looking for C++ include pthread.h - found
    -- Looking for pthread_create
    -- Looking for pthread_create - found
    -- Found Threads: TRUE
    -- Found CUDA: /net/software/modules-sw/cuda/10.1/Linux/RHEL6/x86_64 (found version "10.1")                                                           
    -- Found OpenMP_CXX: -fopenmp (found version "4.5")
    -- Found OpenMP: TRUE (found version "4.5")
    -- Found 49 examples
    -- Found 5 examples/cuda
    -- Found 4 examples/cpp_integration
    -- Found 152 tests in testing
    -- Found 59 tests in backend
    CMake Error at build_cuda/_deps/thrust-src/testing/CMakeLists.txt:48 (add_custom_target):                                                                     
      add_custom_target cannot create target "check" because another target with
      the same name already exists.  The existing target is a custom target
      created in source directory
      "/home/aznb/mycodes/SCgenome_scbmc/build_cuda/_deps/kokkos-src".                                                                         
      See documentation for policy CMP0002 for more details.
    

    Aside from having name conflicts via add_custom_target, I was expecting using thrust as a header-only library and I don't expect cmake to config building any of the test targets unless I want it so.

    opened by Char-Aznable 24
  • Uninitialized __global__ memory in thrust::sort (cub::RadixSort) - incorrect results/segfaults in thrust::sort, thrust::remove_if, etc.

    Uninitialized __global__ memory in thrust::sort (cub::RadixSort) - incorrect results/segfaults in thrust::sort, thrust::remove_if, etc.

    We have been getting weird errors in thrust functions sort_by_key, sort and remove_if in our custom code or in third-party code such as flann (kdtree on cuda) and MinkowskiEngine (pytorch custom lib). After a thorough investigation, we discovered that the mentioned functions sometimes randomly produce wrong results (sorted vectors contain values that were not in the original vectors, remove_if does not remove elements matching a condition, etc). Firstly, we thought the issues are related to pytorch, as they occurred when we linked pytorch lib, but afterward, we were able to produce a minimal example with errors even without any pytorch stuff. Also the errors seem to randomly appear or disappear when a line of code is added/removed or a library (eg. pytorch) is linked (but not used). I suppose this suggests there is some problem related to a physical address of the code/data.

    We tested our binaries with compute-sanitizer --tool initcheck and in cases when thrust::sort or thrust::remove_if returned corrupted results we got e.g. Uninitialized __global__ memory read of size 4 bytes... errors. As mentioned above, when we removed/added some code/library that did not affect the actual computation the results were miraculously fixed but compute-sanitizer --tool initcheck still returned the error. Therefore it seems sometimes the uninitialized memory actually contains the value it should be initialized with and everything runs okay-ish.

    We tested many versions of the example (bellow) as well as many versions of our internal code on at least:

    • nvidia devel ubuntu18.04 and ubuntu20.04 docker images with cuda 10.1, 10.2, 11.0, 11.1, 11.2
    • on ubuntu20.04 and arch linux distributions
    • with default thrust/cub and the latest thrust/cub (thrust version 1.12)

    The issues were present in every setup with slight variations - e.g. changing cuda seemed to fix the issue but adding an independent line of code broke the code again.

    We tested this particular example also on Windows and it seems it is the only place where the code runs without Uninitialized __global__ memory warning. But due to compilation difficulties, we were not able to compile our other programs with the same issue and test them yet.


    To reproduce one of the issues, create main.cu, Dockerfile and CMakeLists.txt (file contents below) and run the following commands:

    docker build -t test-docker-image .
    docker run -it --gpus 1 -v $(pwd):/xxx -w /xxx test-docker-image bash
    mkdir build
    cd build
    cmake ..
    make
    compute-sanitizer --tool initcheck bug_test
    

    You should get the following output:

    RUN 0, NUM 128, dev_ptr 0x7fe1c5800000: OK! 
    RUN 0, NUM 256, dev_ptr 0x7fe1c5800000: OK!
    ...
    ========= Uninitialized __global__ memory read of size 4 bytes
    =========     at 0x5b8 in void cub::DeviceRadixSortOnesweepKernel<cub::DeviceRadixSortPolicy<int,cub::NullType,int>::Policy800,bool=0,int,cub::NullType,int,int>(int*,int,bool=0*,int* const *,int*,int* const * const *,cub::DeviceRadixSortPolicy<int,cub::NullType,int>::Policy800*,int* const * const * const *,int*,int,int)
    =========     by thread (214,0,0) in block (0,0,0)
    =========     Address 0x7f13cdc09dd8
    =========     Saved host backtrace up to driver entry point at kernel launch time
    =========     Host Frame:cuLaunchKernel [0x7f1402c1ba6e]
    =========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
    =========     Host Frame: [0x7f1415e1862b]
    =========                in /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudart.so.11.0
    =========     Host Frame:cudaLaunchKernel [0x7f1415e585b1]
    =========                in /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudart.so.11.0
    ...
    

    When pytorch libs and a specific version of thust is linked we also get Host and device vector doesn't match! aside from the Uninitialized __global__ memory warning. Sometimes, in different setups, we got Uninitialized __global__ memory read of size 1 bytes ... or Floating point exception (core dumped).

    Also, we got the uninitialized memory warning when calling thrust::remove_if in one place of our code. Similarily to the thrust::sort the warning occurred when the outcome of the function was incorrect but it also occurred when the outcome was (probably by chance) correct:

    ========= Uninitialized __global__ memory read of size 4 bytes
    =========     at 0x1d68 in void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__copy_if::CopyIfAgent<thrust::zip_iterator<thrust::tuple<unsigned int*,unsigned int*,unsigned int*,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type>>,thrust::cuda_cub::__copy_if::no_stencil_tag_*,thrust::zip_iterator<thrust::tuple<unsigned int*,unsigned int*,unsigned int*,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type>>,thrust::detail::unary_negate<minkowski::detail::is_first<unsigned int>>,int,int*>,thrust::zip_iterator<thrust::tuple<unsigned int*,unsigned int*,unsigned int*,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type>>,thrust::cuda_cub::__copy_if::no_stencil_tag_*,thrust::zip_iterator<thrust::tuple<unsigned int*,unsigned int*,unsigned int*,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type>>,thrust::detail::unary_negate<minkowski::detail::is_first<unsigned int>>,int,int*,cub::ScanTileState<int,bool=1>,unsigned long>(unsigned int*,unsigned int*,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type)
    =========     by thread (30,0,0) in block (0,0,0)
    =========     Address 0x7fa75a017170
    =========     Saved host backtrace up to driver entry point at kernel launch time
    =========     Host Frame:cuLaunchKernel [0x7fa78effea6e]
    =========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
    =========     Host Frame: [0x7fa7fb2ec62b]
    =========                in /usr/local/cuda/lib64/libcudart.so.11.0
    =========     Host Frame:cudaLaunchKernel [0x7fa7fb32c5b1]
    =========                in /usr/local/cuda/lib64/libcudart.so.11.0
    

    Also a similar example of probably the same problem was mentioned by us in thrust issue https://github.com/NVIDIA/thrust/issues/1341#issuecomment-791642454 and pytorch issue https://github.com/pytorch/pytorch/issues/52663.


    The files:

    main.cu

    #include <iostream>
    #include <thrust/host_vector.h>
    #include <thrust/device_vector.h>
    
    int main() 
    {
        for (size_t NUM = 128; NUM < 32768; NUM+=128) 
        {
            for (int run = 0; run < 1; run++) {
                thrust::host_vector<int> h(NUM);
                thrust::device_vector<int> d(NUM);
                for (int i = 0; i < NUM; i++) {
                    int random_number = rand() * 1000;
                    h[i] = random_number;
                    d[i] = random_number;
                }
                thrust::sort(h.begin(), h.end());
                thrust::sort(d.begin(), d.end());
        
                thrust::host_vector<int> d_host(d.begin(), d.end());
                bool sort_ok = thrust::equal(
                    d_host.begin(), d_host.end() - 1, d_host.begin() + 1,	
                    thrust::less_equal<int>());
                bool match = thrust::equal(d_host.begin(), d_host.end(), h.begin());
    
                std::cout << "RUN " << run << ", NUM " << NUM;
                std::cout << ", dev_ptr " << static_cast<void*>(thrust::raw_pointer_cast(d.data())) << ": ";
                if (sort_ok && match) { std::cout << "OK! "; }
                if (!sort_ok) { std::cout << "Wrong sort! "; }
                if (!sort_ok) { std::cout << "Host and device vector doesn't match! "; }
                std::cout << std::endl;
            }
        }
    
        return 0;
    }
    

    Dockerfile

    FROM nvidia/cuda:11.0-devel-ubuntu20.04
    RUN apt-get update && apt-get install -y wget
    RUN wget -qO- "https://cmake.org/files/v3.17/cmake-3.17.5-Linux-x86_64.tar.gz" | tar --strip-components=1 -xz -C /usr/local
    

    CMakeLists.txt

    cmake_minimum_required(VERSION 3.17.5)
    project(bug_test CUDA CXX)
    add_executable(bug_test main.cu)
    target_compile_options(bug_test PUBLIC $<$<COMPILE_LANGUAGE:CUDA>:-arch compute_XX>)
    target_compile_options(bug_test PUBLIC $<$<COMPILE_LANGUAGE:CUDA>:-code sm_XX>)
    target_compile_features(bug_test PRIVATE cuda_std_14)
    

    I'll gladly provide other examples if necessary. @allisonvacanti

    type: bug: functional P1: should have 
    opened by soCzech 23
  • NVBug 3262468: Bogus data produced in CUDA 11(.1) by thrust::sort_by_key with custom comparison operator over two (zipped) key arrays

    NVBug 3262468: Bogus data produced in CUDA 11(.1) by thrust::sort_by_key with custom comparison operator over two (zipped) key arrays

    Hello,

    while testing GPUSPH on CUDA 11.1 I discovered that the thrust::sort_by_key procedure that we use to sort the particle indices is now producing bogus data. The sort uses a custom comparator that fetches data from two different arrays (using zip_iterator etc). The sort results in one of the (sorted) key arrays being clobbered with invalid data.

    I am currently in the process of finding a minimal test case, but in the mean time the bug can be observed in action on the cuda11-thrust-sort-bug of GPUSPH, by running

    make DamBreak3D && ./DamBreak3D --maxiter 1 | grep '64656 255 '
    

    from the git working directly. This should produce no results (as it does on CUDA 10) if the sort is correct, but it results in numerous hits (all clobbered entries with bogus values) in CUDA 11.

    Thanks for looking into this,

    nvbug type: bug: compiler 
    opened by Oblomov 19
  • Variadic tuple preparation

    Variadic tuple preparation

    Some simplifications preparing Thrust for a variadic tuple implementation (some day... #524). Other changes would require a bit more coordination and can come separately, assuming these sorts of changes are now mergeable.

    With -DTHRUST_DEVICE_SYSTEM=CPP I get:

    100% tests passed, 0 tests failed out of 151
    
    Total Test time (real) = 108.10 sec
    
    testing: gpuCI passed testing: internal ci passed 
    opened by andrewcorrigan 19
  • Updated shuffle implementation to use better hash function

    Updated shuffle implementation to use better hash function

    Fixes https://github.com/thrust/thrust/issues/1256 by changing the hash function from taus88 to use wyhash. Adds test for a random distribution of numbers

    opened by djns99 19
  • NVBug 2318871: Compilation failure with `zip_iterator` and `complex<float>` in CUDA 9.2

    NVBug 2318871: Compilation failure with `zip_iterator` and `complex` in CUDA 9.2

    CUDA version: 9.2 Thrust version: 1.9.2 (bundled in CUDA 9.2) GCC version: 5.4.0

    The following code fails to compile for cuda backend, specifically the line "auto zipout_end = ...". It compiles however if we make any of the following changes:

    1. change TYPE to float
    2. target openmp backend.
    #include <thrust/device_vector.h>
    #include <thrust/sequence.h>
    #include <thrust/copy.h>
    #include <thrust/gather.h>
    #include <thrust/iterator/counting_iterator.h>
    #include <thrust/iterator/zip_iterator.h>
    #include <thrust/complex.h>
     
    /* g++ -std=c++11 -I/usr/local/cuda/include -O2 -x c++ -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP -lgomp minimal.cu
     nvcc -std=c++11 --expt-extended-lambda minimal.cu
    */
     
    // typedef float TYPE;
    typedef thrust::complex<float> TYPE;
     
    int main()
    {
       thrust::device_vector<TYPE> d_vec(10);
       thrust::sequence(d_vec.begin(), d_vec.end());
       thrust::device_vector<TYPE> d_res(10);
     
       auto pred = [] __host__ __device__ (TYPE val) { return abs(val) > 5; };
       auto cntit_begin = thrust::make_counting_iterator(0);
       auto cntit_end = cntit_begin + 10;
     
       // generate indices and values in two calls
       thrust::device_vector<int> indices(10);
       auto indices_end = thrust::copy_if(cntit_begin, cntit_end, d_vec.begin(), indices.begin(), pred);
       thrust::gather(indices.begin(), indices_end, d_vec.begin(), d_res.begin());
     
       // generate indices and values in one call
       auto zipin_begin = thrust::make_zip_iterator(thrust::make_tuple(cntit_begin, d_vec.begin()));
       auto zipin_end = thrust::make_zip_iterator(thrust::make_tuple(cntit_end, d_vec.end()));
       auto zipout_begin = thrust::make_zip_iterator(thrust::make_tuple(indices.begin(), d_res.begin()));
       // the following line fails to compile for combination of cuda backend and complex type
       auto zipout_end = thrust::copy_if(zipin_begin, zipin_end, d_vec.begin(), zipout_begin, pred);
    }
    
    type: bug: functional nvbug 
    opened by pijyoi 19
  • error: cudaErrorIllegalAddress: an illegal memory access was encountered

    error: cudaErrorIllegalAddress: an illegal memory access was encountered

    I need to copy the values of thrust device vector PusiTidal_d to another vector Pusi in a class member function void VS::RestoreOmegaPusi(). It works well if I use the naive " for loop", but if I use the thrust::copy command, it fed back the errors about:

    terminate called after throwing an instance of 'thrust::system::detail::bad_alloc' what(): std::bad_alloc: cudaErrorIllegalAddress: an illegal memory access was encountered Caught signal 6 - SIGABRT (abort) Could you please help me out? What has happen to the copy function? Since the naive "for loop " worked well, the memory should be exactly there. The following is the class member function, both the Pusi and PusiTidal_d were declared as the private members.

    void VS::RestoreOmegaPusi() { // for(int i=0;i<PusiTidal_d.size();i++)Pusi[i]=PusiTidal_d[i]; thrust::copy(thrust::device,PusiTidal_d.begin(), PusiTidal_d.end(), Pusi.begin()); }

    opened by ztdepztdep 1
  • Fix RDC flags on nvc++ builds.

    Fix RDC flags on nvc++ builds.

    nvcc defaults to rdc-off, nvc++ defaults to rdc-on. We need to explicitly enable or disable these flags for each CUDA target, rather than just enabling them when needed.

    blocked type: bug: functional P0: must have only: cmake compiler: nvc++ helps: nvc++ backend: CUDA area: cmake 
    opened by allisonvacanti 5
  • Deprecated Type Trait Primitives

    Deprecated Type Trait Primitives

    According to https://github.com/llvm/llvm-project/blob/release/13.x/clang/docs/LanguageExtensions.rst#type-trait-primitives, __has_trivial_copy and __has_trivial_constructor are deprecated. This causes warnings and turned into hard errors when compiled using -Werror. Should be simple to fix.

    type: bug: functional P1: should have compiler: clang (c++) 
    opened by pca006132 3
  • Implement `shift_left` and `shift_right` algorithms

    Implement `shift_left` and `shift_right` algorithms

    This adds both flavors of the new C++20 shift algorithms.

    Fixes #1484

    NOTE: my stone aged desktop does not have a cuda capable device and it might explode if I try to compile thrust with it, so this is totally eyeballed

    type: enhancement P2: nice to have 
    opened by miscco 1
  • Device-side launch of thrust::lower_bound is creating wrong results

    Device-side launch of thrust::lower_bound is creating wrong results

    #include <thrust/binary_search.h>
    #include <thrust/execution_policy.h>
    #include <stdio.h>
    
    __global__ void lowerbound(float inp_val) {
        constexpr int size = 6;
        float a[size] = {0.1, 0.2, 0.4, 0.6, 0.8, 1.};
        auto result = thrust::lower_bound(
            thrust::device, a, a + size, inp_val);
        printf("%ld\n", result - a);
    }
    
    int main() {
        lowerbound<<<1,1>>>(0.0);
        lowerbound<<<1,1>>>(0.1);
        lowerbound<<<1,1>>>(0.2);
        lowerbound<<<1,1>>>(0.3);
        lowerbound<<<1,1>>>(0.4);
        lowerbound<<<1,1>>>(0.5);
        cudaDeviceSynchronize();
    }
    

    I get

    0
    0
    0
    0
    0
    0
    

    on CUDA 11.7 with the latest thrust

    type: bug: functional P1: should have helps: pytorch backend: CUDA 
    opened by zasdfgbnm 10
Releases(1.17.0)
  • 1.17.0(May 9, 2022)

    Thrust 1.17.0

    Summary

    Thrust 1.17.0 is the final minor release of the 1.X series. This release provides GDB pretty-printers for device vectors/references, a new unique_count algorithm, and an easier way to create tagged Thrust iterators. Several documentation fixes are included, which can be found on the new Thrust documentation site at https://nvidia.github.io/thrust. We’ll be migrating existing documentation sources to this new location over the next few months.

    New Features

    • NVIDIA/thrust#1586: Add new thrust::make_tagged_iterator convenience function. Thanks to @karthikeyann for this contribution.
    • NVIDIA/thrust#1619: Add unique_count algorithm. Thanks to @upsj for this contribution.
    • NVIDIA/thrust#1631: Add GDB pretty-printers for device vectors/references to scripts/gdb-pretty-printers.py. Thanks to @upsj for this contribution.

    Bug Fixes

    • NVIDIA/thrust#1671: Fixed reduce_by_key when called with 2^31 elements.

    Other Enhancements

    • NVIDIA/thrust#1512: Use CUB to implement adjacent_difference.
    • NVIDIA/thrust#1555: Use CUB to implement scan_by_key.
    • NVIDIA/thrust#1611: Add new doxybook-based Thrust documentation at https://nvidia.github.io/thrust.
    • NVIDIA/thrust#1639: Fixed broken link in documentation. Thanks to @jrhemstad for this contribution.
    • NVIDIA/thrust#1644: Increase contrast of search input text in new doc site. Thanks to @bdice for this contribution.
    • NVIDIA/thrust#1647: Add __forceinline__ annotations to a functor wrapper. Thanks to @mkuron for this contribution.
    • NVIDIA/thrust#1660: Fixed typo in documentation example for permutation_iterator.
    • NVIDIA/thrust#1669: Add a new explicit_cuda_stream.cu example that shows how to use explicit CUDA streams and par/par_nosync execution policies.
    Source code(tar.gz)
    Source code(zip)
  • 1.16.0(Feb 8, 2022)

    Summary

    Thrust 1.16.0 provides a new “nosync” hint for the CUDA backend, as well as numerous bugfixes and stability improvements.

    New thrust::cuda::par_nosync Execution Policy

    Most of Thrust’s parallel algorithms are fully synchronous and will block the calling CPU thread until all work is completed. This design avoids many pitfalls associated with asynchronous GPU programming, resulting in simpler and less-error prone usage for new CUDA developers. Unfortunately, this improvement in user experience comes at a performance cost that often frustrates more experienced CUDA programmers.

    Prior to this release, the only synchronous-to-asynchronous migration path for existing Thrust codebases involved significant refactoring, replacing calls to thrust algorithms with a limited set of future-based thrust::async algorithms or lower-level CUB kernels. The new thrust::cuda::par_nosync execution policy provides a new, less-invasive entry point for asynchronous computation.

    par_nosync is a hint to the Thrust execution engine that any non-essential internal synchronizations should be skipped and that an explicit synchronization will be performed by the caller before accessing results.

    While some Thrust algorithms require internal synchronization to safely compute their results, many do not. For example, multiple thrust::for_each invocations can be launched without waiting for earlier calls to complete:

    // Queue three `for_each` kernels:
    thrust::for_each(thrust::cuda::par_nosync, vec1.begin(), vec1.end(), Op{});
    thrust::for_each(thrust::cuda::par_nosync, vec2.begin(), vec2.end(), Op{});
    thrust::for_each(thrust::cuda::par_nosync, vec3.begin(), vec3.end(), Op{});
    
    // Do other work while kernels execute:
    do_something();
    
    // Must explictly synchronize before accessing `for_each` results:
    cudaDeviceSynchronize();
    

    Thanks to @fkallen for this contribution.

    Deprecation Notices

    CUDA Dynamic Parallelism Support

    A future version of Thrust will remove support for CUDA Dynamic Parallelism (CDP).

    This will only affect calls to Thrust algorithms made from CUDA device-side code that currently launches a kernel; such calls will instead execute sequentially on the calling GPU thread instead of launching a device-wide kernel.

    Breaking Changes

    • Thrust 1.14.0 included a change that aliased the cub namespace to thrust::cub. This has caused issues with ambiguous namespaces for projects that declare using namespace thrust; from the global namespace. We recommend against this practice.
    • NVIDIA/thrust#1572: Removed several unnecessary header includes. Downstream projects may need to update their includes if they were relying on this behavior.

    New Features

    • NVIDIA/thrust#1568: Add thrust::cuda::par_nosync policy. Thanks to @fkallen for this contribution.

    Enhancements

    • NVIDIA/thrust#1511: Use CUB’s new DeviceMergeSort API and remove Thrust’s internal implementation.
    • NVIDIA/thrust#1566: Improved performance of thrust::shuffle. Thanks to @djns99 for this contribution.
    • NVIDIA/thrust#1584: Support user-defined CMAKE_INSTALL_INCLUDEDIR values in Thrust’s CMake install rules. Thanks to @robertmaynard for this contribution.

    Bug Fixes

    • NVIDIA/thrust#1496: Fix some issues affecting icc builds.
    • NVIDIA/thrust#1552: Fix some collisions with the min/max macros defined in windows.h.
    • NVIDIA/thrust#1582: Fix issue with function type alias on 32-bit MSVC builds.
    • NVIDIA/thrust#1591: Workaround issue affecting compilation with nvc++.
    • NVIDIA/thrust#1597: Fix some collisions with the small macro defined in windows.h.
    • NVIDIA/thrust#1599, NVIDIA/thrust#1603: Fix some issues with version handling in Thrust’s CMake packages.
    • NVIDIA/thrust#1614: Clarify that scan algorithm results are non-deterministic for pseudo-associative operators (e.g. floating-point addition).
    Source code(tar.gz)
    Source code(zip)
  • 1.15.0(Oct 25, 2021)

    Summary

    Thrust 1.15.0 provides numerous bugfixes, including non-numeric thrust::sequence support, several MSVC-related compilation fixes, fewer conversion warnings, counting_iterator initialization, and documentation updates.

    Deprecation Notices

    A future version of Thrust will remove support for CUDA Dynamic Parallelism (CDP).

    This will only affect calls to Thrust algorithms made from CUDA device-side code that currently launches a kernel; such calls will instead execute sequentially on the calling GPU thread instead of launching a device-wide kernel.

    Bug Fixes

    • NVIDIA/thrust#1507: Allow thrust::sequence to work with non-numeric types. Thanks to Ben Jude (@bjude) for this contribution.
    • NVIDIA/thrust#1509: Avoid macro collision when calling max() on MSVC. Thanks to Thomas (@tomintheshell) for this contribution.
    • NVIDIA/thrust#1514: Initialize all members in counting_iterator's default constructor.
    • NVIDIA/thrust#1518: Fix std::allocator_traits on MSVC + C++17.
    • NVIDIA/thrust#1530: Fix several -Wconversion warnings. Thanks to Matt Stack (@matt-stack) for this contribution.
    • NVIDIA/thrust#1539: Fixed typo in thrust::for_each documentation. Thanks to Salman (@untamedImpala) for this contribution.
    • NVIDIA/thrust#1548: Avoid name collision with B0 macro in termios.h system header. Thanks to Philip Deegan (@PhilipDeegan) for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.14.0(Aug 24, 2021)

    Thrust 1.14.0 is a major release accompanying the NVIDIA HPC SDK 21.9.

    This release adds the ability to wrap the thrust:: namespace in an external namespace, providing a workaround for a variety of shared library linking issues. Thrust also learned to detect when CUB's symbols are in a wrapped namespace and properly import them. To enable this feature, use #define THRUST_CUB_WRAPPED_NAMESPACE foo to wrap both Thrust and CUB in the foo:: namespace. See thrust/detail/config/namespace.h for details and more namespace options.

    Several bugfixes are also included: The tuple_size and tuple_element helpers now support cv-qualified types. scan_by_key uses less memory. thrust::iterator_traits is better integrated with std::iterator_traits. See below for more details and references.

    New Features

    • NVIDIA/thrust#1464: Add preprocessor hooks that allow thrust:: to be wrapped in an external namespace, and support cases when CUB is wrapped in an external namespace.

    Bug Fixes

    • NVIDIA/thrust#1457: Support cv-qualified types in thrust::tuple_size and thrust::tuple_element. Thanks to Jake Hemstad for this contribution.
    • NVIDIA/thrust#1471: Fixed excessive memory allocation in scan_by_key. Thanks to Lilo Huang for this contribution.
    • NVIDIA/thrust#1476: Removed dead code from the expand example. Thanks to Lilo Huang for this contribution.
    • NVIDIA/thrust#1488: Fixed the path to the installed CUB headers in the CMake find_package configuration files.
    • NVIDIA/thrust#1491: Fallback to std::iterator_traits when no thrust::iterator_traits specialization exists for an iterator type. Thanks to Divye Gala for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.13.1(Oct 25, 2021)

    Thrust 1.13.1 is a minor release accompanying the CUDA Toolkit 11.5.

    This release provides a new hook for embedding the thrust:: namespace inside a custom namespace. This is intended to work around various issues related to linking multiple shared libraries that use Thrust. The existing CUB_NS_PREFIX and CUB_NS_POSTFIX macros already provided this capability for CUB; this update provides a simpler mechanism that is extended to and integrated with Thrust. Simply define THRUST_CUB_WRAPPED_NAMESPACE to a namespace name, and both thrust:: and cub:: will be placed inside the new namespace. Using different wrapped namespaces for each shared library will prevent issues like those reported in NVIDIA/thrust#1401.

    New Features

    • NVIDIA/thrust#1464: Add THRUST_CUB_WRAPPED_NAMESPACE hooks.

    Bug Fixes

    • NVIDIA/thrust#1488: Fix path to installed CUB in Thrust's CMake config files.
    Source code(tar.gz)
    Source code(zip)
  • 1.13.0(Jun 15, 2021)

    Thrust 1.13.0 is the major release accompanying the NVIDIA HPC SDK 21.7 release.

    Notable changes include bfloat16 radix sort support (via thrust::sort) and memory handling fixes in the reserve method of Thrust's vectors. The CONTRIBUTING.md file has been expanded to include instructions for building CUB as a component of Thrust, and API documentation now refers to cppreference instead of SGI's STL reference.

    Breaking Changes

    • NVIDIA/thrust#1459: Remove deprecated aliases thrust::host_space_tag and thrust::device_space_tag. Use the equivalent thrust::host_system_tag and thrust::device_system_tag instead.

    New Features

    • NVIDIA/cub#306: Add radix-sort support for bfloat16 in thrust::sort. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
    • NVIDIA/thrust#1423: thrust::transform_iterator now supports non-copyable types. Thanks to Jake Hemstad (@jrhemstad) for this contribution.
    • NVIDIA/thrust#1459: Introduce a new THRUST_IGNORE_DEPRECATED_API macro that disables deprecation warnings on Thrust and CUB APIs.

    Bug Fixes

    • NVIDIA/cub#277: Fixed sanitizer warnings when thrust::sort calls into cub::DeviceRadixSort. Thanks to Andy Adinets (@canonizer) for this contribution.
    • NVIDIA/thrust#1442: Reduce extraneous comparisons in thrust::sort's merge sort implementation.
    • NVIDIA/thrust#1447: Fix memory leak and avoid overallocation when calling reserve on Thrust's vector containers. Thanks to Kai Germaschewski (@germasch) for this contribution.

    Other Enhancements

    • NVIDIA/thrust#1405: Update links to standard C++ documentations from sgi to cppreference. Thanks to Muhammad Adeel Hussain (@AdeilH) for this contribution.
    • NVIDIA/thrust#1432: Updated build instructions in CONTRIBUTING.md to include details on building CUB's test suite as part of Thrust.
    Source code(tar.gz)
    Source code(zip)
  • 1.12.1(Jun 15, 2021)

  • 1.12.0(Feb 23, 2021)

    Summary

    Thrust 1.12.0 is the major release accompanying the NVIDIA HPC SDK 21.3 and the CUDA Toolkit 11.4.

    It includes a new thrust::universal_vector, which holds data that is accessible from both host and device. This allows users to easily leverage CUDA's unified memory with Thrust.

    New asynchronous thrust::async:exclusive_scan and inclusive_scan algorithms have been added, and the synchronous versions of these have been updated to use cub::DeviceScan directly.

    Many compilation warnings and subtle overflow bugs were fixed in the device algorithms, including a long-standing bug that returned invalid temporary storage requirements when num_items was close to (but not exceeding) INT32_MAX.

    This release deprecates support for Clang < 7.0 and MSVC < 2019 (aka 19.20/16.0/14.20).

    Breaking Changes

    • NVIDIA/thrust#1372: Deprecate Clang < 7 and MSVC < 2019.
    • NVIDIA/thrust#1376: Standardize thrust::scan_by_key functors / accumulator types. This may change the results from scan_by_key when input, output, and initial value types are not the same type.

    New Features

    • NVIDIA/thrust#1251: Add two new thrust::async:: algorithms: inclusive_scan and exclusive_scan.
    • NVIDIA/thrust#1334: Add thrust::universal_vector, universal_ptr, and universal_allocator.

    Bug Fixes

    • NVIDIA/thrust#1347: Qualify calls to make_reverse_iterator.
    • NVIDIA/thrust#1359: Enable stricter warning flags. This fixes several outstanding issues:
      • NVIDIA/cub#221: Overflow in temp_storage_bytes when num_items close to (but not over) INT32_MAX.
      • NVIDIA/cub#228: CUB uses non-standard C++ extensions that break strict compilers.
      • NVIDIA/cub#257: Warning when compiling GridEvenShare with unsigned offsets.
      • NVIDIA/thrust#974: Conversion warnings in thrust::transform_reduce.
      • NVIDIA/thrust#1091: Conversion warnings in thrust::counting_iterator.
    • NVIDIA/thrust#1373: Fix compilation error when a standard library type is wrapped in thrust::optional. Thanks to Vukasin Milovanovic for this contribution.
    • NVIDIA/thrust#1388: Fix signbit(double) implementation on MSVC.
    • NVIDIA/thrust#1389: Support building Thrust tests without CUDA enabled.

    Other Enhancements

    • NVIDIA/thrust#1304: Use cub::DeviceScan to implement thrust::exclusive_scan and thrust::inclusive_scan.
    • NVIDIA/thrust#1362, NVIDIA/thrust#1370: Update smoke test naming.
    • NVIDIA/thrust#1380: Fix typos in set_operation documentation. Thanks to Hongyu Cai for this contribution.
    • NVIDIA/thrust#1383: Include FreeBSD license in LICENSE.md for thrust::complex implementation.
    • NVIDIA/thrust#1384: Add missing precondition to thrust::gather documentation.
    Source code(tar.gz)
    Source code(zip)
  • 1.11.0(Nov 23, 2020)

    Thrust 1.11.0 is a major release providing bugfixes and performance enhancements. It includes a new sort algorithm that provides up to 2x more performance from thrust::sort when used with certain key types and hardware. The new thrust::shuffle algorithm has been tweaked to improve the randomness of the output. Our CMake package and build system continue to see improvements with better add_subdirectory support, installation rules, status messages, and other features that make Thrust easier to use from CMake projects. The release includes several other bugfixes and modernizations, and received updates from 12 contributors.

    New Features

    • NVIDIA/cub#204: New implementation for thrust::sort on CUDA when using 32/64-bit numeric keys on Pascal and up (SM60+). This improved radix sort algorithm provides up to 2x more performance. Thanks for Andy Adinets for this contribution.
    • NVIDIA/thrust#1310, NVIDIA/thrust#1312: Various tuple-related APIs have been updated to use variadic templates. Thanks for Andrew Corrigan for these contributions.
    • NVIDIA/thrust#1297: Optionally add install rules when included with CMake's add_subdirectory. Thanks to Kai Germaschewski for this contribution.

    Bug Fixes

    • NVIDIA/thrust#1309: Fix thrust::shuffle to produce better quality random distributions. Thanks to Rory Mitchell and Daniel Stokes for this contribution.
    • NVIDIA/thrust#1337: Fix compile-time regression in transform_inclusive_scan and transform_exclusive_scan.
    • NVIDIA/thrust#1306: Fix binary search middle calculation to avoid overflows. Thanks to Richard Barnes for this contribution.
    • NVIDIA/thrust#1314: Use size_t for the index type parameter in thrust::tuple_element. Thanks to Andrew Corrigan for this contribution.
    • NVIDIA/thrust#1329: Fix runtime error when copying an empty thrust::device_vector in MSVC Debug builds. Thanks to Ben Jude for this contribution.
    • NVIDIA/thrust#1323: Fix and add test for cmake package install rules. Thanks for Keith Kraus and Kai Germaschewski for testing and discussion.
    • NVIDIA/thrust#1338: Fix GCC version checks in thrust::detail::is_pod implementation. Thanks to Anatoliy Tomilov for this contribution.
    • NVIDIA/thrust#1289: Partial fixes for Clang 10 as host/c++ compiler. Exposed an nvcc bug that will be fixed in a future version of the CUDA Toolkit (NVBug 3136307).
    • NVIDIA/thrust#1272: Fix ambiguous iter_swap call when using thrust::partition with STL containers. Thanks to Isaac Deutsch for this contribution.
    • NVIDIA/thrust#1281: Update our bundled FindTBB.cmake module to support latest MSVC.
    • NVIDIA/thrust#1298: Use semantic versioning rules for our CMake package's compatibility checks. Thanks to Kai Germaschewski for this contribution.
    • NVIDIA/thrust#1300: Use FindPackageHandleStandardArgs to print standard status messages when our CMake package is found. Thanks to Kai Germaschewski for this contribution.
    • NVIDIA/thrust#1320: Use feature-testing instead of a language dialect check for thrust::remove_cvref. Thanks to Andrew Corrigan for this contribution.
    • NVIDIA/thrust#1319: Suppress GPU deprecation warnings.

    Other Enhancements

    • NVIDIA/cub#213: Removed some tuning policies for unsupported hardware (<SM35).
    • References to the old Github repository and branch names were updated.
      • Github's thrust/cub repository is now NVIDIA/cub
      • Development has moved from the master branch to the main branch.
    Source code(tar.gz)
    Source code(zip)
  • 1.10.0(Sep 16, 2020)

    Thrust 1.10.0 is the major release accompanying the NVIDIA HPC SDK 20.9 release and the CUDA Toolkit 11.2 release. It drops support for C++03, GCC < 5, Clang < 6, and MSVC < 2017. It also overhauls CMake support. Finally, we now have a Code of Conduct for contributors: https://github.com/thrust/thrust/blob/main/CODE_OF_CONDUCT.md

    Breaking Changes

    • C++03 is no longer supported.
    • GCC < 5, Clang < 6, and MSVC < 2017 are no longer supported.
    • C++11 is deprecated. Using this dialect will generate a compile-time warning. These warnings can be suppressed by defining THRUST_IGNORE_DEPRECATED_CPP_DIALECT or THRUST_IGNORE_DEPRECATED_CPP_11. Suppression is only a short term solution. We will be dropping support for C++11 in the near future.
    • Asynchronous algorithms now require C++14.
    • CMake < 3.15 is no longer supported.
    • The default branch on GitHub is now called main.
    • Allocator and vector classes have been replaced with alias templates.

    New Features

    • thrust/thrust#1159: CMake multi-config support, which allows multiple combinations of host and device systems to be built and tested at once. More details can be found here: https://github.com/thrust/thrust/blob/main/CONTRIBUTING.md#multi-config-cmake-options
    • CMake refactoring:
      • Added install targets to CMake builds.
      • Added support for CUB tests and examples.
      • Thrust can be added to another CMake project by calling add_subdirectory with the Thrust source root (see thrust/thrust#976). An example can be found here: https://github.com/thrust/thrust/blob/main/examples/cmake/add_subdir/CMakeLists.txt
      • CMake < 3.15 is no longer supported.
      • Dialects are now configured through target properties. A new THRUST_CPP_DIALECT option has been added for single config mode. Logic that modified CMAKE_CXX_STANDARD and CMAKE_CUDA_STANDARD has been eliminated.
      • Testing related CMake code has been moved to testing/CMakeLists.txt
      • Example related CMake code has been moved to examples/CMakeLists.txt
      • Header testing related CMake code has been moved to cmake/ThrustHeaderTesting.cmake
      • CUDA configuration CMake code has been moved to to cmake/ThrustCUDAConfig.cmake.
      • Now we explicitly include(cmake/*.cmake) files rather than searching CMAKE_MODULE_PATH - we only want to use the ones in the repo.
    • thrust::transform_input_output_iterator, a variant of transform iterator adapter that works as both an input iterator and an output iterator. The given input function is applied after reading from the wrapped iterator while the output function is applied before writing to the wrapped iterator. Thanks to Trevor Smith for this contribution.

    Other Enhancements

    • Contributor documentation: https://github.com/thrust/thrust/blob/main/CONTRIBUTING.md
    • Code of Conduct: https://github.com/thrust/thrust/blob/main/CODE_OF_CONDUCT.md. Thanks to Conor Hoekstra for this contribution.
    • Support for all combinations of host and device systems.
    • C++17 support.
    • thrust/thrust#1221: Allocator and vector classes have been replaced with alias templates. Thanks to Michael Francis for this contribution.
    • thrust/thrust#1186: Use placeholder expressions to simplify the definitions of a number of algorithms. Thanks to Michael Francis for this contribution.
    • thrust/thrust#1170: More conforming semantics for scan algorithms:
      • Follow P0571's guidance regarding intermediate types.
        • https://wg21.link/P0571
        • The accumulator's type is now:
          • The type of the user-supplied initial value (if provided), or
          • The input iterator's value type if no initial value.
      • Follow C++ standard guidance for default binary operator type.
        • https://eel.is/c++draft/exclusive.scan#1
        • Thrust binary/unary functors now specialize a default void template parameter. Types are deduced and forwarded transparently.
        • Updated the scan's default binary operator to the new thrust::plus<> specialization.
      • The thrust::intermediate_type_from_function_and_iterators helper is no longer needed and has been removed.
    • thrust/thrust#1255: Always use cudaStreamSynchronize instead of cudaDeviceSynchronize if the execution policy has a stream attached to it. Thanks to Rong Ou for this contribution.
    • thrust/thrust#1201: Tests for correct handling of legacy and per-thread default streams. Thanks to Rong Ou for this contribution.

    Bug Fixes

    • thrust/thrust#1260: Fix thrust::transform_inclusive_scan with heterogeneous types. Thanks to Rong Ou for this contribution.
    • thrust/thrust#1258, NVC++ FS #28463: Ensure the CUDA radix sort backend synchronizes before returning; otherwise, copies from temporary storage will race with destruction of said temporary storage.
    • thrust/thrust#1264: Evaluate CUDA_CUB_RET_IF_FAIL macro argument only once. Thanks to Jason Lowe for this contribution.
    • thrust/thrust#1262: Add missing <stdexcept> header.
    • thrust/thrust#1250: Restore some THRUST_DECLTYPE_RETURNS macros in async test implementations.
    • thrust/thrust#1249: Use std::iota in CUDATestDriver::target_devices. Thanks to Michael Francis for this contribution.
    • thrust/thrust#1244: Check for macro collisions with system headers during header testing.
    • thrust/thrust#1224: Remove unnecessary SFINAE contexts from asynchronous algorithms.
    • thrust/thrust#1190: Make out_of_memory_recovery test trigger faster.
    • thrust/thrust#1187: Elminate superfluous iterators specific to the CUDA backend.
    • thrust/thrust#1181: Various fixes for GoUDA. Thanks to Andrei Tchouprakov for this contribution.
    • thrust/thrust#1178, thrust/thrust#1229: Use transparent functionals in placeholder expressions, fixing issues with thrust::device_reference and placeholder expressions and thrust::find with asymmetric equality operators.
    • thrust/thrust#1153: Switch to placement new instead of assignment to construct items in uninitialized memory. Thanks to Hugh Winkler for this contribution.
    • thrust/thrust#1050: Fix compilation of asynchronous algorithms when RDC is enabled.
    • thrust/thrust#1042: Correct return type of thrust::detail::predicate_to_integral from bool to IntegralType. Thanks to Andreas Hehn for this contribution.
    • thrust/thrust#1009: Avoid returning uninitialized allocators. Thanks to Zhihao Yuan for this contribution.
    • thrust/thrust#990: Add missing <thrust/system/cuda/memory.h> include to <thrust/system/cuda/detail/malloc_and_free.h>. Thanks to Robert Maynard for this contribution.
    • thrust/thrust#966: Fix spurious MSVC conversion with loss of data warning in sort algorithms. Thanks to Zhihao Yuan for this contribution.
    • Add more metadata to mock specializations for testing iterator in testing/copy.cu.
    • Add missing include to shuffle unit test.
    • Specialize thrust::wrapped_function for void return types because MSVC is not a fan of the pattern return static_cast<void>(expr);.
    • Replace deprecated tbb/tbb_thread.h with <thread>.
    • Fix overcounting of initial value in TBB scans.
    • Use thrust::advance instead of += for generic iterators.
    • Wrap the OMP flags in -Xcompiler for NVCC
    • Extend ASSERT_STATIC_ASSERT skip for the OMP backend.
    • Add missing header caught by tbb.cuda configs.
    • Fix "unsafe API" warnings in examples on MSVC: s/fopen/fstream/
    • Various C++17 fixes.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.10-1(Jul 27, 2020)

    Thrust 1.9.10-1 is the minor release accompanying the NVIDIA HPC SDK 20.7 release and the CUDA Toolkit 11.1 release.

    Bug Fixes

    • #1214, NVBug 200619442: Stop using std::allocator APIs deprecated in C++17.
    • #1216, NVBug 200540293: Make thrust::optional work with Clang when used with older libstdc++.
    • #1207, NVBug 200618218: Don't force C++14 with older compilers that don't support it.
    • #1218: Wrap includes of <memory> and <algorithm> to avoid circular inclusion with NVC++.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.10(May 16, 2020)

    Thrust 1.9.10 is the release accompanying the NVIDIA HPC SDK 20.5 release. It adds CMake support for compilation with NVC++ and a number of minor bug fixes for NVC++. It also adds CMake find_package support, which replaces the broken 3rd-party legacy FindThrust.cmake script. C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated. Starting with the upcoming 1.10.0 release, C++03 support will be dropped entirely. All other deprecated platforms will be dropped in the near future.

    Breaking Changes

    • #1130: CMake find_package support. This is significant because there is a legacy FindThrust.cmake script authored by a third party in widespread use in the community which has a bug in how it parses Thrust version numbers which will cause it to incorrectly parse 1.9.10. This script only handles the first digit of each part of the Thrust version number correctly: for example, Thrust 17.17.17 would be interpreted as Thrust 1.1.1701717. You can find directions for using the new CMake find_package support and migrating away from the legacy FindThrust.cmake here
    • #1082: Thrust now checks that it is compatible with the version of CUB found in your include path, generating an error if it is not. If you are using your own verison of CUB, it may be too old. It is recommended to simply delete your own version of CUB and use the version of CUB that comes with Thrust.
    • #1089 C++03 and C++11 are deprecated. Using these dialects will generate a compile-time warning. These warnings can be suppressed by defining THRUST_IGNORE_DEPRECATED_CPP_DIALECT (to suppress C++03 and C++11 deprecation warnings) or THRUST_IGNORE_DEPRECATED_CPP_11 (to suppress C++11 deprecation warnings). Suppression is only a short term solution. We will be dropping support for C++03 in the 1.10.0 release and C++11 in the near future.
    • #1089: GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. Using these compilers will generate a compile-time warning. These warnings can be suppressed by defining THRUST_IGNORE_DEPRECATED_COMPILER. Suppression is only a short term solution. We will be dropping support for these compilers in the near future.

    New Features

    • #1129: Added thrust::detail::single_device_tls_caching_allocator, a convenient way to get an MR caching allocator for device memory, which is used by NVC++.

    Other Enhancements

    • #1129: Refactored RDC handling in CMake to be a global option and not create two targets for each example and test.

    Bug Fixes

    • #1129: Fix the legacy thrust::return_temporary_buffer API to support passing a size. This was necessary to enable usage of Thrust caching MR allocators with synchronous Thrust algorithms. This change has allowed NVC++’s C++17 Parallel Algorithms implementation to switch to use Thrust caching MR allocators for device temporary storage, which gives a 2x speedup on large multi-GPU systems such as V100 and A100 DGX where cudaMalloc is very slow.
    • #1128: Respect CUDA_API_PER_THREAD_DEFAULT_STREAM. Thanks to Rong Ou for this contribution.
    • #1131: Fix the one-policy overload of thrust::async::copy to not copy the policy, resolving use-afer-move issues.
    • #1145: When cleaning up type names in unittest::base_class_name, only call std::string::replace if we found the substring we are looking to replace.
    • #1139: Don't use cxx::__demangle in NVC++.
    • #1102: Don't use thrust::detail::normal_distribution_nvcc for Feta because it uses erfcinv, a non-standard function that Feta doesn't have.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.9(May 16, 2020)

    Thrust 1.9.9 adds support for NVC++, which uses Thrust to implement GPU-accelerated C++17 Parallel Algorithms. thrust::zip_function and thrust::shuffle were also added. As of this release, C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. Starting with the upcoming 1.10.0 release, C++03 support will be dropped entirely. All other deprecated platforms will be dropped in the near future.

    Breaking Changes

    • #1082: Thrust now checks that it is compatible with the version of CUB found in your include path, generating an error if it is not. If you are using your own verison of CUB, it may be too old. It is recommended to simply delete your own version of CUB and use the version of CUB that comes with Thrust.
    • #1089 C++03 and C++11 are deprecated. Using these dialects will generate a compile-time warning. These warnings can be suppressed by defining THRUST_IGNORE_DEPRECATED_CPP_DIALECT (to suppress C++03 and C++11 deprecation warnings) or THRUST_IGNORE_DEPRECATED_CPP11 (to suppress C++11 deprecation warnings). Suppression is only a short term solution. We will be dropping support for C++03 in the 1.10.0 release and C++11 in the near future.
    • #1089: GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. Using these compilers will generate a compile-time warning. These warnings can be suppressed by defining THRUST_IGNORE_DEPRECATED_COMPILER. Suppression is only a short term solution. We will be dropping support for these compilers in the near future.

    New Features

    • #1086: Support for NVC++ aka "Feta". The most significant change is in how we use __CUDA_ARCH__. Now, there are four macros that must be used:
      • THRUST_IS_DEVICE_CODE, which should be used in an if statement around device-only code.
      • THRUST_INCLUDE_DEVICE_CODE, which should be used in an #if preprocessor directive inside of the if statement mentioned in the prior bullet.
      • THRUST_IS_HOST_CODE, which should be used in an if statement around host-only code.
      • THRUST_INCLUDE_HOST_CODE, which should be used in an #if preprocessor directive inside of the if statement mentioned in the prior bullet.
    • #1085: thrust::shuffle. Thanks to Rory Mitchell for this contribution.
    • #1029: thrust::zip_function, a facility for zipping functions that take N parameters instead of a tuple of N parameters as thrust::zip_iterator does. Thanks to Ben Jude for this contribution.
    • #1068: thrust::system::cuda::managed_memory_pointer, a universal memory strongly typed pointer compatible with the ISO C++ Standard Library.

    Other Enhancements

    • #1029: Thrust is now built and tested with NVCC warnings treated as errors.
    • #1029: MSVC C++11 support.
    • #1029: THRUST_DEPRECATED abstraction for generating compile-time deprecation warning messages.
    • #1029: thrust::pointer<T>::pointer_to(reference).
    • #1070: Unit test for thrust::inclusive_scan with a user defined types. Thanks to Conor Hoekstra for this contribution.

    Bug Fixes

    • #1088: Allow thrust::replace to take functions that have non-const operator().
    • #1094: Add missing constexpr to par_t constructors. Thanks to Patrick Stotko for this contribution.
    • #1077: Remove __device__ from CUDA MR-based device allocators to fix obscure "host function called from host device function" warning that occurs when you use the new Thrust MR-based allocators.
    • #1029: Remove inconsistently-used THRUST_BEGIN/END_NS macros.
    • #1029: Fix C++ dialect detection on newer MSVC.
    • #1029 Use _Pragma/__pragma instead of #pragma in macros.
    • #1029: Replace raw __cplusplus checks with the appropriate Thrust macros.
    • #1105: Add a missing <math.h> include.
    • #1103: Fix regression of thrust::detail::temporary_allocator with non-CUDA back ends.
    • #1111: Use Thrust's random number engine instead of std::s in device code.
    • #1108: Get rid of a GCC 9 warning about deprecated generation of copy ctors.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.8-1(May 19, 2020)

    Thrust 1.9.8-1 is a variant of 1.9.8 accompanying the NVIDIA HPC SDK 20.3 release. It contains modifications necessary to serve as the implementation of NVC++'s GPU-accelerated C++17 Parallel Algorithms.

    Source code(tar.gz)
    Source code(zip)
  • 1.9.8(May 16, 2020)

    Thrust 1.9.8, which is included in the CUDA Toolkit 11.0 release, removes Thrust's internal derivative of CUB, upstreams all relevant changes too CUB, and adds CUB as a Git submodule. It will now be necessary to do git clone --recursive when checking out Thrust, and to update the CUB submodule when pulling in new Thrust changes. Additionally, CUB is now included as a first class citizen in the CUDA toolkit. Thrust 1.9.8 also fixes bugs preventing most Thrust algorithms from working with more than 2^31-1 elements. Now, thrust::reduce, thrust::*_scan, and related algorithms (aka most of Thrust) work with large element counts.

    Breaking Changes

    • Thrust will now use the version of CUB in your include path instead of its own internal copy. If you are using your own version of CUB, it may be older and incompatible with Thrust. It is recommended to simply delete your own version of CUB and use the version of CUB that comes with Thrust.

    Other Enhancements

    • Refactor Thrust and CUB to support 64-bit indices in most algorithms. In most cases, Thrust now selects between kernels that use 32-bit indices and 64-bit indices at runtime depending on the size of the input. This means large element counts work, but small element counts do not have to pay for the register usage of 64-bit indices if they are not needed. Now, thrust::reduce, thrust::*_scan, and related algorithms (aka most of Thrust) work with more than 2^31-1 elements. Notably, thrust::sort is still limited to less than 2^31-1 elements.
    • CUB is now a submodule and the internal copy of CUB has been removed.
    • #1051: Stop specifying the __launch_bounds__ minimum blocks parameter because it messes up register allocation and increases register pressure, and we don't actually know at compile time how many blocks we will use (aside from single tile kernels).

    Bug Fixes

    • #1020: After making a CUDA API call, always clear the global CUDA error state by calling cudaGetLastError.
    • #1021: Avoid calling destroy in the destructor of a Thrust vector if the vector is empty.
    • #1046: Actually throw thrust::bad_alloc when thrust::system::cuda::malloc fails instead of just constructing a temporary and doing nothing with it.
    • Add missing copy constructor or copy assignment operator to all classes that GCC 9's -Wdeprecated-copy complains about
    • Add missing move operations to thrust::system::cuda::vector.
    • #1015: Check that the backend is CUDA before using CUDA-specifics in thrust::detail::temporary_allocator. Thanks to Hugh Winkler for this contribution.
    • #1055: More correctly detect the presence of aligned/sized new/delete.
    • #1043: Fix ill-formed specialization of thrust::system::is_error_code_enum for thrust::event_errc. Thanks to Toru Niina for this contribution.
    • #1027: Add tests for thrust::tuple_for_each and thrust::tuple_subset. Thanks to Ben Jude for this contribution.
    • #1027: Use correct macro in thrust::tuple_for_each. Thanks to Ben Jude for this contribution.
    • #1026: Use correct MSVC version formatting in CMake. Thanks to Ben Jude for this contribution.
    • Workaround an NVCC issue with type aliases with template template arguments containing a parameter pack.
    • Remove unused functions from the CUDA backend which call slow CUDA attribute query APIs.
    • Replace CUB_RUNTIME_FUNCTION with THRUST_RUNTIME_FUNCTION.
    • Correct typo in thrust::transform documentation. Thanks to Eden Yefet for this contribution.

    Known Issues

    • thrust::sort remains limited to 2^31-1 elements for now.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.7-1(May 18, 2020)

    Thrust 1.9.7-1 is a minor release accompanying the CUDA Toolkit 10.2 release for Tegra. It is nearly identical to 1.9.7.

    Bug Fixes

    • Remove support for GCC's broken nodiscard-like attribute.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.7(May 16, 2020)

    Thrust 1.9.7 is a minor release accompanying the CUDA Toolkit 10.2 release. Unfortunately, although the version and patch numbers are identical, one bug fix present in Thrust 1.9.7 (NVBug 2646034: Fix incorrect dependency handling for stream acquisition in thrust::future) was not included in the CUDA Toolkit 10.2 preview release for AArch64 SBSA. The tag cuda-10.2aarch64sbsa contains the exact version of Thrust present in the CUDA Toolkit 10.2 preview release for AArch64 SBSA.

    Bug Fixes

    • #967, NVBug 2448170: Fix the CUDA backend thrust::for_each so that it supports large input sizes with 64-bit indices.
    • NVBug 2646034: Fix incorrect dependency handling for stream acquisition in thrust::future.
      • Not present in the CUDA Toolkit 10.2 preview release for AArch64 SBSA.
    • #968, NVBug 2612102: Fix the thrust::mr::polymorphic_adaptor to actually use its template parameter.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.6-1(May 18, 2020)

    Thrust 1.9.6-1 is a variant of 1.9.6 accompanying the NVIDIA HPC SDK 20.3 release. It contains modifications necessary to serve as the implementation of NVC++'s GPU-accelerated C++17 Parallel Algorithms when using the CUDA Toolkit 10.1 Update 2 release.

    Source code(tar.gz)
    Source code(zip)
  • 1.9.6(May 16, 2020)

    Thrust 1.9.6 is a minor release accompanying the CUDA Toolkit 10.1 Update 2 release.

    Bug Fixes

    • NVBug 2509847: Inconsistent alignment of thrust::complex
    • NVBug 2586774: Compilation failure with Clang + older libstdc++ that doesn't have std::is_trivially_copyable
    • NVBug 200488234: CUDA header files contain unicode characters which leads compiling errors on Windows
    • #949, #973, NVBug 2422333, NVBug 2522259, NVBug 2528822: thrust::detail::aligned_reinterpret_cast must be annotated with __host__ __device__.
    • NVBug 2599629: Missing include in the OpenMP sort implementation
    • NVBug 200513211: Truncation warning in test code under VC142
    Source code(tar.gz)
    Source code(zip)
  • 1.9.5(May 14, 2019)

    Thrust v1.9.5 is a minor bugfix release accompanying the CUDA 10.1 Update 1 CUDA Toolkit release.

    Bug Fixes

    • 2502854 Assignment of complex vector between host and device fails to compile in CUDA >=9.1 with GCC 6.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.4(Mar 1, 2019)

    Thrust 1.9.4 adds asynchronous interfaces for parallel algorithms, a new allocator system including caching allocators and unified memory support, as well as a variety of other enhancements, mostly related to C++11/C++14/C++17/C++20 support. The new asynchronous algorithms in the thrust::async namespace return thrust::event or thrust::future objects, which can be waited upon to synchronize with the completion of the parallel operation.

    Breaking API Changes

    Synchronous Thrust algorithms now block until all of their operations have completed. Use the new asynchronous Thrust algorithms for non-blocking behavior.

    New Features

    • thrust::event and thrust::future<T>, uniquely-owned asynchronous handles consisting of a state (ready or not ready), content (some value; for thrust::future only), and an optional set of objects that should be destroyed only when the future's value is ready and has been consumed.

      • The design is loosely based on C++11's std::future.
      • They can be .wait'd on, and the value of a future can be waited on and retrieved with .get or .extract.
      • Multiple thrust::events and thrust::futures can be combined with thrust::when_all.
      • thrust::futures can be converted to thrust::events.
      • Currently, these primitives are only implemented for the CUDA backend and are C++11 only.
    • New asynchronous algorithms that return thrust::event/thrust::futures, implemented as C++20 range style customization points:

      • thrust::async::reduce.
      • thrust::async::reduce_into, which takes a target location to store the reduction result into.
      • thrust::async::copy, including a two-policy overload that allows explicit cross system copies which execution policy properties can be attached to.
      • thrust::async::transform.
      • thrust::async::for_each.
      • thrust::async::stable_sort.
      • thrust::async::sort.
      • By default the asynchronous algorithms use the new caching allocators. Deallocation of temporary storage is deferred until the destruction of the returned thrust::future. The content of thrust::futures is stored in either device or universal memory and transferred to the host only upon request to prevent unnecessary data migration.
      • Asynchronous algorithms are currently only implemented for the CUDA system and are C++11 only.
    • exec.after(f, g, ...), a new execution policy method that takes a set of thrust::event/thrust::futures and returns an execution policy that operations on that execution policy should depend upon.

    • New logic and mindset for the type requirements for cross-system sequence copies (currently only used by thrust::async::copy), based on:

      • thrust::is_contiguous_iterator and THRUST_PROCLAIM_CONTIGUOUS_ITERATOR for detecting/indicating that an iterator points to contiguous storage.
      • thrust::is_trivially_relocatable and THRUST_PROCLAIM_TRIVIALLY_RELOCATABLE for detecting/indicating that a type is memcpyable (based on principles from https://wg21.link/P1144).
      • The new approach reduces buffering, increases performance, and increases correctness.
      • The fast path is now enabled when copying fp16 and CUDA vector types with thrust::async::copy.
    • All Thrust synchronous algorithms for the CUDA backend now actually synchronize. Previously, any algorithm that did not allocate temporary storage (counterexample: thrust::sort) and did not have a computation-dependent result (counterexample: thrust::reduce) would actually be launched asynchronously. Additionally, synchronous algorithms that allocated temporary storage would become asynchronous if a custom allocator was supplied that did not synchronize on allocation/deallocation, unlike cudaMalloc/cudaFree. So, now thrust::for_each, thrust::transform, thrust::sort, etc are truly synchronous. In some cases this may be a performance regression; if you need asynchrony, use the new asynchronous algorithms.

    • Thrust's allocator framework has been rewritten. It now uses a memory resource system, similar to C++17's std::pmr but supporting static polymorphism. Memory resources are objects that allocate untyped storage and allocators are cheap handles to memory resources in this new model. The new facilities live in <thrust/mr/*>.

      • thrust::mr::memory_resource<Pointer>, the memory resource base class, which takes a (possibly tagged) pointer to void type as a parameter.
      • thrust::mr::allocator<T, MemoryResource>, an allocator backed by a memory resource object.
      • thrust::mr::polymorphic_adaptor_resource<Pointer>, a type-erased memory resource adaptor.
      • thrust::mr::polymorphic_allocator<T>, a C++17-style polymorphic allocator backed by a type-erased memory resource object.
      • New tunable C++17-style caching memory resources, thrust::mr::(disjoint_)?(un)?synchronized_pool_resource, designed to cache both small object allocations and large repetitive temporary allocations. The disjoint variants use separate storage for management of the pool, which is necessary if the memory being allocated cannot be accessed on the host (e.g. device memory).
      • System-specific allocators were rewritten to use the new memory resource framework.
      • New thrust::device_memory_resource for allocating device memory.
      • New thrust::universal_memory_resource for allocating memory that can be accessed from both the host and device (e.g. cudaMallocManaged).
      • New thrust::universal_host_pinned_memory_resource for allocating memory that can be accessed from the host and the device but always resides in host memory (e.g. cudaMallocHost).
      • thrust::get_per_device_resource and thrust::per_device_allocator, which lazily create and retrieve a per-device singleton memory resource.
      • Rebinding mechanisms (rebind_traits and rebind_alloc) for thrust::allocator_traits.
      • thrust::device_make_unique, a factory function for creating a std::unique_ptr to a newly allocated object in device memory.
      • <thrust/detail/memory_algorithms>, a C++11 implementation of the C++17 uninitialized memory algorithms.
      • thrust::allocate_unique and friends, based on the proposed C++23 std::allocate_unique (https://wg21.link/P0211).
    • New type traits and metaprogramming facilities. Type traits are slowly being migrated out of thrust::detail:: and <thrust/detail/*>; their new home will be thrust:: and <thrust/type_traits/*>.

      • thrust::is_execution_policy.
      • thrust::is_operator_less_or_greater_function_object, which detects thrust::less, thrust::greater, std::less, and std::greater.
      • thrust::is_operator_plus_function_object``, which detectsthrust::plusandstd::plus`.
      • thrust::remove_cvref(_t)?, a C++11 implementation of C++20's thrust::remove_cvref(_t)?.
      • thrust::void_t, and various other new type traits.
      • thrust::integer_sequence and friends, a C++11 implementation of C++20's std::integer_sequence
      • thrust::conjunction, thrust::disjunction, and thrust::disjunction, a C++11 implementation of C++17's logical metafunctions.
      • Some Thrust type traits (such as thrust::is_constructible) have been redefined in terms of C++11's type traits when they are available.
    • <thrust/detail/tuple_algorithms.h>, new std::tuple algorithms:

      • thrust::tuple_transform.
      • thrust::tuple_for_each.
      • thrust::tuple_subset.
    • Miscellaneous new std::-like facilities:

      • thrust::optional, a C++11 implementation of C++17's std::optional.
      • thrust::addressof, an implementation of C++11's std::addressof.
      • thrust::next and thrust::prev, an implementation of C++11's std::next and std::prev.
      • thrust::square, a <functional> style unary function object that multiplies its argument by itself.
      • <thrust/limits.h> and thrust::numeric_limits, a customized version of <limits> and std::numeric_limits.
    • <thrust/detail/preprocessor.h>, new general purpose preprocessor facilities:

      • THRUST_PP_CAT[2-5], concatenates two to five tokens.
      • THRUST_PP_EXPAND(_ARGS)?, performs double expansion.
      • THRUST_PP_ARITY and THRUST_PP_DISPATCH, tools for macro overloading.
      • THRUST_PP_BOOL, boolean conversion.
      • THRUST_PP_INC and THRUST_PP_DEC, increment/decrement.
      • THRUST_PP_HEAD, a variadic macro that expands to the first argument.
      • THRUST_PP_TAIL, a variadic macro that expands to all its arguments after the first.
      • THRUST_PP_IIF, bitwise conditional.
      • THRUST_PP_COMMA_IF, and THRUST_PP_HAS_COMMA, facilities for adding and detecting comma tokens.
      • THRUST_PP_IS_VARIADIC_NULLARY, returns true if called with a nullary __VA_ARGS__.
      • THRUST_CURRENT_FUNCTION, expands to the name of the current function.
    • New C++11 compatibility macros:

      • THRUST_NODISCARD, expands to [[nodiscard]] when available and the best equivalent otherwise.
      • THRUST_CONSTEXPR, expands to constexpr when available and the best equivalent otherwise.
      • THRUST_OVERRIDE, expands to override when available and the best equivalent otherwise.
      • THRUST_DEFAULT, expands to = default; when available and the best equivalent otherwise.
      • THRUST_NOEXCEPT, expands to noexcept when available and the best equivalent otherwise.
      • THRUST_FINAL, expands to final when available and the best equivalent otherwise.
      • THRUST_INLINE_CONSTANT, expands to inline constexpr when available and the best equivalent otherwise.
    • <thrust/detail/type_deduction.h>, new C++11-only type deduction helpers:

      • THRUST_DECLTYPE_RETURNS*, expand to function definitions with suitable conditional noexcept qualifiers and trailing return types.
      • THRUST_FWD(x), expands to ::std::forward<decltype(x)>(x).
      • THRUST_MVCAP, expands to a lambda move capture.
      • THRUST_RETOF, expands to a decltype computing the return type of an invocable.

    New Examples

    • mr_basic demonstrates how to use the new memory resource allocator system.

    Other Enhancements

    • Tagged pointer enhancements:
      • New thrust::pointer_traits specialization for void const*.
      • nullptr support to Thrust tagged pointers.
      • New explicit operator bool for Thrust tagged pointers when using C++11 for std::unique_ptr interoperability.
      • Added thrust::reinterpret_pointer_cast and thrust::static_pointer_cast for casting Thrust tagged pointers.
    • Iterator enhancements:
      • thrust::iterator_system is now SFINAE friendly.
      • Removed cv qualifiers from iterator types when using thrust::iterator_system.
    • Static assert enhancements:
      • New THRUST_STATIC_ASSERT_MSG, takes an optional string constant to be used as the error message when possible.
      • Update THRUST_STATIC_ASSERT(_MSG) to use C++11's static_assert when it's available.
      • Introduce a way to test for static assertions.
    • Testing enhancements:
      • Additional scalar and sequence types, including non-builtin types and vectors with unified memory allocators, have been added to the list of types used by generic unit tests.
      • The generation of random input data has been improved to increase the range of values used and catch more corner cases.
      • New truncate_to_max_representable utility for avoiding the generation of ranges that cannot be represented by the underlying element type in generic unit test code.
      • The test driver now synchronizes with CUDA devices and check for errors after each test, when switching devices, and after each raw kernel launch.
      • The warningtester uber header is now compiled with NVCC to avoid needing to disable CUDA-specific code with the preprocessor.
      • Fixed the unit test framework's ASSERT_* to print chars as ints.
      • New DECLARE_INTEGRAL_VARIABLE_UNITTEST test declaration macro.
      • New DECLARE_VARIABLE_UNITTEST_WITH_TYPES_AND_NAME test declaration macro.
      • thrust::system_error in the CUDA backend now print out its cudaError_t enumerator in addition to the diagnostic message.
      • Stopped using conditionally signed types like char.

    Bug Fixes

    • #897, 2062242 Fix compilation error when using __device__ lambdas with reduce on MSVC.
    • #908, 2089386 Static assert that thrust::generate/thrust::fill isn't operate on const iterators.
    • #919 Fix compilation failure with thrust::zip_iterator and thrust::complex<float>.
    • #924, 2096679, 2315990 Fix dispatch for the CUDA backend's thrust::reduce to use two functions (one with the pragma for disabling exec checks, one with THRUST_RUNTIME_FUNCTION) instead of one. This fixes a regression with device compilation that started in CUDA 9.2.
    • #928, 2341455 Add missing __host__ __device__ annotations to a thrust::complex::operator= to satisfy GoUDA.
    • 2094642 Make thrust::vector_base::clear not depend on the element type being default constructible.
    • 2289115 Remove flaky simple_cuda_streams example.
    • 2328572 Add missing thrust::device_vector constructor that takes an allocator parameter.
    • 2455740 Update the range_view example to not use device-side launch.
    • 2455943 Ensure that sized unit tests that use counting_iterator perform proper truncation.
    • 2455952 Refactor questionable copy_if unit tests.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.3(May 16, 2020)

    Thrust 1.9.3 unifies and integrates CUDA Thrust and GitHub Thrust.

    Bug Fixes

    • #725, #850, #855, #859, #860: Unify the thrust::iter_swap interface and fix thrust::device_reference swapping.
    • NVBug 2004663: Add a data method to thrust::detail::temporary_array and refactor temporary memory allocation in the CUDA backend to be exception and leak safe.
    • #886, #894, #914: Various documentation typo fixes.
    • #724: Provide NVVMIR_LIBRARY_DIR environment variable to NVCC.
    • #878: Optimize thrust::min/max_element to only use thrust::detail::get_iterator_value for non-numeric types.
    • #899: Make thrust::cuda::experimental::pinned_allocator's comparison operators const.
    • NVBug 2092152: Remove all includes of <cuda.h>.
    • #911: Fix default comparator element type for thrust::merge_by_key.

    Acknowledgments

    • Thanks to Andrew Corrigan for contributing fixes for swapping interfaces.
    • Thanks to Francisco Facioni for contributing optimizations for thrust::min/max_element.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.2(May 16, 2020)

    Thrust 1.9.2 brings a variety of performance enhancements, bug fixes and test improvements. CUB 1.7.5 was integrated, enhancing the performance of thrust::sort on small data types and thrust::reduce. Changes were applied to complex to optimize memory access. Thrust now compiles with compiler warnings enabled and treated as errors. Additionally, the unit test suite and framework was enhanced to increase coverage.

    Breaking Changes

    • The fallback_allocator example was removed, as it was buggy and difficult to support.

    New Features

    • <thrust/detail/alignment.h>, utilities for memory alignment:
      • thrust::aligned_reinterpret_cast.
      • thrust::aligned_storage_size, which computes the amount of storage needed for an object of a particular size and alignment.
      • thrust::alignment_of, a C++03 implementation of C++11's std::alignment_of.
      • thrust::aligned_storage, a C++03 implementation of C++11's std::aligned_storage.
      • thrust::max_align_t, a C++03 implementation of C++11's std::max_align_t.

    Bug Fixes

    • NVBug 200385527, NVBug 200385119, NVBug 200385113, NVBug 200349350, NVBug 2058778: Various compiler warning issues.
    • NVBug 200355591: thrust::reduce performance issues.
    • NVBug 2053727: Fixed an ADL bug that caused user-supplied allocate to be overlooked but deallocate to be called with GCC <= 4.3.
    • NVBug 1777043: Fixed thrust::complex to work with thrust::sequence.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.1-2(May 18, 2020)

    Thrust 1.9.1 integrates version 1.7.4 of CUB and introduces a new CUDA backend for thrust::reduce based on CUB.

    Bug Fixes

    • NVBug 1965743: Remove unnecessary static qualifiers.
    • NVBug 1940974: Fix regression causing a compilation error when using thrust::merge_by_key with thrust::constant_iterators.
    • NVBug 1904217: Allow callables that take non-const refs to be used with thrust::reduce and thrust::*_scan.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.0-5(May 18, 2020)

    Thrust 1.9.0 replaces the original CUDA backend (bulk) with a new one written using CUB, a high performance CUDA collectives library. This brings a substantial performance improvement to the CUDA backend across the board.

    Breaking Changes

    • Any code depending on CUDA backend implementation details will likely be broken.

    New Features

    • New CUDA backend based on CUB which delivers substantially higher performance.
    • thrust::transform_output_iterator, a fancy iterator that applies a function to the output before storing the result.

    New Examples

    • transform_output_iterator demonstrates use of the new fancy iterator thrust::transform_output_iterator.

    Other Enhancements

    • When C++11 is enabled, functors do not have to inherit from thrust::(unary|binary)_function anymore to be used with thrust::transform_iterator.
    • Added C++11 only move constructors and move assignment operators for thrust::detail::vector_base-based classes, e.g. thrust::host_vector, thrust::device_vector, and friends.

    Bug Fixes

    • sin(thrust::complex<double>) no longer has precision loss to float.

    Acknowledgments

    • Thanks to Manuel Schiller for contributing a C++11 based enhancement regarding the deduction of functor return types, improving the performance of thrust::unique and implementing thrust::transform_output_iterator.
    • Thanks to Thibault Notargiacomo for the implementation of move semantics for the thrust::vector_base-based classes.
    • Thanks to Duane Merrill for developing CUB and helping to integrate it into Thrust's backend.
    Source code(tar.gz)
    Source code(zip)
  • 1.8.3(May 16, 2020)

    Thrust 1.8.3 is a small bug fix release.

    New Examples

    • range_view demonstrates the use of a view (a non-owning wrapper for an iterator range with a container-like interface).

    Bug Fixes

    • thrust::(min|max|minmax)_element can now accept raw device pointers when an explicit device execution policy is used.
    • thrust::clear operations on vector types no longer requires the element type to have a default constructor.
    Source code(tar.gz)
    Source code(zip)
  • 1.8.2(Feb 28, 2019)

    Thrust 1.8.2 is a small bug fix release.

    Bug Fixes

    • Avoid warnings and errors concerning user functions called from __host__ __device__ functions
    • #632 CUDA set_intersection_by_key error
    • #651 thrust::copy between host & device is not interoperable with thrust::cuda::par.on(stream)
    • #664 CUDA for_each ignores execution policy's stream

    Known Issues

    • #628 CUDA's reduce_by_key fails on sm_50 devices
    Source code(tar.gz)
    Source code(zip)
  • 1.8.1(Mar 18, 2015)

  • 1.8.0(Jan 12, 2015)

    Thrust 1.8.0 introduces support for algorithm invocation from CUDA __device__ code, support for CUDA streams, and algorithm performance improvements. Users may now invoke Thrust algorithms from CUDA __device__ code, providing a parallel algorithms library to CUDA programmers authoring custom kernels, as well as allowing Thrust programmers to nest their algorithm calls within functors. The thrust::seq execution policy allows users to require sequential algorithm execution in the calling thread and makes a sequential algorithms library available to individual CUDA threads. The .on(stream) syntax allows users to request a CUDA stream for kernels launched during algorithm execution. Finally, new CUDA algorithm implementations provide substantial performance improvements.

    New Features

    • Algorithms in CUDA __device__ code
      • Thrust algorithms may now be invoked from CUDA __device__ and __host__ __device__ functions.

        Algorithms invoked in this manner must be invoked with an execution policy as the first parameter:

         __device__ int my_device_sort(int *data, size_t n)
         {
           thrust::sort(thrust::device, data, data + n);
         }
        

        The following execution policies are supported in CUDA __device__ code:

        • thrust::seq
        • thrust::cuda::par
        • thrust::device, when THRUST_DEVICE_SYSTEM == THRUST_DEVICE_SYSTEM_CUDA

        Parallel algorithm execution may not be accelerated unless CUDA Dynamic Parallelism is available.

    • Execution Policies
      • CUDA Streams The thrust::cuda::par.on(stream) syntax allows users to request that CUDA __global__ functions launched during algorithm execution should occur on a given stream:

        // execute for_each on stream s
        thrust::for_each(thrust::cuda::par.on(s), begin, end, my_functor);
        

        Algorithms executed with a CUDA stream in this manner may still synchronize with other streams when allocating temporary storage or returning results to the CPU.

      • thrust::seq The thrust::seq execution policy allows users to require that an algorithm execute sequentially in the calling thread:

        // execute for_each sequentially in this thread
        thrust::for_each(thrust::seq, begin, end, my_functor);
        
    • Other
      • The new thrust::complex template provides complex number support.

    New Examples

    • simple_cuda_streams demonstrates how to request a CUDA stream during algorithm execution.
    • async_reduce demonstrates ways to achieve algorithm invocations which are asynchronous with the calling thread.

    Other Enhancements

    • CUDA sort performance for user-defined types is 300% faster on Tesla K20c for large problem sizes.
    • CUDA merge performance is 200% faster on Tesla K20c for large problem sizes.
    • CUDA sort performance for primitive types is 50% faster on Tesla K20c for large problem sizes.
    • CUDA reduce_by_key performance is 25% faster on Tesla K20c for large problem sizes.
    • CUDA scan performance is 15% faster on Tesla K20c for large problem sizes.
    • fallback_allocator example is simpler.

    Bug Fixes

    • #364 iterators with unrelated system tags may be used with algorithms invoked with an execution policy
    • #371 do not redefine __CUDA_ARCH__
    • #379 fix crash when dereferencing transform_iterator on the CPU
    • #391 avoid use of uppercase variable names
    • #392 fix thrust::copy between cusp::complex & std::complex
    • #396 program compiled with gcc < 4.3 hangs during comparison sort
    • #406 fallback_allocator.cu example checks device for unified addressing support
    • #417 avoid using std::less<T> in binary search algorithms
    • #418 avoid various warnings
    • #443 including version.h no longer configures default systems
    • #578 nvcc produces warnings when sequential algorithms are used with cpu systems

    Known Issues

    • When invoked with primitive data types, thrust::sort, thrust::sort_by_key, thrust::stable_sort, thrust::stable_sort_by_key may fail to link in some cases with nvcc -rdc=true.
    • The CUDA implementation of thrust::reduce_by_key incorrectly outputs the last element in a segment of equivalent keys instead of the first.

    Acknowledgments

    • Thanks to Sean Baxter for contributing faster CUDA reduce, merge, and scan implementations.
    • Thanks to Duane Merrill for contributing a faster CUDA radix sort implementation.
    • Thanks to Filipe Maia for contributing the implementation of thrust::complex.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.2(May 16, 2020)

Owner
NVIDIA Corporation
NVIDIA Corporation
jkds is a modern header-only C++20 library that complements the standard library.

jkds is a modern header-only C++20 library that complements the standard library. It provides generic atypical data structures, ergonomic functional programming abstractions, and then some.

Alberto Schiabel 5 May 24, 2022
Bionic BSD-3-ClauseBionic - Google's standard library, developed for Android. BSD-3-Clause

bionic bionic is Android's C library, math library, and dynamic linker. Using bionic as an app developer See the user documentation. Working on bionic

Android Open Source Project 548 Jul 31, 2022
CloudABI's standard C library

NOTE: This project is unmaintained CloudABI is no longer being maintained. It was an awesome experiment, but it never got enough traction to be sustai

Nuxi 272 Jun 7, 2022
libcu++: The C++ Standard Library for Your Entire System

libcu++, the NVIDIA C++ Standard Library, is the C++ Standard Library for your entire system. It provides a heterogeneous implementation of the C++ Standard Library that can be used in and between CPU and GPU code.

NVIDIA Corporation 2k Aug 8, 2022
EASTL stands for Electronic Arts Standard C++ Template Library

EASTL stands for Electronic Arts Standard Template Library. It is an extensive and robust implementation that has an emphasis on high performance.

Electronic Arts 6.6k Aug 8, 2022
An open source standard C library that includes useful functions && (Reimplementation of libc functions + own functions).

?? LIBFT-42 : Artistic view of LIBC: ?? HOW DOES IT FEEL HAVING YOUR OWN LIB: SUBJECT : ENGLISH PDF ℹ️ What is LIBFT : This project aims to code a C l

Abdessamad Laamimi 10 Mar 24, 2022
Your standard library for metaprogramming

Boost.Hana Your standard library for metaprogramming Overview #include <boost/hana.hpp> #include <cassert> #include <string> namespace hana = boost::h

Boost.org 1.4k Aug 1, 2022
Reimplementation of some of the Standard C Library functions.

42-libft Reimplementation of some of the Standard C Library functions. This repository contains some of the standard library C functions. List of avai

Lavrenova Maria 5 Jan 26, 2022
STXXL: Standard Template Library for Extra Large Data Sets

STXXL is an implementation of the C++ standard template library STL for external memory (out-of-core) computations

STXXL 428 Jul 25, 2022
MSVC's implementation of the C++ Standard Library.

Microsoft's C++ Standard Library This is the official repository for Microsoft's implementation of the C++ Standard Library (also known as the STL), w

Microsoft 8k Aug 10, 2022
mlibc is a C standard library

mlibc is a C standard library Official Discord server: https://discord.gg/7WB6Ur3 Design of the library Directory Purpose options/ (More or less) OS-i

The Managarm Project 511 Aug 4, 2022
A standard conforming C++20 implementation of std::optional.

A standard conforming C++20 implementation of std::optional.

null 29 Feb 11, 2022
Library that simplify to find header for class from STL library.

Library that simplify to find header for class from STL library. Instead of searching header for some class you can just include header with the class name.

null 6 Jun 7, 2022
D++ Extremely Lightweight C++ Discord Library

D++ An incredibly lightweight C++ Discord library This project is in alpha stages of development. Completed so far: Websocket connection with heartbea

brainbox.cc 441 Aug 10, 2022
Single-header header-only C++11 / C++14 / C++17 library for easily managing set of auto-generated type-safe flags.

Single-header header-only C++11 / C++14 / C++17 library for easily managing set of auto-generated type-safe flags. Quick start #include <bitflags/bitf

Marin Peko 71 Jul 30, 2022
expected lite - Expected objects in C++11 and later in a single-file header-only library

expected lite: expected objects for C++11 and later expected lite is a single-file header-only library for objects that either represent a valid value

Martin Moene 221 Aug 5, 2022
Guidelines Support Library

GSL: Guidelines Support Library The Guidelines Support Library (GSL) contains functions and types that are suggested for use by the C++ Core Guideline

Microsoft 5.1k Aug 13, 2022
gsl-lite – A single-file header-only version of ISO C++ Guidelines Support Library (GSL) for C++98, C++11, and later

gsl-lite: Guidelines Support Library for C++98, C++11 up metadata build packages try online gsl-lite is an implementation of the C++ Core Guidelines S

gsl-lite 750 Aug 10, 2022
C++11 metaprogramming library

Mp11, a C++11 metaprogramming library Mp11 is a C++11 metaprogramming library based on template aliases and variadic templates. It implements the appr

Boost.org 175 Jul 11, 2022