Cooperative primitives for CUDA C++.

Overview

About CUB

CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model:

Orientation of collective primitives within the CUDA software stack

CUB is included in the NVIDIA HPC SDK and the CUDA Toolkit.

We recommend the CUB Project Website for further information and examples.



A Simple Example

#include <cub/cub.cuh>

// Block-sorting CUDA kernel
__global__ void BlockSortKernel(int *d_in, int *d_out)
{
     using namespace cub;

     // Specialize BlockRadixSort, BlockLoad, and BlockStore for 128 threads
     // owning 16 integer items each
     typedef BlockRadixSort<int, 128, 16>                     BlockRadixSort;
     typedef BlockLoad<int, 128, 16, BLOCK_LOAD_TRANSPOSE>   BlockLoad;
     typedef BlockStore<int, 128, 16, BLOCK_STORE_TRANSPOSE> BlockStore;

     // Allocate shared memory
     __shared__ union {
         typename BlockRadixSort::TempStorage  sort;
         typename BlockLoad::TempStorage       load;
         typename BlockStore::TempStorage      store;
     } temp_storage;

     int block_offset = blockIdx.x * (128 * 16);	  // OffsetT for this block's ment

     // Obtain a segment of 2048 consecutive keys that are blocked across threads
     int thread_keys[16];
     BlockLoad(temp_storage.load).Load(d_in + block_offset, thread_keys);
     __syncthreads();

     // Collectively sort the keys
     BlockRadixSort(temp_storage.sort).Sort(thread_keys);
     __syncthreads();

     // Store the sorted segment
     BlockStore(temp_storage.store).Store(d_out + block_offset, thread_keys);
}

Each thread block uses cub::BlockRadixSort to collectively sort its own input segment. The class is specialized by the data type being sorted, by the number of threads per block, by the number of keys per thread, and implicitly by the targeted compilation architecture.

The cub::BlockLoad and cub::BlockStore classes are similarly specialized. Furthermore, to provide coalesced accesses to device memory, these primitives are configured to access memory using a striped access pattern (where consecutive threads simultaneously access consecutive items) and then transpose the keys into a blocked arrangement of elements across threads.

Once specialized, these classes expose opaque TempStorage member types. The thread block uses these storage types to statically allocate the union of shared memory needed by the thread block. (Alternatively these storage types could be aliased to global memory allocations).



Supported Compilers

CUB is regularly tested using the specified versions of the following compilers. Unsupported versions may emit deprecation warnings, which can be silenced by defining CUB_IGNORE_DEPRECATED_COMPILER during compilation.

  • NVCC 11.0+
  • NVC++ 20.9+
  • GCC 5+
  • Clang 7+
  • MSVC 2019+ (19.20/16.0/14.20)



Releases

CUB is distributed with the NVIDIA HPC SDK and the CUDA Toolkit in addition to GitHub.

See the changelog for details about specific releases.

CUB Release Included In
1.14.0 NVIDIA HPC SDK 21.9
1.13.1 CUDA Toolkit 11.5
1.13.0 NVIDIA HPC SDK 21.7
1.12.1 CUDA Toolkit 11.4
1.12.0 NVIDIA HPC SDK 21.3
1.11.0 CUDA Toolkit 11.3
1.10.0 NVIDIA HPC SDK 20.9 & CUDA Toolkit 11.2
1.9.10-1 NVIDIA HPC SDK 20.7 & CUDA Toolkit 11.1
1.9.10 NVIDIA HPC SDK 20.5
1.9.9 CUDA Toolkit 11.0
1.9.8-1 NVIDIA HPC SDK 20.3
1.9.8 CUDA Toolkit 11.0 Early Access
1.9.8 CUDA 11.0 Early Access
1.8.0
1.7.5 Thrust 1.9.2
1.7.4 Thrust 1.9.1-2
1.7.3
1.7.2
1.7.1
1.7.0 Thrust 1.9.0-5
1.6.4
1.6.3
1.6.2 (previously 1.5.5)
1.6.1 (previously 1.5.4)
1.6.0 (previously 1.5.3)
1.5.2
1.5.1
1.5.0
1.4.1
1.4.0
1.3.2
1.3.1
1.3.0
1.2.3
1.2.2
1.2.0
1.1.1
1.0.2
1.0.1
0.9.4
0.9.2
0.9.1
0.9.0



Development Process

CUB and Thrust depend on each other. It is recommended to clone Thrust and build CUB as a component of Thrust.

CUB uses the CMake build system to build unit tests, examples, and header tests. To build CUB as a developer, the following recipe should be followed:

# Clone Thrust and CUB from Github. CUB is located in Thrust's
# `dependencies/cub` submodule.
git clone --recursive https://github.com/NVIDIA/thrust.git
cd thrust

# Create build directory:
mkdir build
cd build

# Configure -- use one of the following:
cmake -DTHRUST_INCLUDE_CUB_CMAKE=ON ..   # Command line interface.
ccmake -DTHRUST_INCLUDE_CUB_CMAKE=ON ..  # ncurses GUI (Linux only)
cmake-gui  # Graphical UI, set source/build directories and options in the app

# Build:
cmake --build . -j 
   
       # invokes make (or ninja, etc)

# Run tests and examples:
ctest

   

By default, the C++14 standard is targeted, but this can be changed in CMake. More information on configuring your CUB build and creating a pull request is found in CONTRIBUTING.md.



Open Source License

CUB is available under the "New BSD" open-source license:

Copyright (c) 2010-2011, Duane Merrill.  All rights reserved.
Copyright (c) 2011-2018, NVIDIA CORPORATION.  All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
   *  Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
   *  Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
   *  Neither the name of the NVIDIA CORPORATION nor the
      names of its contributors may be used to endorse or promote products
      derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Issues
  • Add BLOCK_LOAD_STRIPED and BLOCK_STORE_STRIPED

    Add BLOCK_LOAD_STRIPED and BLOCK_STORE_STRIPED

    This PR adds the following to BlockLoadAlgorithm

    1. BLOCK_LOAD_STRIPED It's basically BLOCK_LOAD_TRANSPOSE without the BlockExchange

    This PR adds the following to BlockStoreAlgorithm

    1. BLOCK_STORE_STRIPED It's basically BLOCK_STORE_TRANSPOSE without the BlockExchange
    type: enhancement testing: gpuCI passed P1: should have testing: internal ci passed 
    opened by mnicely 21
  • Support future value for initial value for device scan

    Support future value for initial value for device scan

    Prototyping the support of device pointer for cub device scan.

    Example Usage:

    cub::DeviceScan::ExclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, cub::FutureValue<float>(ptr), num_items);
    

    Tests pass for this PR, but there is no documentation yet. I will add doc if @allisonvacanti thinks this is a good idea.

    Please review https://github.com/NVIDIA/thrust/pull/1519 for thrust change.

    testing: gpuCI passed P1: should have testing: internal ci passed helps: pytorch 
    opened by zasdfgbnm 18
  • Static links to tag releases changed

    Static links to tag releases changed

    Hi @brycelelbach, something changed when you revised the release logs yesterday. We used to point to https://github.com/NVlabs/cub/archive/v1.8.0.tar.gz but now it changes to https://github.com/NVlabs/cub/archive/1.8.0.tar.gz (the "v" is missing). I think it is a breaking change. Any chance it could be reverted?

    opened by leofang 16
  • Faster Least Significant Digit Radix Sort Implementation

    Faster Least Significant Digit Radix Sort Implementation

    • radix sort with decoupled look-back, 8 bits per pass and other optimizations
    • pull request to the previous CUB repository: https://github.com/brycelelbach/cub_historical_2019_2020/pull/26
    blocked testing: gpuCI passed testing: internal ci passed 
    opened by canonizer 11
  • Document that cub::DeviceRadixSort and cub::BlockRadixSort are stable

    Document that cub::DeviceRadixSort and cub::BlockRadixSort are stable

    There is nothing in the documentation https://nvlabs.github.io/cub/structcub_1_1_device_radix_sort.html saying whether it is stable sort or not. It seems to be stable. If so, it would be great if this fact is mentioned in the documentation.

    only: docs good first issue 
    opened by zasdfgbnm 10
  • cub::DeviceReduce::ReduceByKey() results are non-deterministic for floats

    cub::DeviceReduce::ReduceByKey() results are non-deterministic for floats

    @allisonvacanti @senior-zero

    cub::DeviceReduce::ReduceByKey web page describes "run-to-run" determinism for addition of floating point types, but the result looks wrong to me. Is it an expected behavior or a bug?

    From my limited testing somehow I got run-to-run result from the below code with CUDA 11.6 SDK. BTW, you need to run the program multiple times and will occasionally see the error, it doesn't work like the documentation mentioned which provides "run-to-run" determinism.

    Note: this issue is mainly for cub, there is a similar issue for thrust (https://github.com/NVIDIA/thrust/issues/1621)

    #include <thrust/host_vector.h>
    #include <thrust/device_vector.h>
    
    #include <thrust/copy.h>
    #include <thrust/fill.h>
    #include <thrust/sequence.h>
    #include <thrust/reduce.h>
    #include <iostream>
    
    #include <cub/cub.cuh>   // or equivalently <cub/device/device_reduce.cuh>
    
    
    int main() {
       auto const numElements = 250000;
       thrust::device_vector<double> data(numElements, 0.1);
       thrust::device_vector<double> keys(numElements, 1);
    
       thrust::device_vector<double> keys_out1(numElements);
       thrust::device_vector<double> keys_out2(numElements);
    
       thrust::device_vector<double> out1(numElements);
       thrust::device_vector<double> out2(numElements);
    
       thrust::host_vector<double> hostOut1(numElements);
       thrust::host_vector<double> hostOut2(numElements);
    
       thrust::device_vector<int> num_runs_out(1);
    
       // first run
       {
          size_t temp_storage_bytes = 0;
          cub::DeviceReduce::ReduceByKey(
             nullptr, temp_storage_bytes,
             keys.begin(), keys_out1.begin(),
             data.begin(), out1.begin(),
             num_runs_out.begin(),
             thrust::plus<double>(), numElements);
          thrust::device_vector<char> d_temp_storage(temp_storage_bytes);
          cub::DeviceReduce::ReduceByKey(
             d_temp_storage.data().get(), temp_storage_bytes,
             keys.begin(), keys_out1.begin(),
             data.begin(), out1.begin(),
             num_runs_out.begin(),
             thrust::plus<double>(), numElements);
          // copy out1 to the host
          thrust::copy(out1.begin(), out1.begin() + num_runs_out[0], hostOut1.begin());
       }
    
       // second run
       {
          size_t temp_storage_bytes = 0;
          cub::DeviceReduce::ReduceByKey(
             nullptr, temp_storage_bytes,
             keys.begin(), keys_out2.begin(),
             data.begin(), out2.begin(),
             num_runs_out.begin(),
             thrust::plus<double>(), numElements);
          thrust::device_vector<char> d_temp_storage(temp_storage_bytes);
          cub::DeviceReduce::ReduceByKey(
             d_temp_storage.data().get(), temp_storage_bytes,
             keys.begin(), keys_out2.begin(),
             data.begin(), out2.begin(),
             num_runs_out.begin(),
             thrust::plus<double>(), numElements);
          // copy out2 to the host
          thrust::copy(out2.begin(), out2.begin() + num_runs_out[0], hostOut2.begin());
       }
    
       // Check the outputs are exactly the same
       for (int i = 0; i < num_runs_out[0]; i++) {
          if (hostOut1[i] != hostOut2[i]) {
             std::cout << "Element " << i << " is not equal" << std::endl;
          }
       }
    
       return 0;
    }
    
    duplicate only: docs type: bug: functional P1: should have 
    opened by lilohuang 9
  • New segmented sort algorithm

    New segmented sort algorithm

    This PR includes a new segmented sort facility. Few approaches to this problem exist.

    Embed segment number into keys

    This approach provides an elegant solution to the load-balancing issue but can lead to slowdowns. It also can't be applicable if the number of bites representing segments number exceeds a maximal number of bytes used by keys.

    Modified merge sort approach

    This idea is implemented in modern GPU. I've used this approach as a reference for comparison with the new segmented sort algorithm. As I show below, this approach can be outperformed in most cases.

    Kernel specialisation

    The idea behind this approach is to partition input segments into size groups. Specialised kernels can further process each size group. The LRB approach initially discussed in the issue falls into this category. It also represents the approach that the new segmented sort algorithm relies on.

    I'm going to briefly describe the genesis of the new segmented sort algorithm to justify some design decisions.

    To minimise the number of kernel specialisations, I've benchmarked different approaches to small (under a few hundred items) segment sorting. I've benchmarked single-thread even-odd sorting, bitonic warp sorting and newly added warp merge sort. The warp-scope merge-sort approach demonstrated some advantages: it can sort bigger segments and outperforms other methods (in the majority of cases).

    Warp-scope merge sort is included in this PR as a separate facility. It's possible to partition architectural warp into multiple virtual ones to sort multiple segments simultaneously. The warp-scope merge sort duplicated a significant part of the previously introduced block-scope merge sort, so I extracted the merge-sort strategy into a separate facility. Both warp and block sort share this strategy.

    Here's the speedup of warp-scope merge sort over warp-bitonic sort: image

    And the speedup of warp-scope merge sort over single-thread odd-even sort: image

    In the figures above I vary segment sizes and segments number.

    To further increase the performance of warp-scope merge sort I needed to load and store data efficiently. I needed warp-scope load, store and exchange facilities. These facilities are also provided in this PR.

    Using a proper sorting algorithm was not enough. Initially, I've assigned a CUDA thread block to a segment. Although this approach demonstrated speedup over the existing one, it led to inefficient resource utilisation because most threads were idle. Nonetheless, a kernel like this is used as a fallback solution when there are not enough segments. If idle threads don't block other CTAs from execution, there's no reason to spend time on segments partitioning. The fallback kernel helped to eliminate cases when the partitioning stage led to the overall slowdown of the new algorithm.

    image

    Initially, I implemented a single kernel for all size groups. Depending on the CTA number, I allocated a different number of threads per segment. That is, if the segment size exceeded a few thousand items, I've used slow block-scope radix sort. If the data was about a thousand items and fit into shared memory, I've used in-shared-memory block-scope radix sort. In all these cases, the whole CTA was assigned to a single segment. If the CTA number exceeded the number of large segments, I've partitioned CTA into multiple warp groups, each processing a separate segment. It happened that the large-segment branch limited the occupancy of small-segment one. So I separated this kernel into two. One kernel processes large segments and contains two branches - in-shared-memory sort and slow block-scope sort. Another kernel processes medium and small segments.

    To overlap small/large segments processing, I've used concurrent kernels. This PR contains a single-stream implementation, though. The multi-stream API is out of the scope of this PR and might be introduced later.

    Segments partitioning

    The LRB approach discussed in the initial issue balances segments in a non-deterministic way. In some cases, it led to slowdowns because consecutive sub-warps might not end up processing consecutive segments. Here's the speedup of the LRB approach compared to the partitioning based approach.

    image

    I've tried applying LRB only to the large segments group. This approach also leads to controversial performance results. In rare cases, when an extremely large segment is located in the tail of the large segments group, LRB leads to performance improvements. Otherwise, there are enough segments to overlap its processing, and the LRB stage leads to slowdowns. Therefore, we decided to opt-in LRB. The API with pre-partitioned segments is going to be implemented later. Here's the speedup of the version where I apply LRB to the large segment group.

    image

    Instead of LRB, I've implemented a three-way partition facility, which is included in this PR. It leads to deterministic partitioning of segments and might be used outside of the segmented sort. The three-way partitioning stage is faster than the LRB stage.

    Temporary storage layout

    The temporary storage layout of the new segmented sort is quite complex. To simplify the code and make it safer, I've introduced temporary storage layout wrappers, which can be found in this PR.

    Performance

    To benchmark the new algorithm on real-world data, I've converted sparse matrices collection to segment sizes. Segments number is equal to the number of rows in a sparse matrix, while segment size is equal to the number of non-zero values in this row. Here's the speedup of the new algorithm compared to the cub::DeviceSegmentedRadixSort on A100 GPU while sorting pairs of std::uint32_t and std::uint64_t.

    image

    | speedup | new algorithm | modernGPU | | ----------- | ----------- | ----------- | | min | 1.11 | 0.16 | | median | 61.24 | 7.93 | | max | 5314.44 | 2219.73 |

    The speedup depends on the segments number: image

    testing: gpuCI in progress P0: must have testing: internal ci in progress 
    opened by senior-zero 9
  • Bug in ExclusiveSum

    Bug in ExclusiveSum

    The following code prints

    0 100 -128 0
    

    NOTE: It overflows!!!!

    nvcc --version
    

    prints

    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2021 NVIDIA Corporation
    Built on Sun_Mar_21_19:15:46_PDT_2021
    Cuda compilation tools, release 11.3, V11.3.58
    Build cuda_11.3.r11.3/compiler.29745058_0
    
    #include <vector>
    
    #include "cub/cub.cuh"
    
    int main() {
      std::vector<int8_t> v = {100, 28, 1};
    
      int8_t *src;
      // ignore error check
      cudaMalloc((void **)&src, 3);
      cudaMemcpy(src, v.data(), 3, cudaMemcpyHostToDevice);
    
      int32_t *dst;
      cudaMalloc((void **)&dst, 4 * sizeof(int32_t));
    
      void *d_temp_storage = nullptr;
      size_t temp_storage_bytes = 0;
      cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, src, dst,
                                    3);
      cudaMalloc(&d_temp_storage, temp_storage_bytes);
      cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, src, dst,
                                    3);
    
      int32_t h[4];
      cudaMemcpy(h, dst, 4 * sizeof(int32_t), cudaMemcpyDeviceToHost);
    
      for (int32_t i = 0; i != 4; ++i) {
        std::cout << h[i] << " ";
      }
    
      std::cout << "\n";
      // ignore memory leak
      return 0;
    }
    
    question 
    opened by csukuangfj 9
  • 64-bit Offsets in DeviceRadixSort

    64-bit Offsets in DeviceRadixSort

    • 64-bit OffsetT is supported for onesweep sorting
    • for decoupled look-back, the partition kernel is broken into smaller parts (as before), and a separate 32-bit type is used there
    • for histograms, 32-bit counters are used in shared memory and OffsetT-sized counters in global memory
    type: enhancement testing: gpuCI in progress P1: should have testing: internal ci in progress 
    opened by canonizer 8
  • Faster segmented sorting (and segmented problems in general)

    Faster segmented sorting (and segmented problems in general)

    Segmented problems can suffer from workload imbalance if the distribution of the segment sizes vary. The workload imbalance becomes even more challenging when the number of segments is very large.

    The attached zip file has an efficient segmented sort that is based on NVIDIA's CUB library. Most of the sorting is done using CUB's sorting functionality. The key thing that this segmented sorting algorithm offers is simple and efficient load-balancing. lrb_sort.cuh.zip

    The solution used for the segmented sort is applicable to other segmented problems.

    Details of the algorithm (and performance) can be found in the following paper.

    type: enhancement P1: should have 
    opened by ogreen 8
  • Issue using cub reduce on more than elements than fit into a 4 byte integer

    Issue using cub reduce on more than elements than fit into a 4 byte integer

    Reduce (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, ReductionOpT reduction_op, T init, cudaStream_t stream=0, bool debug_synchronous=false)

    my issue seems to be that num_items is of type int so when I try to reduce more elements than fit into a 4 byte integer it overflows and the code obviously doesn't work properly. Given that GPUs are both growing in RAM size and that we can now oversubscribe by using cudaSharedMalloc are there any plans to change that number to be able to receive type size_t?

    opened by felipeblazing 8
  • Dispatch mechanism may break when any two libraries that use CUB and/thrust have been compiled for different set of GPU architectures

    Dispatch mechanism may break when any two libraries that use CUB and/thrust have been compiled for different set of GPU architectures

    The following describes a problem observed in more "complex" software projects, where different components (or libraries) use CUB and/or thrust without separating CUB and/or thrust through namespace costumisation. This issue may be observed when linked libraries include CUB and/or thrust - even if the libraries' dependency on CUB and/or thrust is not apparent to the library user.

    Is this the issue that I'm having?

    If you are:

    • linking against another library that is using either CUB and/or thrust and
    • (your source files or a second library) are using CUB and/or thrust and
    • you are seeing an error like:
      • an exception like "merge_sort: failed on 2nd step: cudaErrorInvalidValue: invalid argument"
      • Running your program under cuda-memcheck or compute-sanitizer --tool memcheck reports out-of-bounds global memory reads or global memory writes (into temporary_storage) within a CUB (or thrust kernel)
      • cudaErrorInvalidValue: invalid argument thrown from a thrust algorithm
    • The issue you're running into is not deterministic. Whether you'll experience a problem or not is determined at load time(?). It may well be that you run your program once and everything works perfectly fine; you can run the affected thrust/CUB algorithm hundreds of times in a loop without any issue. But when you run your program the next time, it will fail (consistently).

    The root cause

    Situation

    • CUB is using tuning policies to determine the optimal "meta parameters" that are most efficient for a kernel on a specific GPU architecture.
    • There's a compile-time and a run-time component to the tuning policies. I'll refer to the run-time component as the "dispatch mechanism".
    • The compile-time component makes sure that, during a compilation pass for a specific GPU architecture, the kernel is compiled using the correct "meta parameters" for that specific architecture (e.g., getting the correct meta parameters may be implemented using __CUDA_ARCH__). Such meta parameters are parameters like BLOCK_THREADS (the number of threads per thread block), ITEMS_PER_THREAD (the number of items processed by each thread), etc.
    • At run-time, the dispatch mechanism needs to configure the kernel launch of a CUB algorithm. I.e., it needs to configure the correct block size (that corresponds to the kernel's BLOCK_THREADS) and the correct grid size. These run-time parameters need to match the parameters of the kernel that will actually get launched.
    • To determine the GPU architecture that a kernel will get dispatched for, CUB uses cudaFuncGetAttributes on cub::EmptyKernel to query the closest architecture for which EmptyKernel was compiled for, assuming that EmptyKernel has been compiled for exactly the same architectures as the kernels actually implementing the various algorithms (which usually is the case).

    Problem

    CUB's kernels have weak external linkage. All kernels from all translation units being linked will end up in the binary's fatbin. If there's multiple choices for a kernel, the CUDA runtime seems to choose any qualifying kernel candidate "at random".


    compilation

    nvcc -c -gencode arch=compute_52,code=compute_52 my_lib.cu 
    nvcc -c -gencode arch=compute_70,code=compute_70 main.cu 
    nvcc -o sort_test my_lib.o main.o  && compute-sanitizer --tool memcheck ./sort_test
    

    my_lib.cu

    #include <thrust/scan.h>
    
    void my_lib_scan(cudaStream_t stream)
    {
        // this can be an arbitrary library
        // imagine it uses some thrust algorithms (e.g., a scan)
        // and it comes pre-compiled for _some_ GPU architecture
        // In this case, just including the header is sufficient for EmptyKernel to be compiled in this TU
    }
    

    main.cu

    #include <thrust/sort.h>
    #include <thrust/device_vector.h>
    
    int main()
    {
        thrust::device_vector<int> d_vec(128 << 20);
        
        thrust::sort(d_vec.begin(), d_vec.end());
        
        cudaDeviceSynchronize();
        std::cout << cudaGetLastError() << "\n";
    }
    

    output

    Running on a V100

    #RUN 0
    compute-sanitizer --tool memcheck ./sort_test
    ========= COMPUTE-SANITIZER
    cudaFuncGetAttributes(EmptyKernel): 700
    0
    DeviceRadixSortHistogramKernel: 700
    DeviceRadixSortOnesweepKernel: 700
    DeviceRadixSortOnesweepKernel: 700
    DeviceRadixSortOnesweepKernel: 700
    DeviceRadixSortOnesweepKernel: 700
    0
    ========= ERROR SUMMARY: 0 errors
    
    #RUN 1
    compute-sanitizer --tool memcheck ./sort_test
    ========= COMPUTE-SANITIZER
    cudaFuncGetAttributes(EmptyKernel): 520
    0
    DeviceRadixSortUpsweepKernel: 700
    RadixSortScanBinsKernel: 700
    RadixSortScanBinsKernel: 700
    DeviceRadixSortUpsweepKernel: 700
    RadixSortScanBinsKernel: 700
    RadixSortScanBinsKernel: 700
    DeviceRadixSortUpsweepKernel: 700
    ========= Invalid __global__ write of size 4 bytes
    =========     at 0x74d0 in cub/agent/agent_radix_sort_downsweep.cuh:264:void cub::AgentRadixSortDownsweep<cub::AgentRadixSortDownsweepPolicy<(int)512, (int)23, int, (cub::BlockLoadAlgorithm)3, (cub::CacheLoadModifier)0, (cub::RadixRankAlgorithm)2, (cub::BlockScanAlgorithm)2, (int)7, cub::RegBoundScaling<(int)512, (int)23, int>>, (bool)0, int, cub::NullType, unsigned int>::ScatterKeys<(bool)1>(unsigned int (&)[23], unsigned int (&)[23], int (&)[23], unsigned int)
    =========     by thread (125,0,0) in block (0,0,0)
    [...]
    

    Potential Solutions

    Declare the CUB kernels static. Making sure that CUB kernels in translation unit A won't interfere with the kernels in translation unit B would be a viable solution. We currently have all the kernels from both translation units in the linked binary anyways. See below cuobjdump for the above code example.

    cuobjdump sort_test -xptx all
    Extracting PTX file and ptxas options    1: my_lib.sm_52.ptx -arch=sm_52  --generate-line-info 
    Extracting PTX file and ptxas options    2: main.sm_70.ptx -arch=sm_70  --generate-line-info
    
    cat my_lib.sm_52.ptx |c++filt|grep .entry
    .visible .entry void cub::EmptyKernel<void>()()
    
    cat main.sm_70.ptx |c++filt|grep .entry
    .visible .entry void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>, int>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>, int>, unsigned long>(thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>, int>, unsigned long)(
    .visible .entry void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__transform::unary_transform_f<int*, int*, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<int>, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::__transform::unary_transform_f<int*, int*, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<int>, thrust::cuda_cub::__transform::always_true_predicate>, long>(thrust::cuda_cub::__transform::unary_transform_f<int*, int*, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<int>, thrust::cuda_cub::__transform::always_true_predicate>, long)(
    .visible .entry void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__transform::unary_transform_f<thrust::detail::normal_iterator<thrust::device_ptr<int> >, thrust::detail::normal_iterator<thrust::device_ptr<int> >, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<int>, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::__transform::unary_transform_f<thrust::detail::normal_iterator<thrust::device_ptr<int> >, thrust::detail::normal_iterator<thrust::device_ptr<int> >, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<int>, thrust::cuda_cub::__transform::always_true_predicate>, long>(thrust::cuda_cub::__transform::unary_transform_f<thrust::detail::normal_iterator<thrust::device_ptr<int> >, thrust::detail::normal_iterator<thrust::device_ptr<int> >, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<int>, thrust::cuda_cub::__transform::always_true_predicate>, long)(
    .visible .entry void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__transform::unary_transform_f<int const*, thrust::device_ptr<int>, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<int>, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::__transform::unary_transform_f<int const*, thrust::device_ptr<int>, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<int>, thrust::cuda_cub::__transform::always_true_predicate>, long>(thrust::cuda_cub::__transform::unary_transform_f<int const*, thrust::device_ptr<int>, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<int>, thrust::cuda_cub::__transform::always_true_predicate>, long)(
    .visible .entry void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__transform::unary_transform_f<thrust::device_ptr<int>, int*, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<int>, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::__transform::unary_transform_f<thrust::device_ptr<int>, int*, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<int>, thrust::cuda_cub::__transform::always_true_predicate>, long>(thrust::cuda_cub::__transform::unary_transform_f<thrust::device_ptr<int>, int*, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<int>, thrust::cuda_cub::__transform::always_true_predicate>, long)(
    .visible .entry void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__transform::unary_transform_f<thrust::device_ptr<int>, thrust::device_ptr<int>, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<int>, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::__transform::unary_transform_f<thrust::device_ptr<int>, thrust::device_ptr<int>, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<int>, thrust::cuda_cub::__transform::always_true_predicate>, long>(thrust::cuda_cub::__transform::unary_transform_f<thrust::device_ptr<int>, thrust::device_ptr<int>, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<int>, thrust::cuda_cub::__transform::always_true_predicate>, long)(
    .visible .entry void cub::EmptyKernel<void>()()
    .visible .entry void cub::DeviceRadixSortSingleTileKernel<cub::DeviceRadixSortPolicy<int, cub::NullType, unsigned int>::Policy800, false, int, cub::NullType, unsigned int>(int const*, int*, cub::NullType const*, cub::NullType*, unsigned int, int, int)(
    .visible .entry void cub::DeviceRadixSortUpsweepKernel<cub::DeviceRadixSortPolicy<int, cub::NullType, unsigned int>::Policy800, false, false, int, unsigned int>(int const*, unsigned int*, unsigned int, int, int, cub::GridEvenShare<unsigned int>)(
    .visible .entry void cub::DeviceRadixSortUpsweepKernel<cub::DeviceRadixSortPolicy<int, cub::NullType, unsigned int>::Policy800, true, false, int, unsigned int>(int const*, unsigned int*, unsigned int, int, int, cub::GridEvenShare<unsigned int>)(
    .visible .entry void cub::RadixSortScanBinsKernel<cub::DeviceRadixSortPolicy<int, cub::NullType, unsigned int>::Policy800, unsigned int>(unsigned int*, int)(
    .visible .entry void cub::DeviceRadixSortDownsweepKernel<cub::DeviceRadixSortPolicy<int, cub::NullType, unsigned int>::Policy800, false, false, int, cub::NullType, unsigned int>(int const*, int*, cub::NullType const*, cub::NullType*, unsigned int*, unsigned int, int, int, cub::GridEvenShare<unsigned int>)(
    .visible .entry void cub::DeviceRadixSortDownsweepKernel<cub::DeviceRadixSortPolicy<int, cub::NullType, unsigned int>::Policy800, true, false, int, cub::NullType, unsigned int>(int const*, int*, cub::NullType const*, cub::NullType*, unsigned int*, unsigned int, int, int, cub::GridEvenShare<unsigned int>)(
    .visible .entry void cub::DeviceRadixSortHistogramKernel<cub::DeviceRadixSortPolicy<int, cub::NullType, unsigned int>::Policy800, false, int, unsigned int>(unsigned int*, int const*, unsigned int, int, int)(
    .visible .entry void cub::DeviceRadixSortExclusiveSumKernel<cub::DeviceRadixSortPolicy<int, cub::NullType, unsigned int>::Policy800, unsigned int>(unsigned int*)(
    .visible .entry void cub::DeviceRadixSortOnesweepKernel<cub::DeviceRadixSortPolicy<int, cub::NullType, unsigned int>::Policy800, false, int, cub::NullType, unsigned int, int, int>(int*, int*, unsigned int*, unsigned int const*, int*, int const*, cub::NullType*, cub::NullType const*, int, int, int)(
    

    List of issues that may be linked to this root cause:

    • https://github.com/NVIDIA/thrust/issues/1400
    • https://github.com/NVIDIA/thrust/issues/936
    opened by elstehle 2
  • Deprecate `cub::DeviceSpmv`

    Deprecate `cub::DeviceSpmv`

    The cub::DeviceSpmv was unmaintained for a while and probably contains bugs. Moreover, there are better implementations in specialized libraries like cuSPARSE. I suggest we deprecate it.

    release: breaking change 
    opened by senior-zero 0
  • SpMV with different matrix and vector types

    SpMV with different matrix and vector types

    Right now, the SpMV kernel allows to specify the matrix and vector type by specializing the ValueT type.

    In our case, our sparse matrix elements is stored in CRS format and the blocks are 3x3 matrices, and our vector elements are vector3.
    In practice, the 3x3 blocks are 3 float4, and the vector3 is a float4, and we have defined the multiplication operator.

    Do you think it's possible to separate the matrix type and the vector type in two different templated types?

    opened by michaelmigliore 3
  • Refine fallback kernel for segmented sort

    Refine fallback kernel for segmented sort

    Currently, cub::DeviceSegmentedSort has a fallback kernel, that apply different algorithms for different segment sizes. In particular, medium-size segments are sorted by merge sort. If segment doesn't fit into registers, it's sorted by in-shared-memory block radix sort. Otherwise, expensive in-global-memory radix sort is applied.

    The suggestion was made, that we have to add small-segment branch and potentially experiment with refining the granularity at which we assign threads to segments at fallback kernel.

    P2: nice to have 
    opened by senior-zero 1
  • First version of developer overview

    First version of developer overview

    This PR briefly explains the current CUB design. The document is intended to help contributors. Coming PTX dispatch changes will lead to changes in this document. Having a diff of design overview will ease the transition period (still having a few PRs in flight that apply existing dispatch logic).

    only: docs area: docs 
    opened by senior-zero 0
Releases(1.17.0)
  • 1.17.0(May 9, 2022)

    CUB 1.17.0

    Summary

    CUB 1.17.0 is the final minor release of the 1.X series. It provides a variety of bug fixes and miscellaneous enhancements, detailed below.

    Known Issues

    “Run-to-run” Determinism Broken

    Several CUB device algorithms are documented to provide deterministic results (per device) for non-associative reduction operators (e.g. floating-point addition). Unfortunately, the implementations of these algorithms contain performance optimizations that violate this guarantee. The DeviceReduce::ReduceByKey and DeviceScan algorithms are known to be affected. We’re currently evaluating the scope and impact of correcting this in a future CUB release. See NVIDIA/cub#471 for details.

    Bug Fixes

    • NVIDIA/cub#444: Fixed DeviceSelect to work with discard iterators and mixed input/output types.
    • NVIDIA/cub#452: Fixed install issue when CMAKE_INSTALL_LIBDIR contained nested directories. Thanks to @robertmaynard for this contribution.
    • NVIDIA/cub#462: Fixed bug that produced incorrect results from DeviceSegmentedSort on sm_61 and sm_70.
    • NVIDIA/cub#464: Fixed DeviceSelect::Flagged so that flags are normalized to 0 or 1.
    • NVIDIA/cub#468: Fixed overflow issues in DeviceRadixSort given num_items close to 2^32. Thanks to @canonizer for this contribution.
    • NVIDIA/cub#498: Fixed compiler regression in BlockAdjacentDifference. Thanks to @MKKnorr for this contribution.

    Other Enhancements

    • NVIDIA/cub#445: Remove device-sync in DeviceSegmentedSort when launched via CDP.
    • NVIDIA/cub#449: Fixed invalid link in documentation. Thanks to @kshitij12345 for this contribution.
    • NVIDIA/cub#450: BlockDiscontinuity: Replaced recursive-template loop unrolling with #pragma unroll. Thanks to @kshitij12345 for this contribution.
    • NVIDIA/cub#451: Replaced the deprecated TexRefInputIterator implementation with an alias to TexObjInputIterator. This fully removes all usages of the deprecated CUDA texture reference APIs from CUB.
    • NVIDIA/cub#456: BlockAdjacentDifference: Replaced recursive-template loop unrolling with #pragma unroll. Thanks to @kshitij12345 for this contribution.
    • NVIDIA/cub#466: cub::DeviceAdjacentDifference API has been updated to use the new OffsetT deduction approach described in NVIDIA/cub#212.
    • NVIDIA/cub#470: Fix several doxygen-related warnings. Thanks to @karthikeyann for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.16.0(Feb 8, 2022)

    Summary

    CUB 1.16.0 is a major release providing several improvements to the device scope algorithms. DeviceRadixSort now supports large (64-bit indexed) input data. A new UniqueByKey algorithm has been added to DeviceSelect. DeviceAdjacentDifference provides new SubtractLeft and SubtractRight functionality.

    This release also deprecates several obsolete APIs, including type traits and BlockAdjacentDifference algorithms. Many bugfixes and documentation updates are also included.

    64-bit Offsets in DeviceRadixSort Public APIs

    Users frequently want to process large datasets using CUB’s device-scope algorithms, but the current public APIs limit input data sizes to those that can be indexed by a 32-bit integer. Beginning with this release, CUB is updating these APIs to support 64-bit offsets, as discussed in NVIDIA/cub#212.

    The device-scope algorithms will be updated with 64-bit offset support incrementally, starting with the cub::DeviceRadixSort family of algorithms. Thanks to @canonizer for contributing this functionality.

    New DeviceSelect::UniqueByKey Algorithm

    cub::DeviceSelect now provides a UniqueByKey algorithm, which has been ported from Thrust. Thanks to @zasdfgbnm for this contribution.

    New DeviceAdjacentDifference Algorithms

    The new cub::DeviceAdjacentDifference interface, also ported from Thrust, provides SubtractLeft and SubtractRight algorithms as CUB kernels.

    Deprecation Notices

    Synchronous CUDA Dynamic Parallelism Support

    A future version of CUB will change the debug_synchronous behavior of device-scope algorithms when invoked via CUDA Dynamic Parallelism (CDP).

    This will only affect calls to CUB device-scope algorithms launched from device-side code with debug_synchronous = true. Such invocations will continue to print extra debugging information, but they will no longer synchronize after kernel launches.

    Deprecated Traits

    CUB provided a variety of metaprogramming type traits in order to support C++03. Since C++14 is now required, these traits have been deprecated in favor of their STL equivalents, as shown below:

    | Deprecated CUB Trait | Replacement STL Trait | |-----------------------|-----------------------| | cub::If | std::conditional | | cub::Equals | std::is_same | | cub::IsPointer | std::is_pointer | | cub::IsVolatile | std::is_volatile | | cub::RemoveQualifiers | std::remove_cv | | cub::EnableIf | std::enable_if |

    CUB now uses the STL traits internally, resulting in a ~6% improvement in compile time.

    Misnamed cub::BlockAdjacentDifference APIs

    The algorithms in cub::BlockAdjacentDifference have been deprecated, as their names did not clearly describe their intent. The FlagHeads method is now SubtractLeft, and FlagTails has been replaced by SubtractRight.

    Breaking Changes

    • NVIDIA/cub#331: Deprecate the misnamed BlockAdjacentDifference::FlagHeads and FlagTails methods. Use the new SubtractLeft and SubtractRight methods instead.
    • NVIDIA/cub#364: Deprecate some obsolete type traits. These should be replaced by the equivalent traits in <type_traits> as described above.

    New Features

    • NVIDIA/cub#331: Port the thrust::adjacent_difference kernel and expose it as cub::DeviceAdjacentDifference.
    • NVIDIA/cub#405: Port the thrust::unique_by_key kernel and expose it as cub::DeviceSelect::UniqueByKey. Thanks to @zasdfgbmn for this contribution.

    Enhancements

    • NVIDIA/cub#340: Allow 64-bit offsets in DeviceRadixSort public APIs. Thanks to @canonizer for this contribution.
    • NVIDIA/cub#400: Implement a significant reduction in DeviceMergeSort compilation time.
    • NVIDIA/cub#415: Support user-defined CMAKE_INSTALL_INCLUDEDIR values in Thrust’s CMake install rules. Thanks for @robertmaynard for this contribution.

    Bug Fixes

    • NVIDIA/cub#381: Fix shared memory alignment in dyn_smem example.
    • NVIDIA/cub#393: Fix some collisions with the min/max macros defined in windows.h.
    • NVIDIA/cub#404: Fix bad cast in util_device.
    • NVIDIA/cub#410: Fix CDP issues in DeviceSegmentedSort.
    • NVIDIA/cub#411: Ensure that the nv_exec_check_disable pragma is only used on nvcc.
    • NVIDIA/cub#418: Fix -Wsizeof-array-div warning on gcc 11. Thanks to @robertmaynard for this contribution.
    • NVIDIA/cub#420: Fix new uninitialized variable warning in DiscardIterator on gcc 10.
    • NVIDIA/cub#423: Fix some collisions with the small macro defined in windows.h.
    • NVIDIA/cub#426: Fix some issues with version handling in CUB’s CMake packages.
    • NVIDIA/cub#430: Remove documentation for DeviceSpmv parameters that are absent from public APIs.
    • NVIDIA/cub#432: Remove incorrect documentation for DeviceScan algorithms that guaranteed run-to-run deterministic results for floating-point addition.
    Source code(tar.gz)
    Source code(zip)
  • 1.15.0(Oct 25, 2021)

    Summary

    CUB 1.15.0 includes a new cub::DeviceSegmentedSort algorithm, which demonstrates up to 5000x speedup compared to cub::DeviceSegmentedRadixSort when sorting a large number of small segments. A new cub::FutureValue<T> helper allows the cub::DeviceScan algorithms to lazily load the initial_value from a pointer. cub::DeviceScan also added ScanByKey functionality.

    The new DeviceSegmentedSort algorithm partitions segments into size groups. Each group is processed with specialized kernels using a variety of sorting algorithms. This approach varies the number of threads allocated for sorting each segment and utilizes the GPU more efficiently.

    cub::FutureValue<T> provides the ability to use the result of a previous kernel as a scalar input to a CUB device-scope algorithm without unnecessary synchronization:

    int *d_intermediate_result = ...;
    intermediate_kernel<<<blocks, threads>>>(d_intermediate_result,  // output
                                             arg1,                   // input
                                             arg2);                  // input
    
    // Wrap the intermediate pointer in a FutureValue -- no need to explicitly
    // sync when both kernels are stream-ordered. The pointer is read after
    // the ExclusiveScan kernel starts executing.
    cub::FutureValue<int> init_value(d_intermediate_result);
    
    cub::DeviceScan::ExclusiveScan(d_temp_storage,
                                   temp_storage_bytes,
                                   d_in,
                                   d_out,
                                   cub::Sum(),
                                   init_value,
                                   num_items);
    

    Previously, an explicit synchronization would have been necessary to obtain the intermediate result, which was passed by value into ExclusiveScan. This new feature enables better performance in workflows that use cub::DeviceScan.

    Deprecation Notices

    A future version of CUB will change the debug_synchronous behavior of device-scope algorithms when invoked via CUDA Dynamic Parallelism (CDP).

    This will only affect calls to CUB device-scope algorithms launched from device-side code with debug_synchronous = true. These algorithms will continue to print extra debugging information, but they will no longer synchronize after kernel launches.

    Breaking Changes

    • NVIDIA/cub#305: The template parameters of cub::DispatchScan have changed to support the new cub::FutureValue helper. More details under "New Features".
    • NVIDIA/cub#377: Remove broken operator->() from cub::TransformInputIterator, since this cannot be implemented without returning a temporary object's address. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.

    New Features

    • NVIDIA/cub#305: Add overloads to cub::DeviceScan algorithms that allow the output of a previous kernel to be used as initial_value without explicit synchronization. See the new cub::FutureValue helper for details. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
    • NVIDIA/cub#354: Add cub::BlockRunLengthDecode algorithm. Thanks to Elias Stehle (@elstehle) for this contribution.
    • NVIDIA/cub#357: Add cub::DeviceSegmentedSort, an optimized version of cub::DeviceSegmentedSort with improved load balancing and small array performance.
    • NVIDIA/cub#376: Add "by key" overloads to cub::DeviceScan. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.

    Bug Fixes

    • NVIDIA/cub#349: Doxygen and unused variable fixes.
    • NVIDIA/cub#363: Maintenance updates for the new cub::DeviceMergeSort algorithms.
    • NVIDIA/cub#382: Fix several -Wconversion warnings. Thanks to Matt Stack (@matt-stack) for this contribution.
    • NVIDIA/cub#388: Fix debug assertion on MSVC when using cub::CachingDeviceAllocator.
    • NVIDIA/cub#395: Support building with __CUDA_NO_HALF_CONVERSIONS__. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.14.0(Aug 24, 2021)

    CUB 1.14.0 is a major release accompanying the NVIDIA HPC SDK 21.9.

    This release provides the often-requested merge sort algorithm, ported from the thrust::sort implementation. Merge sort provides more flexibility than the existing radix sort by supporting arbitrary data types and comparators, though radix sorting is still faster for supported inputs. This functionality is provided through the new cub::DeviceMergeSort and cub::BlockMergeSort algorithms.

    The namespace wrapping mechanism has been overhauled for 1.14. The existing macros (CUB_NS_PREFIX/CUB_NS_POSTFIX) can now be replaced by a single macro, CUB_WRAPPED_NAMESPACE, which is set to the name of the desired wrapped namespace. Defining a similar THRUST_CUB_WRAPPED_NAMESPACE macro will embed both thrust:: and cub:: symbols in the same external namespace. The prefix/postfix macros are still supported, but now require a new CUB_NS_QUALIFIER macro to be defined, which provides the fully qualified CUB namespace (e.g. ::foo::cub). See cub/util_namespace.cuh for details.

    Breaking Changes

    • NVIDIA/cub#350: When the CUB_NS_[PRE|POST]FIX macros are set, CUB_NS_QUALIFIER must also be defined to the fully qualified CUB namespace (e.g. #define CUB_NS_QUALIFIER ::foo::cub). Note that this is handled automatically when using the new [THRUST_]CUB_WRAPPED_NAMESPACE mechanism.

    New Features

    • NVIDIA/cub#322: Ported the merge sort algorithm from Thrust; cub::BlockMergeSort and cub::DeviceMergeSort are now available.
    • NVIDIA/cub#326: Simplify the namespace wrapper macros, and detect when Thrust's symbols are in a wrapped namespace.

    Bug Fixes

    • NVIDIA/cub#160, NVIDIA/cub#163, NVIDIA/cub#352: Fixed several bugs in cub::DeviceSpmv and added basic tests for this algorithm. Thanks to James Wyles and Seunghwa Kang for their contributions.
    • NVIDIA/cub#328: Fixed error handling bug and incorrect debugging output in cub::CachingDeviceAllocator. Thanks to Felix Kallenborn for this contribution.
    • NVIDIA/cub#335: Fixed a compile error affecting clang and NVRTC. Thanks to Jiading Guo for this contribution.
    • NVIDIA/cub#351: Fixed some errors in the cub::DeviceHistogram documentation.

    Enhancements

    • NVIDIA/cub#348: Add an example that demonstrates how to use dynamic shared memory with a CUB block algorithm. Thanks to Matthias Jouanneaux for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.13.1(Oct 25, 2021)

    CUB 1.13.1 is a minor release accompanying the CUDA Toolkit 11.5.

    This release provides a new hook for embedding the cub:: namespace inside a custom namespace. This is intended to work around various issues related to linking multiple shared libraries that use CUB. The existing CUB_NS_PREFIX and CUB_NS_POSTFIX macros already provided this capability; this update provides a simpler mechanism that is extended to and integrated with Thrust. Simply define THRUST_CUB_WRAPPED_NAMESPACE to a namespace name, and both thrust:: and cub:: will be placed inside the new namespace. Using different wrapped namespaces for each shared library will prevent issues like those reported in NVIDIA/thrust#1401.

    New Features

    • NVIDIA/cub#326: Add THRUST_CUB_WRAPPED_NAMESPACE hooks.
    Source code(tar.gz)
    Source code(zip)
  • 1.13.0(Jun 15, 2021)

    CUB 1.13.0 is the major release accompanying the NVIDIA HPC SDK 21.7 release.

    Notable new features include support for striped data arrangements in block load/store utilities, bfloat16 radix sort support, and fewer restrictions on offset iterators in segmented device algorithms. Several bugs in cub::BlockShuffle, cub::BlockDiscontinuity, and cub::DeviceHistogram have been addressed. The amount of code generated in cub::DeviceScan has been greatly reduced, leading to significant compile-time improvements when targeting multiple PTX architectures.

    This release also includes several user-contributed documentation fixes that will be reflected in CUB's online documentation in the coming weeks.

    Breaking Changes

    • NVIDIA/cub#320: Deprecated cub::TexRefInputIterator<T, UNIQUE_ID>. Use cub::TexObjInputIterator<T> as a replacement.

    New Features

    • NVIDIA/cub#274: Add BLOCK_LOAD_STRIPED and BLOCK_STORE_STRIPED functionality to cub::BlockLoadAlgorithm and cub::BlockStoreAlgorithm. Thanks to Matthew Nicely (@mnicely) for this contribution.
    • NVIDIA/cub#291: cub::DeviceSegmentedRadixSort and cub::DeviceSegmentedReduce now support different types for begin/end offset iterators. Thanks to Sergey Pavlov (@psvvsp) for this contribution.
    • NVIDIA/cub#306: Add bfloat16 support to cub::DeviceRadixSort. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
    • NVIDIA/cub#320: Introduce a new CUB_IGNORE_DEPRECATED_API macro that disables deprecation warnings on Thrust and CUB APIs.

    Bug Fixes

    • NVIDIA/cub#277: Fixed sanitizer warnings in RadixSortScanBinsKernels. Thanks to Andy Adinets (@canonizer) for this contribution.
    • NVIDIA/cub#287: cub::DeviceHistogram now correctly handles cases where OffsetT is not an int. Thanks to Dominique LaSalle (@nv-dlasalle) for this contribution.
    • NVIDIA/cub#311: Fixed several bugs and added tests for the cub::BlockShuffle collective operations.
    • NVIDIA/cub#312: Eliminate unnecessary kernel instantiations when compiling cub::DeviceScan. Thanks to Elias Stehle (@elstehle) for this contribution.
    • NVIDIA/cub#319: Fixed out-of-bounds memory access on debugging builds of cub::BlockDiscontinuity::FlagHeadsAndTails.
    • NVIDIA/cub#320: Fixed harmless missing return statement warning in unreachable cub::TexObjInputIterator code path.

    Other Enhancements

    • Several documentation fixes are included in this release.
      • NVIDIA/cub#275: Fixed comments describing the cub::If and cub::Equals utilities. Thanks to Rukshan Jayasekara (@rukshan99) for this contribution.
      • NVIDIA/cub#290: Documented that cub::DeviceSegmentedReduce will produce consistent results run-to-run on the same device for pseudo-associated reduction operators. Thanks to Himanshu (@himanshu007-creator) for this contribution.
      • NVIDIA/cub#298: CONTRIBUTING.md now refers to Thrust's build instructions for developer builds, which is the preferred way to build the CUB test harness. Thanks to Xiang Gao (@zasdfgbnm) for contributing.
      • NVIDIA/cub#301: Expand cub::DeviceScan documentation to include in-place support and add tests. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
      • NVIDIA/cub#307: Expand cub::DeviceRadixSort and cub::BlockRadixSort documentation to clarify stability, in-place support, and type-specific bitwise transformations. Thanks to Himanshu (@himanshu007-creator) for contributing.
      • NVIDIA/cub#316: Move WARP_TIME_SLICING documentation to the correct location. Thanks to Peter Han (@peter9606) for this contribution.
      • NVIDIA/cub#321: Update URLs from deprecated github.com to preferred github.io. Thanks to Lilo Huang (@lilohuang) for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.12.1(Jun 15, 2021)

  • 1.12.0(Feb 23, 2021)

    Summary

    CUB 1.12.0 is a bugfix release accompanying the NVIDIA HPC SDK 21.3 release and the CUDA Toolkit 11.4 release.

    Radix sort is now stable when both +0.0 and -0.0 are present in the input (they are treated as equivalent). Many compilation warnings and subtle overflow bugs were fixed in the device algorithms, including a long-standing bug that returned invalid temporary storage requirements when num_items was close to (but not exceeding) INT32_MAX. Support for Clang < 7.0 and MSVC < 2019 (aka 19.20/16.0/14.20) is now deprecated.

    Breaking Changes

    • NVIDIA/cub#256: Deprecate Clang < 7 and MSVC < 2019.

    New Features

    • NVIDIA/cub#218: Radix sort now treats -0.0 and +0.0 as equivalent for floating point types, which is required for the sort to be stable. Thanks to Andy Adinets for this contribution.

    Bug Fixes

    • NVIDIA/cub#247: Suppress newly triggered warnings in Clang. Thanks to Andrew Corrigan for this contribution.
    • NVIDIA/cub#249: Enable stricter warning flags. This fixes a number of outstanding issues:
      • NVIDIA/cub#221: Overflow in temp_storage_bytes when num_items close to (but not over) INT32_MAX.
      • NVIDIA/cub#228: CUB uses non-standard C++ extensions that break strict compilers.
      • NVIDIA/cub#257: Warning when compiling GridEvenShare with unsigned offsets.
    • NVIDIA/cub#258: Use correct OffsetT in DispatchRadixSort::InitPassConfig. Thanks to Felix Kallenborn for this contribution.
    • NVIDIA/cub#259: Remove some problematic __forceinline__ annotations.

    Other Enhancements

    • NVIDIA/cub#123: Fix incorrect issue number in changelog. Thanks to Peet Whittaker for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.11.0(Nov 23, 2020)

    Summary

    CUB 1.11.0 is a major release providing bugfixes and performance enhancements. It includes a new DeviceRadixSort backend that improves performance by up to 2x on supported keys and hardware. Our CMake package and build system continue to see improvements with add_subdirectory support, installation rules, status messages, and other features that make CUB easier to use from CMake projects. The release includes several other bugfixes and modernizations, and received updates from 11 contributors.

    Breaking Changes

    • NVIDIA/cub#201: The intermediate accumulator type used when DeviceScan is invoked with different input/output types is now consistent with P0571. This may produce different results for some edge cases when compared with earlier releases of CUB.

    New Features

    • NVIDIA/cub#204: Faster DeviceRadixSort, up to 2x performance increase for 32/64-bit keys on Pascal and up (SM60+). Thanks to Andy Adinets for this contribution.
    • Unroll loops in BlockRadixRank to improve performance for 32-bit keys by 1.5-2x on Clang CUDA. Thanks to Justin Lebar for this contribution.
    • NVIDIA/cub#200: Allow CUB to be added to CMake projects via add_subdirectory.
    • NVIDIA/cub#214: Optionally add install rules when included with CMake's add_subdirectory. Thanks to Kai Germaschewski for this contribution.

    Bug Fixes

    • NVIDIA/cub#215: Fix integer truncation in AgentReduceByKey, AgentScan, and AgentSegmentFixup. Thanks to Rory Mitchell for this contribution.
    • NVIDIA/cub#225: Fix compile-time regression when defining CUB_NS_PREFIX/CUB_NS_POSTFIX macro. Thanks to Elias Stehle for this contribution.
    • NVIDIA/cub#210: Fix some edge cases in DeviceScan:
      • Use values from the input when padding temporary buffers. This prevents custom functors from getting unexpected values.
      • Prevent integer truncation when using large indices via the DispatchScan layer.
      • Use timesliced reads/writes for types > 128 bytes.
    • NVIDIA/cub#217: Fix and add test for cmake package install rules. Thanks to Keith Kraus and Kai Germaschewski for testing and discussion.
    • NVIDIA/cub#170, NVIDIA/cub#233: Update CUDA version checks to behave on Clang CUDA and nvc++. Thanks to Artem Belevich, Andrew Corrigan, and David Olsen for these contributions.
    • NVIDIA/cub#220, NVIDIA/cub#216: Various fixes for Clang CUDA. Thanks to Andrew Corrigan for these contributions.
    • NVIDIA/cub#231: Fix signedness mismatch warnings in unit tests.
    • NVIDIA/cub#231: Suppress GPU deprecation warnings.
    • NVIDIA/cub#214: Use semantic versioning rules for our CMake package's compatibility checks. Thanks to Kai Germaschewski for this contribution.
    • NVIDIA/cub#214: Use FindPackageHandleStandardArgs to print standard status messages when our CMake package is found. Thanks to Kai Germaschewski for this contribution.
    • NVIDIA/cub#207: Fix CubDebug usage in CachingDeviceAllocator::DeviceAllocate. Thanks to Andreas Hehn for this contribution.
    • Fix documentation for DevicePartition. Thanks to ByteHamster for this contribution.
    • Clean up unused code in DispatchScan. Thanks to ByteHamster for this contribution.

    Other Enhancements

    • NVIDIA/cub#213: Remove tuning policies for unsupported hardware (<SM35).
    • References to the old Github repository and branch names were updated.
      • Github's thrust/cub repository is now NVIDIA/cub
      • Development has moved from the master branch to the main branch.
    Source code(tar.gz)
    Source code(zip)
  • 1.10.0(Sep 23, 2020)

    Summary

    CUB 1.10.0 is the major release accompanying the NVIDIA HPC SDK 20.9 release and the CUDA Toolkit 11.2 release. It drops support for C++03, GCC < 5, Clang < 6, and MSVC < 2017. It also overhauls CMake support. Finally, we now have a Code of Conduct for contributors: https://github.com/thrust/cub/blob/main/CODE_OF_CONDUCT.md

    Breaking Changes

    • C++03 is no longer supported.
    • GCC < 5, Clang < 6, and MSVC < 2017 are no longer supported.
    • C++11 is deprecated. Using this dialect will generate a compile-time warning. These warnings can be suppressed by defining CUB_IGNORE_DEPRECATED_CPP_DIALECT or CUB_IGNORE_DEPRECATED_CPP_11. Suppression is only a short term solution. We will be dropping support for C++11 in the near future.
    • CMake < 3.15 is no longer supported.
    • The default branch on GitHub is now called main.

    Other Enhancements

    • Contributor documentation: https://github.com/thrust/cub/blob/main/CONTRIBUTING.md
    • Code of Conduct: https://github.com/thrust/cub/blob/main/CODE_OF_CONDUCT.md. Thanks to Conor Hoekstra for this contribution.
    • Added install targets to CMake builds.
    • C++17 support.

    Bug Fixes

    • thrust/thrust#1244: Check for macro collisions with system headers during header testing.
    • thrust/thrust#1153: Switch to placement new instead of assignment to construct items in uninitialized memory. Thanks to Hugh Winkler for this contribution.
    • thrust/cub#38: Fix cub::DeviceHistogram for size_t OffsetTs. Thanks to Leo Fang for this contribution.
    • thrust/cub#35: Fix GCC-5 maybe-uninitialized warning. Thanks to Rong Ou for this contribution.
    • thrust/cub#36: Qualify namespace for va_printf in _CubLog. Thanks to Andrei Tchouprakov for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.10-1(Sep 23, 2020)

    Summary

    CUB 1.9.10-1 is the minor release accompanying the NVIDIA HPC SDK 20.7 release and the CUDA Toolkit 11.1 release.

    Bug Fixes

    • #1217: Move static local in cub::DeviceCount to a separate host-only function because NVC++ doesn't support static locals in host-device functions.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.10(May 19, 2020)

    Summary

    Thrust 1.9.10 is the release accompanying the NVIDIA HPC SDK 20.5 release. It adds CMake find_package support. C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated. Starting with the upcoming 1.10.0 release, C++03 support will be dropped entirely.

    Breaking Changes

    • Thrust now checks that it is compatible with the version of CUB found in your include path, generating an error if it is not. If you are using your own version of CUB, it may be too old. It is recommended to simply delete your own version of CUB and use the version of CUB that comes with Thrust.
    • C++03 and C++11 are deprecated. Using these dialects will generate a compile-time warning. These warnings can be suppressed by defining CUB_IGNORE_DEPRECATED_CPP_DIALECT (to suppress C++03 and C++11 deprecation warnings) or CUB_IGNORE_DEPRECATED_CPP11 (to suppress C++11 deprecation warnings). Suppression is only a short term solution. We will be dropping support for C++03 in the 1.10.0 release and C++11 in the near future.
    • GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. Using these compilers will generate a compile-time warning. These warnings can be suppressed by defining CUB_IGNORE_DEPRECATED_COMPILER. Suppression is only a short term solution. We will be dropping support for these compilers in the near future.

    New Features

    • CMake find_package support. Just point CMake at the cmake folder in your CUB include directory (ex: cmake -DCUB_DIR=/usr/local/cuda/include/cub/cmake/ .) and then you can add CUB to your CMake project with find_package(CUB REQUIRED CONFIG).
    Source code(tar.gz)
    Source code(zip)
  • 1.9.9(May 19, 2020)

    CUB 1.9.9 (CUDA 11.0)

    Summary

    CUB 1.9.9 is the release accompanying the CUDA Toolkit 11.0 release. It introduces CMake support, version macros, platform detection machinery, and support for NVC++, which uses Thrust (and thus CUB) to implement GPU-accelerated C++17 Parallel Algorithms. Additionally, the scan dispatch layer was refactored and modernized. C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated. Starting with the upcoming 1.10.0 release, C++03 support will be dropped entirely.

    Breaking Changes

    • Thrust now checks that it is compatible with the version of CUB found in your include path, generating an error if it is not. If you are using your own version of CUB, it may be too old. It is recommended to simply delete your own version of CUB and use the version of CUB that comes with Thrust.
    • C++03 and C++11 are deprecated. Using these dialects will generate a compile-time warning. These warnings can be suppressed by defining CUB_IGNORE_DEPRECATED_CPP_DIALECT (to suppress C++03 and C++11 deprecation warnings) or CUB_IGNORE_DEPRECATED_CPP11 (to suppress C++11 deprecation warnings). Suppression is only a short term solution. We will be dropping support for C++03 in the 1.10.0 release and C++11 in the near future.
    • GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. Using these compilers will generate a compile-time warning. These warnings can be suppressed by defining CUB_IGNORE_DEPRECATED_COMPILER. Suppression is only a short term solution. We will be dropping support for these compilers in the near future.

    New Features

    • CMake support. Thanks to Francis Lemaire for this contribution.
    • Refactorized and modernized scan dispatch layer. Thanks to Francis Lemaire for this contribution.
    • Policy hooks for device-wide reduce, scan, and radix sort facilities to simplify tuning and allow users to provide custom policies. Thanks to Francis Lemaire for this contribution.
    • <cub/version.cuh>: CUB_VERSION, CUB_VERSION_MAJOR, CUB_VERISON_MINOR, CUB_VERSION_SUBMINOR, and CUB_PATCH_NUMBER.
    • Platform detection machinery:
      • <cub/util_cpp_dialect.cuh>: Detects the C++ standard dialect.
      • <cub/util_compiler.cuh>: host and device compiler detection.
      • <cub/util_deprecated.cuh>: CUB_DEPRECATED.
      • <cub/config.cuh>: Includes<cub/util_arch.cuh>,<cub/util_compiler.cuh>,<cub/util_cpp_dialect.cuh>,<cub/util_deprecated.cuh>,<cub/util_macro.cuh>,<cub/util_namespace.cuh>`
    • cub::DeviceCount and cub::DeviceCountUncached, caching abstractions for cudaGetDeviceCount.

    Other Enhancements

    • Lazily initialize the per-device CUDAattribute caches, because CUDA context creation is expensive and adds up with large CUDA binaries on machines with many GPUs. Thanks to the NVIDIA PyTorch team for bringing this to our attention.
    • Make cub::SwitchDevice avoid setting/resetting the device if the current device is the same as the target device.

    Bug Fixes

    • Add explicit failure parameter to CAS in the CUB attribute cache to workaround a GCC 4.8 bug.
    • Revert a change in reductions that changed the signedness of the lane_id variable to suppress a warning, as this introduces a bug in optimized device code.
    • Fix initialization in cub::ExclusiveSum. Thanks to Conor Hoekstra for this contribution.
    • Fix initialization of the std::array in the CUB attribute cache.
    • Fix -Wsign-compare warnings. Thanks to Elias Stehle for this contribution.
    • Fix test_block_reduce.cu to build without parameters. Thanks to Francis Lemaire for this contribution.
    • Add missing includes to grid_even_share.cuh. Thanks to Francis Lemaire for this contribution.
    • Add missing includes to thread_search.cuh. Thanks to Francis Lemaire for this contribution.
    • Add missing includes to cub.cuh. Thanks to Felix Kallenborn for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.8-1(May 19, 2020)

    Summary

    CUB 1.9.8-1 is a variant of 1.9.8 accompanying the NVIDIA HPC SDK 20.3 release. It contains modifications necessary to serve as the implementation of NVC++'s GPU-accelerated C++17 Parallel Algorithms.

    Source code(tar.gz)
    Source code(zip)
  • 1.9.8(May 19, 2020)

    Summary

    CUB 1.9.8 is the first release of CUB to be officially supported and included in the CUDA Toolkit. When compiling CUB in C++11 mode, CUB now caches calls to CUDA attribute query APIs, which improves performance of these queries by 20x to 50x when they are called concurrently by multiple host threads.

    Enhancements

    • (C++11 or later) Cache calls to cudaFuncGetAttributes and cudaDeviceGetAttribute within cub::PtxVersion and cub::SmVersion. These CUDA APIs acquire locks to CUDA driver/runtime mutex and perform poorly under contention; with the caching, they are 20 to 50x faster when called concurrently. Thanks to Bilge Acun for bringing this issue to our attention.
    • DispatchReduce now takes an OutputT template parameter so that users can specify the intermediate type explicitly.
    • Radix sort tuning policies updates to fix performance issues for element types smaller than 4 bytes.

    Bug Fixes

    • Change initialization style from copy initialization to direct initialization (which is more permissive) in AgentReduce to allow a wider range of types to be used with it.
    • Fix bad signed/unsigned comparisons in WarpReduce.
    • Fix computation of valid lanes in warp-level reduction primitive to correctly handle the case where there are 0 input items per warp.
    Source code(tar.gz)
    Source code(zip)
  • 1.8.0(May 19, 2020)

    Summary

    CUB 1.8.0 introduces changes to the cub::Shuffle* interfaces.

    Breaking Changes

    • The interfaces of cub::ShuffleIndex, cub::ShuffleUp, and cub::ShuffleDown have been changed to allow for better computation of the PTX SHFL control constant for logical warps smaller than 32 threads.

    Bug Fixes

    • #112: Fix cub::WarpScan's broadcast of warp-wide aggregate for logical warps smaller than 32 threads.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.5(May 19, 2020)

    Summary

    CUB 1.7.5 adds support for radix sorting __half keys and improved sorting performance for 1 byte keys. It was incorporated into Thrust 1.9.2.

    Enhancements

    • Radix sort support for __half keys.
    • Radix sort tuning policy updates to improve 1 byte key performance.

    Bug Fixes

    • Syntax tweaks to mollify Clang.
    • #127: cub::DeviceRunLengthEncode::Encode returns incorrect results.
    • #128: 7-bit sorting passes fail for SM61 with large values.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.4(May 19, 2020)

    Summary

    CUB 1.7.4 is a minor release that was incorporated into Thrust 1.9.1-2.

    Bug Fixes

    • #114: Can't pair non-trivially-constructible values in radix sort.
    • #115: cub::WarpReduce segmented reduction is broken in CUDA 9 for logical warp sizes smaller than 32.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.3(May 19, 2020)

  • 1.7.2(May 19, 2020)

    Summary

    CUB 1.7.2 is a minor release.

    Bug Fixes

    • #104: Device-wide reduction is now "run-to-run" deterministic for pseudo-associative reduction operators (like floating point addition).
    Source code(tar.gz)
    Source code(zip)
  • 1.7.1(May 19, 2020)

    Summary

    CUB 1.7.0 brings support for CUDA 9.0 and SM7x (Volta) GPUs. It is compatible with independent thread scheduling.

    Breaking Changes

    • Remove cub::WarpAll and cub::WarpAny. These functions served to emulate __all and __any functionality for SM1x devices, which did not have those operations. However, SM1x devices are now deprecated in CUDA, and the interfaces of these two functions are now lacking the lane-mask needed for collectives to run on SM7x and newer GPUs which have independent thread scheduling.

    Other Enhancements

    • Remove any assumptions of implicit warp synchronization to be compatible with SM7x's (Volta) independent thread scheduling.

    Bug Fixes

    • #86: Incorrect results with reduce-by-key.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.0(May 19, 2020)

    Summary

    CUB 1.7.0 brings support for CUDA 9.0 and SM7x (Volta) GPUs. It is compatible with independent thread scheduling. It was incorporated into Thrust 1.9.2.

    Breaking Changes

    • Remove cub::WarpAll and cub::WarpAny. These functions served to emulate __all and __any functionality for SM1x devices, which did not have those operations. However, SM1x devices are now deprecated in CUDA, and the interfaces of these two functions are now lacking the lane-mask needed for collectives to run on SM7x and newer GPUs which have independent thread scheduling.

    Other Enhancements

    • Remove any assumptions of implicit warp synchronization to be compatible with SM7x's (Volta) independent thread scheduling.

    Bug Fixes

    • #86: Incorrect results with reduce-by-key.
    Source code(tar.gz)
    Source code(zip)
  • 1.6.4(May 19, 2020)

    Summary

    CUB 1.6.4 improves radix sorting performance for SM5x (Maxwell) and SM6x (Pascal) GPUs.

    Enhancements

    • Radix sort tuning policies updated for SM5x (Maxwell) and SM6x (Pascal) - 3.5B and 3.4B 32 byte keys/s on TitanX and GTX 1080, respectively.

    Bug Fixes

    • Restore fence work-around for scan (reduce-by-key, etc.) hangs in CUDA 8.5.
    • #65: cub::DeviceSegmentedRadixSort should allow inputs to have pointer-to-const type.
    • Mollify Clang device-side warnings.
    • Remove out-dated MSVC project files.
    Source code(tar.gz)
    Source code(zip)
  • 1.6.3(May 19, 2020)

    Summary

    CUB 1.6.3 improves support for Windows, changes cub::BlockLoad/cub::BlockStore interface to take the local data type, and enhances radix sort performance for SM6x (Pascal) GPUs.

    Breaking Changes

    • cub::BlockLoad and cub::BlockStore are now templated by the local data type, instead of the Iterator type. This allows for output iterators having void as their value_type (e.g. discard iterators).

    Other Enhancements

    • Radix sort tuning policies updated for SM6x (Pascal) GPUs - 6.2B 4 byte keys/s on GP100.
    • Improved support for Windows (warnings, alignment, etc).

    Bug Fixes

    • #74: cub::WarpReduce executes reduction operator for out-of-bounds items.
    • #72: cub:InequalityWrapper::operator should be non-const.
    • #71: cub::KeyValuePair won't work if Key has non-trivial constructor.
    • #69: cub::BlockStore::Storedoesn't compile ifOutputIteratorT::value_typeisn'tT`.
    • #68: cub::TilePrefixCallbackOp::WarpReduce doesn't permit PTX arch specialization.
    Source code(tar.gz)
    Source code(zip)
  • 1.6.2(May 19, 2020)

    Summary

    CUB 1.6.2 (previously 1.5.5) improves radix sort performance for SM6x (Pascal) GPUs.

    Enhancements

    • Radix sort tuning policies updated for SM6x (Pascal) GPUs.

    Bug Fixes

    • Fix AArch64 compilation of cub::CachingDeviceAllocator.
    Source code(tar.gz)
    Source code(zip)
  • 1.6.1(May 19, 2020)

  • 1.6.0(May 19, 2020)

    Summary

    CUB 1.6.0 changes the scan and reduce interfaces. Exclusive scans now accept an "initial value" instead of an "identity value". Scans and reductions now support differing input and output sequence types. Additionally, many bugs have been fixed.

    Breaking Changes

    • Device/block/warp-wide exclusive scans have been revised to now accept an "initial value" (instead of an "identity value") for seeding the computation with an arbitrary prefix.
    • Device-wide reductions and scans can now have input sequence types that are different from output sequence types (as long as they are convertible).

    Other Enhancements

    • Reduce repository size by moving the doxygen binary to doc repository.
    • Minor reduction in cub::BlockScan instruction counts.

    Bug Fixes

    • Issue #55: Warning in cub/device/dispatch/dispatch_reduce_by_key.cuh.
    • Issue #59: cub::DeviceScan::ExclusiveSum can't prefix sum of float into double.
    • Issue #58: Infinite loop in cub::CachingDeviceAllocator::NearestPowerOf.
    • Issue #47: cub::CachingDeviceAllocator needs to clean up CUDA global error state upon successful retry.
    • Issue #46: Very high amount of needed memory from the cub::DeviceHistogram::HistogramEven.
    • Issue #45: cub::CachingDeviceAllocator fails with debug output enabled
    Source code(tar.gz)
    Source code(zip)
  • 1.5.2(May 19, 2020)

    Summary

    CUB 1.5.2 enhances cub::CachingDeviceAllocator and improves scan performance for SM5x (Maxwell).

    Enhancements

    • Improved medium-size scan performance on SM5x (Maxwell).
    • Refactored cub::CachingDeviceAllocator:
      • Now spends less time locked.
      • Uses C++11's std::mutex when available.
      • Failure to allocate a block from the runtime will retry once after freeing cached allocations.
      • Now respects max-bin, fixing an issue where blocks in excess of max-bin were still being retained in the free cache.

    Bug fixes:

    • Fix for generic-type reduce-by-key cub::WarpScan for SM3x and newer GPUs.
    Source code(tar.gz)
    Source code(zip)
  • 1.5.1(May 19, 2020)

    CUB 1.5.1

    Summary

    CUB 1.5.1 is a minor release.

    Bug Fixes

    • Fix for incorrect cub::DeviceRadixSort output for some small problems on SM52 (Mawell) GPUs.
    • Fix for macro redefinition warnings when compiling thrust::sort.
    Source code(tar.gz)
    Source code(zip)
  • 1.5.0(May 19, 2020)

    CUB 1.5.0

    CUB 1.5.0 introduces segmented sort and reduction primitives.

    New Features:

    • Segmented device-wide operations for device-wide sort and reduction primitives.

    Bug Fixes:

    • #36: cub::ThreadLoad generates compiler errors when loading from pointer-to-const.
    • #29: cub::DeviceRadixSort::SortKeys<bool> yields compiler errors.
    • #26: Misaligned address after cub::DeviceRadixSort::SortKeys.
    • #25: Fix for incorrect results and crashes when radix sorting 0-length problems.
    • Fix CUDA 7.5 issues on SM52 GPUs with SHFL-based warp-scan and warp-reduction on non-primitive data types (e.g. user-defined structs).
    • Fix small radix sorting problems where 0 temporary bytes were required and users code was invoking malloc(0) on some systems where that returns NULL. CUB assumed the user was asking for the size again and not running the sort.
    Source code(tar.gz)
    Source code(zip)
Owner
NVIDIA Corporation
NVIDIA Corporation
An unified library for fitting primitives from 3D point cloud data with both C++&Python API.

PrimitivesFittingLib An unified library for fitting multiple primitives from 3D point cloud data with both C++&Python API. The supported primitives ty

Yueci Deng 10 Jun 30, 2022
A pytorch implementation of instant-ngp, as described in Instant Neural Graphics Primitives with a Multiresolution Hash Encoding.

torch-ngp A pytorch implementation of instant-ngp, as described in Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. Note: This

hawkey 627 Aug 10, 2022
A GPU (CUDA) based Artificial Neural Network library

Updates - 05/10/2017: Added a new example The program "image_generator" is located in the "/src/examples" subdirectory and was submitted by Ben Bogart

Daniel Frenzel 91 Jun 13, 2022
GPU Cloth TOP in TouchDesigner using CUDA-enabled NVIDIA Flex

This project demonstrates how to use NVIDIA FleX for GPU cloth simulation in a TouchDesigner Custom Operator. It also shows how to render dynamic meshes from the texture data using custom PBR GLSL material shaders inside TouchDesigner.

Vinícius Ginja 37 Jul 27, 2022
GPU PyTorch TOP in TouchDesigner with CUDA-enabled OpenCV

PyTorchTOP This project demonstrates how to use OpenCV with CUDA modules and PyTorch/LibTorch in a TouchDesigner Custom Operator. Building this projec

David 65 Jun 15, 2022
A CUDA implementation of Lattice Boltzmann for fluid dynamics simulation

Lattice Boltzmann simulation I am conscious of being only an individual struggling weakly against the stream of time. But it still remains in my power

Long Nguyen 17 Mar 1, 2022
Tiny CUDA Neural Networks

This is a small, self-contained framework for training and querying neural networks. Most notably, it contains a lightning fast "fully fused" multi-layer perceptron as well as support for various advanced input encodings, losses, and optimizers.

NVIDIA Research Projects 1.5k Aug 13, 2022
BM3D denoising filter for VapourSynth, implemented in CUDA

VapourSynth-BM3DCUDA Copyright© 2021 WolframRhodium BM3D denoising filter for VapourSynth, implemented in CUDA Description Please check VapourSynth-BM

null 50 Jul 25, 2022
HIPIFY: Convert CUDA to Portable C++ Code

Tools to translate CUDA source code into portable HIP C++ automatically

ROCm Developer Tools 174 Aug 10, 2022
A easy-to-use image processing library accelerated with CUDA on GPU.

gpucv Have you used OpenCV on your CPU, and wanted to run it on GPU. Did you try installing OpenCV and get frustrated with its installation. Fret not

shrikumaran pb 4 Aug 14, 2021
CUDA-accelerated Apriltag detection and pose estimation.

Isaac ROS Apriltag Overview This ROS2 node uses the NVIDIA GPU-accelerated AprilTags library to detect AprilTags in images and publishes their poses,

NVIDIA Isaac ROS 40 Aug 3, 2022
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

NVIDIA Isaac ROS 42 Jul 18, 2022
CUDA Custom Buffers and example blocks

gr-cuda CUDA Support for GNU Radio using the custom buffer changes introduced in GR 3.10. Custom buffers for CUDA-enabled hardware are provided that c

GNU Radio 4 Dec 9, 2021
Raytracer implemented with CPU and GPU using CUDA

Raytracer This is a training project aimed at learning ray tracing algorithm and practicing convert sequential CPU code into a parallelized GPU code u

Alex Kotovsky 2 Nov 29, 2021
PointPillars MultiHead 40FPS - A REAL-TIME 3D detection network [Pointpillars] compiled by CUDA/TensorRT/C++.

English | 简体中文 PointPillars High performance version of 3D object detection network -PointPillars, which can achieve the real-time processing (less th

Yan haixu 170 Jul 30, 2022
FoxRaycaster, optimized, fixed and with a CUDA option

Like FoxRaycaster(link) but with a nicer GUI, bug fixes, more optimized and with CUDA. Used in project: Code from FoxRaycaster, which was based on thi

Błażej Roszkowski 2 Oct 21, 2021
The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs based on CUDA.

dgSPARSE Library Introdution The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs bas

dgSPARSE 53 Aug 9, 2022
We implemented our own sequential version of GA, PSO, SA and ACA using C++ and the parallelized version with CUDA support

We implemented our own sequential version of GA, PSO, SA and ACA using C++ (some using Eigen3 as matrix operation backend) and the parallelized version with CUDA support. All of them are much faster than the popular lib scikit-opt.

Aron751 4 May 7, 2022
C++20 coroutines-based cooperative multitasking library

?? Coop Coop is a C++20 coroutines-based library to support cooperative multitasking in the context of a multithreaded application. The syntax will be

Jeremy Ong 73 Jul 3, 2022
cooperative testcases for ueb

tuwien-compilerbau-test-21 Kooperative Sammlung von Testfällen für die LVA "Übersetzerbau" der TU Wien im SS21. Bitte beachtet, dass dies eine koopera

svchost.exe 16 Jun 12, 2021
experimental cooperative threading library for gba in pure C

gba-co-thread Experimental cooperative threading library for Gameboy Advance in pure C. See co_thread.h and co_thread.c for the tiny threading library

Evan Bowman 12 Jul 24, 2022
Concurrency Kit 2.1k Aug 2, 2022
Optimized primitives for collective multi-GPU communication

NCCL Optimized primitives for inter-GPU communication. Introduction NCCL (pronounced "Nickel") is a stand-alone library of standard communication rout

NVIDIA Corporation 1.8k Aug 4, 2022
High-level networking API for real-time simulations with primitives for remote procedure call and object state replication

tnl2 - Torque Network Library version 2 tnl2 is a high-level networking API for real-time simulations with primitives for remote procedure call and o

Mark Frohnmayer 23 Apr 10, 2022
Reference implementations of post-quantum cryptographic primitives

PQ Crypto Catalog Implementation of quantum-safe signature and KEM schemes submitted to NIST PQC Standardization Process. The goal is to provide an ea

Kris Kwiatkowski 21 Jul 7, 2022
Triton - a language and compiler for writing highly efficient custom Deep-Learning primitives.

Triton - a language and compiler for writing highly efficient custom Deep-Learning primitives.

OpenAI 3.9k Aug 12, 2022
An unified library for fitting primitives from 3D point cloud data with both C++&Python API.

PrimitivesFittingLib An unified library for fitting multiple primitives from 3D point cloud data with both C++&Python API. The supported primitives ty

Yueci Deng 10 Jun 30, 2022
Thread pool - Thread pool using std::* primitives from C++17, with optional priority queue/greenthreading for POSIX.

thread_pool Thread pool using std::* primitives from C++11. Also includes a class for a priority thread pool. Requires concepts and C++17, including c

Tyler Hardin 75 Jul 13, 2022
A pytorch implementation of instant-ngp, as described in Instant Neural Graphics Primitives with a Multiresolution Hash Encoding.

torch-ngp A pytorch implementation of instant-ngp, as described in Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. Note: This

hawkey 627 Aug 10, 2022