Cooperative primitives for CUDA C++.

Overview

About CUB

CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model:

Orientation of collective primitives within the CUDA software stack

CUB is included in the NVIDIA HPC SDK and the CUDA Toolkit.

We recommend the CUB Project Website for further information and examples.



A Simple Example

#include <cub/cub.cuh>

// Block-sorting CUDA kernel
__global__ void BlockSortKernel(int *d_in, int *d_out)
{
     using namespace cub;

     // Specialize BlockRadixSort, BlockLoad, and BlockStore for 128 threads
     // owning 16 integer items each
     typedef BlockRadixSort<int, 128, 16>                     BlockRadixSort;
     typedef BlockLoad<int, 128, 16, BLOCK_LOAD_TRANSPOSE>   BlockLoad;
     typedef BlockStore<int, 128, 16, BLOCK_STORE_TRANSPOSE> BlockStore;

     // Allocate shared memory
     __shared__ union {
         typename BlockRadixSort::TempStorage  sort;
         typename BlockLoad::TempStorage       load;
         typename BlockStore::TempStorage      store;
     } temp_storage;

     int block_offset = blockIdx.x * (128 * 16);	  // OffsetT for this block's ment

     // Obtain a segment of 2048 consecutive keys that are blocked across threads
     int thread_keys[16];
     BlockLoad(temp_storage.load).Load(d_in + block_offset, thread_keys);
     __syncthreads();

     // Collectively sort the keys
     BlockRadixSort(temp_storage.sort).Sort(thread_keys);
     __syncthreads();

     // Store the sorted segment
     BlockStore(temp_storage.store).Store(d_out + block_offset, thread_keys);
}

Each thread block uses cub::BlockRadixSort to collectively sort its own input segment. The class is specialized by the data type being sorted, by the number of threads per block, by the number of keys per thread, and implicitly by the targeted compilation architecture.

The cub::BlockLoad and cub::BlockStore classes are similarly specialized. Furthermore, to provide coalesced accesses to device memory, these primitives are configured to access memory using a striped access pattern (where consecutive threads simultaneously access consecutive items) and then transpose the keys into a blocked arrangement of elements across threads.

Once specialized, these classes expose opaque TempStorage member types. The thread block uses these storage types to statically allocate the union of shared memory needed by the thread block. (Alternatively these storage types could be aliased to global memory allocations).



Supported Compilers

CUB is regularly tested using the specified versions of the following compilers. Unsupported versions may emit deprecation warnings, which can be silenced by defining CUB_IGNORE_DEPRECATED_COMPILER during compilation.

  • NVCC 11.0+
  • NVC++ 20.9+
  • GCC 5+
  • Clang 7+
  • MSVC 2019+ (19.20/16.0/14.20)



Releases

CUB is distributed with the NVIDIA HPC SDK and the CUDA Toolkit in addition to GitHub.

See the changelog for details about specific releases.

CUB Release Included In
1.14.0 NVIDIA HPC SDK 21.9
1.13.1 CUDA Toolkit 11.5
1.13.0 NVIDIA HPC SDK 21.7
1.12.1 CUDA Toolkit 11.4
1.12.0 NVIDIA HPC SDK 21.3
1.11.0 CUDA Toolkit 11.3
1.10.0 NVIDIA HPC SDK 20.9 & CUDA Toolkit 11.2
1.9.10-1 NVIDIA HPC SDK 20.7 & CUDA Toolkit 11.1
1.9.10 NVIDIA HPC SDK 20.5
1.9.9 CUDA Toolkit 11.0
1.9.8-1 NVIDIA HPC SDK 20.3
1.9.8 CUDA Toolkit 11.0 Early Access
1.9.8 CUDA 11.0 Early Access
1.8.0
1.7.5 Thrust 1.9.2
1.7.4 Thrust 1.9.1-2
1.7.3
1.7.2
1.7.1
1.7.0 Thrust 1.9.0-5
1.6.4
1.6.3
1.6.2 (previously 1.5.5)
1.6.1 (previously 1.5.4)
1.6.0 (previously 1.5.3)
1.5.2
1.5.1
1.5.0
1.4.1
1.4.0
1.3.2
1.3.1
1.3.0
1.2.3
1.2.2
1.2.0
1.1.1
1.0.2
1.0.1
0.9.4
0.9.2
0.9.1
0.9.0



Development Process

CUB and Thrust depend on each other. It is recommended to clone Thrust and build CUB as a component of Thrust.

CUB uses the CMake build system to build unit tests, examples, and header tests. To build CUB as a developer, the following recipe should be followed:

# Clone Thrust and CUB from Github. CUB is located in Thrust's
# `dependencies/cub` submodule.
git clone --recursive https://github.com/NVIDIA/thrust.git
cd thrust

# Create build directory:
mkdir build
cd build

# Configure -- use one of the following:
cmake -DTHRUST_INCLUDE_CUB_CMAKE=ON ..   # Command line interface.
ccmake -DTHRUST_INCLUDE_CUB_CMAKE=ON ..  # ncurses GUI (Linux only)
cmake-gui  # Graphical UI, set source/build directories and options in the app

# Build:
cmake --build . -j 
   
       # invokes make (or ninja, etc)

# Run tests and examples:
ctest

   

By default, the C++14 standard is targeted, but this can be changed in CMake. More information on configuring your CUB build and creating a pull request is found in CONTRIBUTING.md.



Open Source License

CUB is available under the "New BSD" open-source license:

Copyright (c) 2010-2011, Duane Merrill.  All rights reserved.
Copyright (c) 2011-2018, NVIDIA CORPORATION.  All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
   *  Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
   *  Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
   *  Neither the name of the NVIDIA CORPORATION nor the
      names of its contributors may be used to endorse or promote products
      derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Comments
  • Add BLOCK_LOAD_STRIPED and BLOCK_STORE_STRIPED

    Add BLOCK_LOAD_STRIPED and BLOCK_STORE_STRIPED

    This PR adds the following to BlockLoadAlgorithm

    1. BLOCK_LOAD_STRIPED It's basically BLOCK_LOAD_TRANSPOSE without the BlockExchange

    This PR adds the following to BlockStoreAlgorithm

    1. BLOCK_STORE_STRIPED It's basically BLOCK_STORE_TRANSPOSE without the BlockExchange
    type: enhancement testing: gpuCI passed P1: should have testing: internal ci passed 
    opened by mnicely 21
  • Support future value for initial value for device scan

    Support future value for initial value for device scan

    Prototyping the support of device pointer for cub device scan.

    Example Usage:

    cub::DeviceScan::ExclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, cub::FutureValue<float>(ptr), num_items);
    

    Tests pass for this PR, but there is no documentation yet. I will add doc if @allisonvacanti thinks this is a good idea.

    Please review https://github.com/NVIDIA/thrust/pull/1519 for thrust change.

    testing: gpuCI passed P1: should have testing: internal ci passed helps: pytorch 
    opened by zasdfgbnm 18
  • Static links to tag releases changed

    Static links to tag releases changed

    Hi @brycelelbach, something changed when you revised the release logs yesterday. We used to point to https://github.com/NVlabs/cub/archive/v1.8.0.tar.gz but now it changes to https://github.com/NVlabs/cub/archive/1.8.0.tar.gz (the "v" is missing). I think it is a breaking change. Any chance it could be reverted?

    opened by leofang 16
  • Faster Least Significant Digit Radix Sort Implementation

    Faster Least Significant Digit Radix Sort Implementation

    • radix sort with decoupled look-back, 8 bits per pass and other optimizations
    • pull request to the previous CUB repository: https://github.com/brycelelbach/cub_historical_2019_2020/pull/26
    blocked testing: gpuCI passed testing: internal ci passed 
    opened by canonizer 11
  • Document that cub::DeviceRadixSort and cub::BlockRadixSort are stable

    Document that cub::DeviceRadixSort and cub::BlockRadixSort are stable

    There is nothing in the documentation https://nvlabs.github.io/cub/structcub_1_1_device_radix_sort.html saying whether it is stable sort or not. It seems to be stable. If so, it would be great if this fact is mentioned in the documentation.

    only: docs good first issue 
    opened by zasdfgbnm 10
  • cub::DeviceReduce::ReduceByKey() results are non-deterministic for floats

    cub::DeviceReduce::ReduceByKey() results are non-deterministic for floats

    @allisonvacanti @senior-zero

    cub::DeviceReduce::ReduceByKey web page describes "run-to-run" determinism for addition of floating point types, but the result looks wrong to me. Is it an expected behavior or a bug?

    From my limited testing somehow I got run-to-run result from the below code with CUDA 11.6 SDK. BTW, you need to run the program multiple times and will occasionally see the error, it doesn't work like the documentation mentioned which provides "run-to-run" determinism.

    Note: this issue is mainly for cub, there is a similar issue for thrust (https://github.com/NVIDIA/thrust/issues/1621)

    #include <thrust/host_vector.h>
    #include <thrust/device_vector.h>
    
    #include <thrust/copy.h>
    #include <thrust/fill.h>
    #include <thrust/sequence.h>
    #include <thrust/reduce.h>
    #include <iostream>
    
    #include <cub/cub.cuh>   // or equivalently <cub/device/device_reduce.cuh>
    
    
    int main() {
       auto const numElements = 250000;
       thrust::device_vector<double> data(numElements, 0.1);
       thrust::device_vector<double> keys(numElements, 1);
    
       thrust::device_vector<double> keys_out1(numElements);
       thrust::device_vector<double> keys_out2(numElements);
    
       thrust::device_vector<double> out1(numElements);
       thrust::device_vector<double> out2(numElements);
    
       thrust::host_vector<double> hostOut1(numElements);
       thrust::host_vector<double> hostOut2(numElements);
    
       thrust::device_vector<int> num_runs_out(1);
    
       // first run
       {
          size_t temp_storage_bytes = 0;
          cub::DeviceReduce::ReduceByKey(
             nullptr, temp_storage_bytes,
             keys.begin(), keys_out1.begin(),
             data.begin(), out1.begin(),
             num_runs_out.begin(),
             thrust::plus<double>(), numElements);
          thrust::device_vector<char> d_temp_storage(temp_storage_bytes);
          cub::DeviceReduce::ReduceByKey(
             d_temp_storage.data().get(), temp_storage_bytes,
             keys.begin(), keys_out1.begin(),
             data.begin(), out1.begin(),
             num_runs_out.begin(),
             thrust::plus<double>(), numElements);
          // copy out1 to the host
          thrust::copy(out1.begin(), out1.begin() + num_runs_out[0], hostOut1.begin());
       }
    
       // second run
       {
          size_t temp_storage_bytes = 0;
          cub::DeviceReduce::ReduceByKey(
             nullptr, temp_storage_bytes,
             keys.begin(), keys_out2.begin(),
             data.begin(), out2.begin(),
             num_runs_out.begin(),
             thrust::plus<double>(), numElements);
          thrust::device_vector<char> d_temp_storage(temp_storage_bytes);
          cub::DeviceReduce::ReduceByKey(
             d_temp_storage.data().get(), temp_storage_bytes,
             keys.begin(), keys_out2.begin(),
             data.begin(), out2.begin(),
             num_runs_out.begin(),
             thrust::plus<double>(), numElements);
          // copy out2 to the host
          thrust::copy(out2.begin(), out2.begin() + num_runs_out[0], hostOut2.begin());
       }
    
       // Check the outputs are exactly the same
       for (int i = 0; i < num_runs_out[0]; i++) {
          if (hostOut1[i] != hostOut2[i]) {
             std::cout << "Element " << i << " is not equal" << std::endl;
          }
       }
    
       return 0;
    }
    
    duplicate only: docs type: bug: functional P1: should have 
    opened by lilohuang 9
  • New segmented sort algorithm

    New segmented sort algorithm

    This PR includes a new segmented sort facility. Few approaches to this problem exist.

    Embed segment number into keys

    This approach provides an elegant solution to the load-balancing issue but can lead to slowdowns. It also can't be applicable if the number of bites representing segments number exceeds a maximal number of bytes used by keys.

    Modified merge sort approach

    This idea is implemented in modern GPU. I've used this approach as a reference for comparison with the new segmented sort algorithm. As I show below, this approach can be outperformed in most cases.

    Kernel specialisation

    The idea behind this approach is to partition input segments into size groups. Specialised kernels can further process each size group. The LRB approach initially discussed in the issue falls into this category. It also represents the approach that the new segmented sort algorithm relies on.

    I'm going to briefly describe the genesis of the new segmented sort algorithm to justify some design decisions.

    To minimise the number of kernel specialisations, I've benchmarked different approaches to small (under a few hundred items) segment sorting. I've benchmarked single-thread even-odd sorting, bitonic warp sorting and newly added warp merge sort. The warp-scope merge-sort approach demonstrated some advantages: it can sort bigger segments and outperforms other methods (in the majority of cases).

    Warp-scope merge sort is included in this PR as a separate facility. It's possible to partition architectural warp into multiple virtual ones to sort multiple segments simultaneously. The warp-scope merge sort duplicated a significant part of the previously introduced block-scope merge sort, so I extracted the merge-sort strategy into a separate facility. Both warp and block sort share this strategy.

    Here's the speedup of warp-scope merge sort over warp-bitonic sort: image

    And the speedup of warp-scope merge sort over single-thread odd-even sort: image

    In the figures above I vary segment sizes and segments number.

    To further increase the performance of warp-scope merge sort I needed to load and store data efficiently. I needed warp-scope load, store and exchange facilities. These facilities are also provided in this PR.

    Using a proper sorting algorithm was not enough. Initially, I've assigned a CUDA thread block to a segment. Although this approach demonstrated speedup over the existing one, it led to inefficient resource utilisation because most threads were idle. Nonetheless, a kernel like this is used as a fallback solution when there are not enough segments. If idle threads don't block other CTAs from execution, there's no reason to spend time on segments partitioning. The fallback kernel helped to eliminate cases when the partitioning stage led to the overall slowdown of the new algorithm.

    image

    Initially, I implemented a single kernel for all size groups. Depending on the CTA number, I allocated a different number of threads per segment. That is, if the segment size exceeded a few thousand items, I've used slow block-scope radix sort. If the data was about a thousand items and fit into shared memory, I've used in-shared-memory block-scope radix sort. In all these cases, the whole CTA was assigned to a single segment. If the CTA number exceeded the number of large segments, I've partitioned CTA into multiple warp groups, each processing a separate segment. It happened that the large-segment branch limited the occupancy of small-segment one. So I separated this kernel into two. One kernel processes large segments and contains two branches - in-shared-memory sort and slow block-scope sort. Another kernel processes medium and small segments.

    To overlap small/large segments processing, I've used concurrent kernels. This PR contains a single-stream implementation, though. The multi-stream API is out of the scope of this PR and might be introduced later.

    Segments partitioning

    The LRB approach discussed in the initial issue balances segments in a non-deterministic way. In some cases, it led to slowdowns because consecutive sub-warps might not end up processing consecutive segments. Here's the speedup of the LRB approach compared to the partitioning based approach.

    image

    I've tried applying LRB only to the large segments group. This approach also leads to controversial performance results. In rare cases, when an extremely large segment is located in the tail of the large segments group, LRB leads to performance improvements. Otherwise, there are enough segments to overlap its processing, and the LRB stage leads to slowdowns. Therefore, we decided to opt-in LRB. The API with pre-partitioned segments is going to be implemented later. Here's the speedup of the version where I apply LRB to the large segment group.

    image

    Instead of LRB, I've implemented a three-way partition facility, which is included in this PR. It leads to deterministic partitioning of segments and might be used outside of the segmented sort. The three-way partitioning stage is faster than the LRB stage.

    Temporary storage layout

    The temporary storage layout of the new segmented sort is quite complex. To simplify the code and make it safer, I've introduced temporary storage layout wrappers, which can be found in this PR.

    Performance

    To benchmark the new algorithm on real-world data, I've converted sparse matrices collection to segment sizes. Segments number is equal to the number of rows in a sparse matrix, while segment size is equal to the number of non-zero values in this row. Here's the speedup of the new algorithm compared to the cub::DeviceSegmentedRadixSort on A100 GPU while sorting pairs of std::uint32_t and std::uint64_t.

    image

    | speedup | new algorithm | modernGPU | | ----------- | ----------- | ----------- | | min | 1.11 | 0.16 | | median | 61.24 | 7.93 | | max | 5314.44 | 2219.73 |

    The speedup depends on the segments number: image

    testing: gpuCI in progress P0: must have testing: internal ci in progress 
    opened by senior-zero 9
  • Bug in ExclusiveSum

    Bug in ExclusiveSum

    The following code prints

    0 100 -128 0
    

    NOTE: It overflows!!!!

    nvcc --version
    

    prints

    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2021 NVIDIA Corporation
    Built on Sun_Mar_21_19:15:46_PDT_2021
    Cuda compilation tools, release 11.3, V11.3.58
    Build cuda_11.3.r11.3/compiler.29745058_0
    
    #include <vector>
    
    #include "cub/cub.cuh"
    
    int main() {
      std::vector<int8_t> v = {100, 28, 1};
    
      int8_t *src;
      // ignore error check
      cudaMalloc((void **)&src, 3);
      cudaMemcpy(src, v.data(), 3, cudaMemcpyHostToDevice);
    
      int32_t *dst;
      cudaMalloc((void **)&dst, 4 * sizeof(int32_t));
    
      void *d_temp_storage = nullptr;
      size_t temp_storage_bytes = 0;
      cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, src, dst,
                                    3);
      cudaMalloc(&d_temp_storage, temp_storage_bytes);
      cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, src, dst,
                                    3);
    
      int32_t h[4];
      cudaMemcpy(h, dst, 4 * sizeof(int32_t), cudaMemcpyDeviceToHost);
    
      for (int32_t i = 0; i != 4; ++i) {
        std::cout << h[i] << " ";
      }
    
      std::cout << "\n";
      // ignore memory leak
      return 0;
    }
    
    question 
    opened by csukuangfj 9
  • Adds DeviceBatchMemcpy algorithm and tests

    Adds DeviceBatchMemcpy algorithm and tests

    Algorithm Overview

    The DeviceBatchMemcpy takes N input buffers and N output buffers and copies buffer_size[i] bytes from the i-th input buffer to the i-th output buffer. If any input buffer aliases memory from any output buffer the behavior is undefined. If any output buffer aliases memory of another output buffer the behavior is undefined. Input buffers can alias one another.

    Implementation Details

    We distinguish each buffer by its size and assign it to one of three size classes:

    1. Thread-level buffer (TLEV buffer). A buffer that is processed by one or more threads but not a whole warp (e.g., up to 32 bytes).
    2. Warp-level buffer (WLEV buffer). A buffer that is processed by a whole warp (e.g., above 32 bytes but only up to 1024 bytes).
    3. Block-level buffer (BLEV buffer). A buffer that is processed by one or more thread blocks. The number of thread blocks assigned to such a buffer is proportional to its size (e.g., all buffers above 1024 bytes).

    Step 1: Partitioning Buffers by Size

    1. Each thread block loads a tile of buffer_size[i].
    2. Threads compute a three-bin histogram over their assigned buffer_size[ITEMS_PER_THREAD] chunk. Binning buffers by the size class they fall into
    3. An exclusive prefix sum is computed over the histograms. The prefix sum's aggregate reflects the number of buffers that fall into each size class. The prefix sum of each thread corresponds to the relative offset within each partition.
    4. Scatter the buffers into their partition. For each buffer, we scatter the tuple: {tile_buffer_id, buffer_size}, where tile_buffer_id is the buffer id, relative to the tile (i.e., from the interval [0, TILE_SIZE)). buffer_size is only defined for buffers that belong to the tlev partition and corresponds to the buffer's size (number of bytes) in that case.

    | tile_buffer_id | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | | |-------------------|---|----|----|---|----|------|----|------|---| | tile_buffer_sizes | 3 | 37 | 17 | 4 | 9 | 4242 | 11 | 2000 | | | | | | | | | | | | | | | T | T | T | T | T | W | B | B | | | tile_buffer_id | 0 | 2 | 3 | 4 | 6 | 1 | 5 | 7 | | | tile_buffer_size | 3 | 17 | 4 | 9 | 11 | - | - | - | |

    Note, the partitioning does not necessarily need to be stable. It may be desired if we expect neighbouring buffers to hold neighbouring byte segments.

    After the partitioning, each partition represents all the buffers that belong to the respective size class (i.e., one of TLEV, WLEV, BLEV). Depending on the size class, a different logic is applied. We process each partition separately.

    Step 2.a: Copying TLEV Buffers

    Usually, TLEV buffers are buffers of only a few bytes. Vectorised loads and stores do not really pay off here, as there's only few bytes that can actually be read from a four byte-aligned address. It does not pay off to have the two different code paths for (a) loading individual bytes from non-aligned adrresses and (b) doing vectorised loads from aligned addresses.

    Instead, we use the BlockRunLengthDecode algorithm to both (a) coalesce reads and writes as well as (b) load balance the number of bytes copied by each thred. Specifically, we are able to assign neighbouring bytes to neighbouring threads.

    The following tables illustrates how the first 8 bytes from the TLEV buffers are getting assigned to threads. | | T | T | T | T | T | | | | | |-------------------|---|----|----|---|----|------|----|------|---| | | | | | | | | | | | | tile_buffer_id | 0 | 2 | 3 | 4 | 6 | | | | | | tile_buffer_size | 3 | 17 | 4 | 9 | 11 | - | - | - | | | [1] run_length_decode | | | | | | | | | | | | | | | | | | | | | | | t0 | t1 | t2 | t3 | t4 | | | | | | buffer_id | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | | | byte_of_buffer | 0 | 1 | 2 | 0 | 1 | 2 | 3 | 4 | |

    [1] Use BlockRunLengthDecode using the tile_buffer_id as the "unique_items" and each buffer's size as the respective run's length. The result from the run-length decode yields the assignment from threads to the buffer along with the specific byte from that buffer.

    Step 2.b: Copying WLEV Buffers

    A full warp is assigned to each WLEV buffer. Loads from the input buffer are vectorised (aliased to a wider data type), loading 4, 8 or even 16 bytes at a time from the input buffer's first address that is aligned to such aliased data type. The implementation for the vectorised copy is based on @gaohao95's (thanks!) string gather improvement in https://github.com/rapidsai/cudf/pull/7980/files

    I think we want to have the vectorised copy as a reusable component. But I wanted to coordinate on what exactly that would look like first. Should this be (a) a warp-/block-level copy or should we (b) separate it into a warp-&block-level vectorised load (which will also have the async copy, maybe) and a warp-&block-level vectorised store?

    Step 2.c: Enqueueing BLEV Buffers

    These are buffers that may be very large. We want to avoid a scenario where there's potentially one very large buffer that a single thread block is copying while other thread blocks are sitting idle. To avoid this, BLEV buffers will be put into a queue that will be picked up in a subsequent kernel. In the subsequent kernel, the number of thred blocks getting assigned to each buffer is proportional to the buffer's size.

    testing: gpuCI passed helps: rapids 
    opened by elstehle 8
  • 64-bit Offsets in DeviceRadixSort

    64-bit Offsets in DeviceRadixSort

    • 64-bit OffsetT is supported for onesweep sorting
    • for decoupled look-back, the partition kernel is broken into smaller parts (as before), and a separate 32-bit type is used there
    • for histograms, 32-bit counters are used in shared memory and OffsetT-sized counters in global memory
    type: enhancement testing: gpuCI in progress P1: should have testing: internal ci in progress 
    opened by canonizer 8
  • Faster segmented sorting (and segmented problems in general)

    Faster segmented sorting (and segmented problems in general)

    Segmented problems can suffer from workload imbalance if the distribution of the segment sizes vary. The workload imbalance becomes even more challenging when the number of segments is very large.

    The attached zip file has an efficient segmented sort that is based on NVIDIA's CUB library. Most of the sorting is done using CUB's sorting functionality. The key thing that this segmented sorting algorithm offers is simple and efficient load-balancing. lrb_sort.cuh.zip

    The solution used for the segmented sort is applicable to other segmented problems.

    Details of the algorithm (and performance) can be found in the following paper.

    type: enhancement P1: should have 
    opened by ogreen 8
  • Optimize Decoupled Look-Back

    Optimize Decoupled Look-Back

    Decoupled Look-Back is a core of many CUB algorithms. This PR provides a few optimizations that help reduce contention on L2 and improve overall performance. This PR is intended as the first in a series of optimizations/tunings, so selecting the best parameters is out of the scope for now. This PR also addresses the following issue by relying on strong operations (.relaxed, .release etc.) instead of hints (.cg).

    Optimizations

    • Introduce a larger delay before loading the first look-back window since the data has a low probability of being updated.
    • Try loading tile states before falling into the spin loop. Once the first window is in the partial state, the previous ones should likely be as well. In other words, there's no point in waiting before loading.
    • Introduce a delay into the spin loop of WaitForValid to reduce contention and help the signal propagate faster.
    • Make tile state size at least four bytes (used to be two) to distribute the load between a larger number of cache lines.
    • Make flag size U32 instead of U8 (for the same reasons above) when message size doesn't let us use a single architectural word, and we have to store flags separately.

    Fixes

    The increase of tile state size revealed issues with .cg loads on P100 when the message size doesn't fit the architectural word. The fix consists of voting in the spin loop while (WARP_ANY((status == SCAN_TILE_INVALID), 0xffffffff));. Although this might be considered as a breaking change since ScanTileState<T, false>::WaitForValid didn't use to be cooperative, I think it's okay since ScanTileState<T, true>::WaitForValid is cooperative, and we haven't guaranteed that anyway.

    Results

    To benchmark proposed optimizations, I've selected various GPUs and all algorithms that depend on decoupled look-back. Since the algorithm is sensitive to compute / memory clock ratio, I've run benchmarks with both base and TDP-locked clocks. In general, on large input problem sizes, the speedup is significant enough not to lock clocks. Below is the distribution of execution times for old (main) and new (optimized) decoupled look-back in the case of the device exclusive sum on A100.

    density

    Apart from the speedup, the deviation of the new version is smaller. To illustrate broader results, I've grouped multiple benchmarks by the underlying algorithm. For instance, the select.if speedups for different input patterns and operations are combined into a single list as follows:

    |    T     |  Op  |  Pattern  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
    |----------|------|-----------|------------|------------|-------------|------------|-------------|--------------|---------|----------|
    |   U32    | Mid  |    Seq    |    2^28    |   3.302 ms |       3.31% |   2.484 ms |       0.35% |  -817.584 us | -24.76% |   FAIL   |
    |   U32    | Mid  |   Const   |    2^28    |   2.946 ms |       3.06% |   2.148 ms |       0.32% |  -797.100 us | -27.06% |   FAIL   |
    |   U32    | Mid  |   Rand    |    2^28    |   3.642 ms |       2.64% |   2.867 ms |       0.35% |  -774.785 us | -21.27% |   FAIL   |
    |   U32    | Zero |    Seq    |    2^28    |   3.610 ms |       2.86% |   2.820 ms |       0.35% |  -790.071 us | -21.89% |   FAIL   |
    |   U32    | Zero |   Const   |    2^28    |   3.543 ms |       3.05% |   2.750 ms |       0.37% |  -792.352 us | -22.37% |   FAIL   |
    |   U32    | Zero |   Rand    |    2^28    |   3.551 ms |       3.04% |   2.748 ms |       0.31% |  -803.270 us | -22.62% |   FAIL   |
    |   U32    | Even |    Seq    |    2^28    |   3.358 ms |       3.23% |   2.449 ms |       0.34% |  -908.775 us | -27.07% |   FAIL   |
    |   U32    | Even |   Const   |    2^28    |   3.704 ms |       3.05% |   2.748 ms |       0.33% |  -955.250 us | -25.79% |   FAIL   |
    |   U32    | Even |   Rand    |    2^28    |   3.389 ms |       2.64% |   2.524 ms |       0.27% |  -864.592 us | -25.51% |   FAIL   |
    

    Turns into:

    speedups = [24.76, 27.06, 21.27, 21.89, 22.37, 22.62, 27.07, 25.79, 25.51]
    

    This is further presented as a bar in the bar plot below. Therefore, error bars should be treated as different speedups in different scenarios rather than a run-to-run variance. Here's the aggregate result for base clocks:

    base

    And the image below is for TDP-locked clocks:

    tdp

    type: bug: functional testing: gpuCI passed P0: must have area: performance 
    opened by senior-zero 0
  • Clarify alignment requirements

    Clarify alignment requirements

    Currently, we rely on AliasTemporaries to adjust alignment for allocations in temporary storage, so that proper temporary storage alignment should not be ensured on the user side. This is not documented nor tested, so it's easy to break or introduce new functionality without these guarantees.

    We should add tests and document alignment policy for temporary storage.

    P2: nice to have 
    opened by senior-zero 0
  • `DeviceHistogram` doesn't support `CounterT=int64_t` due to missing `atomicAdd` overload

    `DeviceHistogram` doesn't support `CounterT=int64_t` due to missing `atomicAdd` overload

    I am trying to use cub::DeviceHistogram::HistogramEven with CounterT=int64_t and get the following error:

    cub/agent/agent_histogram.cuh(370): error: no instance of overloaded function "atomicAdd" matches the argument list
                argument types are: (int64_t *, int64_t)
              detected during:
                instantiation of "void cub::CUB_101702_800_NS::AgentHistogram<AgentHistogramPolicyT, PRIVATIZED_SMEM_BINS, NUM_CHANNELS, NUM_ACTIVE_CHANNELS, SampleIteratorT, CounterT, PrivatizedDecodeOpT, OutputDecodeOpT, OffsetT, PTX_ARCH>::AccumulatePixels(cub::CUB_101702_800_NS::AgentHistogram<AgentHistogramPolicyT, PRIVATIZED_SMEM_BINS, NUM_CHANNELS, NUM_ACTIVE_CHANNELS, SampleIteratorT, CounterT, PrivatizedDecodeOpT, OutputDecodeOpT, OffsetT, PTX_ARCH>::SampleT (*)[NUM_CHANNELS], __nv_bool *, CounterT **, cub::CUB_101702_800_NS::Int2Type<1>) [with AgentHistogramPolicyT=cub::CUB_101702_800_NS::DispatchHistogram<1, 1, const uint32_t *, int64_t, int64_t, int>::PtxHistogramSweepPolicy, PRIVATIZED_SMEM_BINS=0, NUM_CHANNELS=1, NUM_ACTIVE_CHANNELS=1, SampleIteratorT=const uint32_t *, CounterT=int64_t, PrivatizedDecodeOpT=cub::CUB_101702_800_NS::DispatchHistogram<1, 1, const uint32_t *, int64_t, int64_t, int>::ScaleTransform, OutputDecodeOpT=cub::CUB_101702_800_NS::DispatchHistogram<1, 1, const uint32_t *, int64_t, int64_t, int>::PassThruTransform, OffsetT=int, PTX_ARCH=800]"
    
    type: bug: functional P0: must have 
    opened by Nyrio 2
  • Handle operators that accept lvalue references

    Handle operators that accept lvalue references

    Since 2.0 we compute accumulator type for algorithms like reduce and scan as invoke_result_t<OpT, InitT, InputT>. This leads to compilation issues when operator accepts lvalue references:

    #include <cub/cub.cuh>
    
    struct op_t {
    public:
      __host__ __device__ int operator()(int &v1, int &v2) { 
        return v1 * v2; 
      }
    };
    
    int main() {
      static_assert(std::is_same_v<cuda::std::invoke_result_t<op_t, int, int>, int>);
    }
    

    Having lvalue references shouldn't be allowed, since we assume that arguments are not modified when invoking the operator. Algorithms like std::accumulate also require:

    op must not modify any elements of the range involved. The signature of the function should be equivalent to the following:

    Ret fun(const Type1 &a, const Type2 &b); 
    

    In our documentation we explicitly require operator to have T operator()(const T &a, const T &b) method. We should adjust accumulator type computation to pass const references into invoke_result_t.

    P1: should have 
    opened by senior-zero 0
  • Support for negative segment sizes

    Support for negative segment sizes

    In our segmented reduce we state:

    If d_end_offsets[i] - 1 <= d_begin_offsets[i], the ith segment is considered empty. In our segmented sort and segmented radix sort we state the same thing.

    While in the segmented reduce implementation, we check that:

    if (segment_begin == segment_end)
    

    to recognize empty segments. I don't think we test for negative segment sizes.

    I suggest we decide if we want to support negative offsets and then add necessary tests / fixes, or we change the docs.

    opened by senior-zero 2
Releases(1.17.2)
  • 1.17.2(Sep 13, 2022)

    Summary

    CUB 1.17.2 is a minor bugfix release.

    • NVIDIA/cub#547: Introduce an annotated inline namespace to prevent issues with collisions and mismatched kernel configurations across libraries. The new namespace encodes the CUB version and target SM architectures.
    Source code(tar.gz)
    Source code(zip)
  • 2.0.0-rc2(Aug 15, 2022)

    Summary

    The CUB 2.0.0 major release adds a dependency on libcu++ and contains several breaking changes. These include new diagnostics when inspecting device-only lambdas from the host, an updated method of determining accumulator types for algorithms like Reduce and Scan, and a compile-time replacement for the runtime debug_synchronous debugging flags.

    This release also includes several new features. DeviceHistogram now supports __half and better handles various edge cases. WarpReduce now performs correctly when restricted to a single-thread “warp”, and will use the __reduce_add_sync accelerated intrinsic (introduced with Ampere) when appropriate. DeviceRadixSort learned to handle the case where begin_bit == end_bit.

    Several algorithms also have updated documentation, with a particular focus on clarifying which operations can and cannot be performed in-place.

    Breaking Changes

    • NVIDIA/cub#448 Add libcu++ dependency (v1.8.0+).
    • NVIDIA/cub#448: The following macros are no longer defined by default. They can be re-enabled by defining CUB_PROVIDE_LEGACY_ARCH_MACROS. These will be completely removed in a future release.
      • CUB_IS_HOST_CODE: Replace with NV_IF_TARGET.
      • CUB_IS_DEVICE_CODE: Replace with NV_IF_TARGET.
      • CUB_INCLUDE_HOST_CODE: Replace with NV_IF_TARGET.
      • CUB_INCLUDE_DEVICE_CODE: Replace with NV_IF_TARGET.
    • NVIDIA/cub#486: CUB’s CUDA Runtime support macros have been updated to support NV_IF_TARGET. They are now defined consistently across all host/device compilation passes. This should not affect most usages of these macros, but may require changes for some edge cases.
      • CUB_RUNTIME_FUNCTION: Execution space annotations for functions that invoke CUDA Runtime APIs.
        • Old behavior:
          • RDC enabled: Defined to __host__ __device__
          • RDC not enabled:
            • NVCC host pass: Defined to __host__ __device__
            • NVCC device pass: Defined to __host__
        • New behavior:
          • RDC enabled: Defined to __host__ __device__
          • RDC not enabled: Defined to __host__
      • CUB_RUNTIME_ENABLED: No change in behavior, but no longer used in CUB. Provided for legacy support only. Legacy behavior:
        • RDC enabled: Macro is defined.
        • RDC not enabled:
          • NVCC host pass: Macro is defined.
          • NVCC device pass: Macro is not defined.
      • CUB_RDC_ENABLED: New macro, may be combined with NV_IF_TARGET to replace most usages of CUB_RUNTIME_ENABLED. Behavior:
        • RDC enabled: Macro is defined.
        • RDC not enabled: Macro is not defined.
    • NVIDIA/cub#509: A compile-time error is now emitted when a __device__-only lambda’s return type is queried from host code (requires libcu++ ≥ 1.9.0).
      • Due to limitations in the CUDA programming model, the result of this query is unreliable, and will silently return an incorrect result. This leads to difficult to debug errors.
      • When using libcu++ 1.9.0, an error will be emitted with information about work-arounds:
        • Use a named function object with a __device__-only implementation of operator().
        • Use a __host__ __device__ lambda.
        • Use cuda::proclaim_return_type (Added in libcu++ 1.9.0)
    • NVIDIA/cub#509: Use the result type of the binary reduction operator for accumulating intermediate results in the DeviceReduce algorithm, following guidance from http://wg21.link/P2322R6.
      • This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
      • In addition to the behavioral changes, the interfaces for the Dispatch*Reduce layer have changed:
        • DispatchReduce:
          • Now accepts accumulator type as last parameter.
          • Now accepts initializer type instead of output iterator value type.
          • Constructor now accepts init as initial type instead of output iterator value type.
        • DispatchSegmentedReduce:
          • Accepts accumulator type as last parameter.
          • Accepts initializer type instead of output iterator value type.
      • Thread operators now accept parameters using different types: Equality, Inequality, InequalityWrapper, Sum, Difference, Division, Max, ArgMax, Min, ArgMin.
      • ThreadReduce now accepts accumulator type and uses a different type for prefix.
    • NVIDIA/cub#511: Use the result type of the binary operator for accumulating intermediate results in the DeviceScan, DeviceScanByKey, and DeviceReduceByKey algorithms, following guidance from http://wg21.link/P2322R6.
      • This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
      • In addition to the behavioral changes, the interfaces for the Dispatch layer have changed:
        • DispatchScan now accepts accumulator type as a template parameter.
        • DispatchScanByKey now accepts accumulator type as a template parameter.
        • DispatchReduceByKey now accepts accumulator type as the last template parameter.
    • NVIDIA/cub#527: Deprecate the debug_synchronous flags on device algorithms.
      • This flag no longer has any effect. Define CUB_DEBUG_SYNC during compilation to enable these checks.
      • Moving this option from run-time to compile-time avoids the compilation overhead of unused debugging paths in production code.

    New Features

    • NVIDIA/cub#514: Support __half in DeviceHistogram.
    • NVIDIA/cub#516: Add support for single-threaded invocations of WarpReduce.
    • NVIDIA/cub#516: Use __reduce_add_sync hardware acceleration for WarpReduce on supported architectures.

    Bug Fixes

    • NVIDIA/cub#481: Fix the device-wide radix sort implementations to simply copy the input to the output when begin_bit == end_bit.
    • NVIDIA/cub#487: Fix DeviceHistogram::Even for a variety of edge cases:
      • Bin ids are now correctly computed when mixing different types for SampleT and LevelT.
      • Bin ids are now correctly computed when LevelT is an integral type and the number of levels does not evenly divide the level range.
    • NVIDIA/cub#508: Ensure that temp_storage_bytes is properly set in the AdjacentDifferenceCopy device algorithms.
    • NVIDIA/cub#508: Remove excessive calls to the binary operator given to the AdjacentDifferenceCopy device algorithms.
    • NVIDIA/cub#533: Fix debugging utilities when RDC is disabled.

    Other Enhancements

    • NVIDIA/cub#448: Removed special case code for unsupported CUDA architectures.
    • NVIDIA/cub#448: Replace several usages of __CUDA_ARCH__ with <nv/target> to handle host/device code divergence.
    • NVIDIA/cub#448: Mark unused PTX arch parameters as legacy.
    • NVIDIA/cub#476: Enabled additional debug logging for the onesweep radix sort implementation. Thanks to @canonizer for this contribution.
    • NVIDIA/cub#480: Add CUB_DISABLE_BF16_SUPPORT to avoid including the cuda_bf16.h header or using the __nv_bfloat16 type.
    • NVIDIA/cub#486: Add debug log messages for post-kernel debug synchronizations.
    • NVIDIA/cub#490: Clarify documentation for in-place usage of DeviceScan algorithms.
    • NVIDIA/cub#494: Clarify documentation for in-place usage of DeviceHistogram algorithms.
    • NVIDIA/cub#495: Clarify documentation for in-place usage of DevicePartition algorithms.
    • NVIDIA/cub#499: Clarify documentation for in-place usage of Device*Sort algorithms.
    • NVIDIA/cub#500: Clarify documentation for in-place usage of DeviceReduce algorithms.
    • NVIDIA/cub#501: Clarify documentation for in-place usage of DeviceRunLengthEncode algorithms.
    • NVIDIA/cub#503: Clarify documentation for in-place usage of DeviceSelect algorithms.
    • NVIDIA/cub#518: Fix typo in WarpMergeSort documentation.
    • NVIDIA/cub#519: Clarify segmented sort documentation regarding the handling of elements that are not included in any segment.
    Source code(tar.gz)
    Source code(zip)
  • 1.17.1(Aug 15, 2022)

    Summary

    CUB 1.17.1 is a minor bugfix release.

    • NVIDIA/cub#508: Ensure that temp_storage_bytes is properly set in the AdjacentDifferenceCopy device algorithms.
    • NVIDIA/cub#508: Remove excessive calls to the binary operator given to the AdjacentDifferenceCopy device algorithms.
    • Fix device-side debug synchronous behavior in DeviceSegmentedSort.
    Source code(tar.gz)
    Source code(zip)
  • 1.17.0(May 9, 2022)

    CUB 1.17.0

    Summary

    CUB 1.17.0 is the final minor release of the 1.X series. It provides a variety of bug fixes and miscellaneous enhancements, detailed below.

    Known Issues

    “Run-to-run” Determinism Broken

    Several CUB device algorithms are documented to provide deterministic results (per device) for non-associative reduction operators (e.g. floating-point addition). Unfortunately, the implementations of these algorithms contain performance optimizations that violate this guarantee. The DeviceReduce::ReduceByKey and DeviceScan algorithms are known to be affected. We’re currently evaluating the scope and impact of correcting this in a future CUB release. See NVIDIA/cub#471 for details.

    Bug Fixes

    • NVIDIA/cub#444: Fixed DeviceSelect to work with discard iterators and mixed input/output types.
    • NVIDIA/cub#452: Fixed install issue when CMAKE_INSTALL_LIBDIR contained nested directories. Thanks to @robertmaynard for this contribution.
    • NVIDIA/cub#462: Fixed bug that produced incorrect results from DeviceSegmentedSort on sm_61 and sm_70.
    • NVIDIA/cub#464: Fixed DeviceSelect::Flagged so that flags are normalized to 0 or 1.
    • NVIDIA/cub#468: Fixed overflow issues in DeviceRadixSort given num_items close to 2^32. Thanks to @canonizer for this contribution.
    • NVIDIA/cub#498: Fixed compiler regression in BlockAdjacentDifference. Thanks to @MKKnorr for this contribution.

    Other Enhancements

    • NVIDIA/cub#445: Remove device-sync in DeviceSegmentedSort when launched via CDP.
    • NVIDIA/cub#449: Fixed invalid link in documentation. Thanks to @kshitij12345 for this contribution.
    • NVIDIA/cub#450: BlockDiscontinuity: Replaced recursive-template loop unrolling with #pragma unroll. Thanks to @kshitij12345 for this contribution.
    • NVIDIA/cub#451: Replaced the deprecated TexRefInputIterator implementation with an alias to TexObjInputIterator. This fully removes all usages of the deprecated CUDA texture reference APIs from CUB.
    • NVIDIA/cub#456: BlockAdjacentDifference: Replaced recursive-template loop unrolling with #pragma unroll. Thanks to @kshitij12345 for this contribution.
    • NVIDIA/cub#466: cub::DeviceAdjacentDifference API has been updated to use the new OffsetT deduction approach described in NVIDIA/cub#212.
    • NVIDIA/cub#470: Fix several doxygen-related warnings. Thanks to @karthikeyann for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.16.0(Feb 8, 2022)

    Summary

    CUB 1.16.0 is a major release providing several improvements to the device scope algorithms. DeviceRadixSort now supports large (64-bit indexed) input data. A new UniqueByKey algorithm has been added to DeviceSelect. DeviceAdjacentDifference provides new SubtractLeft and SubtractRight functionality.

    This release also deprecates several obsolete APIs, including type traits and BlockAdjacentDifference algorithms. Many bugfixes and documentation updates are also included.

    64-bit Offsets in DeviceRadixSort Public APIs

    Users frequently want to process large datasets using CUB’s device-scope algorithms, but the current public APIs limit input data sizes to those that can be indexed by a 32-bit integer. Beginning with this release, CUB is updating these APIs to support 64-bit offsets, as discussed in NVIDIA/cub#212.

    The device-scope algorithms will be updated with 64-bit offset support incrementally, starting with the cub::DeviceRadixSort family of algorithms. Thanks to @canonizer for contributing this functionality.

    New DeviceSelect::UniqueByKey Algorithm

    cub::DeviceSelect now provides a UniqueByKey algorithm, which has been ported from Thrust. Thanks to @zasdfgbnm for this contribution.

    New DeviceAdjacentDifference Algorithms

    The new cub::DeviceAdjacentDifference interface, also ported from Thrust, provides SubtractLeft and SubtractRight algorithms as CUB kernels.

    Deprecation Notices

    Synchronous CUDA Dynamic Parallelism Support

    A future version of CUB will change the debug_synchronous behavior of device-scope algorithms when invoked via CUDA Dynamic Parallelism (CDP).

    This will only affect calls to CUB device-scope algorithms launched from device-side code with debug_synchronous = true. Such invocations will continue to print extra debugging information, but they will no longer synchronize after kernel launches.

    Deprecated Traits

    CUB provided a variety of metaprogramming type traits in order to support C++03. Since C++14 is now required, these traits have been deprecated in favor of their STL equivalents, as shown below:

    | Deprecated CUB Trait | Replacement STL Trait | |-----------------------|-----------------------| | cub::If | std::conditional | | cub::Equals | std::is_same | | cub::IsPointer | std::is_pointer | | cub::IsVolatile | std::is_volatile | | cub::RemoveQualifiers | std::remove_cv | | cub::EnableIf | std::enable_if |

    CUB now uses the STL traits internally, resulting in a ~6% improvement in compile time.

    Misnamed cub::BlockAdjacentDifference APIs

    The algorithms in cub::BlockAdjacentDifference have been deprecated, as their names did not clearly describe their intent. The FlagHeads method is now SubtractLeft, and FlagTails has been replaced by SubtractRight.

    Breaking Changes

    • NVIDIA/cub#331: Deprecate the misnamed BlockAdjacentDifference::FlagHeads and FlagTails methods. Use the new SubtractLeft and SubtractRight methods instead.
    • NVIDIA/cub#364: Deprecate some obsolete type traits. These should be replaced by the equivalent traits in <type_traits> as described above.

    New Features

    • NVIDIA/cub#331: Port the thrust::adjacent_difference kernel and expose it as cub::DeviceAdjacentDifference.
    • NVIDIA/cub#405: Port the thrust::unique_by_key kernel and expose it as cub::DeviceSelect::UniqueByKey. Thanks to @zasdfgbmn for this contribution.

    Enhancements

    • NVIDIA/cub#340: Allow 64-bit offsets in DeviceRadixSort public APIs. Thanks to @canonizer for this contribution.
    • NVIDIA/cub#400: Implement a significant reduction in DeviceMergeSort compilation time.
    • NVIDIA/cub#415: Support user-defined CMAKE_INSTALL_INCLUDEDIR values in Thrust’s CMake install rules. Thanks for @robertmaynard for this contribution.

    Bug Fixes

    • NVIDIA/cub#381: Fix shared memory alignment in dyn_smem example.
    • NVIDIA/cub#393: Fix some collisions with the min/max macros defined in windows.h.
    • NVIDIA/cub#404: Fix bad cast in util_device.
    • NVIDIA/cub#410: Fix CDP issues in DeviceSegmentedSort.
    • NVIDIA/cub#411: Ensure that the nv_exec_check_disable pragma is only used on nvcc.
    • NVIDIA/cub#418: Fix -Wsizeof-array-div warning on gcc 11. Thanks to @robertmaynard for this contribution.
    • NVIDIA/cub#420: Fix new uninitialized variable warning in DiscardIterator on gcc 10.
    • NVIDIA/cub#423: Fix some collisions with the small macro defined in windows.h.
    • NVIDIA/cub#426: Fix some issues with version handling in CUB’s CMake packages.
    • NVIDIA/cub#430: Remove documentation for DeviceSpmv parameters that are absent from public APIs.
    • NVIDIA/cub#432: Remove incorrect documentation for DeviceScan algorithms that guaranteed run-to-run deterministic results for floating-point addition.
    Source code(tar.gz)
    Source code(zip)
  • 1.15.0(Oct 25, 2021)

    Summary

    CUB 1.15.0 includes a new cub::DeviceSegmentedSort algorithm, which demonstrates up to 5000x speedup compared to cub::DeviceSegmentedRadixSort when sorting a large number of small segments. A new cub::FutureValue<T> helper allows the cub::DeviceScan algorithms to lazily load the initial_value from a pointer. cub::DeviceScan also added ScanByKey functionality.

    The new DeviceSegmentedSort algorithm partitions segments into size groups. Each group is processed with specialized kernels using a variety of sorting algorithms. This approach varies the number of threads allocated for sorting each segment and utilizes the GPU more efficiently.

    cub::FutureValue<T> provides the ability to use the result of a previous kernel as a scalar input to a CUB device-scope algorithm without unnecessary synchronization:

    int *d_intermediate_result = ...;
    intermediate_kernel<<<blocks, threads>>>(d_intermediate_result,  // output
                                             arg1,                   // input
                                             arg2);                  // input
    
    // Wrap the intermediate pointer in a FutureValue -- no need to explicitly
    // sync when both kernels are stream-ordered. The pointer is read after
    // the ExclusiveScan kernel starts executing.
    cub::FutureValue<int> init_value(d_intermediate_result);
    
    cub::DeviceScan::ExclusiveScan(d_temp_storage,
                                   temp_storage_bytes,
                                   d_in,
                                   d_out,
                                   cub::Sum(),
                                   init_value,
                                   num_items);
    

    Previously, an explicit synchronization would have been necessary to obtain the intermediate result, which was passed by value into ExclusiveScan. This new feature enables better performance in workflows that use cub::DeviceScan.

    Deprecation Notices

    A future version of CUB will change the debug_synchronous behavior of device-scope algorithms when invoked via CUDA Dynamic Parallelism (CDP).

    This will only affect calls to CUB device-scope algorithms launched from device-side code with debug_synchronous = true. These algorithms will continue to print extra debugging information, but they will no longer synchronize after kernel launches.

    Breaking Changes

    • NVIDIA/cub#305: The template parameters of cub::DispatchScan have changed to support the new cub::FutureValue helper. More details under "New Features".
    • NVIDIA/cub#377: Remove broken operator->() from cub::TransformInputIterator, since this cannot be implemented without returning a temporary object's address. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.

    New Features

    • NVIDIA/cub#305: Add overloads to cub::DeviceScan algorithms that allow the output of a previous kernel to be used as initial_value without explicit synchronization. See the new cub::FutureValue helper for details. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
    • NVIDIA/cub#354: Add cub::BlockRunLengthDecode algorithm. Thanks to Elias Stehle (@elstehle) for this contribution.
    • NVIDIA/cub#357: Add cub::DeviceSegmentedSort, an optimized version of cub::DeviceSegmentedSort with improved load balancing and small array performance.
    • NVIDIA/cub#376: Add "by key" overloads to cub::DeviceScan. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.

    Bug Fixes

    • NVIDIA/cub#349: Doxygen and unused variable fixes.
    • NVIDIA/cub#363: Maintenance updates for the new cub::DeviceMergeSort algorithms.
    • NVIDIA/cub#382: Fix several -Wconversion warnings. Thanks to Matt Stack (@matt-stack) for this contribution.
    • NVIDIA/cub#388: Fix debug assertion on MSVC when using cub::CachingDeviceAllocator.
    • NVIDIA/cub#395: Support building with __CUDA_NO_HALF_CONVERSIONS__. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.14.0(Aug 24, 2021)

    CUB 1.14.0 is a major release accompanying the NVIDIA HPC SDK 21.9.

    This release provides the often-requested merge sort algorithm, ported from the thrust::sort implementation. Merge sort provides more flexibility than the existing radix sort by supporting arbitrary data types and comparators, though radix sorting is still faster for supported inputs. This functionality is provided through the new cub::DeviceMergeSort and cub::BlockMergeSort algorithms.

    The namespace wrapping mechanism has been overhauled for 1.14. The existing macros (CUB_NS_PREFIX/CUB_NS_POSTFIX) can now be replaced by a single macro, CUB_WRAPPED_NAMESPACE, which is set to the name of the desired wrapped namespace. Defining a similar THRUST_CUB_WRAPPED_NAMESPACE macro will embed both thrust:: and cub:: symbols in the same external namespace. The prefix/postfix macros are still supported, but now require a new CUB_NS_QUALIFIER macro to be defined, which provides the fully qualified CUB namespace (e.g. ::foo::cub). See cub/util_namespace.cuh for details.

    Breaking Changes

    • NVIDIA/cub#350: When the CUB_NS_[PRE|POST]FIX macros are set, CUB_NS_QUALIFIER must also be defined to the fully qualified CUB namespace (e.g. #define CUB_NS_QUALIFIER ::foo::cub). Note that this is handled automatically when using the new [THRUST_]CUB_WRAPPED_NAMESPACE mechanism.

    New Features

    • NVIDIA/cub#322: Ported the merge sort algorithm from Thrust; cub::BlockMergeSort and cub::DeviceMergeSort are now available.
    • NVIDIA/cub#326: Simplify the namespace wrapper macros, and detect when Thrust's symbols are in a wrapped namespace.

    Bug Fixes

    • NVIDIA/cub#160, NVIDIA/cub#163, NVIDIA/cub#352: Fixed several bugs in cub::DeviceSpmv and added basic tests for this algorithm. Thanks to James Wyles and Seunghwa Kang for their contributions.
    • NVIDIA/cub#328: Fixed error handling bug and incorrect debugging output in cub::CachingDeviceAllocator. Thanks to Felix Kallenborn for this contribution.
    • NVIDIA/cub#335: Fixed a compile error affecting clang and NVRTC. Thanks to Jiading Guo for this contribution.
    • NVIDIA/cub#351: Fixed some errors in the cub::DeviceHistogram documentation.

    Enhancements

    • NVIDIA/cub#348: Add an example that demonstrates how to use dynamic shared memory with a CUB block algorithm. Thanks to Matthias Jouanneaux for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.13.1(Oct 25, 2021)

    CUB 1.13.1 is a minor release accompanying the CUDA Toolkit 11.5.

    This release provides a new hook for embedding the cub:: namespace inside a custom namespace. This is intended to work around various issues related to linking multiple shared libraries that use CUB. The existing CUB_NS_PREFIX and CUB_NS_POSTFIX macros already provided this capability; this update provides a simpler mechanism that is extended to and integrated with Thrust. Simply define THRUST_CUB_WRAPPED_NAMESPACE to a namespace name, and both thrust:: and cub:: will be placed inside the new namespace. Using different wrapped namespaces for each shared library will prevent issues like those reported in NVIDIA/thrust#1401.

    New Features

    • NVIDIA/cub#326: Add THRUST_CUB_WRAPPED_NAMESPACE hooks.
    Source code(tar.gz)
    Source code(zip)
  • 1.13.0(Jun 15, 2021)

    CUB 1.13.0 is the major release accompanying the NVIDIA HPC SDK 21.7 release.

    Notable new features include support for striped data arrangements in block load/store utilities, bfloat16 radix sort support, and fewer restrictions on offset iterators in segmented device algorithms. Several bugs in cub::BlockShuffle, cub::BlockDiscontinuity, and cub::DeviceHistogram have been addressed. The amount of code generated in cub::DeviceScan has been greatly reduced, leading to significant compile-time improvements when targeting multiple PTX architectures.

    This release also includes several user-contributed documentation fixes that will be reflected in CUB's online documentation in the coming weeks.

    Breaking Changes

    • NVIDIA/cub#320: Deprecated cub::TexRefInputIterator<T, UNIQUE_ID>. Use cub::TexObjInputIterator<T> as a replacement.

    New Features

    • NVIDIA/cub#274: Add BLOCK_LOAD_STRIPED and BLOCK_STORE_STRIPED functionality to cub::BlockLoadAlgorithm and cub::BlockStoreAlgorithm. Thanks to Matthew Nicely (@mnicely) for this contribution.
    • NVIDIA/cub#291: cub::DeviceSegmentedRadixSort and cub::DeviceSegmentedReduce now support different types for begin/end offset iterators. Thanks to Sergey Pavlov (@psvvsp) for this contribution.
    • NVIDIA/cub#306: Add bfloat16 support to cub::DeviceRadixSort. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
    • NVIDIA/cub#320: Introduce a new CUB_IGNORE_DEPRECATED_API macro that disables deprecation warnings on Thrust and CUB APIs.

    Bug Fixes

    • NVIDIA/cub#277: Fixed sanitizer warnings in RadixSortScanBinsKernels. Thanks to Andy Adinets (@canonizer) for this contribution.
    • NVIDIA/cub#287: cub::DeviceHistogram now correctly handles cases where OffsetT is not an int. Thanks to Dominique LaSalle (@nv-dlasalle) for this contribution.
    • NVIDIA/cub#311: Fixed several bugs and added tests for the cub::BlockShuffle collective operations.
    • NVIDIA/cub#312: Eliminate unnecessary kernel instantiations when compiling cub::DeviceScan. Thanks to Elias Stehle (@elstehle) for this contribution.
    • NVIDIA/cub#319: Fixed out-of-bounds memory access on debugging builds of cub::BlockDiscontinuity::FlagHeadsAndTails.
    • NVIDIA/cub#320: Fixed harmless missing return statement warning in unreachable cub::TexObjInputIterator code path.

    Other Enhancements

    • Several documentation fixes are included in this release.
      • NVIDIA/cub#275: Fixed comments describing the cub::If and cub::Equals utilities. Thanks to Rukshan Jayasekara (@rukshan99) for this contribution.
      • NVIDIA/cub#290: Documented that cub::DeviceSegmentedReduce will produce consistent results run-to-run on the same device for pseudo-associated reduction operators. Thanks to Himanshu (@himanshu007-creator) for this contribution.
      • NVIDIA/cub#298: CONTRIBUTING.md now refers to Thrust's build instructions for developer builds, which is the preferred way to build the CUB test harness. Thanks to Xiang Gao (@zasdfgbnm) for contributing.
      • NVIDIA/cub#301: Expand cub::DeviceScan documentation to include in-place support and add tests. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
      • NVIDIA/cub#307: Expand cub::DeviceRadixSort and cub::BlockRadixSort documentation to clarify stability, in-place support, and type-specific bitwise transformations. Thanks to Himanshu (@himanshu007-creator) for contributing.
      • NVIDIA/cub#316: Move WARP_TIME_SLICING documentation to the correct location. Thanks to Peter Han (@peter9606) for this contribution.
      • NVIDIA/cub#321: Update URLs from deprecated github.com to preferred github.io. Thanks to Lilo Huang (@lilohuang) for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.12.1(Jun 15, 2021)

  • 1.12.0(Feb 23, 2021)

    Summary

    CUB 1.12.0 is a bugfix release accompanying the NVIDIA HPC SDK 21.3 release and the CUDA Toolkit 11.4 release.

    Radix sort is now stable when both +0.0 and -0.0 are present in the input (they are treated as equivalent). Many compilation warnings and subtle overflow bugs were fixed in the device algorithms, including a long-standing bug that returned invalid temporary storage requirements when num_items was close to (but not exceeding) INT32_MAX. Support for Clang < 7.0 and MSVC < 2019 (aka 19.20/16.0/14.20) is now deprecated.

    Breaking Changes

    • NVIDIA/cub#256: Deprecate Clang < 7 and MSVC < 2019.

    New Features

    • NVIDIA/cub#218: Radix sort now treats -0.0 and +0.0 as equivalent for floating point types, which is required for the sort to be stable. Thanks to Andy Adinets for this contribution.

    Bug Fixes

    • NVIDIA/cub#247: Suppress newly triggered warnings in Clang. Thanks to Andrew Corrigan for this contribution.
    • NVIDIA/cub#249: Enable stricter warning flags. This fixes a number of outstanding issues:
      • NVIDIA/cub#221: Overflow in temp_storage_bytes when num_items close to (but not over) INT32_MAX.
      • NVIDIA/cub#228: CUB uses non-standard C++ extensions that break strict compilers.
      • NVIDIA/cub#257: Warning when compiling GridEvenShare with unsigned offsets.
    • NVIDIA/cub#258: Use correct OffsetT in DispatchRadixSort::InitPassConfig. Thanks to Felix Kallenborn for this contribution.
    • NVIDIA/cub#259: Remove some problematic __forceinline__ annotations.

    Other Enhancements

    • NVIDIA/cub#123: Fix incorrect issue number in changelog. Thanks to Peet Whittaker for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.11.0(Nov 23, 2020)

    Summary

    CUB 1.11.0 is a major release providing bugfixes and performance enhancements. It includes a new DeviceRadixSort backend that improves performance by up to 2x on supported keys and hardware. Our CMake package and build system continue to see improvements with add_subdirectory support, installation rules, status messages, and other features that make CUB easier to use from CMake projects. The release includes several other bugfixes and modernizations, and received updates from 11 contributors.

    Breaking Changes

    • NVIDIA/cub#201: The intermediate accumulator type used when DeviceScan is invoked with different input/output types is now consistent with P0571. This may produce different results for some edge cases when compared with earlier releases of CUB.

    New Features

    • NVIDIA/cub#204: Faster DeviceRadixSort, up to 2x performance increase for 32/64-bit keys on Pascal and up (SM60+). Thanks to Andy Adinets for this contribution.
    • Unroll loops in BlockRadixRank to improve performance for 32-bit keys by 1.5-2x on Clang CUDA. Thanks to Justin Lebar for this contribution.
    • NVIDIA/cub#200: Allow CUB to be added to CMake projects via add_subdirectory.
    • NVIDIA/cub#214: Optionally add install rules when included with CMake's add_subdirectory. Thanks to Kai Germaschewski for this contribution.

    Bug Fixes

    • NVIDIA/cub#215: Fix integer truncation in AgentReduceByKey, AgentScan, and AgentSegmentFixup. Thanks to Rory Mitchell for this contribution.
    • NVIDIA/cub#225: Fix compile-time regression when defining CUB_NS_PREFIX/CUB_NS_POSTFIX macro. Thanks to Elias Stehle for this contribution.
    • NVIDIA/cub#210: Fix some edge cases in DeviceScan:
      • Use values from the input when padding temporary buffers. This prevents custom functors from getting unexpected values.
      • Prevent integer truncation when using large indices via the DispatchScan layer.
      • Use timesliced reads/writes for types > 128 bytes.
    • NVIDIA/cub#217: Fix and add test for cmake package install rules. Thanks to Keith Kraus and Kai Germaschewski for testing and discussion.
    • NVIDIA/cub#170, NVIDIA/cub#233: Update CUDA version checks to behave on Clang CUDA and nvc++. Thanks to Artem Belevich, Andrew Corrigan, and David Olsen for these contributions.
    • NVIDIA/cub#220, NVIDIA/cub#216: Various fixes for Clang CUDA. Thanks to Andrew Corrigan for these contributions.
    • NVIDIA/cub#231: Fix signedness mismatch warnings in unit tests.
    • NVIDIA/cub#231: Suppress GPU deprecation warnings.
    • NVIDIA/cub#214: Use semantic versioning rules for our CMake package's compatibility checks. Thanks to Kai Germaschewski for this contribution.
    • NVIDIA/cub#214: Use FindPackageHandleStandardArgs to print standard status messages when our CMake package is found. Thanks to Kai Germaschewski for this contribution.
    • NVIDIA/cub#207: Fix CubDebug usage in CachingDeviceAllocator::DeviceAllocate. Thanks to Andreas Hehn for this contribution.
    • Fix documentation for DevicePartition. Thanks to ByteHamster for this contribution.
    • Clean up unused code in DispatchScan. Thanks to ByteHamster for this contribution.

    Other Enhancements

    • NVIDIA/cub#213: Remove tuning policies for unsupported hardware (<SM35).
    • References to the old Github repository and branch names were updated.
      • Github's thrust/cub repository is now NVIDIA/cub
      • Development has moved from the master branch to the main branch.
    Source code(tar.gz)
    Source code(zip)
  • 1.10.0(Sep 23, 2020)

    Summary

    CUB 1.10.0 is the major release accompanying the NVIDIA HPC SDK 20.9 release and the CUDA Toolkit 11.2 release. It drops support for C++03, GCC < 5, Clang < 6, and MSVC < 2017. It also overhauls CMake support. Finally, we now have a Code of Conduct for contributors: https://github.com/thrust/cub/blob/main/CODE_OF_CONDUCT.md

    Breaking Changes

    • C++03 is no longer supported.
    • GCC < 5, Clang < 6, and MSVC < 2017 are no longer supported.
    • C++11 is deprecated. Using this dialect will generate a compile-time warning. These warnings can be suppressed by defining CUB_IGNORE_DEPRECATED_CPP_DIALECT or CUB_IGNORE_DEPRECATED_CPP_11. Suppression is only a short term solution. We will be dropping support for C++11 in the near future.
    • CMake < 3.15 is no longer supported.
    • The default branch on GitHub is now called main.

    Other Enhancements

    • Contributor documentation: https://github.com/thrust/cub/blob/main/CONTRIBUTING.md
    • Code of Conduct: https://github.com/thrust/cub/blob/main/CODE_OF_CONDUCT.md. Thanks to Conor Hoekstra for this contribution.
    • Added install targets to CMake builds.
    • C++17 support.

    Bug Fixes

    • thrust/thrust#1244: Check for macro collisions with system headers during header testing.
    • thrust/thrust#1153: Switch to placement new instead of assignment to construct items in uninitialized memory. Thanks to Hugh Winkler for this contribution.
    • thrust/cub#38: Fix cub::DeviceHistogram for size_t OffsetTs. Thanks to Leo Fang for this contribution.
    • thrust/cub#35: Fix GCC-5 maybe-uninitialized warning. Thanks to Rong Ou for this contribution.
    • thrust/cub#36: Qualify namespace for va_printf in _CubLog. Thanks to Andrei Tchouprakov for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.10-1(Sep 23, 2020)

    Summary

    CUB 1.9.10-1 is the minor release accompanying the NVIDIA HPC SDK 20.7 release and the CUDA Toolkit 11.1 release.

    Bug Fixes

    • #1217: Move static local in cub::DeviceCount to a separate host-only function because NVC++ doesn't support static locals in host-device functions.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.10(May 19, 2020)

    Summary

    Thrust 1.9.10 is the release accompanying the NVIDIA HPC SDK 20.5 release. It adds CMake find_package support. C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated. Starting with the upcoming 1.10.0 release, C++03 support will be dropped entirely.

    Breaking Changes

    • Thrust now checks that it is compatible with the version of CUB found in your include path, generating an error if it is not. If you are using your own version of CUB, it may be too old. It is recommended to simply delete your own version of CUB and use the version of CUB that comes with Thrust.
    • C++03 and C++11 are deprecated. Using these dialects will generate a compile-time warning. These warnings can be suppressed by defining CUB_IGNORE_DEPRECATED_CPP_DIALECT (to suppress C++03 and C++11 deprecation warnings) or CUB_IGNORE_DEPRECATED_CPP11 (to suppress C++11 deprecation warnings). Suppression is only a short term solution. We will be dropping support for C++03 in the 1.10.0 release and C++11 in the near future.
    • GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. Using these compilers will generate a compile-time warning. These warnings can be suppressed by defining CUB_IGNORE_DEPRECATED_COMPILER. Suppression is only a short term solution. We will be dropping support for these compilers in the near future.

    New Features

    • CMake find_package support. Just point CMake at the cmake folder in your CUB include directory (ex: cmake -DCUB_DIR=/usr/local/cuda/include/cub/cmake/ .) and then you can add CUB to your CMake project with find_package(CUB REQUIRED CONFIG).
    Source code(tar.gz)
    Source code(zip)
  • 1.9.9(May 19, 2020)

    CUB 1.9.9 (CUDA 11.0)

    Summary

    CUB 1.9.9 is the release accompanying the CUDA Toolkit 11.0 release. It introduces CMake support, version macros, platform detection machinery, and support for NVC++, which uses Thrust (and thus CUB) to implement GPU-accelerated C++17 Parallel Algorithms. Additionally, the scan dispatch layer was refactored and modernized. C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated. Starting with the upcoming 1.10.0 release, C++03 support will be dropped entirely.

    Breaking Changes

    • Thrust now checks that it is compatible with the version of CUB found in your include path, generating an error if it is not. If you are using your own version of CUB, it may be too old. It is recommended to simply delete your own version of CUB and use the version of CUB that comes with Thrust.
    • C++03 and C++11 are deprecated. Using these dialects will generate a compile-time warning. These warnings can be suppressed by defining CUB_IGNORE_DEPRECATED_CPP_DIALECT (to suppress C++03 and C++11 deprecation warnings) or CUB_IGNORE_DEPRECATED_CPP11 (to suppress C++11 deprecation warnings). Suppression is only a short term solution. We will be dropping support for C++03 in the 1.10.0 release and C++11 in the near future.
    • GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. Using these compilers will generate a compile-time warning. These warnings can be suppressed by defining CUB_IGNORE_DEPRECATED_COMPILER. Suppression is only a short term solution. We will be dropping support for these compilers in the near future.

    New Features

    • CMake support. Thanks to Francis Lemaire for this contribution.
    • Refactorized and modernized scan dispatch layer. Thanks to Francis Lemaire for this contribution.
    • Policy hooks for device-wide reduce, scan, and radix sort facilities to simplify tuning and allow users to provide custom policies. Thanks to Francis Lemaire for this contribution.
    • <cub/version.cuh>: CUB_VERSION, CUB_VERSION_MAJOR, CUB_VERISON_MINOR, CUB_VERSION_SUBMINOR, and CUB_PATCH_NUMBER.
    • Platform detection machinery:
      • <cub/util_cpp_dialect.cuh>: Detects the C++ standard dialect.
      • <cub/util_compiler.cuh>: host and device compiler detection.
      • <cub/util_deprecated.cuh>: CUB_DEPRECATED.
      • <cub/config.cuh>: Includes<cub/util_arch.cuh>,<cub/util_compiler.cuh>,<cub/util_cpp_dialect.cuh>,<cub/util_deprecated.cuh>,<cub/util_macro.cuh>,<cub/util_namespace.cuh>`
    • cub::DeviceCount and cub::DeviceCountUncached, caching abstractions for cudaGetDeviceCount.

    Other Enhancements

    • Lazily initialize the per-device CUDAattribute caches, because CUDA context creation is expensive and adds up with large CUDA binaries on machines with many GPUs. Thanks to the NVIDIA PyTorch team for bringing this to our attention.
    • Make cub::SwitchDevice avoid setting/resetting the device if the current device is the same as the target device.

    Bug Fixes

    • Add explicit failure parameter to CAS in the CUB attribute cache to workaround a GCC 4.8 bug.
    • Revert a change in reductions that changed the signedness of the lane_id variable to suppress a warning, as this introduces a bug in optimized device code.
    • Fix initialization in cub::ExclusiveSum. Thanks to Conor Hoekstra for this contribution.
    • Fix initialization of the std::array in the CUB attribute cache.
    • Fix -Wsign-compare warnings. Thanks to Elias Stehle for this contribution.
    • Fix test_block_reduce.cu to build without parameters. Thanks to Francis Lemaire for this contribution.
    • Add missing includes to grid_even_share.cuh. Thanks to Francis Lemaire for this contribution.
    • Add missing includes to thread_search.cuh. Thanks to Francis Lemaire for this contribution.
    • Add missing includes to cub.cuh. Thanks to Felix Kallenborn for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.8-1(May 19, 2020)

    Summary

    CUB 1.9.8-1 is a variant of 1.9.8 accompanying the NVIDIA HPC SDK 20.3 release. It contains modifications necessary to serve as the implementation of NVC++'s GPU-accelerated C++17 Parallel Algorithms.

    Source code(tar.gz)
    Source code(zip)
  • 1.9.8(May 19, 2020)

    Summary

    CUB 1.9.8 is the first release of CUB to be officially supported and included in the CUDA Toolkit. When compiling CUB in C++11 mode, CUB now caches calls to CUDA attribute query APIs, which improves performance of these queries by 20x to 50x when they are called concurrently by multiple host threads.

    Enhancements

    • (C++11 or later) Cache calls to cudaFuncGetAttributes and cudaDeviceGetAttribute within cub::PtxVersion and cub::SmVersion. These CUDA APIs acquire locks to CUDA driver/runtime mutex and perform poorly under contention; with the caching, they are 20 to 50x faster when called concurrently. Thanks to Bilge Acun for bringing this issue to our attention.
    • DispatchReduce now takes an OutputT template parameter so that users can specify the intermediate type explicitly.
    • Radix sort tuning policies updates to fix performance issues for element types smaller than 4 bytes.

    Bug Fixes

    • Change initialization style from copy initialization to direct initialization (which is more permissive) in AgentReduce to allow a wider range of types to be used with it.
    • Fix bad signed/unsigned comparisons in WarpReduce.
    • Fix computation of valid lanes in warp-level reduction primitive to correctly handle the case where there are 0 input items per warp.
    Source code(tar.gz)
    Source code(zip)
  • 1.8.0(May 19, 2020)

    Summary

    CUB 1.8.0 introduces changes to the cub::Shuffle* interfaces.

    Breaking Changes

    • The interfaces of cub::ShuffleIndex, cub::ShuffleUp, and cub::ShuffleDown have been changed to allow for better computation of the PTX SHFL control constant for logical warps smaller than 32 threads.

    Bug Fixes

    • #112: Fix cub::WarpScan's broadcast of warp-wide aggregate for logical warps smaller than 32 threads.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.5(May 19, 2020)

    Summary

    CUB 1.7.5 adds support for radix sorting __half keys and improved sorting performance for 1 byte keys. It was incorporated into Thrust 1.9.2.

    Enhancements

    • Radix sort support for __half keys.
    • Radix sort tuning policy updates to improve 1 byte key performance.

    Bug Fixes

    • Syntax tweaks to mollify Clang.
    • #127: cub::DeviceRunLengthEncode::Encode returns incorrect results.
    • #128: 7-bit sorting passes fail for SM61 with large values.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.4(May 19, 2020)

    Summary

    CUB 1.7.4 is a minor release that was incorporated into Thrust 1.9.1-2.

    Bug Fixes

    • #114: Can't pair non-trivially-constructible values in radix sort.
    • #115: cub::WarpReduce segmented reduction is broken in CUDA 9 for logical warp sizes smaller than 32.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.3(May 19, 2020)

  • 1.7.2(May 19, 2020)

    Summary

    CUB 1.7.2 is a minor release.

    Bug Fixes

    • #104: Device-wide reduction is now "run-to-run" deterministic for pseudo-associative reduction operators (like floating point addition).
    Source code(tar.gz)
    Source code(zip)
  • 1.7.1(May 19, 2020)

    Summary

    CUB 1.7.0 brings support for CUDA 9.0 and SM7x (Volta) GPUs. It is compatible with independent thread scheduling.

    Breaking Changes

    • Remove cub::WarpAll and cub::WarpAny. These functions served to emulate __all and __any functionality for SM1x devices, which did not have those operations. However, SM1x devices are now deprecated in CUDA, and the interfaces of these two functions are now lacking the lane-mask needed for collectives to run on SM7x and newer GPUs which have independent thread scheduling.

    Other Enhancements

    • Remove any assumptions of implicit warp synchronization to be compatible with SM7x's (Volta) independent thread scheduling.

    Bug Fixes

    • #86: Incorrect results with reduce-by-key.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.0(May 19, 2020)

    Summary

    CUB 1.7.0 brings support for CUDA 9.0 and SM7x (Volta) GPUs. It is compatible with independent thread scheduling. It was incorporated into Thrust 1.9.2.

    Breaking Changes

    • Remove cub::WarpAll and cub::WarpAny. These functions served to emulate __all and __any functionality for SM1x devices, which did not have those operations. However, SM1x devices are now deprecated in CUDA, and the interfaces of these two functions are now lacking the lane-mask needed for collectives to run on SM7x and newer GPUs which have independent thread scheduling.

    Other Enhancements

    • Remove any assumptions of implicit warp synchronization to be compatible with SM7x's (Volta) independent thread scheduling.

    Bug Fixes

    • #86: Incorrect results with reduce-by-key.
    Source code(tar.gz)
    Source code(zip)
  • 1.6.4(May 19, 2020)

    Summary

    CUB 1.6.4 improves radix sorting performance for SM5x (Maxwell) and SM6x (Pascal) GPUs.

    Enhancements

    • Radix sort tuning policies updated for SM5x (Maxwell) and SM6x (Pascal) - 3.5B and 3.4B 32 byte keys/s on TitanX and GTX 1080, respectively.

    Bug Fixes

    • Restore fence work-around for scan (reduce-by-key, etc.) hangs in CUDA 8.5.
    • #65: cub::DeviceSegmentedRadixSort should allow inputs to have pointer-to-const type.
    • Mollify Clang device-side warnings.
    • Remove out-dated MSVC project files.
    Source code(tar.gz)
    Source code(zip)
  • 1.6.3(May 19, 2020)

    Summary

    CUB 1.6.3 improves support for Windows, changes cub::BlockLoad/cub::BlockStore interface to take the local data type, and enhances radix sort performance for SM6x (Pascal) GPUs.

    Breaking Changes

    • cub::BlockLoad and cub::BlockStore are now templated by the local data type, instead of the Iterator type. This allows for output iterators having void as their value_type (e.g. discard iterators).

    Other Enhancements

    • Radix sort tuning policies updated for SM6x (Pascal) GPUs - 6.2B 4 byte keys/s on GP100.
    • Improved support for Windows (warnings, alignment, etc).

    Bug Fixes

    • #74: cub::WarpReduce executes reduction operator for out-of-bounds items.
    • #72: cub:InequalityWrapper::operator should be non-const.
    • #71: cub::KeyValuePair won't work if Key has non-trivial constructor.
    • #69: cub::BlockStore::Storedoesn't compile ifOutputIteratorT::value_typeisn'tT`.
    • #68: cub::TilePrefixCallbackOp::WarpReduce doesn't permit PTX arch specialization.
    Source code(tar.gz)
    Source code(zip)
  • 1.6.2(May 19, 2020)

    Summary

    CUB 1.6.2 (previously 1.5.5) improves radix sort performance for SM6x (Pascal) GPUs.

    Enhancements

    • Radix sort tuning policies updated for SM6x (Pascal) GPUs.

    Bug Fixes

    • Fix AArch64 compilation of cub::CachingDeviceAllocator.
    Source code(tar.gz)
    Source code(zip)
  • 1.6.1(May 19, 2020)

  • 1.6.0(May 19, 2020)

    Summary

    CUB 1.6.0 changes the scan and reduce interfaces. Exclusive scans now accept an "initial value" instead of an "identity value". Scans and reductions now support differing input and output sequence types. Additionally, many bugs have been fixed.

    Breaking Changes

    • Device/block/warp-wide exclusive scans have been revised to now accept an "initial value" (instead of an "identity value") for seeding the computation with an arbitrary prefix.
    • Device-wide reductions and scans can now have input sequence types that are different from output sequence types (as long as they are convertible).

    Other Enhancements

    • Reduce repository size by moving the doxygen binary to doc repository.
    • Minor reduction in cub::BlockScan instruction counts.

    Bug Fixes

    • Issue #55: Warning in cub/device/dispatch/dispatch_reduce_by_key.cuh.
    • Issue #59: cub::DeviceScan::ExclusiveSum can't prefix sum of float into double.
    • Issue #58: Infinite loop in cub::CachingDeviceAllocator::NearestPowerOf.
    • Issue #47: cub::CachingDeviceAllocator needs to clean up CUDA global error state upon successful retry.
    • Issue #46: Very high amount of needed memory from the cub::DeviceHistogram::HistogramEven.
    • Issue #45: cub::CachingDeviceAllocator fails with debug output enabled
    Source code(tar.gz)
    Source code(zip)
Owner
NVIDIA Corporation
NVIDIA Corporation
An unified library for fitting primitives from 3D point cloud data with both C++&Python API.

PrimitivesFittingLib An unified library for fitting multiple primitives from 3D point cloud data with both C++&Python API. The supported primitives ty

Yueci Deng 10 Jun 30, 2022
A pytorch implementation of instant-ngp, as described in Instant Neural Graphics Primitives with a Multiresolution Hash Encoding.

torch-ngp A pytorch implementation of instant-ngp, as described in Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. Note: This

hawkey 1k Jan 6, 2023
A GPU (CUDA) based Artificial Neural Network library

Updates - 05/10/2017: Added a new example The program "image_generator" is located in the "/src/examples" subdirectory and was submitted by Ben Bogart

Daniel Frenzel 93 Dec 10, 2022
GPU Cloth TOP in TouchDesigner using CUDA-enabled NVIDIA Flex

This project demonstrates how to use NVIDIA FleX for GPU cloth simulation in a TouchDesigner Custom Operator. It also shows how to render dynamic meshes from the texture data using custom PBR GLSL material shaders inside TouchDesigner.

Vinícius Ginja 37 Jul 27, 2022
GPU PyTorch TOP in TouchDesigner with CUDA-enabled OpenCV

PyTorchTOP This project demonstrates how to use OpenCV with CUDA modules and PyTorch/LibTorch in a TouchDesigner Custom Operator. Building this projec

David 65 Jun 15, 2022
A CUDA implementation of Lattice Boltzmann for fluid dynamics simulation

Lattice Boltzmann simulation I am conscious of being only an individual struggling weakly against the stream of time. But it still remains in my power

Long Nguyen 17 Mar 1, 2022
Tiny CUDA Neural Networks

This is a small, self-contained framework for training and querying neural networks. Most notably, it contains a lightning fast "fully fused" multi-layer perceptron as well as support for various advanced input encodings, losses, and optimizers.

NVIDIA Research Projects 1.9k Jan 7, 2023
BM3D denoising filter for VapourSynth, implemented in CUDA

VapourSynth-BM3DCUDA Copyright© 2021 WolframRhodium BM3D denoising filter for VapourSynth, implemented in CUDA Description Please check VapourSynth-BM

null 54 Dec 16, 2022
HIPIFY: Convert CUDA to Portable C++ Code

Tools to translate CUDA source code into portable HIP C++ automatically

ROCm Developer Tools 206 Dec 31, 2022
A easy-to-use image processing library accelerated with CUDA on GPU.

gpucv Have you used OpenCV on your CPU, and wanted to run it on GPU. Did you try installing OpenCV and get frustrated with its installation. Fret not

shrikumaran pb 4 Aug 14, 2021
CUDA-accelerated Apriltag detection and pose estimation.

Isaac ROS Apriltag Overview This ROS2 node uses the NVIDIA GPU-accelerated AprilTags library to detect AprilTags in images and publishes their poses,

NVIDIA Isaac ROS 46 Dec 26, 2022
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

NVIDIA Isaac ROS 62 Dec 14, 2022
CUDA Custom Buffers and example blocks

gr-cuda CUDA Support for GNU Radio using the custom buffer changes introduced in GR 3.10. Custom buffers for CUDA-enabled hardware are provided that c

GNU Radio 5 Aug 17, 2022
Raytracer implemented with CPU and GPU using CUDA

Raytracer This is a training project aimed at learning ray tracing algorithm and practicing convert sequential CPU code into a parallelized GPU code u

Alex Kotovsky 2 Nov 29, 2021
PointPillars MultiHead 40FPS - A REAL-TIME 3D detection network [Pointpillars] compiled by CUDA/TensorRT/C++.

English | 简体中文 PointPillars High performance version of 3D object detection network -PointPillars, which can achieve the real-time processing (less th

Yan haixu 191 Jan 3, 2023
FoxRaycaster, optimized, fixed and with a CUDA option

Like FoxRaycaster(link) but with a nicer GUI, bug fixes, more optimized and with CUDA. Used in project: Code from FoxRaycaster, which was based on thi

Błażej Roszkowski 2 Oct 21, 2021
The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs based on CUDA.

dgSPARSE Library Introdution The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs bas

dgSPARSE 59 Dec 5, 2022
We implemented our own sequential version of GA, PSO, SA and ACA using C++ and the parallelized version with CUDA support

We implemented our own sequential version of GA, PSO, SA and ACA using C++ (some using Eigen3 as matrix operation backend) and the parallelized version with CUDA support. All of them are much faster than the popular lib scikit-opt.

Aron751 4 May 7, 2022
C++20 coroutines-based cooperative multitasking library

?? Coop Coop is a C++20 coroutines-based library to support cooperative multitasking in the context of a multithreaded application. The syntax will be

Jeremy Ong 81 Dec 9, 2022
cooperative testcases for ueb

tuwien-compilerbau-test-21 Kooperative Sammlung von Testfällen für die LVA "Übersetzerbau" der TU Wien im SS21. Bitte beachtet, dass dies eine koopera

svchost.exe 15 Nov 4, 2022