libcu++: The C++ Standard Library for Your Entire System

Overview

libcu++: The C++ Standard Library for Your Entire System

Examples Godbolt Documentation

libcu++, the NVIDIA C++ Standard Library, is the C++ Standard Library for your entire system. It provides a heterogeneous implementation of the C++ Standard Library that can be used in and between CPU and GPU code.

If you know how to use your C++ Standard Library, then you know how to use libcu++. All you have to do is add cuda/std/ to the start of your Standard Library includes and cuda:: before any uses of std:::

#include <cuda/std/atomic>
cuda::std::atomic<int> x;

The NVIDIA C++ Standard Library is an open source project; it is available on GitHub and included in the NVIDIA HPC SDK and CUDA Toolkit. If you have one of those SDKs installed, no additional installation or compiler flags are needed to use libcu++.

cuda:: and cuda::std::

When used with NVCC, NVIDIA C++ Standard Library facilities live in their own header hierarchy and namespace with the same structure as, but distinct from, the host compiler's Standard Library:

  • std::/<*>: When using NVCC, this is your host compiler's Standard Library that works in __host__ code only, although you can use the --expt-relaxed-constexpr flag to use any constexpr functions in __device__ code. With NVCC, libcu++ does not replace or interfere with host compiler's Standard Library.
  • cuda::std::/: Strictly conforming implementations of facilities from the Standard Library that work in __host__ __device__ code.
  • cuda::/: Conforming extensions to the Standard Library that work in __host__ __device__ code.
  • cuda::device/: Conforming extensions to the Standard Library that work only in __device__ code.
// Standard C++, __host__ only.
#include <atomic>
std::atomic<int> x;

// CUDA C++, __host__ __device__.
// Strictly conforming to the C++ Standard.
#include <cuda/std/atomic>
cuda::std::atomic<int> x;

// CUDA C++, __host__ __device__.
// Conforming extensions to the C++ Standard.
#include <cuda/atomic>
cuda::atomic<int, cuda::thread_scope_block> x;

libcu++ is Heterogeneous

The NVIDIA C++ Standard Library works across your entire codebase, both in and across host and device code. libcu++ is a C++ Standard Library for your entire system, not just Everything in cuda:: is __host__ __device__.

libcu++ facilities are designed to be passed between host and device code. Unless otherwise noted, any libcu++ object which is copyable or movable can be copied or moved between host and device code.

Synchronization objects work across host and device code, and can be used to synchronize between host and device threads. However, there are some restrictions to be aware of; please see the synchronization library section for more details.

cuda::device::

A small number of libcu++ facilities only work in device code, usually because there is no sensible implementation in host code.

Such facilities live in cuda::device::.

libcu++ is Incremental

Today, the NVIDIA C++ Standard Library delivers a high-priority subset of the C++ Standard Library today, and each release increases the feature set. But it is a subset; not everything is available today. The Standard API section lists the facilities available and the releases they were first introduced in.

Licensing

The NVIDIA C++ Standard Library is an open source project developed on GitHub. It is NVIDIA's variant of LLVM's libc++. libcu++ is distributed under the Apache License v2.0 with LLVM Exceptions.

Conformance

The NVIDIA C++ Standard Library aims to be a conforming implementation of the C++ Standard, ISO/IEC IS 14882, Clause 16 through 32.

ABI Evolution

The NVIDIA C++ Standard Library does not maintain long-term ABI stability. Promising long-term ABI stability would prevent us from fixing mistakes and providing best in class performance. So, we make no such promises.

Every major CUDA Toolkit release, the ABI will be broken. The life cycle of an ABI version is approximately one year. Long-term support for an ABI version ends after approximately two years. Please see the versioning section for more details.

We recommend that you always recompile your code and dependencies with the latest NVIDIA SDKs and use the latest NVIDIA C++ Standard Library ABI. Live at head.

Comments
  • Add atomics for floating point types.

    Add atomics for floating point types.

    This PR is a draft to add support for float/double atomics.

    Please review and let me know what is missing. Unfortunately, the diff between the old and new codegen output is a mess due to the reordering of operations.

    Also rolls back #282 and fixes #279

    enhancement testing: internal ci passed bug: functional 
    opened by sleeepyjack 18
  • Port `std::span` and enable if for C++11 onwards to support mdspan

    Port `std::span` and enable if for C++11 onwards to support mdspan

    In addition to porting the span implementation from libc++ the PR does the following:

    • Remove outdated support for tuple interface
    • Remove outdated support for const_iterator
    • Change index_type to size_type

    I intentionally did not adopt the ranges support, as that is out of scope.

    testing: internal ci passed 
    opened by miscco 11
  • Alignment requirements of cuda::std::complex

    Alignment requirements of cuda::std::complex

    The following two static_asserts compile without issues:

    #include <cuda/std/complex>
    static_assert(alignof(cuda::std::complex<double>) == 8);
    static_assert(alignof(cuda::std::complex<float>) == 4);
    

    I'd expected them to be 16 and 8 to match the double2 and float2 types.

    bug: performance helps: quda 
    opened by gonzalobg 9
  • Redistributable

    Redistributable

    I'm interested in supporting this library on my own compiler. Do you provide a binary redistributable that covers the non-header portions of the library? Would that be hiding in the hpc toolkit somewhere? Basically I'm looking for guidance on deployment.

    opened by seanbaxter 8
  • Backport C++17 type_traits and cuda::std::byte to C++14

    Backport C++17 type_traits and cuda::std::byte to C++14

    Broken off from https://github.com/NVIDIA/libcudacxx/pull/10

    • [x] Backports <type_traits> features from C++17 to make them available in C++14

    • [x] Backports tests of type traits to make them supported in C++14

    • [x] Backports cuda::std::byte to be available in C++14 (this was necessary to make some tests pass as a result of the <type_traits> changes

    opened by jrhemstad 8
  • `cuda::atomic_ref<float>::fetch_min` and `fetch_max` incorrect results on values of different signs

    `cuda::atomic_ref::fetch_min` and `fetch_max` incorrect results on values of different signs

    It looks like cuda::atomic_ref<T>::fetch_min for floating point types of T is not generating correct code. For these types a CAS loop implementation should be used, but as this Godbolt PTX shows, an unsigned integer atom.max is being generated. https://godbolt.org/z/sWeezx1a1

    The problem is in the conversion to unsigned: negative and positive floats reinterpreted as unsigned integers will not compare in the same order.

    The code in the Godbolt does a max of positive values with the initial max set to a negative value. It also does a min of negative values with the initial min set to a positive value. Both result in incorrect results.

    The correct output of the program should be

    0: Min: 0 Max: 511
    1: Min: 512 Max: 1023
    2: Min: 1024 Max: 1535
    3: Min: 1536 Max: 2047
    4: Min: 2048 Max: 2559
    5: Min: 2560 Max: 3071
    6: Min: 3072 Max: 3583
    7: Min: 3584 Max: 4095
    

    Instead, the program prints

    0: Min: 3.40282e+38 Max: -1
    1: Min: 3.40282e+38 Max: -1
    2: Min: 3.40282e+38 Max: -1
    3: Min: 3.40282e+38 Max: -1
    4: Min: 3.40282e+38 Max: -1
    5: Min: 3.40282e+38 Max: -1
    6: Min: 3.40282e+38 Max: -1
    7: Min: 3.40282e+38 Max: -1
    

    Tested on CUDA 11.5 on a Tesla V100 (DGX system).

    Discovered when trying to convert cuSpatial to use libcudacxx atomic_ref rather than a custom implementation which uses atomicCAS. RAPIDS would like to eliminate its custom atomic operation implementations, but this bug needs to be fixed first.

    CC @jrhemstad

    P0: must have bug: functional helps: rapids 
    opened by harrism 7
  • Fix buggy numerics of tanh(complex) at inf

    Fix buggy numerics of tanh(complex) at inf

    Because: lim[x->inf, tanh(x+iy)] = 1 lim[x->-inf, tanh(x+iy)] = -1

    Test:

    #include <complex>
    #include <cuda/std/complex>
    #include <iostream>
    
    constexpr float inf = std::numeric_limits<float>::infinity();
    
    int main() {
        float values[] = {inf, 1, 0, -1, -inf};
        for (float r : values) {
            for (float i : values) {
                std::complex<float> s = {r, i};
                cuda::std::complex<float> c = {r, i};
    
                auto ts = std::tanh(s);
                auto tc = cuda::std::tanh(c);
    
                std::cout << "input: (" << r << ", " << i << ")" << std::endl;
                std::cout << "std: (" << ts.real() << ", " << ts.imag() << ")" << std::endl;
                std::cout << "cuda::std: (" << tc.real() << ", " << tc.imag() << ")" << std::endl;
                std::cout << std::endl;
            }
        }
    }
    

    Before:

    input: (inf, inf)
    std: (1, 0)
    cuda::std: (1, 0)
    
    input: (inf, 1)
    std: (1, 0)
    cuda::std: (1, 0)
    
    input: (inf, 0)
    std: (1, 0)
    cuda::std: (1, 0)
    
    input: (inf, -1)
    std: (1, -0)
    cuda::std: (1, -0)
    
    input: (inf, -inf)
    std: (1, -0)
    cuda::std: (1, 0)
    
    input: (1, inf)
    std: (nan, nan)
    cuda::std: (-nan, -nan)
    
    input: (1, 1)
    std: (1.08392, 0.271753)
    cuda::std: (1.08392, 0.271753)
    
    input: (1, 0)
    std: (0.761594, 0)
    cuda::std: (0.761594, 0)
    
    input: (1, -1)
    std: (1.08392, -0.271753)
    cuda::std: (1.08392, -0.271753)
    
    input: (1, -inf)
    std: (nan, nan)
    cuda::std: (-nan, -nan)
    
    input: (0, inf)
    std: (0, nan)
    cuda::std: (-nan, -nan)
    
    input: (0, 1)
    std: (0, 1.55741)
    cuda::std: (0, 1.55741)
    
    input: (0, 0)
    std: (0, 0)
    cuda::std: (0, 0)
    
    input: (0, -1)
    std: (0, -1.55741)
    cuda::std: (0, -1.55741)
    
    input: (0, -inf)
    std: (0, nan)
    cuda::std: (-nan, -nan)
    
    input: (-1, inf)
    std: (nan, nan)
    cuda::std: (-nan, -nan)
    
    input: (-1, 1)
    std: (-1.08392, 0.271753)
    cuda::std: (-1.08392, 0.271753)
    
    input: (-1, 0)
    std: (-0.761594, 0)
    cuda::std: (-0.761594, 0)
    
    input: (-1, -1)
    std: (-1.08392, -0.271753)
    cuda::std: (-1.08392, -0.271753)
    
    input: (-1, -inf)
    std: (nan, nan)
    cuda::std: (-nan, -nan)
    
    input: (-inf, inf)
    std: (-1, 0)
    cuda::std: (1, 0)
    
    input: (-inf, 1)
    std: (-1, 0)
    cuda::std: (1, 0)
    
    input: (-inf, 0)
    std: (-1, 0)
    cuda::std: (1, 0)
    
    input: (-inf, -1)
    std: (-1, -0)
    cuda::std: (1, -0)
    
    input: (-inf, -inf)
    std: (-1, -0)
    cuda::std: (1, 0)
    

    After:

    input: (inf, inf)
    std: (1, 0)
    cuda::std: (1, 0)
    
    input: (inf, 1)
    std: (1, 0)
    cuda::std: (1, 0)
    
    input: (inf, 0)
    std: (1, 0)
    cuda::std: (1, 0)
    
    input: (inf, -1)
    std: (1, -0)
    cuda::std: (1, -0)
    
    input: (inf, -inf)
    std: (1, -0)
    cuda::std: (1, 0)
    
    input: (1, inf)
    std: (nan, nan)
    cuda::std: (-nan, -nan)
    
    input: (1, 1)
    std: (1.08392, 0.271753)
    cuda::std: (1.08392, 0.271753)
    
    input: (1, 0)
    std: (0.761594, 0)
    cuda::std: (0.761594, 0)
    
    input: (1, -1)
    std: (1.08392, -0.271753)
    cuda::std: (1.08392, -0.271753)
    
    input: (1, -inf)
    std: (nan, nan)
    cuda::std: (-nan, -nan)
    
    input: (0, inf)
    std: (0, nan)
    cuda::std: (-nan, -nan)
    
    input: (0, 1)
    std: (0, 1.55741)
    cuda::std: (0, 1.55741)
    
    input: (0, 0)
    std: (0, 0)
    cuda::std: (0, 0)
    
    input: (0, -1)
    std: (0, -1.55741)
    cuda::std: (0, -1.55741)
    
    input: (0, -inf)
    std: (0, nan)
    cuda::std: (-nan, -nan)
    
    input: (-1, inf)
    std: (nan, nan)
    cuda::std: (-nan, -nan)
    
    input: (-1, 1)
    std: (-1.08392, 0.271753)
    cuda::std: (-1.08392, 0.271753)
    
    input: (-1, 0)
    std: (-0.761594, 0)
    cuda::std: (-0.761594, 0)
    
    input: (-1, -1)
    std: (-1.08392, -0.271753)
    cuda::std: (-1.08392, -0.271753)
    
    input: (-1, -inf)
    std: (nan, nan)
    cuda::std: (-nan, -nan)
    
    input: (-inf, inf)
    std: (-1, 0)
    cuda::std: (-1, 0)
    
    input: (-inf, 1)
    std: (-1, 0)
    cuda::std: (-1, 0)
    
    input: (-inf, 0)
    std: (-1, 0)
    cuda::std: (-1, 0)
    
    input: (-inf, -1)
    std: (-1, -0)
    cuda::std: (-1, -0)
    
    input: (-inf, -inf)
    std: (-1, -0)
    cuda::std: (-1, 0)
    

    Thanks a lot cc @mruberry @ngimel for discussion.

    testing: internal ci passed 
    opened by zasdfgbnm 7
  • Revamp samples and benchmarks

    Revamp samples and benchmarks

    • [x] Created benchmarks/ to hold libcu++ benchmarks
    • [x] Moved benchmark.cu/.cpp from samples/ to benchmarks (renamed to concurrency_host/device, but this can be changed)
    • [x] Overhauled the old CMakeList.txt for building the benchmarks to use more modern cmake
    • [x] Add CMakeList.txt to samples/ to build the existing samples
    opened by jrhemstad 7
  • Move and update

    Move and update "memory model" docs

    This PR moves the "memory model" related docs to a separate page, so that they can be easily linked from the Programming Guide.

    It also updates the memory model docs.

    It provides a new guarantee that we did not provide before, system-scope atomics are also atomic if the operation:

    it is a load or store that affects a naturally-aligned object of sizes 1, 2, 4, or 8 bytes on mapped memory

    The scripts to build the site have stopped working, so this PR also updates those.

    The site layout is constrained to a 800px body width, which makes tables render poorly. This PR bumps this to 1000px here.

    Syntax highlighting within HTML tables seems to be buggy with Jekyll.

    opened by gonzalobg 6
  • Add clang-11 docker configurations

    Add clang-11 docker configurations

    Based on nvbug https://nvbugs/200700358

    added the required docker configuration related to the build with Clang-11

    The image is based on Ubuntu 20.04. In addition, we define DEBIAN_FRONTEND and TZ in docker layer.

    P1: should have only: tests 
    opened by veictry 6
  • Fix <tuple> on MSVC

    Fix on MSVC

    decltype is a culprit for a slew of MSVC bugs. About 50 failures have been fixed by hacking the __tuple_sfinae_base trait.

    Tests that still need to be addressed:

    Failing Tests (8):
        libcu++ :: std/utilities/tuple/tuple.tuple/tuple.apply/apply_extended_types.pass.cpp
        libcu++ :: std/utilities/tuple/tuple.tuple/tuple.cnstr/PR27684_contains_ref_to_incomplete_type.pass.cpp
        libcu++ :: std/utilities/tuple/tuple.tuple/tuple.cnstr/UTypes.pass.cpp
        libcu++ :: std/utilities/tuple/tuple.tuple/tuple.cnstr/alloc.pass.cpp
        libcu++ :: std/utilities/tuple/tuple.tuple/tuple.cnstr/deduct.pass.cpp
        libcu++ :: std/utilities/tuple/tuple.tuple/tuple.cnstr/nothrow_cnstr.pass.cpp
        libcu++ :: std/utilities/tuple/tuple.tuple/tuple.cnstr/test_lazy_sfinae.pass.cpp
        libcu++ :: std/utilities/tuple/tuple.tuple/tuple.helper/tuple_size_structured_bindings.pass.cpp
    
      Expected Passes    : 70
      Unexpected Failures: 8
    
    opened by wmaxey 6
  • Improve tooling for error handling

    Improve tooling for error handling

    CUDA API function calls may fail in a variety of ways, some of which are more frequent than in host code (e.g. memory allocation). Therefore, coders need all the help they can get, which is the simple motivation behind this issue. The approach is multi-pronged:

    1. Implement std::(experimental::) expected, scope_exit, scope_fail, and scope_success

    These are time-tested tools that help with restoring the correct system state in the presence of errors. expected allows APIs that can be used with or without exceptions, and scope_* allow simple, correct transactional code in the presence of any early returns.

    1. Standardize the CUDA_SAFE_CALL (or similar) function/macro that everybody and their cat are defining on their own. Writing such n artifact has become a rite of passage of any programmer going from "I'm trying CUDA" to "I'm writing a CUDA application". Factoring such code into libcu++ will have a terrific reuse rate.

    I gratuitously include the ~~macro~~function we defined in our own app, partly to illustrate how inconsistent error information fetching is across CUDA/host, CUDA/device, cuBLAS, and cuSolver.

    template <typename T>
    void cuda_safe_call(
            const T status, const std::experimental::source_location loc = std::experimental::source_location::current()) {
        // All "success" statuses are zero
        static_assert(cudaSuccess == 0 && CUDA_SUCCESS == 0 && CUBLAS_STATUS_SUCCESS == 0 && CUSOLVER_STATUS_SUCCESS == 0,
            "Please revise this function.");
    
        // Common early exit test for all cases
        if (status == 0)
            return;
    
        int dev = -1;
        cudaGetDevice(&dev);
    
        if constexpr (std::is_same_v<T, cudaError_t>) {
            fprintf(stderr, "%s(%u:%u) [device %d] CUDA error in %s: %s (%s).\n", loc.file_name(), loc.line(), loc.column(),
                    dev, loc.function_name(), cudaGetErrorString(status), cudaGetErrorName(status));
        } else if constexpr (std::is_same_v<T, CUresult>) {
            const char* error_string;
            cuGetErrorString(status, &error_string);
            const char* error_name;
            cuGetErrorName(status, &error_name);
            fprintf(stderr, "%s(%u:%u) [device %d] CUDA DRIVER error in %s: %s (%s).\n", loc.file_name(), loc.line(),
                    loc.column(), dev, loc.function_name(), error_string, error_name);
        } else if constexpr (std::is_same_v<T, cublasStatus_t>) {
            fprintf(stderr, "%s(%u:%u) [device %d] CUBLAS error in %s: %s.\n", loc.file_name(), loc.line(), loc.column(),
                    dev, loc.function_name(), cublasGetStatusString(status));
        } else {
            static_assert(std::is_same_v<T, cusolverStatus_t>, "Error: not a CUDA status.");
            fprintf(stderr, "%s(%u:%u) [device %d] CUSOLVER error in call %s: %s.\n", loc.file_name(), loc.line(),
                    loc.column(), dev, loc.function_name(), cusolverGetErrorString(status));
        }
    
        abort();
    }
    
    opened by andralex 0
  • Modularize <iterator>

    Modularize

    This modularizes the <iterator> header as done by libc++.

    This is a precursor to updating to a more recent state of the code and bringing in C++20 ranges machinery

    opened by miscco 1
  • `__libcpp_isnan_or_builtin`   and `__libcpp_isinf_or_builtin` cause implicit conversion to fp64

    `__libcpp_isnan_or_builtin` and `__libcpp_isinf_or_builtin` cause implicit conversion to fp64

    The __libcpp_isnan_or_builtin and __libcpp_isinf_or_builtin functions will invoke __isnan and __isinf when invoked from device code.

    These functions will implicitly convert float to double, which can have performance impacts on surrounding code. This is because the __isinf provided by cuda/include/crt/math_functions.hpp only has an overload for double.

    The corresponding isnan and isinf functions have overloads for both float and double that avoid the conversion.

    We should use isnan and isinf directly instead of the __isnan/__isinf.

    opened by jrhemstad 0
  • Minor compare exchange optimization

    Minor compare exchange optimization

    Currently, compare_exchange_strong is using __stronger_order_cuda:

    inline __host__ __device__ int __stronger_order_cuda(int __a, int __b) {
        int const __max = __a > __b ? __a : __b;
        if(__max != __ATOMIC_RELEASE)
            return __max;
        static int const __xform[] = {
            __ATOMIC_RELEASE,
            __ATOMIC_ACQ_REL,
            __ATOMIC_ACQ_REL,
            __ATOMIC_RELEASE };
        return __xform[__a < __b ? __a : __b];
    }
    

    The code above leads to actual memory loads. We can consider the following optimization:

    inline __host__ __device__ int __stronger_order_cuda(int __a, int __b) {
        int const __max = __a > __b ? __a : __b;
        if(__max != __ATOMIC_RELEASE)
            return __max;
        const int __min = __a < __b ? __a : __b;
        return __ATOMIC_ACQ_REL - ((__min & 1) == (__min >> 1));
    }
    

    The change leads to about 4% better performance of compare exchange on mobile 3070 ti when memory ordering is not known at compile time:

    switch_vs_array

    When the memory ordering is known at compile time, there's no difference in generated SASS for both versions. Here's the benchmark:

    #include <iostream>
    #include <cuda/atomic>
    
    constexpr int threads_in_block = 1024;
    
    __launch_bounds__(threads_in_block)
    __global__ void kernel(int *ptr, int target, cuda::memory_order success, cuda::memory_order failure) {
      __shared__ int cache;
    
      int expected = -1;
      if (threadIdx.x == target) {
        cache = expected;
      }
      __syncthreads();
    
      cuda::atomic_ref<int, cuda::thread_scope_block> ref(cache);
    
      if (ref.compare_exchange_strong(expected, threadIdx.x, success, failure)) {
        ptr[blockIdx.x] = threadIdx.x;
      }
    }
    
    int main() {
      int blocks_in_grid = 256 * 1024;
      int n = blocks_in_grid;
    
      int *ptr{};
      cudaMalloc(&ptr, sizeof(int) * n);
      cudaMemset(ptr, 0, sizeof(int) * n);
    
      cudaEvent_t begin, end;
      cudaEventCreate(&begin);
      cudaEventCreate(&end);
    
      cudaEventRecord(begin);
      kernel<<<blocks_in_grid, threads_in_block>>>(ptr, 0, cuda::memory_order_release, cuda::memory_order_relaxed);
      cudaEventRecord(end);
      cudaEventSynchronize(end);
    
      float ms{};
      cudaEventElapsedTime(&ms, begin, end);
    
      std::cout << ms << std::endl;
    
      cudaEventDestroy(end);
      cudaEventDestroy(begin);
    
      cudaFree(ptr);
    }
    
    nvcc -gencode arch=compute_86,code=sm_86 -std=c++17 -DNDEBUG -O3 main.cu
    
    P3: backlog 
    opened by senior-zero 3
  • libcu++ should document policy for mixing a custom version with the CUDA Toolkit

    libcu++ should document policy for mixing a custom version with the CUDA Toolkit

    Users of libcu++ may obtain the library by either using what is shipped in the CUDA toolkit or by fetching a version from GitHub.

    Furthermore, there are headers (e.g., cuda_bf16.h) in the CUDA toolkit that rely on the version of libcu++ shipped with that particular toolkit.

    This has the potential to cause problems. For example, if a user is using an older version of libcu++ than was shipped with the CTK, then cuda_bf16.h will end up using an older version of libcu++ header. This could break if cuda_bf16.h was relying on features/changes in the corresponding version of libcu++ shipped in the same CTK version.

    libcu++ should have a section in its documentation about guarantees (if any) and best practices when it comes to using libcu++ from GitHub and mixing it with the CUDA toolkit.

    opened by jrhemstad 1
Releases(1.9.0-rc1)
  • 1.9.0-rc1(Oct 12, 2022)

    libcu++ 1.9.0

    Adds float and double support to cuda::std::atomic and cuda::atomic. This release also adds workflows for contributors based on Docker to improve testing and coverage.

    Supported ABI Versions: 4 (default), 3, and 2.

    New Features

    • #286: Add atomics for floating point types.
      • Thanks Daniel Jünger for this contribution.
    • #284: cuda::proclaim_return_type for use with extended lambda support in NVCC.
    • #267: Docker refactor, parameterizes OS and compiler versions.

    Issues Fixed

    • #280: NVHPC: Disable <nv/target> macro code paths when compiled without -stdpar.
    • #282: Prevent usage of cuda::atomic::fetch_max/min for non-integral types.
    • #288: Fix shortcut in fetch_min CAS loop.
      • Thanks Daniel Jünger for this contribution.
    • #291: Remove usage of find_path to locate cuda/std/detail/__config.
      • Thanks Robert Maynard for this contribution.
    • #276: Delete tests for unsupported header <compare>.
    • #293: Fix failures in several tests unsupportable by NVRTC.
    • #303: Move the emission of atomic errors on unsupported platforms to <atomic>.
    • #305: Add workflow to add issues/PRs to Project.
    • #314: Remove SM_35 from testing.
    • #312: Use escape hook for removal of <ciso646>.
    • #310: <atomics> Remove defaulted copy constructor from __cxx_atomic_lock_impl.
    • #300: Soundness bugfix for barrier<thread_scope_block> on sm_70.
    • #319: Fix ubuntu18 failing in CI due to missing lit prereqs.
    • #318: Fix gcc12 issues.
    • #320: Use cache_from to speed up builds if local versions exist.
    • #304: Fix <chrono> and <atomic> build errors with clang-cuda.
    • #324: Also disable tests on windows && pre-sm-70.
    Source code(tar.gz)
    Source code(zip)
  • 1.8.1(Apr 22, 2022)

    libcu++ 1.8.1 (CUDA Toolkit 11.8)

    libcu++ 1.8.1 is a minor release. It fixes minor issues in source, tests, and documentation.

    Supported ABI Versions: 4 (default), 3, and 2.

    Issues Fixed

    • #275: Fixes long double warnings in <cuda/std/cmath>.
    • #274: Fixes multiple definition errors.
      • Thanks Gonzalo Brito for this contribution.
    • #273: Fixes pedantic build warnings on IBM's xlc.
    • #271: Changes XFAIL state for CTAD tests on clang.
    • #266: Fix typo leading to broken link.
      • Thanks Daniel Lustig for this contribution.
    • #269: Fix several MSVC Test failures.
    • #268: Remove NVIDIA internal paths from CMake includes.
    • #265: Move pipeline into libcudacxx. Previously was a seperate CTK component.
    • #264: Fix builds using NVHPC by adding a new line.
      • Thanks Chengjie Wang and Royil Damer for this contribution.
    • #261: Fix extra line in perform_tests.bash causing invalid test results.
      • Thanks Chengjie Wang and Royil Damer for this contribution.
    • #246: Documentation fixes regarding atomics in GPU memory.
      • Thanks Daniel Lustig for this contribution.
    • #258: Lock contrast of our documenation's search text field.
      • Thanks Bradley Dice for this contribution.
    • #259: Add system_header pragma to portions of <nv/target>
    • #249: Documentation update for building libcudacxx.
    • #247: Update godbolt links in examples.
      • Thanks Asher Mancinelli for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.8.0-post-release(Feb 19, 2022)

    libcu++ 1.8.0 (CUDA Toolkit 11.7)

    libcu++ 1.8.0 is a major release. It adds several constexpr bit manipulation functions from C++20's <bit> to C++11 and up. Also added is cuda::std::array providing fixed size arrays and iterators for both host and device code.

    Supported ABI Versions: 4 (default), 3, and 2.

    New Features

    • #237: Add <cuda/std/bit> and enable backports to C++11.
    • #243: Add <cuda/std/array> and <cuda/std/iterator>.

    Issues Fixed

    • #234: Fix building with GCC/Clang when NVCC was not being used.
    • #240: Create a config for lit to generate a JSON output of the build status.
      • Thanks Royil Damer for this contribution.
    • #241: Fix octal notation of libcudacxx version number.
    • #242: Add support for find_package and add_subdirectory in CMake.
    • #244: Merge build system improvements from NVC++ branch.
    • #250: Fix pragma typo on MSVC.
    • #251: Add several new compilers versions to our docker suite.
    • #252: Fix several deprecations in Clang 13.
    • #253: Fix truncations and warnings in numerics.
    • #254: Fix warnings in <array> tests and move __cuda_std__ escapes in <algorithm>
    • #255: Fix deprecated copy ctor warnings in __annotated_ptr for Clang 13.
    • #256: Fix SM detection in the perform_tests script.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.0(Jan 14, 2022)

    libcu++ 1.7.0 (CUDA Toolkit 11.6)

    libcu++ 1.7.0 is a major release. It adds cuda::std::atomic_ref for integral types. cuda::std::atomic_ref may potentially replace uses of CUDA specific atomicOperator(_Scope) calls and provides a singular API for host and device code.

    Supported ABI Versions: 4 (default), 3, and 2.

    New Features

    • #203 Implements cuda::std::atomic_ref for integral types.

    Issues Fixed

    • #204: Fallback macro backend in <nv/target> when C or pre-C++11 dialects are used.
    • #206: Fix compilation with ASAN enabled.
      • Thanks Janusz Lisiecki for this contribution.
    • #207: Fix compilation of <cuda/std/atomic> for GCC/Clang.
    • #208: Flip an internal directory symlink, fixes packaging issues for internal tools.
    • #212: Fix <nv/target> on MSVC, fallback macros would always choose pre-C++11 backend.
    • #216: Annotated Pointer documentation.
      • Thanks Gonzalo Brito for this contribution.
    • #215: Add SM87 awareness to <nv/target>.
    • #217: Fix how CUDACC version is calculated for __int128 support.
    • #228: Fix LLVM lit pattern matching in test score calculation.
    • #227: Silence 4296 for type_traits.
    • #225: Fix calculation of _LIBCUDACXX_CUDACC_VER broken from #217.
      • Thanks Robert Maynard for this contribution.
    • #220: memcpy_async should cache only in L2 when possible.
    • #219: Change atomic/atomic_ref ctors to prevent copy construction.
    Source code(tar.gz)
    Source code(zip)
  • 1.6.0(Aug 9, 2021)

    libcu++ 1.6.0 (CUDA Toolkit 11.5)

    libcu++ 1.6.0 is a major release. It changes the default alignment of cuda::std::complex for better code generation and changes cuda::std::atomic to use <nv/target> as the primary dispatch mechanism.

    This release adds cuda::annotated_ptr and cuda::access_property, two APIs that allow associating an address space and an explicit caching policy with a pointer, and the related cuda::apply_access_property, cuda::associate_access_property and cuda::discard_memory APIs.

    This release introduces ABI version 4, which is now the default.

    Supported ABI Versions: 4 (default), 3, and 2.

    Issues Fixed

    • #194: <cuda/std/barrier> and <cuda/std/atomic> failed to compile with NVRTC.
    • #179: Refactors the atomic layer to allow for layering the host device/host abstractions.
    • #189: Changed pragmas for silencing chrono long double warnings.
    • #186: Allows <nv/target> to be used under NVRTC.
    • #177: Allows <nv/target> to build when compiled under C and C++98.
      • Thanks to David Olsen for this contribution.
    • #172: Introduces ABI version 4.
      • Forces cuda::std::complex alignment for enhanced performance.
      • Sets the internal representation of cuda::std::chrono literals to double.
    • #165: For tests on some older distributions keep using Python 3, but downgrade lit.
    • #164: Fixes testing issues related to Python 2/3 switch for lit.
      • Thanks to Royil Damer for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.5.0(Apr 29, 2021)

    libcu++ 1.5.0 (CUDA Toolkit 11.4)

    libcu++ 1.5.0 is a major release. It adds <nv/target>, the library support header for the new if target target specialization mechanism.

    Supported ABI Versions: 3 (default) and 2.

    Included in: CUDA Toolkit 11.4.

    New Features

    • <nv/target> - Portability macros for NVCC/NVC++ and other compilers.

    Issues Fixed

    • Documentation: Several typo fixes.
    • #126: Compiler warnings in <cuda/atomic>.
      • Thanks to anstellaire for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.4.1(Feb 25, 2021)

    libcu++ 1.4.1 (CUDA Toolkit 11.3)

    libcu++ 1.4.1 is a minor bugfix release.

    Supported ABI versions: 3 (default) and 2.

    Included in: CUDA Toolkit 11.3.

    Other Enhancements

    • Documentation: Several enhancements and fixed a few broken links.
    • #108: Added constexpr to synchronization object constructors.
      • Thanks to Olivier Giroux for this contribution.

    Issues Fixed

    • #106: Fixed host code atomics on VS 2019 Version 16.5 / MSVC 1925 and above.
    • #101: Fixed cuda::std::complex for NVRTC.
    • #118: Renamed __is_convertible, which NVCC treats as a context sensitive keyword.
    Source code(tar.gz)
    Source code(zip)
  • 1.4.0(Dec 4, 2020)

    libcu++ 1.4.0 adds <cuda/std/complex>, NVCC + MSVC support for <cuda/std/tuple>, and backports of C++20 <cuda/std/chrono> and C++17 <cuda/std/type_traits> features to C++14.

    Supported ABI versions: 3 (default) and 2.

    New Features

    • #32: <cuda/std/complex>.
      • long double is not supported and disabled when building with NVCC.
    • #34: C++17/20 <cuda/std/chrono> backported to C++14.
      • Thanks to Jake Hemstad and Paul Taylor for this contribution.
    • #44: C++17 <cuda/std/type_traits> backported to C++14.
      • Thanks to Jake Hemstad and Paul Taylor for this contribution.
    • #66: C++17 cuda::std::byte (in <cuda/std/cstddef>) backported to C++14.
      • Thanks to Jake Hemstad and Paul Taylor for this contribution.
    • #76: C++20 cuda::std::is_constant_evaluated backported to C++11.
      • Thanks to Jake Hemstad and Paul Taylor for this contribution.

    Other Enhancements

    • Documentation has been improved and reorganized.
    • #43: Atomics on MSVC have been decoupled from host Standard Library.
    • #78: Fixed header licensing.
    • #31: Revamped examples and benchmarks.
      • Thanks to Jake Hemstad for this contribution.

    Issues Fixed

    • #53, #80, #81: Improved documentation for <cuda/pipeline> and the asynchronous operations API.
    • #14: NVRTC missing definitions for several macros.
      • Thanks to Ben Barsdell for this contribution.
    • #56: <cuda/std/tuple> now works on a set of most recent MSVC compilers.
    • #66, #82: <cuda/std/chrono>/<cuda/std/type_traits> backports.
      • Thanks to Jake Hemstad and Paul Taylor for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.3.0(Oct 3, 2020)

    libcu++ 1.3.0 adds <cuda/std/tuple> and cuda::std::pair, although they are not supported with NVCC + MSVC. It also adds documentation.

    Supported ABI versions: 3 (default) and 2.

    Included in: CUDA 11.2.

    New Features

    • #17: <cuda/std/tuple>: cuda::std::tuple, a fixed-size collection of heterogeneous values. Not supported with NVCC + MSVC.
    • #17: <cuda/std/utility>: cuda::std::pair, a collection of two heterogeneous values. The only <cuda/std/utility> facilities supported are cuda::std::pair. Not supported with NVCC + MSVC.

    Other Enhancements

    Issues Fixed

    • #21: Disable __builtin_is_constant_evaluated usage with NVCC in C++11 mode because it's broken.
    • #25: Fix some declarations/definitions in __threading_support which have inconsistent qualifiers. Thanks to Gonzalo Brito Gadeschi for this contribution.
    Source code(tar.gz)
    Source code(zip)
  • 1.2.0(Oct 29, 2020)

    libcu++ 1.2.0 adds <cuda/pipeline>/cuda::pipeline, a facility for coordinating cuda::memcpy_async operations. This release introduces ABI version 3, which is now the default.

    Supported ABI versions: 3 (default) and 2.

    Included in: CUDA 11.1.

    ABI Breaking Changes

    • ABI version 3 has been introduced and is now the default. A new ABI version was necessary to improve the performance of cuda::[std::]barrier by changing its alignment. Users may define _LIBCUDACXX_CUDA_ABI_VERSION=2 before including any libcu++ or CUDA headers to use ABI version 2, which was the default for the 1.1.0 / CUDA 11.0 release. Both ABI version 3 and ABI version 2 will be supported until the next major CUDA release.

    New Features

    • <cuda/pipeline>: cuda::pipeline, a facility for coordinating cuda::memcpy_async operations.
    • <cuda/std/version>: API version macros _LIBCUDACXX_CUDA_API_VERSION, _LIBCUDACXX_CUDA_API_VERSION_MAJOR, _LIBCUDACXX_CUDA_API_VERSION_MINOR, and _LIBCUDACXX_CUDA_API_VERSION_PATCH.
    • ABI version switching: users can define _LIBCUDACXX_CUDA_ABI_VERSION to request a particular supported ABI version. _LIBCUDACXX_CUDA_ABI_VERSION_LATEST is set to the latest ABI version, which is always the default.

    Other Enhancements

    • <cuda/latch>/<cuda/semaphore>: <cuda/*> headers added for cuda::latch, cuda::counting_semaphore, and cuda::binary_semaphore. These features were available in prior releases, but you had to include <cuda/std/latch> and <cuda/std/semaphore> to access them.
    • NVCC + GCC 10 support.
    • NVCC + Clang 10 support.
    Source code(tar.gz)
    Source code(zip)
  • 1.1.0(Oct 29, 2020)

    libcu++ 1.1.0 introduces the world's first implementation of the Standard C++20 synchronization library: <cuda/[std/]barrier>, <cuda/std/latch>, <cuda/std/semaphore>, cuda::[std::]atomic_flag::test, cuda::[std::]atomic::wait, and cuda::[std::]atomic::notify*. An extension for managing asynchronous local copies, cuda::memcpy_async is introduced as well. It also adds <cuda/std/chrono>, <cuda/std/ratio>, and most of <cuda/std/functional>.

    ABI Breaking Changes

    • ABI version 2 has been introduced and is now the default. A new ABI version was introduced because it is our policy to do so in every major CUDA toolkit release. ABI version 1 is no longer supported.

    API Breaking Changes

    • Atomics on Pascal + Windows are disabled because the platform does not support them and on this platform the CUDA driver rejects binaries containing these operations.

    New Features

    • <cuda/[std/]barrier>: C++20's cuda::[std::]barrier, an asynchronous thread coordination mechanism whose lifetime consists of a sequence of barrier phases, where each phase allows at most an expected number of threads to block until the expected number of threads arrive at the barrier. It is backported to C++11. The cuda::barrier variant takes an additional cuda::thread_scope parameter.
    • <cuda/barrier>: cuda::memcpy_async, asynchronous local copies. This facility is NOT for transferring data between threads or transferring data between host and device; it is not a cudaMemcpyAsync replacement or abstraction. It uses cuda::[std::]barriers objects to synchronize the copies.
    • <cuda/std/functional>: common function objects, such as cuda::std::plus, cuda::std::minus, etc. cuda::std::function, cuda::std::bind, cuda::std::hash, and cuda::std::reference_wrapper are omitted.

    Other Enhancements

    • Upgraded to a newer version of upstream libc++.
    • Standalone NVRTC support.
    • C++17 support.
    • NVCC + GCC 9 support.
    • NVCC + Clang 9 support.
    • Build with warnings-as-errors.

    Issues Fixed

    • Made __cuda_memcmp inline to fix ODR violations when compiling multiple translation units.
    Source code(tar.gz)
    Source code(zip)
  • 1.0.0(Oct 29, 2020)

    libcu++ 1.0.0 is the first release of libcu++, the C++ Standard Library for your entire system. It brings C++ atomics to CUDA: <cuda/[std/]atomic>. It also introduces <cuda/std/type_traits>, <cuda/std/cassert>, <cuda/std/cfloat>, <cuda/std/cstddef>, and <cuda/std/cstdint>.

    New Features

    • <cuda/[std/]atomic>:
      • cuda::thread_scope: An enumeration that specifies which group of threads can synchronize with each other using a concurrency primitive.
      • cuda::atomic<T, Scope>: Scoped atomic objects.
      • cuda::std::atomic<T>: Atomic objects.
    • <cuda/std/type_traits>: Type traits and metaprogramming facilities.
    • <cuda/std/cassert>: assert, an error-reporting mechanism.
    • <cuda/std/cstddef>: Builtin fundamental types.
    • <cuda/std/cstdint>: Builtin integral types.
    • <cuda/std/cfloat>: Builtin floating point types.

    Known Issues

    • Due to circumstances beyond our control, the NVIDIA-provided Debian packages install libcu++ to the wrong path. This makes libcu++ unusable if installed from the NVIDIA-provided Debian packages and may interfere with the operation of your host C++ Standard Library.
    Source code(tar.gz)
    Source code(zip)
Libft is an individual project at 42 that requires us to re-create some standard C library functions including some additional ones that can be used later to build a library of useful functions for the rest of the program.

?? Index What is Libft? List of Functions Technologies ✨ What is Libft? Libft is an individual project at 42 that requires us to re-create some standa

Paulo Rafael Ramalho 7 Jan 17, 2022
Thrust is a C++ parallel programming library which resembles the C++ Standard Library.

Thrust: Code at the speed of light Thrust is a C++ parallel programming library which resembles the C++ Standard Library. Thrust's high-level interfac

NVIDIA Corporation 4.3k Nov 28, 2022
jkds is a modern header-only C++20 library that complements the standard library.

jkds is a modern header-only C++20 library that complements the standard library. It provides generic atypical data structures, ergonomic functional programming abstractions, and then some.

Alberto Schiabel 6 Nov 16, 2022
Bionic BSD-3-ClauseBionic - Google's standard library, developed for Android. BSD-3-Clause

bionic bionic is Android's C library, math library, and dynamic linker. Using bionic as an app developer See the user documentation. Working on bionic

Android Open Source Project 561 Nov 24, 2022
CloudABI's standard C library

NOTE: This project is unmaintained CloudABI is no longer being maintained. It was an awesome experiment, but it never got enough traction to be sustai

Nuxi 275 Oct 30, 2022
EASTL stands for Electronic Arts Standard C++ Template Library

EASTL stands for Electronic Arts Standard Template Library. It is an extensive and robust implementation that has an emphasis on high performance.

Electronic Arts 6.8k Nov 23, 2022
An open source standard C library that includes useful functions && (Reimplementation of libc functions + own functions).

?? LIBFT-42 : Artistic view of LIBC: ?? HOW DOES IT FEEL HAVING YOUR OWN LIB: SUBJECT : ENGLISH PDF ℹ️ What is LIBFT : This project aims to code a C l

Abdessamad Laamimi 10 Nov 4, 2022
Reimplementation of some of the Standard C Library functions.

42-libft Reimplementation of some of the Standard C Library functions. This repository contains some of the standard library C functions. List of avai

Lavrenova Maria 6 Sep 29, 2022
STXXL: Standard Template Library for Extra Large Data Sets

STXXL is an implementation of the C++ standard template library STL for external memory (out-of-core) computations

STXXL 436 Nov 1, 2022
MSVC's implementation of the C++ Standard Library.

Microsoft's C++ Standard Library This is the official repository for Microsoft's implementation of the C++ Standard Library (also known as the STL), w

Microsoft 8.3k Nov 24, 2022
mlibc is a C standard library

mlibc is a C standard library Official Discord server: https://discord.gg/7WB6Ur3 Design of the library Directory Purpose options/ (More or less) OS-i

The Managarm Project 554 Nov 22, 2022
A standard conforming C++20 implementation of std::optional.

A standard conforming C++20 implementation of std::optional.

null 31 Aug 24, 2022
This project is pretty straightforward, you have to recode printf. You will learn what is and how to implement variadic functions. Once you validate it, you will reuse this function in your future projects.

100/100 Introduction to ft_printf This is the third project in the 1337 Curriculum #42network . This project is pretty straight forward, recode the pr

Zakaria Yacoubi 4 May 27, 2022
A thread-safe, easy-to-use, utility for sending and receiving notifications. It allows you to decouple different modules of your application.

NotificationManager NotificationManager is a thread-safe, easy-to-use utility for sending and receiving notifications. It allows you to decouple diffe

Carlos Aragonés 6 Dec 27, 2021
Library that simplify to find header for class from STL library.

Library that simplify to find header for class from STL library. Instead of searching header for some class you can just include header with the class name.

null 6 Jun 7, 2022
D++ Extremely Lightweight C++ Discord Library

D++ An incredibly lightweight C++ Discord library This project is in alpha stages of development. Completed so far: Websocket connection with heartbea

brainbox.cc 551 Dec 2, 2022
Single-header header-only C++11 / C++14 / C++17 library for easily managing set of auto-generated type-safe flags.

Single-header header-only C++11 / C++14 / C++17 library for easily managing set of auto-generated type-safe flags. Quick start #include <bitflags/bitf

Marin Peko 76 Nov 22, 2022
expected lite - Expected objects in C++11 and later in a single-file header-only library

expected lite: expected objects for C++11 and later expected lite is a single-file header-only library for objects that either represent a valid value

Martin Moene 240 Nov 22, 2022
Guidelines Support Library

GSL: Guidelines Support Library The Guidelines Support Library (GSL) contains functions and types that are suggested for use by the C++ Core Guideline

Microsoft 5.2k Dec 1, 2022