Thin C++-flavored wrappers for the CUDA Runtime API

Overview

cuda-api-wrappers:
Thin C++-flavored wrappers for the CUDA runtime API

Branch Build Status: Master Master Build Status | Development: Development Build Status

nVIDIA's Runtime API for CUDA is intended for use both in C and C++ code. As such, it uses a C-style API, the lowest common denominator (with a few notable exceptions of templated function overloads).

This library of wrappers around the Runtime API is intended to allow us to embrace many of the features of C++ (including some C++11) for using the runtime API - but without reducing expressivity or increasing the level of abstraction (as in, e.g., the Thrust library). Using cuda-api-wrappers, you still have your devices, streams, events and so on - but they will be more convenient to work with in more C++-idiomatic ways.

Key features

  • All functions and methods throw exceptions on failure - no need to check return values (the exceptions carry the status information).
  • Judicious namespacing (and some internal namespace-like classes) for better clarity and for semantically grouping related functionality together.
  • There are proxy/wrapper objects for devices, streams, events, kernels and so on, using RAII to relieve you of remembering to free or destroy resources.
  • You can mostly forget about numeric IDs and handles; the proxy classes will fit everywhere.
  • Various Plain Old Data structs adorned with convenience methods and operators.
  • Aims for clarity and straightforwardness in naming and semantics, so that you don't need to look concepts up in the official documentation to understand what each class and function do.
  • Thin and lightweight:
    • No work done behind your back, no caches or indices or any such thing.
    • No costly inheritance structure, vtables, virtual methods and so on - vanishes almost entirely on compilation.
    • Doesn't really "hide" any of CUDA's complexity or functionality; it only simplifies use of the Runtime API.

Detailed documentation

Detailed nearly-complete Doxygen-genereated documentation is available.

Requirements

  • CUDA v8.0 or later is recommended and v7.5 should be supported (but is untested). CUDA 6.x should probably be Ok as well.
  • A C++11-capable compiler compatible with your version of CUDA.
  • CMake v3.8 or later - although most of the library will work as simple headers with no building.

Coverage of the Runtime API

Considering the list of runtime API modules, the library currently has the following (w.r.t. CUDA 8.x):

Coverage level Modules
full Error Handling, Stream Management, Event Management, Version Management, Peer Device Memory Access, Occupancy, Unified Addressing
almost full Device Management (no chooseDevice, cudaSetValidDevices), Memory Management, Execution Control (no support for working with parameter buffers)
partial 2D & 3D Arrays, Texture Object Management, Texture Reference Management
(deprecated) Thread management
no coverage Graph Management, OpenGL Interoperability, Direct3D Interoperability, VDPAU Interoperability, EGL Interoperability, Graphics Interoperability, Surface Reference Management, Surface Object Management

The Milestones indicates some features which aren't covered and are slated for future work.

Since I am not currently working on anything graphics-related, there are no short-term plans to extend coverage to any of the graphics related modules.

A taste of the key features in play

We've all dreamed of being able to type in:

my_stream.enqueue.callback(
	[&foo](cuda::stream_t stream, cuda::status_t status) {
		std::cout << "Hello " << foo << " world!\n";
	}
);

... and have that just work, right? Well, now it does!

On a slightly more serious note, though, let's demonstrate the principles listed above:

Use of namespaces (and internal classes)

With this library, you would do cuda::memory::host::allocate() instead of cudaMallocHost() and cuda::device_t::memory::allocate() instead of setting the current device and then cudaMalloc(). Note, though, that device_t::memory::allocate() is not a freestanding function but a method of an internal class, so a call to it might be cuda::device::get(my_device_id).memory.allocate(my_size). The compiled version of this supposedly complicated construct will be nothing but the sequence of cudaSetDevice() and cudaMalloc() calls.

Adorning POD structs with convenience methods

The expression

my_device.properties().compute_capability() >= cuda::make_compute_capability(50)

is a valid comparison, true for all devices with a Maxwell-or-later micro-architecture. This, despite the fact that struct cuda::compute_capability_t is a POD type with two unsigned integer fields, not a scalar. Note that struct cuda::device::properties_t (which is really basically a struct cudaDeviceProp of the Runtime API itself) does not have a compute_capability field.

Meaningful naming

Instead of using

cudaError_t cudaEventCreateWithFlags(
    cudaEvent_t* event, 
    unsigned int flags) 

which requires you remember what you need to specify as flags and how, you create a cuda::event_t proxy object, using the function:

cuda::event_t cuda::event::create(
    cuda::device_t  device,
    bool            uses_blocking_sync,
    bool            records_timing      = cuda::event::do_record_timing,
    bool            interprocess        = cuda::event::not_interprocess)

The default values here are enum : bool's, which you can use yourself when creating non-default-parameter events - to make the call more easily readable than with true or false.

Example programs

More detailed documentation / feature walk-through is forthcoming. For now I'm providing two kinds of short example programs; browsing their source you'll know essentially all there is to know about the API wrappers.

To build and run the examples (just as a sanity check), execute the following:

[[email protected]:/path/to/cuda-api-wrappers/]$ cmake -S . -B build -DBUILD_EXAMPLES=ON . && cmake --build build/ && find build/examples/bin -exec "{}" ";"

Modified CUDA samples

The CUDA distribution contains sample programs demostrating various features and concepts. A few of these - which are not focused on device-side work - have been adapted to use the API wrappers - completely foregoing direct use of the CUDA Runtime API itself. You will find them in the modified CUDA samples example programs folder.

'Coverage' test programs - by module of the Runtime API

Gradually, an example program is being added for each one of the CUDA Runtime API Modules, in which the approach replacing use of those module API calls by use of the API wrappers is demonstrated. These per-module example programs can be found here.

Bugs, suggestions, feedback

I would like some help with building up documentation and perhaps a Wiki here; if you can spare the time - do write me. You can also do so if you're interested in collaborating on some related project or for general comments/feedback/suggestions.

If you notice a specific issue which needs addressing, especially any sort of bug or compilation error, please file the issue here on GitHub.

Issues
  • Non-Portable Path to NVTX Headers [Windows]

    Non-Portable Path to NVTX Headers [Windows]

    On Windows with CUDA 10.0 installed, the paths to the NVTX headers that are used in profiling.cpp need to be prepended with nvtx3 so that the full include path is nvtx3/nvToolsExt.h and nvtx3/nvToolsExtCudaRt.h. Not sure how to make a "portable" solution here other than just hacking together the include paths based on the OS.

    One would think that the include paths should be the same if everything is being found correctly via CMake.

    bug nvtx ms-windows 
    opened by DeveloperPaul123 20
  • examples fail on kepler GPU

    examples fail on kepler GPU

    I'm trying to use this project but many examples fail on machines that have older GPUs. I have tried on a few machines with Tesla K40c, and one Titan V. Most tests passed on the Titan V except two, but many tests do not pass on K40c, including many flavors of VectorAdd. Could you please help? Here are my specs:

    --failed calculations and configurations-- Tesla K40c (compute capability 3.5) NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 cmake version 3.23.0 c++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 failed tests: asyncAPI, binaryPartitionCG, event_management, execution_control, inlinePTX, p2pBandwidthLatencyTest, simpleIPC, simpleStreams, stream_management, vectorAdd, vectorAddManaged, vectorAddMapped, vectorAddMMAP

    --less failed calculations and configurations-- Titan V (compute capability 7.0) NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4
    cmake version 3.23.0 c++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 failed tests: simpleStreams, vectorAddMMAP

    bug 
    opened by rainli323 19
  • cannot find -lcudadevrt  -lcudart_static

    cannot find -lcudadevrt -lcudart_static

    $ mkdir build && cd build
    
    $ cmake ..
    
    $ make
    ...
    [ 13%] Linking CXX executable examples/bin/unified_addressing
    /usr/bin/x86_64-linux-gnu-ld: cannot find -lcudadevrt
    /usr/bin/x86_64-linux-gnu-ld: cannot find -lcudart_static
    collect2: error: ld returned 1 exit status
    CMakeFiles/unified_addressing.dir/build.make:84: recipe for target 'examples/bin/unified_addressing' failed
    
    $ ls ~/.local/cuda-9.0/lib64 | grep cuda
    libcudadevrt.a
    libcudart.so
    libcudart.so.9.0
    libcudart.so.9.0.176
    libcudart_static.a
    

    I'm using CUDA 9.0 and g++ 6.5.0. Looks like it cannot find my customized CUDA lib folder. Any solutions for this?

    Also, some similar warnings are given when compiling:

    [ 11%] Building CXX object CMakeFiles/unified_addressing.dir/examples/by_runtime_api_module/unified_addressing.cpp.o                                  
    In file included from /home/user/cuda-api-wrappers/src/cuda/api/device.hpp:13:0,                                                                       
                     from /home/user/cuda-api-wrappers/examples/by_runtime_api_module/unified_addressing.cpp:15:                                           
    /home/user/cuda-api-wrappers/src/cuda/api/device_properties.hpp:99:13: warning: In the GNU C Library, "major" is defined                               
     by <sys/sysmacros.h>. For historical compatibility, it is                                                                                            
     currently defined by <sys/types.h> as well, but we plan to                                                                                           
     remove this soon. To use "major", include <sys/sysmacros.h>                                                                                          
     directly. If you did not intend to use a system-defined macro                                                                                        
     "major", you should undefine it after including <sys/types.h>.                                                                                       
      unsigned as_combined_number() const noexcept { return major() * 10 + minor_; }                                                                      
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                              
    
    opened by zingdle 19
  • Fix/compiling shared libs 76

    Fix/compiling shared libs 76

    Note that I also had to add the following code in the CMakeLists.txt to get everything to compile (not including examples).

    set(CUDA_LIBRARIES "cudadevrt.lib;cudart.lib")
    target_link_libraries(${PROJECT_NAME} PUBLIC ${CUDA_LIBRARIES})
    

    Let me know if you have questions.

    Another thing to note is that the install command installs all the libraries and include files correctly, but the cmake export files aren't installed. Is this on purpose? I would consider installing these and also adding a cuda-api-wrappersConfig.cmake file as part of the install so that the installed version of the library can easily be used by other cmake based projects. This would also make it trivial to create installation packages for releases of your library if/when you decide to do that.

    opened by DeveloperPaul123 19
  • `make_unique` without device behaves strangely in the example `vectorAdd`

    `make_unique` without device behaves strangely in the example `vectorAdd`

    My environment is

    • GCC 12.1.0
    • CMake 3.23.1
    • CUDA 11.6
    • cuda-api-wrappers latest commit
    • RTX 2070 Super
    • Linux 5.17.9

    In the example vectorAdd, if I change

    auto device = cuda::device::current::get();
    auto d_A = cuda::memory::device::make_unique<float[]>(device, numElements);
    auto d_B = cuda::memory::device::make_unique<float[]>(device, numElements);
    auto d_C = cuda::memory::device::make_unique<float[]>(device, numElements);
    

    to

    auto d_A = cuda::memory::device::make_unique<float[]>(numElements);
    auto d_B = cuda::memory::device::make_unique<float[]>(numElements);
    auto d_C = cuda::memory::device::make_unique<float[]>(numElements);
    

    I will get

    terminate called after throwing an instance of 'cuda::runtime_error'
      what():  Failed releasing the reference to the primary context for device 0: invalid device context
    

    Sometimes I get a SIGSEGV.

    I thought these two code pieces are equivalent.

    bug resolved-on-development 
    opened by QuarticCat 18
  • Bus error in `array_management.cu`

    Bus error in `array_management.cu`

    Does device.synchronize(); missed in https://github.com/eyalroz/cuda-api-wrappers/blob/36c0fa021ceacfe96548636d3716913e009634b6/examples/other/array_management.cu#L162 ? If I do not add this, I will get a bus error. But it is just a simple copy. I actually do not except a synchronize here.

    bug resolved-on-development 
    opened by szsdk 17
  • Support for CUDA arrays and texture memory

    Support for CUDA arrays and texture memory

    I think it would be useful to provide wrappers to use CUDA arrays and texture memory. I am going to implement an initial functionality in the next two weeks. Maybe you can give me feedback on the implementation, such that I can later create a pull request. See here:

    https://github.com/codecircuit/cuda-api-wrappers

    Some ideas:

    
    // only support for these types is given by cudaChannelFormatDesc
    auto arr0 = cuda::array::create<float, 3>({w, h, d});
    // Is it inconvenient to give the number of dimensions as a template parameter?
    // Maybe this can be deduced from `w, h, d`
    auto arr1 = cuda::array::create<unsigned, 3>({w, h, d});
    auto arr2= cuda::array::create<int, 3>({w, h, d});
    
    // only support for 2 and 3 dimensional arrays
    auto arr3 = cuda::array::create<float, 2>({w, h});
    auto arr4 = cuda::array::create<unsigned, 2>({w, h});
    auto arr5= cuda::array::create<int, 2>({w, h});
    
    auto ptr = cuda::memory::device::make_unique<float[]>(device, w * h * d);
    
    // copy data into array
    cuda::memory::copy(arr0, ptr.get());
    // copy data into buffer
    cuda::memory::copy(ptr.get(), arr1);
    
    // Create a texture with reasonable defaults:
    //	texDesc.addressMode[0]   = cudaAddressModeBorder;
    //	texDesc.addressMode[1]   = cudaAddressModeBorder;
    //	texDesc.addressMode[2]   = cudaAddressModeBorder;
    //	texDesc.filterMode       = cudaFilterModePoint;
    //	texDesc.readMode         = cudaReadModeElementType;
    //	texDesc.normalizedCoords = 0;
    //
    // Create texture from an array
    auto texture1 = cuda::texture::create(arr0);
    // is this misleading, because texture1 is not owning memory?
    
    // use in the kernel
    cudaTextureObject_t raw = texture1.get(); // expose C-Object
    cuda::launch(kernel, {1,1}, texture1.get(), ...);
    
    enhancement resolved-on-development 
    opened by codecircuit 17
  • CMake 3.17 automatically adds Threads::Threads as interface library

    CMake 3.17 automatically adds Threads::Threads as interface library

    The cuda-api-wrappers library depends on PThreads, which is now handled as a separate target in CMake. Thus if you just include the cuda-api-wrappers in your own CMake project CMake complains about the missing Target Threads::Threads. This dependency is not just a 'nice-to-have' dependency because PThreads is used in the compiled part of the wrappers and not just in one stand alone header.

    Helpful for that is the CMake module CMakeFindDependencyMacro. If you need help with this I can create a PR.

    bug nvtx 
    opened by codecircuit 16
  • Allow for obtaining architecture and compute capability information at compile-time

    Allow for obtaining architecture and compute capability information at compile-time

    https://github.com/eyalroz/cuda-api-wrappers/blob/b33726d4cd72760fbd9d06c370e5fd5ee59c7d31/src/cuda/api/device_properties.cpp#L24-L37

    For several things, we need these values known at compile time. I am not familiar with find_if, will it be able to find the value for the corresponding arch at compile time?

    This is necessary for simple and performance critical tasks, stuff like kernel launch/bound configs, static shared memory, etc.

    task resolved-on-development 
    opened by neoblizz 14
  • region_t and const pointers

    region_t and const pointers

    Consider the case you want to do a prefetch on managed memory within this function:

    
    void f(const double* ptr, ..., size_t N) {
    cuda::memory::managed::region_t region;
    region.start = ptr; // impossible because we cannot cast `const double*` to `void*`
    region.size_in_bytes = N * sizeof(double);
    cuda::memory::managed::prefetch_to_host(region);
    }
    

    Do you think it is reasonable to support prefetches on const data, because CUDA also supports it? The signature is cudaMemPrefetchAsync(const void* ptr, ...). Currently a const_cast is needed in the snippet above.

    question task resolved-on-development 
    opened by codecircuit 13
  • Build error with CUDA 11.2 & MSVC 19.28

    Build error with CUDA 11.2 & MSVC 19.28

    Hi, I'd like to use the api wrappers on a Windows targeted application, however I was neither able to build the examples using CMake nor my own application when using the API wrappers (a mere include of cuda/runtime_api.hpp already leads up to compilation errors).

    Configuration:

    • Win10 x64
    • MSVC ~~16.8.4~~ 19.28.29336, Visual Studio 16.8.4
    • CUDA 11.2
    • CMake 3.19.4

    Steps to reproduce: I cloned version v0.4 of the repo and performed two slight modifications regarding CMake variables to allow compilation using MSVC:

    1. Add variable CMAKE_CUDA_ARCHITECTURES and set it to 52, otherwise the CMake generation will fail with BUILD_EXAMPLES enabled
    2. Add /Zc:__cplusplus to CMAKE_CUDA_FLAGS* (-Xcompiler="*other flags* /Zc:__cplusplus"), otherwise the predefined macro __cplusplus will always report 199711L and a static_assert will be hit runtime_api.hpp(11): static_assert(__cplusplus >= 201103L, "The CUDA Runtime API headers can only be compiled with C++11 or a later version of the C++ language standard");

    The nvtx project may now be built without any errors, the example projects however are failing. The compilation log for vectorAdd is attached, the main cause for trouble seems to be:

    1>cuda/api/memory.hpp(1030): error : template instantiation resulted in unexpected function type of "__nv_bool (cuda::memory::managed::region_t, cudaMemRangeAttribute)" (the meaning of a name may have changed since the template declaration -- the type of the template is "T (cuda::memory::region_t, cudaMemRangeAttribute)")
    1>          detected during instantiation of "cuda::memory::managed::detail::get_scalar_range_attribute" based on template argument <__nv_bool>
    1>(1030): here
    

    I'm a little lost on this one. Any suggestions? Thanks in advance for any support.

    bug ms-windows 
    opened by quxflux 13
  • A builder-class for NVRTC programs

    A builder-class for NVRTC programs

    It would be useful if one could build NVRTC programs incrementally, adding and setting headers, options, etc. at one's convenience rather than when constructing a program_t object.

    enhancement resolved-on-development 
    opened by eyalroz 0
  • Add support for the PTX compilation API

    Add support for the PTX compilation API

    Somehow, NVIDIA's separate library for compiling PTX code into SASS escaped me...

    It's documented at:

    https://docs.nvidia.com/cuda/ptx-compiler-api/index.html

    and we should definitely add support for it.

    There's a "handle" type, similar to an NVRTC "program"; and the functions are:

    Functions:

    • nvPTXCompilerCompile ( nvPTXCompilerHandle compiler, int numCompileOptions, const char** compileOptions )
      Compile a PTX program with the given compiler options.
    • nvPTXCompilerCreate ( nvPTXCompilerHandle* compiler, size_t ptxCodeLen, const char* ptxCode )
      Obtains the handle to an instance of the PTX compiler initialized with the given PTX program ptxCode.
    • nvPTXCompilerDestroy ( nvPTXCompilerHandle* compiler )
      Destroys and cleans the already created PTX compiler.
    • nvPTXCompilerGetCompiledProgram ( nvPTXCompilerHandle compiler, void* binaryImage )
      Obtains the image of the compiled program.
    • nvPTXCompilerGetCompiledProgramSize ( nvPTXCompilerHandle compiler, size_t* binaryImageSize )
      Obtains the size of the image of the compiled program.
    • nvPTXCompilerGetErrorLog ( nvPTXCompilerHandle compiler, char* errorLog )
      Query the error message that was seen previously for the handle.
    • nvPTXCompilerGetErrorLogSize ( nvPTXCompilerHandle compiler, size_t* errorLogSize )
      Query the size of the error message that was seen previously for the handle.
    • nvPTXCompilerGetInfoLog ( nvPTXCompilerHandle compiler, char* infoLog )
      Query the information message that was seen previously for the handle.
    • nvPTXCompilerGetInfoLogSize ( nvPTXCompilerHandle compiler, size_t* infoLogSize )
      Query the size of the information message that was seen previously for the handle.

    and a version function:

    • nvPTXCompileResult nvPTXCompilerGetVersion] ( unsigned int* major, unsigned int* minor )

    Need to think about where that's going to go, namespace-wise (and reconsider the same for NVRTC - if it even has a version number).

    enhancement task 
    opened by eyalroz 1
  • Compilation log vector<char> contains trailing '\0'

    Compilation log vector contains trailing '\0'

    It seems nvrtcGetProgramLogSize() includes 1 for a trailing '\0' character, and so we end up placing it in our return value - which is not a C-style string. Let's not do that.

    bug resolved-on-development 
    opened by eyalroz 0
Releases(v0.5.3-b2)
  • v0.5.3-b2(Jul 26, 2022)

    Runtime program compilation (NVRTC) improvements

    • #379: Can get the compilation log, PTX, cubin or NVVM in a user-provided rather than self-allocated buffer
    • #388: A builder interface for NVRTC programs
    • #386: Add support for nvrtcGetSupportedArchs()
    • #375: Support adding arbitrary options when dynamically compiling a CUDA program
    • #265: Support for diag-suppress/error/warn compilation options

    Runtime-compilation-related Bug fixes

    • #384: Make nvrtc depend on runtime-and-driver
    • #376: When rendering compilation options to a string, we get an extra space
    • #378: Compilation log vector contains trailing '\0'
    • #387: nvrtc.h included in wrong file

    Other features

    • #248: Support asynchronous memory allocation (in v0.5.2 we only had allocation, no freeing)

    Caveats

    Continuous build testing on Windows is failing on GitHub Actions due to trouble with CMake detecting the NVTX path. Assistance from users on this matter would be appreciated.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.2(Jun 18, 2022)

    Full MS Windows support is restored in this version (AFAICT). Also worked out some kinks and polished a few interfaces.

    Bug fixes

    • #330, #369, #372 Corrected some launch_config_builder logic bugs.
    • #368 Fixed an accidental primary context deactivation in p2pBandwidthLatencyTest
    • #360 Was missing an implementation of context_t::create_event()
    • #357 All assignment operators updated to appropriatlyhandle primary context reference unit propagation
    • #351 Fixed a typo in Windows-target-only code
    • #335 Redundant 0x in error messages
    • #329 marshalled_options.hpp errors with C++17
    • #324 marshalled_options.hpp needs cuda::span, but doesn't see it
    • #325 nvrtc/compilation_options.hpp needs to know about device_t

    Windows compatibility

    • #345 Avoid non-portable assumptions regarding thread handles in vectorAdd_profiled
    • #344 Workaround for an MSVC SFINAE error with std::iterator_traits<Iter>
    • #343 std::experimental::filesystem not properly supported on Windows
    • #342 Don't try to use mkstemp on Windows
    • #341 Avoid size_t <-> unsigned overload clash on Windows
    • #340 Apply the CUDA_CB decoration to shared memory size-determiner function - it's actually necessary on Windows
    • #339 Avoid some MSVC compiler warnings
    • #338 Added missing inclusions to have Windows NT HANDLE defined
    • #337 Support for MSVC's standard-incompliant __cplusplus value
    • #347 Using ::std:: rather than std::, to avoid clashes with NVIDIA's libcustd - that is included by default by CUDA 11.7's nvcc.

    Interface tweaks

    RTC compilation options

    • #364 marshal() and render() are now stand-alone functions.
    • #363 Can now render compilation options to an ::std::string (in case you want to save/print them)
    • #362 Add a clear_language_dialect() to rtc::compilation_options_t
    • #361 If an rtc::compilation_options_t is asked to set the language dialect to an empty or null string - unset it instead
    • #355 Support taking the C++ language dialect as an ::std::string, not just a C-style string.

    Other classes

    • #365 module::get_kernel() can now take an ::std::string
    • #359 Now exposing the interface for enqueuing kernels with type-erased arguments, passed via an array of void* (so far, you could only enqueue when you passed the parameter types_.
    • #356 (Almost) all proxy classes are now move-assignable and move-cosntructible, but not copy-assignable or copy-constructible. Move them or use cosnt-ref's.
    • #358 link_t should have a device_id()

    Miscellaneous and internal issues

    • #367 Avoiding a redundant scoped context setting when enqueuing a kernel
    • #366 Spruced up CUDA_DEVICE_OR_THIS_SCOPE() and CUDA_CONTEXT_FOR_THIS_SCOPE()
    • #353 Added missing PCI function initializer to the PCI location wrappers class.
    • #352 Simplified the options marshalling code
    • #349 Prefix CMake options with CAW_, for use as a subproject (e.g. FetchContent)
    • #346 Fix CUDA installation in GitHub action scripts
    • #326 Drop redundant inclusions and make include order more "challenging" in vectorAdd examples
    • #328 Reduce gratuitous API calls in current_device::detail::set()
    • #331 Can now load a module from file into any context, not just the current context
    • #334 Reduce the number of redundant informative API calls enhancement resolved-on-development
    • #333 Don't treat freeing in a destroyed context as an error
    • #303 Use CUDA_VERSION instead of CUDART_VERSION
    • #370 cuda::context::current::exists() now return false, rather than throwing, if the CUDA driver has not been initialized
    • #373 In Debug builds, now validating launch configuration grid dimensions before enqueueing/launching anything (as CUDA tends to fail silently, e.g. for emtpy grids)

    Caveats

    Continuous build testing on Windows is failing on GitHub Actions due to trouble with CMake detecting the NVTX path. Assistance from users on this matter would be appreciated.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.1(May 9, 2022)

    Build mechanism

    • #307 The library is now entirely header-only (the NVTX wrappers, which used to be compiled, are now all within headers).

    New supported features

    • #308 Supporting both narrow/regular and wide character inputs for NVRTC compilation.
    • #309 Support for naming streams, devices and events with NVTX

    Concepts/facilities introduced

    • #311 A Builder-pattern class for building launch configurations more easily.

    Compatibility

    • #304 : Now compatible will all CUDA versions between 9.0 and 11.6

    Bug fixes

    • #320 No longer getting an error message about module::create() when including only runtime_api.hpp.
    • #317 No longer "leaking" references to device primary contexts which made them never be destroyed after some point. Fixing this exposed a few other latent issues involving non-existence of primary contexts: #316.
    • #314 No longer failing to enqueue events when there is no current context.
    • #305, #306 :
      • Added missing named errors to cuda::status
      • Now using driver error codes wherever applicable (they only started to coincide with Runtime API error codes in a recent CUDA version)
      • Renamed mis-named error: cuda::status::not_ready -> cuda::status::async_operations_not_yet_completed.
    • #315 In one of the example programs, we were launching a kernel on the current device rather than the one the user had chosen.

    Miscellaneous and internal issues

    • #310 NVTX wrapper now uses driver-API-style
    • #303 Using CUDA_VERSION instead of CUDA_RT_VERSION where relevant.
    • #320 Added an example program only explicitly including runtime-API-related headers.
    • #321 Weakened requirement from kernel parameter types from TriviallyCopyable to just being trivially copy-constructible.

    Caveats

    Windows support is partially broken in this version.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Feb 19, 2022)

    This is a near-complete under-the-hood rewrite of the API wrappers library, while maintaining its existing API almost entirely: The library now primarily relies on CUDA Driver API calls, with Runtime API calls used only where the driver does not straightforwardly provide the same functionality.

    If you are only interested in the Runtime API, you may which to use the latest 0.4.x release. At the moment, that is 0.4.7.

    Fundamental feature set additions

    • #9 Driver API support
    • #228, #262 : NVRTC support

    Wrapper classes introduced

    • Contexts: context_t.
    • Dynamically vs. statically compiled kernels: kernel_t and apriori_compiled_kernel_t
    • Device primary contexts: device::primary_context_t
    • link_t: Linking together compiled code to satisfy symbol definition requirements and complete executables.
    • link_options_t defining options for linking.
    • Virtual memory: physical_allocation_t, address_range_reservation_t and mapping_t between pairs of the former.
    • Modules: module_t, made up of compiled binary/PTX code - functions, global symbols etc - which may be loaded into contexts

    and via NVRTC support:

    • Programs: rtc::program_t, made up of CUDA or PTX source code: program_t.
    • Compilation options, rtc::compilation_options_t defining options for compiling programs.

    (All of the classes above are under the cuda:: namespace)

    Concepts/facilities introduced

    • Treatment of the primary context as a context and its creation or destruction
    • The context stack
    • The current context
    • Waiting on a the value of a scalar in global device memory
    • Access by specific contexts to specific contexts of peer devices

    Caveats

    Windows support is partially broken in this version.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.7(Mar 12, 2022)

    This version has very few changes to relative to 0.4.6. These are:

    Bug fixes

    • #301 : Now ensuring launch configurations can be assigned to each other.

    Note: Users's help is kindly requested in preparing for the next major release, which will cover both the runtime and the driver API, and NVRTC as well. See this branch and contact me / open relevant issues.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.6(Mar 9, 2022)

    (v0.4.5 was discarded due to an invalid version string; this is essentially the same as v0.4.5 but with the version string fixed.)

    Changes since v0.4.4:

    API changes

    • #298 : The wrap() methods, which take raw CUDA handles for events, devices, streams etc. and wrap them in, well, the library's wrapper objects (as opposed to otherwise getting/creating wrapper objects directly, with no raw handles) - are now out of the detail_:: namespace and part of the library's proper API.

    Bug fixes

    • #300: Was hiding some CUDA 11 stream-related features due to faulty runtime API version check.
    • #299: Now correctly copying stream properites.
    • #296: (Probably) fixed a Win64-to-Win32 cross-build compilation issue with callback function signatures.

    Note: Users's help is kindly requested in preparing for the next major release, which will cover both the runtime and the driver API, and NVRTC as well. See this branch and contact me / open relevant issues.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.4(Dec 23, 2021)

    Changes since v0.4.3:

    Bug fixes

    • Device-properties-related functions using baked-in data corrected for some compute capabilities.

    New functionality

    • #284 Introduced a grid-and-block-dimensions structure, grid::complete_dimensions_t
    • Additional variants of cuda::memory::set() so that you may use either regions or plain pointer.
    • device_t::global_memory_t now has an associated_device() member.
    • #223, #272 Support for CUDA 11.0 stream attributes.
    • Added device_t::supports_block_cooperation().
    • Additional variants of cuda::memory::copy() for convenience.
    • #292: Device-properties-related functions requiring baked-in data now support Ampere GPUs (CC 8.0, 8.6).
    • #293: Some methods of compute_architecture_t are now available only for compute_capability_t, as it is no longer reasonable to rely on microarchitecture-generation-default values (e.g. amount of shared memory per block, number of in-flight threads per multiprocessor etc.)

    Changes to existing functionality

    • #280 Events and streams now have "handles" rather than "ids".
    • Partial revamp of the CUDA array wrapper classes (e.g. no templatization).
    • #258 Block "Cooperativity" is now part of the launch configuration, so less launch variants are necessary.
    • #250 Now offering const variants for both regular and mapped memory.
    • #269 Renamed cuda::device::resource_limit_t -> cuda::device::limit_t.
    • Support for GitHub workflows
    • #267 the NVTX library now depends on CUDA::nvToolsExt (which it should).
    • #268 Now exporting the requirement for the CUDAToolkit dependency.
    • cuda::runtime_error can now be constructed also using an r-value string reference, not just a constant l-value reference.
    • Removed some unnecessary explicit namespace specification in error.hpp.
    • Now using uniform parameter name in allocation functions
    • Renamed: array_t::associated_device() -> array_t::device().
    • #285, #289 Now using the wrap() idiom for constructing device_t's
    • #273: Added device-setter RAII objects to some asynchronous stream method.
    • Rework of (global-memory) symbol handling: No more symbol_t type; functionality moved from cuda::memory:: to cuda::symbol::; and now willing to locate any-type-argument.

    Build mechanism

    • Avoid always re-determining CUDA architectures by minding the cache.
    • Fixed the CompileWithWarnings.cmake module to pass the appropriate flags to the appropriate executables (NVCC front-end vs actual compiler, MSVC vd GCC/clang)

    Other changes

    • Multiple cosmetic changes to avoid MSVC compilation warning, e.g. explicit narrowing casts.
    • Example program changes, including utility headers.
    • Added a modified version of the CUDA sample program binaryPartitionCG.
    • Some internal changes to wrapper classes with no external interface change.
    • NVTX exception what() message fix.
    • #283 : Some wrapper identification string generator functions in detail_ subnamespaces.

    This version is know to work with CUDA versions up to 11.5; pre-11.0 CUDA versions are supported, but not tested routinely.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.3(Aug 20, 2021)

    Changes since v0.4.2:

    New functionality

    • Support for working with CUDA symbols.
    • Support for asynchronous memory allocation.
    • Classes for all memory regions - both managed and regular, both constant and non-constant memory (we used to have some of these only).

    Changes to existing functionality

    • launch_configuration_t is now constexpr.
    • Arguably better interface for the partially-existing managed memory region classes.
    • Pervasive use of regions as parameters to API functions involving memory: Copying, allocating, modifying attributes etc.
    • Renamed: no_shared_memory -> no_dynamic_shared_memory.

    Other changes

    • CMake-based build mechanism changes to rely on CMake 3.17 changes to CUDA support (no effect on the use of the library).
    • Replaced the internal detail namespaces with detail_, for libcu++ compatibility.
    • Dropped the FindCUDAAPIWrappers.cmake module.

    This version is know to work with CUDA versions up to 11.4 (but old CUDA versions are not routinely tested).

    Source code(tar.gz)
    Source code(zip)
  • v0.4.2(Feb 24, 2021)

    This is a minor release, with mostly bug fixes and compatibility improvements. Other than in its version number, it is identical to 0.4.1, which was retracted due to a version numbering issue.

    Changes since 0.4:

    • Can now access all devices as a range: for(auto device : cuda::devices()) { /* etc. etc. */ }.
    • Wrapper classes (specifically, events and streams) now have non-owning copy constructors.
    • A stream priority range is now its own class.

    Bug fixes:

    • Dropped invalid stream-priority-related constant.
    • The device management test was getting the direction of priority ranges backwards.
    • The p2pBandwidthLatencyTest example program was failing with cross-device event wait attempts, due to calling wait() and record() on the wrong stream.
    • Removed a spurious template specifier in device.hpp
    • Can now construct cuda::launch_configuration_t from two integers with C++14 and later.

    Build, compatibility, usability:

    • CMake 3.18 and later no longer complain about the lack of a CUDA_ARCHITECTURES value.
    • Should now be compatible with MSVC 16.8 on Windows.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1rc1(Feb 10, 2021)

    This is a minor release, with mostly bug fixes and compatibility improvements.

    Changes since 0.4:

    • Can now access all devices as a range: for(auto device : cuda::devices()) { /* etc. etc. */ }.
    • Wrapper classes (specifically, events and streams) now have non-owning copy constructors.
    • A stream priority range is now its own class.

    Bug fixes:

    • Dropped invalid stream-priority-related constant.
    • The device management test was getting the direction of priority ranges backwards.
    • The p2pBandwidthLatencyTest example program was failing with cross-device event wait attempts, due to calling wait() and record() on the wrong stream.
    • Removed a spurious template specifier in device.hpp
    • Can now construct cuda::launch_configuration_t from two integers with C++14 and later.

    Build, compatibility, usability:

    • CMake 3.18 and later no longer complain about the lack of a CUDA_ARCHITECTURES value.
    • Should now be compatible with MSVC 16.8 on Windows.
    Source code(tar.gz)
    Source code(zip)
  • v0.4(Oct 14, 2020)

    Main changes since 0.3.3:

    • The runtime API wrappers are now a header-only library.
    • Split the NVTX wrappers and the Runtime API wrappers into two separate libraries.
    • Added several fundamental types which were implicit in previous versions: cuda::size_t, cuda::dimensionality_t.

    Minor API tweaks:

    • Renamed launch -> enqueue_launch
    • Can now schedule managed memory region attachment on streams
    • Now wrapping cudaMemAdvise() advice.
    • Array copying uses typed pointers
    • Added: A cuda::managed::device_side_pointer_for() standalone function
    • Added: A container facade for the sequence of all devices, so you can now write for (auto device : cuda::devices() ) { }.
    • De-templatized: device setter RAII class
    • Added: a freestanding cuda::synchronize() function instead of some wrapper methods
    • Made some type definitions from inside device_t to the device:: namespace
    • Added: A subclass of memory::region_t for managed memory
    • Using memory::region_t in more API functions
    • Dropped cuda::kernel::maximum_dynamic_shared_memory_per_block().
    • Centralized the definitions of take_ownership and do_not_take_ownership
    • Made stream_t& parameters into const stream_t&, almost universally.

    Bug fixes:

    • Cross-device waiting on events
    • Error message fixes
    • Not assuming the uintNN_t types are in the default namespace

    Build, compatibility, usability:

    • Fix support for CMake 3.8 (CMakeLists.txt was using some post-3.8 features)
    • Clang-related:
      • Skipping examples which clang++ doesn't support yet (need
      • Only enabling separable compilation and CUDA
      • const-cast'ing const void * kernel function pointers before reinterpretation - clang wont'tt let it
      • GNU extension dropped when compiling examples with CUDA (clang dioesn't support ths)
      • Fixed std::max() call issue
    • CMake targets depending on the wrappers should now have a C++11 language standard requirement for compilation
    • The wrappers now assert C++11 or later is used, instead of letting you just fail somewhere.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Jul 20, 2020)

    This release includes both significant additions to the coverage by the wrappers, as well as major changes to the existing wrappers API.

    Main changes since 0.2.0:

    • Forget about numeric handles! The wrapper classes no longer take numeric handles as parameters, in methods exposed to the user. You'll be dealing with device_t's, event_t's, stream_t's etc. - not device::id_t, device::stream_t and device::event_t's.
    • Wrappers classes no longer templated. That means, on one hand, you don't have to worry about the template argument of "do we assume the wrapper's device is the current one?" ; but on the other hand, every use of the wrapper will set the current device (even if it's already the right one). A lot of code was simplified or even remoed thanks to this change.
    • device_function_t is now named kernel_t, as only kernels are acceptable by the CUDA Runtime API calls mentioning "device functions". Also, kernel_t's are now a pair of (kernel, device), as the settings which can be made for a kernel are mostly/entirely device-specific.
    • The examples CMakeLists.txt has been split off from the main CMakeFiles.txt and moved into a subdirectory, removing any dependencies it may have.
    • Kernel launching now uses perfect forwarding of all parameters.
    • The library is now almost completely header-only. The single exception to this rule is profiling-related code. If you don't use it - the library is header-only for you.
    • Changed my email address in the code...

    Main additions since 0.2.0:

    • 2D and 3D Array support.
    • 2D and 3D texture support.
    • A single set() and get() for all memory spaces.

    Plus a few bug fixes, and another example program from the CUDA samples.

    Changes from 0.3.0:

    • Fixed: Self-recursion in one of the memory allocation functions.
    • Fixed: Added missing inline specifiers to some functions
    • White space tweaks
    Source code(tar.gz)
    Source code(zip)
  • release_0_2_0(Jan 20, 2020)

    This repository has not really needed "releases" so far:

    • We're gradually wrapping an API, with the underlying API changing occasionally - so breaking changes are made frequently.
    • The master branch is always the most stable and rounded-out version of the code one can use.

    However, with other code potentially starting to depend on this repository, and with the CMake scripts maturing somewhat (thanks goes to @codecircuit for the latter) - named/versioned releases start to make more sense, if only for referential convenience.

    Of course, there's the question of a versioning scheme. If we go with semantic versioning, we're going to be switching major version numbers all the time.

    For now, versions will be numbered as follows: A.B.C or A.B.C-string.

    • A is the major version number. It will increase with major changes to the library's overall functionality relative to the previous major-version. What counts major? If a whole lot of your host-side code has to change for it to work, then the library change is major.
    • b is the minor version number. It will increase with changes to the library's functionality - including its API; and unlike SemVer - this change is not necessarily an addition. The change may be rather big in terms of code, but not in terms of the fundamental use patterns .
    • C is a "patch" version number. These changes are for bugfixes and minor tweaks. They often don't affect the API at all - but they might in some small subtle way.

    Finally, why 0.2.0? Well, it's somewhat arbitrary; but the extension has had "core" functionality pretty stable for a while now, with quite a few users; so 0.1.0 feel a bit "premature", which this isn't. On the other hand, 1.0.0 is too presumptuous, since:

    • We don't have decent feature-test coverage of most of the library (the examples cover a lot though.);
    • We don't have full nor effectively-fool support of CUDA 9.x
    • We don't have good enough unit test coverage.

    So 1.0.0 is a while off; enjoy 0.2.0 for now.

    Source code(tar.gz)
    Source code(zip)
Owner
Eyal Rozenberg
Researcher-Developer, focusing on GPU computing and analytic DBMS architecture deeply integrating heterogenuous parallel processors.
Eyal Rozenberg
Parallel algorithms (quick-sort, merge-sort , enumeration-sort) implemented by p-threads and CUDA

程序运行方式 一、编译程序,进入sort-project(cuda-sort-project),输入命令行 make 程序即可自动编译为可以执行文件sort(cudaSort)。 二、运行可执行程序,输入命令行 ./sort 或 ./cudaSort 三、删除程序 make clean 四、指定线程

Fu-Yun Wang 3 May 30, 2022
Partr - Parallel Tasks Runtime

Parallel Tasks Runtime A parallel task execution runtime that uses parallel depth-first (PDF) scheduling [1]. [1] Shimin Chen, Phillip B. Gibbons, Mic

null 32 Jul 17, 2022
checkedthreads: no race condition goes unnoticed! Simple API, automatic load balancing, Valgrind-based checking

checkedthreads checkedthreads is a fork-join parallelism framework for C and C++ providing: Automated race detection using debugging schedulers and Va

Yossi Kreinin 276 Jul 27, 2022
Simple and fast C library implementing a thread-safe API to manage hash-tables, linked lists, lock-free ring buffers and queues

libhl C library implementing a set of APIs to efficiently manage some basic data structures such as : hashtables, linked lists, queues, trees, ringbuf

Andrea Guzzo 387 Jul 30, 2022
OOX: Out-of-Order Executor library. Yet another approach to efficient and scalable tasking API and task scheduling.

OOX Out-of-Order Executor library. Yet another approach to efficient and scalable tasking API and task scheduling. Try it Requirements: Install cmake,

Intel Corporation 17 Mar 10, 2022
Lev - Lightweight C++ wrapper for LibEvent 2 API

lev Lightweight C++ wrapper for LibEvent 2 API LibEvent is a great library. It uses a C interface which is well designed but has a learning curve. Thi

Yasser Asmi 47 Jul 27, 2022
Lucy job system - Fiber-based job system with extremely simple API

Lucy Job System This is outdated compared to Lumix Engine. Use that instead. Fiber-based job system with extremely simple API. It's a standalone versi

Mikulas Florek 78 Mar 11, 2022
Fiber - A header only cross platform wrapper of fiber API.

Fiber Header only cross platform wrapper of fiber API A fiber is a particularly lightweight thread of execution. Which is useful for implementing coro

Tony Wang 41 Jul 31, 2022
✔️The smallest header-only GUI library(4 KLOC) for all platforms

Welcome to GUI-lite The smallest header-only GUI library (4 KLOC) for all platforms. 中文 Lightweight ✂️ Small: 4,000+ lines of C++ code, zero dependenc

null 6.3k Jul 31, 2022
This is a thin c-api wrapper programmatically generated for the excellent C++ immediate mode gui Dear ImGui.

cimgui This is a thin c-api wrapper programmatically generated for the excellent C++ immediate mode gui Dear ImGui. All imgui.h functions are programm

Victor Bombi 22 Jul 5, 2021
the thin c++ game engine

CI Community Support toy is a thin and modular c++ game engine. it aims to provide the thinnest and simplest stack of technology for making games dire

Hugo Amnov 1.5k Aug 2, 2022
Low Level Graphics Library (LLGL) is a thin abstraction layer for the modern graphics APIs OpenGL, Direct3D, Vulkan, and Metal

Low Level Graphics Library (LLGL) Documentation NOTE: This repository receives bug fixes only, but no major updates. Pull requests may still be accept

Lukas Hermanns 1.4k Aug 1, 2022
🎮 C Bindings/Wrappers for Apple's METAL framework

Apple's Metal for C C Wrapper for Apple's METAL framework. This library is C bindings of Metal API (MetalGL). Since OpenGL is deprecated, this library

Recep Aslantas 107 Jul 31, 2022
🎮 C Bindings/Wrappers for Apple's METAL framework

Apple's Metal for C C Wrapper for Apple's METAL framework. This library is C bindings of Metal API (MetalGL). Since OpenGL is deprecated, this library

Recep Aslantas 107 Jul 31, 2022
C++ wrappers for SIMD intrinsics and parallelized, optimized mathematical functions (SSE, AVX, NEON, AVX512)

C++ wrappers for SIMD intrinsics and parallelized, optimized mathematical functions (SSE, AVX, NEON, AVX512)

Xtensor Stack 1.4k Jul 30, 2022
Enabling services on your device 70 Jul 31, 2022
VNOpenAI 23 Jul 31, 2022
A thin, highly portable C++ intermediate representation for dense loop-based computation.

A thin, highly portable C++ intermediate representation for dense loop-based computation.

Facebook Research 109 Jul 8, 2022
A Script to thin Universal Apps on macOS quickly

UBThinner A Script to thin Universal Apps on macOS quickly. It traverses through the given folder recursively, identifies any universal binaries and t

Arm 2 Dec 26, 2021
autogen bindings to Raylib 4.0 and convenience wrappers on top. Requires use of `unsafe`

Raylib-CsLo Raylib-CsLo LowLevel autogen bindings to Raylib 4.0 and convenience wrappers on top. Requires use of unsafe A focus on performance. No run

NotNot 51 Jul 25, 2022
Mini-async-log-c - Mini async log C port. Now with C++ wrappers.

Description A C11/C++11 low-latency wait-free producer (when using Thread Local Storage) asynchronous textual data logger with type-safe strings. Base

null 66 Dec 7, 2021
NavMeshComponents - High Level API Components for Runtime NavMesh Building

Status of the project Development This project is now developed as part of the AI Navigation package. Please add that package to your project in order

Unity Technologies 2.6k Jul 31, 2022
A GPU (CUDA) based Artificial Neural Network library

Updates - 05/10/2017: Added a new example The program "image_generator" is located in the "/src/examples" subdirectory and was submitted by Ben Bogart

Daniel Frenzel 91 Jun 13, 2022
VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP

VexCL VexCL is a vector expression template library for OpenCL/CUDA. It has been created for ease of GPGPU development with C++. VexCL strives to redu

Denis Demidov 674 Aug 4, 2022
GPU Cloth TOP in TouchDesigner using CUDA-enabled NVIDIA Flex

This project demonstrates how to use NVIDIA FleX for GPU cloth simulation in a TouchDesigner Custom Operator. It also shows how to render dynamic meshes from the texture data using custom PBR GLSL material shaders inside TouchDesigner.

Vinícius Ginja 37 Jul 27, 2022
GPU PyTorch TOP in TouchDesigner with CUDA-enabled OpenCV

PyTorchTOP This project demonstrates how to use OpenCV with CUDA modules and PyTorch/LibTorch in a TouchDesigner Custom Operator. Building this projec

David 65 Jun 15, 2022
Ethereum miner with OpenCL, CUDA and stratum support

Ethminer is an Ethash GPU mining worker: with ethminer you can mine every coin which relies on an Ethash Proof of Work thus including Ethereum, Ethereum Classic, Metaverse, Musicoin, Ellaism, Pirl, Expanse and others. This is the actively maintained version of ethminer. It originates from cpp-ethereum project (where GPU mining has been discontinued) and builds on the improvements made in Genoil's fork. See FAQ for more details.

null 5.9k Jul 29, 2022
A CUDA implementation of Lattice Boltzmann for fluid dynamics simulation

Lattice Boltzmann simulation I am conscious of being only an individual struggling weakly against the stream of time. But it still remains in my power

Long Nguyen 17 Mar 1, 2022