Thin C++-flavored wrappers for the CUDA Runtime API

Overview

cuda-api-wrappers:
Thin C++-flavored wrappers for the CUDA runtime API

Branch Build Status: Master Master Build Status | Development: Development Build Status

nVIDIA's Runtime API for CUDA is intended for use both in C and C++ code. As such, it uses a C-style API, the lowest common denominator (with a few notable exceptions of templated function overloads).

This library of wrappers around the Runtime API is intended to allow us to embrace many of the features of C++ (including some C++11) for using the runtime API - but without reducing expressivity or increasing the level of abstraction (as in, e.g., the Thrust library). Using cuda-api-wrappers, you still have your devices, streams, events and so on - but they will be more convenient to work with in more C++-idiomatic ways.

Key features

  • All functions and methods throw exceptions on failure - no need to check return values (the exceptions carry the status information).
  • Judicious namespacing (and some internal namespace-like classes) for better clarity and for semantically grouping related functionality together.
  • There are proxy/wrapper objects for devices, streams, events, kernels and so on, using RAII to relieve you of remembering to free or destroy resources.
  • You can mostly forget about numeric IDs and handles; the proxy classes will fit everywhere.
  • Various Plain Old Data structs adorned with convenience methods and operators.
  • Aims for clarity and straightforwardness in naming and semantics, so that you don't need to look concepts up in the official documentation to understand what each class and function do.
  • Thin and lightweight:
    • No work done behind your back, no caches or indices or any such thing.
    • No costly inheritance structure, vtables, virtual methods and so on - vanishes almost entirely on compilation.
    • Doesn't really "hide" any of CUDA's complexity or functionality; it only simplifies use of the Runtime API.

Detailed documentation

Detailed nearly-complete Doxygen-genereated documentation is available.

Requirements

  • CUDA v8.0 or later is recommended and v7.5 should be supported (but is untested). CUDA 6.x should probably be Ok as well.
  • A C++11-capable compiler compatible with your version of CUDA.
  • CMake v3.8 or later - although most of the library will work as simple headers with no building.

Coverage of the Runtime API

Considering the list of runtime API modules, the library currently has the following (w.r.t. CUDA 8.x):

Coverage level Modules
full Error Handling, Stream Management, Event Management, Version Management, Peer Device Memory Access, Occupancy, Unified Addressing
almost full Device Management (no chooseDevice, cudaSetValidDevices), Memory Management, Execution Control (no support for working with parameter buffers)
partial 2D & 3D Arrays, Texture Object Management, Texture Reference Management
(deprecated) Thread management
no coverage Graph Management, OpenGL Interoperability, Direct3D Interoperability, VDPAU Interoperability, EGL Interoperability, Graphics Interoperability, Surface Reference Management, Surface Object Management

The Milestones indicates some features which aren't covered and are slated for future work.

Since I am not currently working on anything graphics-related, there are no short-term plans to extend coverage to any of the graphics related modules.

A taste of the key features in play

We've all dreamed of being able to type in:

my_stream.enqueue.callback(
	[&foo](cuda::stream_t stream, cuda::status_t status) {
		std::cout << "Hello " << foo << " world!\n";
	}
);

... and have that just work, right? Well, now it does!

On a slightly more serious note, though, let's demonstrate the principles listed above:

Use of namespaces (and internal classes)

With this library, you would do cuda::memory::host::allocate() instead of cudaMallocHost() and cuda::device_t::memory::allocate() instead of setting the current device and then cudaMalloc(). Note, though, that device_t::memory::allocate() is not a freestanding function but a method of an internal class, so a call to it might be cuda::device::get(my_device_id).memory.allocate(my_size). The compiled version of this supposedly complicated construct will be nothing but the sequence of cudaSetDevice() and cudaMalloc() calls.

Adorning POD structs with convenience methods

The expression

my_device.properties().compute_capability() >= cuda::make_compute_capability(50)

is a valid comparison, true for all devices with a Maxwell-or-later micro-architecture. This, despite the fact that struct cuda::compute_capability_t is a POD type with two unsigned integer fields, not a scalar. Note that struct cuda::device::properties_t (which is really basically a struct cudaDeviceProp of the Runtime API itself) does not have a compute_capability field.

Meaningful naming

Instead of using

cudaError_t cudaEventCreateWithFlags(
    cudaEvent_t* event, 
    unsigned int flags) 

which requires you remember what you need to specify as flags and how, you create a cuda::event_t proxy object, using the function:

cuda::event_t cuda::event::create(
    cuda::device_t  device,
    bool            uses_blocking_sync,
    bool            records_timing      = cuda::event::do_record_timing,
    bool            interprocess        = cuda::event::not_interprocess)

The default values here are enum : bool's, which you can use yourself when creating non-default-parameter events - to make the call more easily readable than with true or false.

Example programs

More detailed documentation / feature walk-through is forthcoming. For now I'm providing two kinds of short example programs; browsing their source you'll know essentially all there is to know about the API wrappers.

To build and run the examples (just as a sanity check), execute the following:

[user@host:/path/to/cuda-api-wrappers/]$ cmake -S . -B build -DBUILD_EXAMPLES=ON . && cmake --build build/ && find build/examples/bin -exec "{}" ";"

Modified CUDA samples

The CUDA distribution contains sample programs demostrating various features and concepts. A few of these - which are not focused on device-side work - have been adapted to use the API wrappers - completely foregoing direct use of the CUDA Runtime API itself. You will find them in the modified CUDA samples example programs folder.

'Coverage' test programs - by module of the Runtime API

Gradually, an example program is being added for each one of the CUDA Runtime API Modules, in which the approach replacing use of those module API calls by use of the API wrappers is demonstrated. These per-module example programs can be found here.

Bugs, suggestions, feedback

I would like some help with building up documentation and perhaps a Wiki here; if you can spare the time - do write me. You can also do so if you're interested in collaborating on some related project or for general comments/feedback/suggestions.

If you notice a specific issue which needs addressing, especially any sort of bug or compilation error, please file the issue here on GitHub.

Comments
  • Non-Portable Path to NVTX Headers [Windows]

    Non-Portable Path to NVTX Headers [Windows]

    On Windows with CUDA 10.0 installed, the paths to the NVTX headers that are used in profiling.cpp need to be prepended with nvtx3 so that the full include path is nvtx3/nvToolsExt.h and nvtx3/nvToolsExtCudaRt.h. Not sure how to make a "portable" solution here other than just hacking together the include paths based on the OS.

    One would think that the include paths should be the same if everything is being found correctly via CMake.

    bug nvtx ms-windows 
    opened by DeveloperPaul123 20
  • Multi GPU support is broken from v0.5.4 to the current development branch

    Multi GPU support is broken from v0.5.4 to the current development branch

    Hi,

    The current multi GPU support does not work. It either hangs or causes a:

    terminate called after throwing an instance of 'cuda::runtime_error'
      what():  Synchronously copying data: an illegal memory access was encountered
    

    It seems to me that the problem is that data is not allocated to the correct device by cuda::memory::device::make_unique

    example here: https://github.com/DiamonDinoia/test-cuda/tree/master/cuda-api-wrappers-multi-example

    Until version 0.5.0 this worked fine. Unfortunately, I cannot use that version anymore as systems were updated to cuda 11.8 which causes the ::std issue.

    Thanks, Marco

    opened by DiamonDinoia 19
  • examples fail on kepler GPU

    examples fail on kepler GPU

    I'm trying to use this project but many examples fail on machines that have older GPUs. I have tried on a few machines with Tesla K40c, and one Titan V. Most tests passed on the Titan V except two, but many tests do not pass on K40c, including many flavors of VectorAdd. Could you please help? Here are my specs:

    --failed calculations and configurations-- Tesla K40c (compute capability 3.5) NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 cmake version 3.23.0 c++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 failed tests: asyncAPI, binaryPartitionCG, event_management, execution_control, inlinePTX, p2pBandwidthLatencyTest, simpleIPC, simpleStreams, stream_management, vectorAdd, vectorAddManaged, vectorAddMapped, vectorAddMMAP

    --less failed calculations and configurations-- Titan V (compute capability 7.0) NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4
    cmake version 3.23.0 c++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 failed tests: simpleStreams, vectorAddMMAP

    bug 
    opened by rainli323 19
  • cannot find -lcudadevrt  -lcudart_static

    cannot find -lcudadevrt -lcudart_static

    $ mkdir build && cd build
    
    $ cmake ..
    
    $ make
    ...
    [ 13%] Linking CXX executable examples/bin/unified_addressing
    /usr/bin/x86_64-linux-gnu-ld: cannot find -lcudadevrt
    /usr/bin/x86_64-linux-gnu-ld: cannot find -lcudart_static
    collect2: error: ld returned 1 exit status
    CMakeFiles/unified_addressing.dir/build.make:84: recipe for target 'examples/bin/unified_addressing' failed
    
    $ ls ~/.local/cuda-9.0/lib64 | grep cuda
    libcudadevrt.a
    libcudart.so
    libcudart.so.9.0
    libcudart.so.9.0.176
    libcudart_static.a
    

    I'm using CUDA 9.0 and g++ 6.5.0. Looks like it cannot find my customized CUDA lib folder. Any solutions for this?

    Also, some similar warnings are given when compiling:

    [ 11%] Building CXX object CMakeFiles/unified_addressing.dir/examples/by_runtime_api_module/unified_addressing.cpp.o                                  
    In file included from /home/user/cuda-api-wrappers/src/cuda/api/device.hpp:13:0,                                                                       
                     from /home/user/cuda-api-wrappers/examples/by_runtime_api_module/unified_addressing.cpp:15:                                           
    /home/user/cuda-api-wrappers/src/cuda/api/device_properties.hpp:99:13: warning: In the GNU C Library, "major" is defined                               
     by <sys/sysmacros.h>. For historical compatibility, it is                                                                                            
     currently defined by <sys/types.h> as well, but we plan to                                                                                           
     remove this soon. To use "major", include <sys/sysmacros.h>                                                                                          
     directly. If you did not intend to use a system-defined macro                                                                                        
     "major", you should undefine it after including <sys/types.h>.                                                                                       
      unsigned as_combined_number() const noexcept { return major() * 10 + minor_; }                                                                      
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                              
    
    opened by zingdle 19
  • Fix/compiling shared libs 76

    Fix/compiling shared libs 76

    Note that I also had to add the following code in the CMakeLists.txt to get everything to compile (not including examples).

    set(CUDA_LIBRARIES "cudadevrt.lib;cudart.lib")
    target_link_libraries(${PROJECT_NAME} PUBLIC ${CUDA_LIBRARIES})
    

    Let me know if you have questions.

    Another thing to note is that the install command installs all the libraries and include files correctly, but the cmake export files aren't installed. Is this on purpose? I would consider installing these and also adding a cuda-api-wrappersConfig.cmake file as part of the install so that the installed version of the library can easily be used by other cmake based projects. This would also make it trivial to create installation packages for releases of your library if/when you decide to do that.

    opened by DeveloperPaul123 19
  • `make_unique` without device behaves strangely in the example `vectorAdd`

    `make_unique` without device behaves strangely in the example `vectorAdd`

    My environment is

    • GCC 12.1.0
    • CMake 3.23.1
    • CUDA 11.6
    • cuda-api-wrappers latest commit
    • RTX 2070 Super
    • Linux 5.17.9

    In the example vectorAdd, if I change

    auto device = cuda::device::current::get();
    auto d_A = cuda::memory::device::make_unique<float[]>(device, numElements);
    auto d_B = cuda::memory::device::make_unique<float[]>(device, numElements);
    auto d_C = cuda::memory::device::make_unique<float[]>(device, numElements);
    

    to

    auto d_A = cuda::memory::device::make_unique<float[]>(numElements);
    auto d_B = cuda::memory::device::make_unique<float[]>(numElements);
    auto d_C = cuda::memory::device::make_unique<float[]>(numElements);
    

    I will get

    terminate called after throwing an instance of 'cuda::runtime_error'
      what():  Failed releasing the reference to the primary context for device 0: invalid device context
    

    Sometimes I get a SIGSEGV.

    I thought these two code pieces are equivalent.

    bug resolved-on-development 
    opened by QuarticCat 18
  • Bus error in `array_management.cu`

    Bus error in `array_management.cu`

    Does device.synchronize(); missed in https://github.com/eyalroz/cuda-api-wrappers/blob/36c0fa021ceacfe96548636d3716913e009634b6/examples/other/array_management.cu#L162 ? If I do not add this, I will get a bus error. But it is just a simple copy. I actually do not except a synchronize here.

    bug resolved-on-development 
    opened by szsdk 17
  • Support for CUDA arrays and texture memory

    Support for CUDA arrays and texture memory

    I think it would be useful to provide wrappers to use CUDA arrays and texture memory. I am going to implement an initial functionality in the next two weeks. Maybe you can give me feedback on the implementation, such that I can later create a pull request. See here:

    https://github.com/codecircuit/cuda-api-wrappers

    Some ideas:

    
    // only support for these types is given by cudaChannelFormatDesc
    auto arr0 = cuda::array::create<float, 3>({w, h, d});
    // Is it inconvenient to give the number of dimensions as a template parameter?
    // Maybe this can be deduced from `w, h, d`
    auto arr1 = cuda::array::create<unsigned, 3>({w, h, d});
    auto arr2= cuda::array::create<int, 3>({w, h, d});
    
    // only support for 2 and 3 dimensional arrays
    auto arr3 = cuda::array::create<float, 2>({w, h});
    auto arr4 = cuda::array::create<unsigned, 2>({w, h});
    auto arr5= cuda::array::create<int, 2>({w, h});
    
    auto ptr = cuda::memory::device::make_unique<float[]>(device, w * h * d);
    
    // copy data into array
    cuda::memory::copy(arr0, ptr.get());
    // copy data into buffer
    cuda::memory::copy(ptr.get(), arr1);
    
    // Create a texture with reasonable defaults:
    //	texDesc.addressMode[0]   = cudaAddressModeBorder;
    //	texDesc.addressMode[1]   = cudaAddressModeBorder;
    //	texDesc.addressMode[2]   = cudaAddressModeBorder;
    //	texDesc.filterMode       = cudaFilterModePoint;
    //	texDesc.readMode         = cudaReadModeElementType;
    //	texDesc.normalizedCoords = 0;
    //
    // Create texture from an array
    auto texture1 = cuda::texture::create(arr0);
    // is this misleading, because texture1 is not owning memory?
    
    // use in the kernel
    cudaTextureObject_t raw = texture1.get(); // expose C-Object
    cuda::launch(kernel, {1,1}, texture1.get(), ...);
    
    enhancement resolved-on-development 
    opened by codecircuit 17
  • CMake 3.17 automatically adds Threads::Threads as interface library

    CMake 3.17 automatically adds Threads::Threads as interface library

    The cuda-api-wrappers library depends on PThreads, which is now handled as a separate target in CMake. Thus if you just include the cuda-api-wrappers in your own CMake project CMake complains about the missing Target Threads::Threads. This dependency is not just a 'nice-to-have' dependency because PThreads is used in the compiled part of the wrappers and not just in one stand alone header.

    Helpful for that is the CMake module CMakeFindDependencyMacro. If you need help with this I can create a PR.

    bug nvtx 
    opened by codecircuit 16
  • v0.5.5 does not compile

    v0.5.5 does not compile

    Hi, I am using gcc-11 and I am unable to compile with the following error:

    cudawrappers-src/src/cuda/api/launch_config_builder.hpp(158): error: namespace "cuda::detail_" has no member "poor_mans_optional"
    cudawrappers-src/src/cuda/api/launch_config_builder.hpp(158): error: expected a ";"
    cudawrappers-src/src/cuda/api/launch_config_builder.hpp(180): error: namespace "cuda::detail_" has no member "poor_mans_optional"
    cudawrappers-src/src/cuda/api/launch_config_builder.hpp(180): error: expected a ")"
    cudawrappers-src/src/cuda/api/launch_config_builder.hpp(182): error: identifier "maybe_id" is undefined
    bug resolved-on-development 
    opened by DiamonDinoia 15
  • Allow for obtaining architecture and compute capability information at compile-time

    Allow for obtaining architecture and compute capability information at compile-time

    https://github.com/eyalroz/cuda-api-wrappers/blob/b33726d4cd72760fbd9d06c370e5fd5ee59c7d31/src/cuda/api/device_properties.cpp#L24-L37

    For several things, we need these values known at compile time. I am not familiar with find_if, will it be able to find the value for the corresponding arch at compile time?

    This is necessary for simple and performance critical tasks, stuff like kernel launch/bound configs, static shared memory, etc.

    task resolved-on-development 
    opened by neoblizz 14
  • cuda::profiling::mark::range_start and range_end call create_attributions the wrong way

    cuda::profiling::mark::range_start and range_end call create_attributions the wrong way

    The functions cuda::profiling::mark::range_start() and cuda::profiling::mark::range_end() call create_attributions(description, color), while:

    1. The function is within the detail_ sub-namespace.
    2. The function takes a color first and a description later
    bug resolved-on-development 
    opened by eyalroz 0
  • Accept

    Accept "execution affinity" info when creating a context

    Recently (CUDA 12?) a new version of cuCtxCreate has emerged (v3) which takes CUexecAffinityParam* paramsArray, int numParams. Let's be willing to accept such information, e.g. in an optional<span>.

    opened by eyalroz 0
  • Support obtaining unique stream and context IDs

    Support obtaining unique stream and context IDs

    CUDA 12 has added APIs to obtain unique stream and context IDs from user-provided objects:

    cuStreamGetId(CUstream hStream, unsigned long long *streamId)
    cuCtxGetId(CUcontext ctx, unsigned long long *ctxId)
    

    While I don't understand how these are even useful - I suppose we should expose them to users.

    opened by eyalroz 0
  • Support use of external semaphores

    Support use of external semaphores

    CUDA supports the importation and use (waiting/signaling) of external semaphores; see here.

    We currently don't support this anywhere - neither the functions directly involving semaphores, nor the graph functionality involving them. We probably should.

    task 
    opened by eyalroz 0
  • Should we expose pointer-to-const versions of CUDA driver handles?

    Should we expose pointer-to-const versions of CUDA driver handles?

    Many of the CUDA driver handles of different entities are pointers-to-structs, e.g.:

    typedef struct CUctx_st *CUcontext;                          /**< CUDA context */
    typedef struct CUmod_st *CUmodule;                           /**< CUDA module */
    typedef struct CUfunc_st *CUfunction;                        /**< CUDA function */
    typedef struct CUarray_st *CUarray;                          /**< CUDA array */
    

    The thing is, that if we use const CUcontext (or in our case, const cuda::context::handle_t) - we get a constant pointer to non-constant data. While the user should never be accessing this data directly - from the language perspective, s/he very well might do so.

    Should we, therefore, peddle (also) in constant versions of these handles, e.g.:

    namespace cuda {
    namespace context {
    using CUctx_st       * handle_t;
    using CUctx_st const * const_handle_t;
    } // namespace cuda
    } // namespace context
    
    question 
    opened by eyalroz 0
  • Member initialisation order error.

    Member initialisation order error.

    First, thanks for putting the time into making this library.

    I got an error when using cuda::memory::device::unique_ptr<T[]>

    cuda\api\array.hpp(160): error : the initialization of member "cuda::array_t<T, NumDimensions>::dimensions_" will be done before that of member "cuda::array_t<T, NumDimensions>::context_handle_"
    

    In version 0.6.1.

    Swapping the initialisation order in the constructor to match the member list fixes the issue.

    task resolved-on-development 
    opened by j-horner-c4x 3
Releases(v0.6.1)
  • v0.6.1(Dec 7, 2022)

    Bug fixes

    • #442 Changed a no-longer-valid use of link::input_type_t in link.hpp which was triggering an error when building with C++17.
    • #438 Corrected the make_cuda_host_alloc_flags() function, which was bitwise-AND-ing instead of bitwise-OR-ing.

    Other changes

    • #441 kernel_t::context() now uses wrap() and is noexcept
    • #436 , #437 Now respecting the CUDA_NO_HALF preprocessor define, and not defining nor including including and half-precision-related code with it defined.
    Source code(tar.gz)
    Source code(zip)
  • v0.6(Oct 8, 2022)

    PTX Compilation library

    This version introduces a single major change:

    Note: The CUDA driver already supports compilation of PTX code, but it has limited supported for various compilation options; plus - it requires a driver to be loaded, i.e. requires kernel involvement and a GPU on your system. This library does not.

    Value-vs-reference issues

    • #430 : Now passing kernel-like objects by reference rather than by value where relevant in the kernel launch wrapper functions.
    • #433 : Now passing program name by value rather than by reference.

    Other changes

    • #431 : The NVTX wrappers no longer depend on a thread support library
    • #436 : The wrapper library now respects CUDA_NO_HALF, when you want to avoid CUDA defining the half
    • #432 : Removed some std:: rather than ::std:: namespace qualifications which had snuck into the codebase recently (which cause trouble with NVIDIA's cuda::std namespace).
    • #435 : Updated static data tables for the Ampere/Lovelace (8.x) and Hopper architectures.
    Source code(tar.gz)
    Source code(zip)
  • v0.5.6(Oct 8, 2022)

    New functionality

    • #423 Add an implementation of the surface and texture reference getters for modules (getting raw references, not corresponding wrapper classes for these objects, which this library does not currently offer)

    C++14-and-later compatibility fixes

    • #415: Resolved incompatibility of std::optional/std::experimental::optional with the internal poor_mans_optional
    • #416: corrected placement of inclusion of std::experimental::optional

    Other changes

    • #428, #429 : Minor fixes and tweaks to CUDA array code (via the cuda::array_t class template)
    • #427, #406 : Stream and Event wrapper class instances are now non-copyable (you need to either move them or pass references/pointers to them)
    • #425, #426: Error and exception handling improvements (with a slight performance benefit)
    • #424 : Link options now passed by const-reference, not by value
    • #411: Add :: prefix to occurrences of std:: (which snuck in again in recent versions; these potentially clashe with NVIDIA's standard library constructs)
    • #413: Added missing intra-library #include directives which were masked when including all APIs, but not when including individual headers. Also, removed inappropriate inline decorators from declaration-only lines
    • #420: Internal renaming
    • #417, #417: Internal placement of functionality in header files (files in cuda/api/ vs in cuda/api/multi_wrapper_impls).
    • #412: bandwidthtest now includes <iostream> on its own
    • #409: Moved pci_id_impl.hpp into the detail/ subfolder (and renamed it)
    Source code(tar.gz)
    Source code(zip)
  • v0.5.5(Sep 10, 2022)

    Run-time compilation functionality

    • #397 : The NVRTC compilation options class now supports passing extra options to PTXAS, and also supports --dopt
    • #403 : The program builder class can now accept named header additions using std::string's for the name and/or header source (rather than only C-style const char* strings).

    Bug fixes

    • #396 : scoped_existence_ensurer_t, the gadget for ensuring there is some current context (regardless of which) will now make sure the driver has been initialized.
    • #395 : Can now start profiling with our nvtx component even if the driver not yet being initialized.

    Other changes

    • #400 : Added an alias for waiting/synchronizing on an event: You can now execute cuda::wait(my_event), not just cuda::synchronize(my_event).
    • #399 : time_elapsed_between() can now accept std::pair's of events.
    • #398 : Added another example program, the CUDA sample bandwidthtest
    • #401 : Made all stream enqueuing methods const (so you can now enqueue on a stream passed by const-reference).
    • #404 : Can now construct grid::overall_dimensions_t from a dim3 object, so that they're more interoperable with CUDA-related values you obtained elsewhere.
    Source code(tar.gz)
    Source code(zip)
  • v0.5.4(Aug 19, 2022)

    This is a bug fix release over v0.5.3 (see its release notes).

    Build-related fixes

    • #392 Made the NVTX and NVRTC wrappers usable in multiple translation units within the same executable
    • #393 Made the NVTX dependency on libdl (on Linux) explicit

    Other changes

    • #394 Avoiding redundant cuInit() call when getting a device's name
    Source code(tar.gz)
    Source code(zip)
  • v0.5.3(Jul 26, 2022)

    Runtime program compilation (NVRTC) improvements

    • #379: Can get the compilation log, PTX, cubin or NVVM in a user-provided rather than self-allocated buffer
    • #388: A builder interface for NVRTC programs
    • #386: Add support for nvrtcGetSupportedArchs()
    • #375: Support adding arbitrary options when dynamically compiling a CUDA program
    • #265: Support for diag-suppress/error/warn compilation options

    Runtime-compilation-related Bug fixes

    • #391: Fix for a CUDA 10.0 support regression
    • #384: Make nvrtc depend on runtime-and-driver
    • #376: When rendering compilation options to a string, we get an extra space
    • #378: Compilation log vector contains trailing '\0'
    • #387: nvrtc.h included in wrong file

    Other changes

    • #390: Avoiding a memory leak when getting a CUDA device's name
    • #248: Support asynchronous memory allocation (in v0.5.2 we only had allocation, no freeing)

    Caveats

    Continuous build testing on Windows is failing on GitHub Actions due to trouble with CMake detecting the NVTX path. Assistance from users on this matter would be appreciated.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.2(Jun 18, 2022)

    Full MS Windows support is restored in this version (AFAICT). Also worked out some kinks and polished a few interfaces.

    Bug fixes

    • #330, #369, #372 Corrected some launch_config_builder logic bugs.
    • #368 Fixed an accidental primary context deactivation in p2pBandwidthLatencyTest
    • #360 Was missing an implementation of context_t::create_event()
    • #357 All assignment operators updated to appropriatlyhandle primary context reference unit propagation
    • #351 Fixed a typo in Windows-target-only code
    • #335 Redundant 0x in error messages
    • #329 marshalled_options.hpp errors with C++17
    • #324 marshalled_options.hpp needs cuda::span, but doesn't see it
    • #325 nvrtc/compilation_options.hpp needs to know about device_t

    Windows compatibility

    • #345 Avoid non-portable assumptions regarding thread handles in vectorAdd_profiled
    • #344 Workaround for an MSVC SFINAE error with std::iterator_traits<Iter>
    • #343 std::experimental::filesystem not properly supported on Windows
    • #342 Don't try to use mkstemp on Windows
    • #341 Avoid size_t <-> unsigned overload clash on Windows
    • #340 Apply the CUDA_CB decoration to shared memory size-determiner function - it's actually necessary on Windows
    • #339 Avoid some MSVC compiler warnings
    • #338 Added missing inclusions to have Windows NT HANDLE defined
    • #337 Support for MSVC's standard-incompliant __cplusplus value
    • #347 Using ::std:: rather than std::, to avoid clashes with NVIDIA's libcustd - that is included by default by CUDA 11.7's nvcc.

    Interface tweaks

    RTC compilation options

    • #364 marshal() and render() are now stand-alone functions.
    • #363 Can now render compilation options to an ::std::string (in case you want to save/print them)
    • #362 Add a clear_language_dialect() to rtc::compilation_options_t
    • #361 If an rtc::compilation_options_t is asked to set the language dialect to an empty or null string - unset it instead
    • #355 Support taking the C++ language dialect as an ::std::string, not just a C-style string.

    Other classes

    • #365 module::get_kernel() can now take an ::std::string
    • #359 Now exposing the interface for enqueuing kernels with type-erased arguments, passed via an array of void* (so far, you could only enqueue when you passed the parameter types_.
    • #356 (Almost) all proxy classes are now move-assignable and move-cosntructible, but not copy-assignable or copy-constructible. Move them or use cosnt-ref's.
    • #358 link_t should have a device_id()

    Miscellaneous and internal issues

    • #367 Avoiding a redundant scoped context setting when enqueuing a kernel
    • #366 Spruced up CUDA_DEVICE_OR_THIS_SCOPE() and CUDA_CONTEXT_FOR_THIS_SCOPE()
    • #353 Added missing PCI function initializer to the PCI location wrappers class.
    • #352 Simplified the options marshalling code
    • #349 Prefix CMake options with CAW_, for use as a subproject (e.g. FetchContent)
    • #346 Fix CUDA installation in GitHub action scripts
    • #326 Drop redundant inclusions and make include order more "challenging" in vectorAdd examples
    • #328 Reduce gratuitous API calls in current_device::detail::set()
    • #331 Can now load a module from file into any context, not just the current context
    • #334 Reduce the number of redundant informative API calls enhancement resolved-on-development
    • #333 Don't treat freeing in a destroyed context as an error
    • #303 Use CUDA_VERSION instead of CUDART_VERSION
    • #370 cuda::context::current::exists() now return false, rather than throwing, if the CUDA driver has not been initialized
    • #373 In Debug builds, now validating launch configuration grid dimensions before enqueueing/launching anything (as CUDA tends to fail silently, e.g. for emtpy grids)

    Caveats

    Continuous build testing on Windows is failing on GitHub Actions due to trouble with CMake detecting the NVTX path. Assistance from users on this matter would be appreciated.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.1(May 9, 2022)

    Build mechanism

    • #307 The library is now entirely header-only (the NVTX wrappers, which used to be compiled, are now all within headers).

    New supported features

    • #308 Supporting both narrow/regular and wide character inputs for NVRTC compilation.
    • #309 Support for naming streams, devices and events with NVTX

    Concepts/facilities introduced

    • #311 A Builder-pattern class for building launch configurations more easily.

    Compatibility

    • #304 : Now compatible will all CUDA versions between 9.0 and 11.6

    Bug fixes

    • #320 No longer getting an error message about module::create() when including only runtime_api.hpp.
    • #317 No longer "leaking" references to device primary contexts which made them never be destroyed after some point. Fixing this exposed a few other latent issues involving non-existence of primary contexts: #316.
    • #314 No longer failing to enqueue events when there is no current context.
    • #305, #306 :
      • Added missing named errors to cuda::status
      • Now using driver error codes wherever applicable (they only started to coincide with Runtime API error codes in a recent CUDA version)
      • Renamed mis-named error: cuda::status::not_ready -> cuda::status::async_operations_not_yet_completed.
    • #315 In one of the example programs, we were launching a kernel on the current device rather than the one the user had chosen.

    Miscellaneous and internal issues

    • #310 NVTX wrapper now uses driver-API-style
    • #303 Using CUDA_VERSION instead of CUDA_RT_VERSION where relevant.
    • #320 Added an example program only explicitly including runtime-API-related headers.
    • #321 Weakened requirement from kernel parameter types from TriviallyCopyable to just being trivially copy-constructible.

    Caveats

    Windows support is partially broken in this version.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Feb 19, 2022)

    This is a near-complete under-the-hood rewrite of the API wrappers library, while maintaining its existing API almost entirely: The library now primarily relies on CUDA Driver API calls, with Runtime API calls used only where the driver does not straightforwardly provide the same functionality.

    If you are only interested in the Runtime API, you may which to use the latest 0.4.x release. At the moment, that is 0.4.7.

    Fundamental feature set additions

    • #9 Driver API support
    • #228, #262 : NVRTC support

    Wrapper classes introduced

    • Contexts: context_t.
    • Dynamically vs. statically compiled kernels: kernel_t and apriori_compiled_kernel_t
    • Device primary contexts: device::primary_context_t
    • link_t: Linking together compiled code to satisfy symbol definition requirements and complete executables.
    • link_options_t defining options for linking.
    • Virtual memory: physical_allocation_t, address_range_reservation_t and mapping_t between pairs of the former.
    • Modules: module_t, made up of compiled binary/PTX code - functions, global symbols etc - which may be loaded into contexts

    and via NVRTC support:

    • Programs: rtc::program_t, made up of CUDA or PTX source code: program_t.
    • Compilation options, rtc::compilation_options_t defining options for compiling programs.

    (All of the classes above are under the cuda:: namespace)

    Concepts/facilities introduced

    • Treatment of the primary context as a context and its creation or destruction
    • The context stack
    • The current context
    • Waiting on a the value of a scalar in global device memory
    • Access by specific contexts to specific contexts of peer devices

    Caveats

    Windows support is partially broken in this version.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.7(Mar 12, 2022)

    This version has very few changes to relative to 0.4.6. These are:

    Bug fixes

    • #301 : Now ensuring launch configurations can be assigned to each other.

    Note: Users's help is kindly requested in preparing for the next major release, which will cover both the runtime and the driver API, and NVRTC as well. See this branch and contact me / open relevant issues.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.6(Mar 9, 2022)

    (v0.4.5 was discarded due to an invalid version string; this is essentially the same as v0.4.5 but with the version string fixed.)

    Changes since v0.4.4:

    API changes

    • #298 : The wrap() methods, which take raw CUDA handles for events, devices, streams etc. and wrap them in, well, the library's wrapper objects (as opposed to otherwise getting/creating wrapper objects directly, with no raw handles) - are now out of the detail_:: namespace and part of the library's proper API.

    Bug fixes

    • #300: Was hiding some CUDA 11 stream-related features due to faulty runtime API version check.
    • #299: Now correctly copying stream properites.
    • #296: (Probably) fixed a Win64-to-Win32 cross-build compilation issue with callback function signatures.

    Note: Users's help is kindly requested in preparing for the next major release, which will cover both the runtime and the driver API, and NVRTC as well. See this branch and contact me / open relevant issues.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.4(Dec 23, 2021)

    Changes since v0.4.3:

    Bug fixes

    • Device-properties-related functions using baked-in data corrected for some compute capabilities.

    New functionality

    • #284 Introduced a grid-and-block-dimensions structure, grid::complete_dimensions_t
    • Additional variants of cuda::memory::set() so that you may use either regions or plain pointer.
    • device_t::global_memory_t now has an associated_device() member.
    • #223, #272 Support for CUDA 11.0 stream attributes.
    • Added device_t::supports_block_cooperation().
    • Additional variants of cuda::memory::copy() for convenience.
    • #292: Device-properties-related functions requiring baked-in data now support Ampere GPUs (CC 8.0, 8.6).
    • #293: Some methods of compute_architecture_t are now available only for compute_capability_t, as it is no longer reasonable to rely on microarchitecture-generation-default values (e.g. amount of shared memory per block, number of in-flight threads per multiprocessor etc.)

    Changes to existing functionality

    • #280 Events and streams now have "handles" rather than "ids".
    • Partial revamp of the CUDA array wrapper classes (e.g. no templatization).
    • #258 Block "Cooperativity" is now part of the launch configuration, so less launch variants are necessary.
    • #250 Now offering const variants for both regular and mapped memory.
    • #269 Renamed cuda::device::resource_limit_t -> cuda::device::limit_t.
    • Support for GitHub workflows
    • #267 the NVTX library now depends on CUDA::nvToolsExt (which it should).
    • #268 Now exporting the requirement for the CUDAToolkit dependency.
    • cuda::runtime_error can now be constructed also using an r-value string reference, not just a constant l-value reference.
    • Removed some unnecessary explicit namespace specification in error.hpp.
    • Now using uniform parameter name in allocation functions
    • Renamed: array_t::associated_device() -> array_t::device().
    • #285, #289 Now using the wrap() idiom for constructing device_t's
    • #273: Added device-setter RAII objects to some asynchronous stream method.
    • Rework of (global-memory) symbol handling: No more symbol_t type; functionality moved from cuda::memory:: to cuda::symbol::; and now willing to locate any-type-argument.

    Build mechanism

    • Avoid always re-determining CUDA architectures by minding the cache.
    • Fixed the CompileWithWarnings.cmake module to pass the appropriate flags to the appropriate executables (NVCC front-end vs actual compiler, MSVC vd GCC/clang)

    Other changes

    • Multiple cosmetic changes to avoid MSVC compilation warning, e.g. explicit narrowing casts.
    • Example program changes, including utility headers.
    • Added a modified version of the CUDA sample program binaryPartitionCG.
    • Some internal changes to wrapper classes with no external interface change.
    • NVTX exception what() message fix.
    • #283 : Some wrapper identification string generator functions in detail_ subnamespaces.

    This version is know to work with CUDA versions up to 11.5; pre-11.0 CUDA versions are supported, but not tested routinely.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.3(Aug 20, 2021)

    Changes since v0.4.2:

    New functionality

    • Support for working with CUDA symbols.
    • Support for asynchronous memory allocation.
    • Classes for all memory regions - both managed and regular, both constant and non-constant memory (we used to have some of these only).

    Changes to existing functionality

    • launch_configuration_t is now constexpr.
    • Arguably better interface for the partially-existing managed memory region classes.
    • Pervasive use of regions as parameters to API functions involving memory: Copying, allocating, modifying attributes etc.
    • Renamed: no_shared_memory -> no_dynamic_shared_memory.

    Other changes

    • CMake-based build mechanism changes to rely on CMake 3.17 changes to CUDA support (no effect on the use of the library).
    • Replaced the internal detail namespaces with detail_, for libcu++ compatibility.
    • Dropped the FindCUDAAPIWrappers.cmake module.

    This version is know to work with CUDA versions up to 11.4 (but old CUDA versions are not routinely tested).

    Source code(tar.gz)
    Source code(zip)
  • v0.4.2(Feb 24, 2021)

    This is a minor release, with mostly bug fixes and compatibility improvements. Other than in its version number, it is identical to 0.4.1, which was retracted due to a version numbering issue.

    Changes since 0.4:

    • Can now access all devices as a range: for(auto device : cuda::devices()) { /* etc. etc. */ }.
    • Wrapper classes (specifically, events and streams) now have non-owning copy constructors.
    • A stream priority range is now its own class.

    Bug fixes:

    • Dropped invalid stream-priority-related constant.
    • The device management test was getting the direction of priority ranges backwards.
    • The p2pBandwidthLatencyTest example program was failing with cross-device event wait attempts, due to calling wait() and record() on the wrong stream.
    • Removed a spurious template specifier in device.hpp
    • Can now construct cuda::launch_configuration_t from two integers with C++14 and later.

    Build, compatibility, usability:

    • CMake 3.18 and later no longer complain about the lack of a CUDA_ARCHITECTURES value.
    • Should now be compatible with MSVC 16.8 on Windows.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1rc1(Feb 10, 2021)

    This is a minor release, with mostly bug fixes and compatibility improvements.

    Changes since 0.4:

    • Can now access all devices as a range: for(auto device : cuda::devices()) { /* etc. etc. */ }.
    • Wrapper classes (specifically, events and streams) now have non-owning copy constructors.
    • A stream priority range is now its own class.

    Bug fixes:

    • Dropped invalid stream-priority-related constant.
    • The device management test was getting the direction of priority ranges backwards.
    • The p2pBandwidthLatencyTest example program was failing with cross-device event wait attempts, due to calling wait() and record() on the wrong stream.
    • Removed a spurious template specifier in device.hpp
    • Can now construct cuda::launch_configuration_t from two integers with C++14 and later.

    Build, compatibility, usability:

    • CMake 3.18 and later no longer complain about the lack of a CUDA_ARCHITECTURES value.
    • Should now be compatible with MSVC 16.8 on Windows.
    Source code(tar.gz)
    Source code(zip)
  • v0.4(Oct 14, 2020)

    Main changes since 0.3.3:

    • The runtime API wrappers are now a header-only library.
    • Split the NVTX wrappers and the Runtime API wrappers into two separate libraries.
    • Added several fundamental types which were implicit in previous versions: cuda::size_t, cuda::dimensionality_t.

    Minor API tweaks:

    • Renamed launch -> enqueue_launch
    • Can now schedule managed memory region attachment on streams
    • Now wrapping cudaMemAdvise() advice.
    • Array copying uses typed pointers
    • Added: A cuda::managed::device_side_pointer_for() standalone function
    • Added: A container facade for the sequence of all devices, so you can now write for (auto device : cuda::devices() ) { }.
    • De-templatized: device setter RAII class
    • Added: a freestanding cuda::synchronize() function instead of some wrapper methods
    • Made some type definitions from inside device_t to the device:: namespace
    • Added: A subclass of memory::region_t for managed memory
    • Using memory::region_t in more API functions
    • Dropped cuda::kernel::maximum_dynamic_shared_memory_per_block().
    • Centralized the definitions of take_ownership and do_not_take_ownership
    • Made stream_t& parameters into const stream_t&, almost universally.

    Bug fixes:

    • Cross-device waiting on events
    • Error message fixes
    • Not assuming the uintNN_t types are in the default namespace

    Build, compatibility, usability:

    • Fix support for CMake 3.8 (CMakeLists.txt was using some post-3.8 features)
    • Clang-related:
      • Skipping examples which clang++ doesn't support yet (need
      • Only enabling separable compilation and CUDA
      • const-cast'ing const void * kernel function pointers before reinterpretation - clang wont'tt let it
      • GNU extension dropped when compiling examples with CUDA (clang dioesn't support ths)
      • Fixed std::max() call issue
    • CMake targets depending on the wrappers should now have a C++11 language standard requirement for compilation
    • The wrappers now assert C++11 or later is used, instead of letting you just fail somewhere.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Jul 20, 2020)

    This release includes both significant additions to the coverage by the wrappers, as well as major changes to the existing wrappers API.

    Main changes since 0.2.0:

    • Forget about numeric handles! The wrapper classes no longer take numeric handles as parameters, in methods exposed to the user. You'll be dealing with device_t's, event_t's, stream_t's etc. - not device::id_t, device::stream_t and device::event_t's.
    • Wrappers classes no longer templated. That means, on one hand, you don't have to worry about the template argument of "do we assume the wrapper's device is the current one?" ; but on the other hand, every use of the wrapper will set the current device (even if it's already the right one). A lot of code was simplified or even remoed thanks to this change.
    • device_function_t is now named kernel_t, as only kernels are acceptable by the CUDA Runtime API calls mentioning "device functions". Also, kernel_t's are now a pair of (kernel, device), as the settings which can be made for a kernel are mostly/entirely device-specific.
    • The examples CMakeLists.txt has been split off from the main CMakeFiles.txt and moved into a subdirectory, removing any dependencies it may have.
    • Kernel launching now uses perfect forwarding of all parameters.
    • The library is now almost completely header-only. The single exception to this rule is profiling-related code. If you don't use it - the library is header-only for you.
    • Changed my email address in the code...

    Main additions since 0.2.0:

    • 2D and 3D Array support.
    • 2D and 3D texture support.
    • A single set() and get() for all memory spaces.

    Plus a few bug fixes, and another example program from the CUDA samples.

    Changes from 0.3.0:

    • Fixed: Self-recursion in one of the memory allocation functions.
    • Fixed: Added missing inline specifiers to some functions
    • White space tweaks
    Source code(tar.gz)
    Source code(zip)
  • release_0_2_0(Jan 20, 2020)

    This repository has not really needed "releases" so far:

    • We're gradually wrapping an API, with the underlying API changing occasionally - so breaking changes are made frequently.
    • The master branch is always the most stable and rounded-out version of the code one can use.

    However, with other code potentially starting to depend on this repository, and with the CMake scripts maturing somewhat (thanks goes to @codecircuit for the latter) - named/versioned releases start to make more sense, if only for referential convenience.

    Of course, there's the question of a versioning scheme. If we go with semantic versioning, we're going to be switching major version numbers all the time.

    For now, versions will be numbered as follows: A.B.C or A.B.C-string.

    • A is the major version number. It will increase with major changes to the library's overall functionality relative to the previous major-version. What counts major? If a whole lot of your host-side code has to change for it to work, then the library change is major.
    • b is the minor version number. It will increase with changes to the library's functionality - including its API; and unlike SemVer - this change is not necessarily an addition. The change may be rather big in terms of code, but not in terms of the fundamental use patterns .
    • C is a "patch" version number. These changes are for bugfixes and minor tweaks. They often don't affect the API at all - but they might in some small subtle way.

    Finally, why 0.2.0? Well, it's somewhat arbitrary; but the extension has had "core" functionality pretty stable for a while now, with quite a few users; so 0.1.0 feel a bit "premature", which this isn't. On the other hand, 1.0.0 is too presumptuous, since:

    • We don't have decent feature-test coverage of most of the library (the examples cover a lot though.);
    • We don't have full nor effectively-fool support of CUDA 9.x
    • We don't have good enough unit test coverage.

    So 1.0.0 is a while off; enjoy 0.2.0 for now.

    Source code(tar.gz)
    Source code(zip)
Owner
Eyal Rozenberg
Researcher-Developer, focusing on GPU computing and analytic DBMS architecture deeply integrating heterogenuous parallel processors.
Eyal Rozenberg
Parallel algorithms (quick-sort, merge-sort , enumeration-sort) implemented by p-threads and CUDA

程序运行方式 一、编译程序,进入sort-project(cuda-sort-project),输入命令行 make 程序即可自动编译为可以执行文件sort(cudaSort)。 二、运行可执行程序,输入命令行 ./sort 或 ./cudaSort 三、删除程序 make clean 四、指定线程

Fu-Yun Wang 3 May 30, 2022
Partr - Parallel Tasks Runtime

Parallel Tasks Runtime A parallel task execution runtime that uses parallel depth-first (PDF) scheduling [1]. [1] Shimin Chen, Phillip B. Gibbons, Mic

null 32 Jul 17, 2022
checkedthreads: no race condition goes unnoticed! Simple API, automatic load balancing, Valgrind-based checking

checkedthreads checkedthreads is a fork-join parallelism framework for C and C++ providing: Automated race detection using debugging schedulers and Va

Yossi Kreinin 279 Nov 4, 2022
Simple and fast C library implementing a thread-safe API to manage hash-tables, linked lists, lock-free ring buffers and queues

libhl C library implementing a set of APIs to efficiently manage some basic data structures such as : hashtables, linked lists, queues, trees, ringbuf

Andrea Guzzo 392 Dec 3, 2022
OOX: Out-of-Order Executor library. Yet another approach to efficient and scalable tasking API and task scheduling.

OOX Out-of-Order Executor library. Yet another approach to efficient and scalable tasking API and task scheduling. Try it Requirements: Install cmake,

Intel Corporation 18 Oct 25, 2022
Lev - Lightweight C++ wrapper for LibEvent 2 API

lev Lightweight C++ wrapper for LibEvent 2 API LibEvent is a great library. It uses a C interface which is well designed but has a learning curve. Thi

Yasser Asmi 46 Sep 15, 2022
Lucy job system - Fiber-based job system with extremely simple API

Lucy Job System This is outdated compared to Lumix Engine. Use that instead. Fiber-based job system with extremely simple API. It's a standalone versi

Mikulas Florek 80 Dec 21, 2022
Fiber - A header only cross platform wrapper of fiber API.

Fiber Header only cross platform wrapper of fiber API A fiber is a particularly lightweight thread of execution. Which is useful for implementing coro

Tony Wang 41 Jul 31, 2022
✔️The smallest header-only GUI library(4 KLOC) for all platforms

Welcome to GUI-lite The smallest header-only GUI library (4 KLOC) for all platforms. 中文 Lightweight ✂️ Small: 4,000+ lines of C++ code, zero dependenc

null 6.6k Jan 8, 2023
This is a thin c-api wrapper programmatically generated for the excellent C++ immediate mode gui Dear ImGui.

cimgui This is a thin c-api wrapper programmatically generated for the excellent C++ immediate mode gui Dear ImGui. All imgui.h functions are programm

Victor Bombi 22 Jul 5, 2021
the thin c++ game engine

CI Community Support toy is a thin and modular c++ game engine. it aims to provide the thinnest and simplest stack of technology for making games dire

Hugo Amnov 1.5k Dec 28, 2022
Low Level Graphics Library (LLGL) is a thin abstraction layer for the modern graphics APIs OpenGL, Direct3D, Vulkan, and Metal

Low Level Graphics Library (LLGL) Documentation NOTE: This repository receives bug fixes only, but no major updates. Pull requests may still be accept

Lukas Hermanns 1.5k Jan 8, 2023
🎮 C Bindings/Wrappers for Apple's METAL framework

Apple's Metal for C C Wrapper for Apple's METAL framework. This library is C bindings of Metal API (MetalGL). Since OpenGL is deprecated, this library

Recep Aslantas 116 Dec 30, 2022
🎮 C Bindings/Wrappers for Apple's METAL framework

Apple's Metal for C C Wrapper for Apple's METAL framework. This library is C bindings of Metal API (MetalGL). Since OpenGL is deprecated, this library

Recep Aslantas 116 Dec 30, 2022
C++ wrappers for SIMD intrinsics and parallelized, optimized mathematical functions (SSE, AVX, NEON, AVX512)

C++ wrappers for SIMD intrinsics and parallelized, optimized mathematical functions (SSE, AVX, NEON, AVX512)

Xtensor Stack 1.6k Jan 3, 2023
Enabling services on your device 81 Jan 6, 2023
VNOpenAI 31 Dec 26, 2022
A thin, highly portable C++ intermediate representation for dense loop-based computation.

A thin, highly portable C++ intermediate representation for dense loop-based computation.

Facebook Research 125 Nov 24, 2022
A Script to thin Universal Apps on macOS quickly

UBThinner A Script to thin Universal Apps on macOS quickly. It traverses through the given folder recursively, identifies any universal binaries and t

Arm 2 Dec 26, 2021