Multi-backend implementation of SYCL for CPUs and GPUs

Overview

Project logo

hipSYCL - a SYCL implementation for CPUs and GPUs

hipSYCL is a modern SYCL implementation targeting CPUs and GPUs, with a focus on leveraging existing toolchains such as CUDA or HIP. hipSYCL currently targets the following devices:

  • Any CPU via OpenMP
  • NVIDIA GPUs via CUDA
  • AMD GPUs via HIP/ROCm
  • Intel GPUs via oneAPI Level Zero and SPIR-V (highly experimental and WIP!)

hipSYCL supports compiling source files into a single binary that can run on all these backends when building against appropriate clang distributions. More information about the compilation flow can be found here.

The runtime architecture of hipSYCL consists of the main library hipSYCL-rt, as well as independent, modular plugin libraries for the individual backends: Runtime architecture

hipSYCL's compilation and runtime design allows hipSYCL to effectively aggregate multiple toolchains that are otherwise incompatible, making them accessible with a single SYCL interface.

The philosophy behind hipSYCL is to leverage such existing toolchains as much as possible. This brings not only maintenance and stability advantages, but enables performance on par with those established toolchains by design, and also allows for maximum interoperability with existing compute platforms. For example, the hipSYCL CUDA and ROCm backends rely on the clang CUDA/HIP frontends that have been augmented by hipSYCL to additionally also understand SYCL code. This means that the hipSYCL compiler can not only compile SYCL code, but also CUDA/HIP code even if they are mixed in the same source file, making all CUDA/HIP features - such as the latest device intrinsics - also available from SYCL code (details). Additionally, vendor-optimized template libraries such as rocPRIM or CUB can also be used with hipSYCL. Consequently, hipSYCL allows for highly optimized code paths in SYCL code for specific devices.

Because a SYCL program compiled with hipSYCL looks just like any other CUDA or HIP program to vendor-provided software, vendor tools such as profilers or debuggers also work well with hipSYCL.

The following image illustrates how hipSYCL fits into the wider SYCL implementation ecosystem:

About the project

While hipSYCL started its life as a hobby project, development is now led and funded by Heidelberg University. hipSYCL not only serves as a research platform, but is also a solution used in production on machines of all scales, including some of the most powerful supercomputers.

Contributing to hipSYCL

We encourage contributions and are looking forward to your pull request! Please have a look at CONTRIBUTING.md. If you need any guidance, please just open an issue and we will get back to you shortly.

If you are a student at Heidelberg University and wish to work on hipSYCL, please get in touch with us. There are various options possible and we are happy to include you in the project :-)

Citing hipSYCL

hipSYCL is a research project. As such, if you use hipSYCL in your research, we kindly request that you cite:

Aksel Alpay and Vincent Heuveline. 2020. SYCL beyond OpenCL: The architecture, current state and future direction of hipSYCL. In Proceedings of the International Workshop on OpenCL (IWOCL ’20). Association for Computing Machinery, New York, NY, USA, Article 8, 1. DOI:https://doi.org/10.1145/3388333.3388658

(This is a talk and available online. Note that some of the content in this talk is outdated by now)

Acknowledgements

We gratefully acknowledge contributions from the community.

Performance

hipSYCL has been repeatedly shown to deliver very competitive performance compared to other SYCL implementations or proprietary solutions like CUDA. See for example:

  • Sohan Lal, Aksel Alpay, Philip Salzmann, Biagio Cosenza, Nicolai Stawinoga, Peter Thoman, Thomas Fahringer, and Vincent Heuveline. 2020. SYCL-Bench: A Versatile Single-Source Benchmark Suite for Heterogeneous Computing. In Proceedings of the International Workshop on OpenCL (IWOCL ’20). Association for Computing Machinery, New York, NY, USA, Article 10, 1. DOI:https://doi.org/10.1145/3388333.3388669
  • Brian Homerding and John Tramm. 2020. Evaluating the Performance of the hipSYCL Toolchain for HPC Kernels on NVIDIA V100 GPUs. In Proceedings of the International Workshop on OpenCL (IWOCL ’20). Association for Computing Machinery, New York, NY, USA, Article 16, 1–7. DOI:https://doi.org/10.1145/3388333.3388660
  • Tom Deakin and Simon McIntosh-Smith. 2020. Evaluating the performance of HPC-style SYCL applications. In Proceedings of the International Workshop on OpenCL (IWOCL ’20). Association for Computing Machinery, New York, NY, USA, Article 12, 1–11. DOI:https://doi.org/10.1145/3388333.3388643

Benchmarking hipSYCL

When targeting the CUDA or HIP backends, hipSYCL just massages the AST slightly to get clang -x cuda and clang -x hip to accept SYCL code. hipSYCL is not involved in the actual code generation. Therefore any significant deviation in kernel performance compared to clang-compiled CUDA or clang-compiled HIP is unexpected.

As a consequence, if you compare it to other llvm-based compilers please make sure to compile hipSYCL against the same llvm version. Otherwise you would effectively be simply comparing the performance of two different LLVM versions. This is in particular true when comparing it to clang CUDA or clang HIP.

Current state

hipSYCL is not yet a fully conformant SYCL implementation, although many SYCL programs already work with hipSYCL.

Hardware and operating system support

Supported hardware:

  • Any CPU for which a C++17 OpenMP compiler exists
  • NVIDIA CUDA GPUs. Note that clang, which hipSYCL relies on, may not always support the very latest CUDA version which may sometimes impact support for very new hardware. See the clang documentation for more details.
  • AMD GPUs that are supported by ROCm

Operating system support currently strongly focuses on Linux. On Mac, only the CPU backend is expected to work. Windows support with CPU and CUDA backends is experimental, see Using hipSYCL on Windows.

Installing and using hipSYCL

In order to compile software with hipSYCL, use syclcc which automatically adds all required compiler arguments to the CUDA/HIP compiler. syclcc can be used like a regular compiler, i.e. you can use syclcc -o test test.cpp to compile your SYCL application called test.cpp with hipSYCL.

syclcc accepts both command line arguments and environment variables to configure its behavior (e.g., to select the target platform CUDA/ROCm/CPU to compile for). See syclcc --help for a comprehensive list of options.

When targeting a GPU, you will need to provide a target GPU architecture. The expected formats are defined by clang CUDA/HIP. Examples:

  • sm_52: NVIDIA Maxwell GPUs
  • sm_60: NVIDIA Pascal GPUs
  • sm_70: NVIDIA Volta GPUs
  • gfx900: AMD Vega 10 GPUs
  • gfx906: AMD Vega 20 GPUs

The full documentation of syclcc and hints for the CMake integration can be found in using hipSYCL.

Documentation

Comments
  • ROCm backend - build instructions

    ROCm backend - build instructions

    Hi,

    to build hipSYCL with ROCm backend, the instructions say that the "amd-common" branch for llvm/clang/lld from AMD should be used (Link). But currently this is a llvm/clang/lld version 10 (Link).

    And according to the "CMakeLists.txt" of hipSYCL it supports only llvm/clang/lld up to version 9.

    Shouldn´t it be "roc-ocl-2.7x" instead of "amd-common" branch?

    opened by justxi 75
  • Fix cmake config file generation to use correct function

    Fix cmake config file generation to use correct function

    Earlier to this change, the codebase was using regular configure_file to generate the package config file which renders the install files non-relocatable. This commit refactors the regular configure_file to configure_package_config_file command.

    Apart from the major change mentioned above, this commit also did a couple of additional changes.

    • adds openmp flags to the interface link options of hipSYCL-rt target so that this flag is passed to any application using cmake target. This was required on Arch Linux, without which examples are failing to build with undefined reference errors.
    • Fixed the include path values passed to hipSYCL-rt target
    opened by 9prady9 47
  • MacOS CMake issues

    MacOS CMake issues

    Prompted by https://github.com/illuhad/hipSYCL/issues/222#issuecomment-601104011, I tried to build on Mac but it seems to ignore my explicit specification of Clang location and instead finds /usr/bin/clang++.

    It is possible that the issue here is that the Homebrew installs of LLVM don't have the required components, but CMake is not doing a proper job detecting that.

    jrhammon-mac02:build jrhammon$ cmake .. -DCMAKE_INSTALL_PREFIX=/opt/hipsycl -DCMAKE_CXX_COMPILER=/usr/local/Cellar/llvm/9.0.1/bin/clang++ -DCMAKE_C_COMPILER=/usr/local/Cellar/llvm/9.0.1/bin/clang -DDISABLE_LLVM_VERSION_CHECK=ON
    -- The C compiler identification is Clang 9.0.1
    -- The CXX compiler identification is Clang 9.0.1
    -- Check for working C compiler: /usr/local/Cellar/llvm/9.0.1/bin/clang
    -- Check for working C compiler: /usr/local/Cellar/llvm/9.0.1/bin/clang -- works
    -- Detecting C compiler ABI info
    -- Detecting C compiler ABI info - done
    -- Detecting C compile features
    -- Detecting C compile features - done
    -- Check for working CXX compiler: /usr/local/Cellar/llvm/9.0.1/bin/clang++
    -- Check for working CXX compiler: /usr/local/Cellar/llvm/9.0.1/bin/clang++ -- works
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    CMake Warning (dev) at CMakeLists.txt:17 (set):
      implicitly converting 'INTEGER' to 'STRING' type.
    This warning is for project developers.  Use -Wno-dev to suppress it.
    
    -- Could NOT find LLVM (missing: LLVM_DIR)
    -- Building hipSYCL against LLVM configured from LLVM_DIR-NOTFOUND
    -- Selecting clang: /usr/bin/clang++
    CMake Error at CMakeLists.txt:77 (message):
      clang include path CLANG_INCLUDE_PATH-NOTFOUND does not exist.  Please
      provide clang's internal include path manually.
    
    
    -- Using clang include directory: CLANG_INCLUDE_PATH-NOTFOUND
    -- Configuring incomplete, errors occurred!
    See also "/Users/jrhammon/Work/SYCL/hipSYCL/build/CMakeFiles/CMakeOutput.log".
    
    opened by jeffhammond 36
  • atomics on CPU

    atomics on CPU

    I have the following simple atomic counter example that compiles and runs fine for CUDA, but doesn't compile on the CPU.

    #include <iostream>
    #include <CL/sycl.hpp>
    
    namespace s = cl::sycl;
    
    int main()
    {
        s::queue q;
    
        int counter = 0;
        {
            s::buffer<int> counter_buf(&counter, 1);
    
            q.submit([&](cl::sycl::handler& cgh)
            {
                auto access_counter = counter_buf.get_access<cl::sycl::access::mode::read_write>(cgh);
    
                cgh.parallel_for<class atomic_increment>(s::range<1>(1 << 30), [=] (cl::sycl::id<1> tid)
                {
                    s::atomic<int> atomic_counter { s::global_ptr<int> {&access_counter[0]} };
                    atomic_counter.fetch_add(1);
                });
            });
        }
    
        std::cout << "Counter: " << counter << std::endl;
    }
    
    /opt/hipSYCL/CUDA/bin/syclcc-clang --hipsycl-gpu-arch=sm_52 -O3 atomic-counter.cpp -o atomic-counter-cuda
    

    compiles and runs fine.

    /opt/hipSYCL/CUDA/bin/syclcc-clang --hipsycl-platform=cpu -g atomic-counter.cpp -o atomic-counter-cpu
    In file included from atomic-counter.cpp:3:
    In file included from /opt/hipSYCL/CUDA/bin/../include/CL/sycl.hpp:58:
    /opt/hipSYCL/CUDA/bin/../include/CL/sycl/atomic.hpp:103:12: error: use of undeclared identifier 'atomicAdd'
        return atomicAdd(_ptr, operand);
               ^
    atomic-counter.cpp:23:32: note: in instantiation of function template specialization 'cl::sycl::atomic<int, cl::sycl::access::address_space::global_space>::fetch_add<int, nullptr>' requested here
                    atomic_counter.fetch_add(1);
    

    I know atomics are not fully supported (they are listed as a limitation in README), but other issues suggest that only minor features of atomics are missing. Is the above error by design, or is it something that can be easily fixed?

    opened by mrzv 24
  • windows compilation report

    windows compilation report "lld-link: error: undefined symbol: cuModuleGetFunction"

    After the failure, repeat 'ninja': E:\hipSYCL-sycl-2020\build>ninja [1/1] Linking CXX shared library src\runtime\rt-backend-cuda.dll FAILED: src/runtime/rt-backend-cuda.dll src/runtime/rt-backend-cuda.lib cmd.exe /C "cd . && E:\hipSYCL-sycl-2020\LLVM\bin\clang++.exe -fuse-ld=lld-link -nostartfiles -nostdlib -O2 -g -DNDEBUG -Xclang -gcodeview -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -shared -o src\runtime\rt-backend-cuda.dll -Xlinker /implib:src\runtime\rt-backend-cuda.lib -Xlinker /pdb:src\runtime\rt-backend-cuda.pdb -Xlinker /version:0.0 src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_event.cpp.obj src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_queue.cpp.obj src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_allocator.cpp.obj src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_device_manager.cpp.obj src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_hardware_manager.cpp.obj src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_backend.cpp.obj src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_module.cpp.obj src/runtime/hipSYCL-rt.lib E:/CUDA11/lib/x64/cudart_static.lib -lkernel32 -luser32 -lgdi32 -lwinspool -lshell32 -lole32 -loleaut32 -luuid -lcomdlg32 -ladvapi32 -loldnames && cd ." lld-link: error: undefined symbol: cuModuleGetFunction

    referenced by E:\hipSYCL-sycl-2020\src\runtime\cuda\cuda_queue.cpp:329 src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_queue.cpp.obj:(public: class hipsycl::rt::result __cdecl hipsycl::rt::cuda_queue::submit_kernel_from_module(class hipsycl::rt::cuda_module_manager &, class hipsycl::rt::cuda_module const &, class std::basic_string<char, struct std::char_traits, class std::allocator> const &, class hipsycl::rt::static_array<3> const &, class hipsycl::rt::static_array<3> const &, unsigned int, void **))

    lld-link: error: undefined symbol: cuLaunchKernel

    referenced by E:\hipSYCL-sycl-2020\src\runtime\cuda\cuda_queue.cpp:337 src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_queue.cpp.obj:(public: class hipsycl::rt::result __cdecl hipsycl::rt::cuda_queue::submit_kernel_from_module(class hipsycl::rt::cuda_module_manager &, class hipsycl::rt::cuda_module const &, class std::basic_string<char, struct std::char_traits, class std::allocator> const &, class hipsycl::rt::static_array<3> const &, class hipsycl::rt::static_array<3> const &, unsigned int, void **))

    lld-link: error: undefined symbol: cuModuleUnload

    referenced by E:\hipSYCL-sycl-2020\src\runtime\cuda\cuda_module.cpp:131 src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_module.cpp.obj:(public: __cdecl hipsycl::rt::cuda_module_manager::~cuda_module_manager(void)) referenced by E:\hipSYCL-sycl-2020\src\runtime\cuda\cuda_module.cpp:182 src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_module.cpp.obj:(public: class hipsycl::rt::result __cdecl hipsycl::rt::cuda_module_manager::load(class hipsycl::rt::device_id, class hipsycl::rt::cuda_module const &, struct CUmod_st *&))

    lld-link: error: undefined symbol: cuModuleLoadDataEx

    referenced by E:\hipSYCL-sycl-2020\src\runtime\cuda\cuda_module.cpp:192 src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_module.cpp.obj:(public: class hipsycl::rt::result __cdecl hipsycl::rt::cuda_module_manager::load(class hipsycl::rt::device_id, class hipsycl::rt::cuda_module const &, struct CUmod_st *&)) clang++: error: linker command failed with exit code 1 (use -v to see invocation) ninja: build stopped: subcommand failed.


    I follow the steps from https://github.com/illuhad/hipSYCL/wiki/Using-hipSYCL-on-Windows use the prebuilt llvm11.1.0 and boost 1.75 binary, hipSYCL "sycl/2020" branch hipSYCL-rt.dll and rt-backend-omp.dll successfully generated but failed to create cuda backend part. I tried with CUDA 10.2 and 11, both report the same errors. Cannot understand why lld-link reports "undefined symbol: cuModuleGetFunction" when linking with cudart_static.lib. I use the official CUDA windows 10 x86_64 release downloaded from NVIDIA website.

    discussion 
    opened by mz24cn 23
  • Problems with compilation on ubuntu 18.04 with rocm 2.3

    Problems with compilation on ubuntu 18.04 with rocm 2.3

    I tried to run cmake with default parameters seems on my system it will be compiled with gcc 7.3.0 but I am getting errors like error: no member named 'make_unique' in namespace 'std' so I tried to switch to clang 6.0.0 which I have on my system but with same result CC=clang CXX=clang++ cmake. I thought clang 6 has default std c++14. So I tried it with CXXFLAGS+=-std=c++17 cmake but I am getting error The platform rocm was explicitly chosen, but it is not available.. This was again with gcc and 7.3.0 so I suppose gcc is not supported. I would expect that your cmake config will try to choose clang as C++ compiler as I see it listed as dependency and will add -std=c++14 at least.

    At last I tried CC=clang CXX=clang++ CXXFLAGS+=-std=c++17 cmake and it was finally successful (with warning in many places warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]). But when I tried to compile sycl source syclcc test.cpp --hipsycl-platform=amd --std=c++17 I got this error:

    warning: argument unused during compilation: '-L/usr/local/bin/../lib' [-Wunused-command-line-argument]
    ld: /opt/rocm/bin/../lib/libhip_hcc.so: undefined reference to symbol '[email protected]@ROCR_1'
    //opt/rocm/hsa/lib/libhsa-runtime64.so.1: error adding symbols: DSO missing from command line
    clang-9: error: linker command failed with exit code 1 (use -v to see invocation)
    

    This clang-9 come from rocm hcc. So I am going to try to set default C and C++ compiler on my system to newest clang.

    So I tried it with clang-7. I again had to specify CXXFLAGS=-std=c++14 before cmake. This time I tried c++14 and hipsycl compiles without warning. But syclcc test.cpp --hipsycl-platform=rocm gives this strange error:

    warning: argument unused during compilation: '-L/usr/local/bin/../lib' [-Wunused-command-line-argument]
    syclcc fatal error: [Errno 2] No such file or directory: 'hipsycl_211bd330025b7952.cpp'
    

    I tried also singulatiry container. I see that it is using clang 6 and adds std c++14:

    + export CXX=clang++-6.0
    + cmake -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_CXX_FLAGS=-std=c++14 -DWITH_CPU_BACKEND=ON -DWITH_ROCM_BACKEND=ON ..
    -- The C compiler identification is GNU 5.4.0
    -- The CXX compiler identification is Clang 6.0.0
    

    But when compiling sycl program I am getting similar error as previously (with additional perl locale warning):

    [email protected]:~$ syclcc test.cpp --hipsycl-platform=rocm -std=c++14
    perl: warning: Setting locale failed.
    perl: warning: Please check that your locale settings:
    	LANGUAGE = "en_US:en",
    	LC_ALL = (unset),
    	LANG = "en_US.UTF-8"
        are supported and installed on your system.
    perl: warning: Falling back to the standard locale ("C").
    warning: argument unused during compilation: '-L/usr/bin/../lib' [-Wunused-command-line-argument]
    ld: /opt/rocm/bin/../lib/libhip_hcc.so: undefined reference to symbol '[email protected]@ROCR_1'
    //opt/rocm/hsa/lib/libhsa-runtime64.so.1: error adding symbols: DSO missing from command line
    clang-9: error: linker command failed with exit code 1 (use -v to see invocation)
    

    I have latest rocm 2.3. I now tested sample hip program and hcc program which is calling function hsa_system_major_extension_supported and linking with -lhsa-runtime64 and seems everything is working fine.

    But seems when I skip -lhsa-runtime64 then I am getting similar error as syclcc:

    ld: /tmp/tmp.EcD56X0on0/main.host.o: undefined reference to symbol '[email protected]@ROCR_1'
    //opt/rocm/hsa/lib/libhsa-runtime64.so.1: error adding symbols: DSO missing from command line
    clang-9: error: linker command failed with exit code 1 (use -v to see invocation)
    

    And this finally worked from singularity container: syclcc test.cpp --hipsycl-platform=rocm -std=c++14 -lhsa-runtime64

    opened by misos1 22
  • Using multiple queues within OpenMP threads slows down SYCL

    Using multiple queues within OpenMP threads slows down SYCL

    Hi,

    I'm trying to optimize a piece of code by using multiple GPUs. To achieve this I first enumerate the devices on the host, create one queue for each device and then within a for loop access the "correct" queue by using omp_get_thread_num(). The problem is that using multiple queues slows down the entire computation by almost half. Using the same code with only one GPU (and thus one queue) gives the expected performance.

    Since the code is not mine to share I have tried to condense the relevant bits:

    const std::vector<sycl::device> devices = enumerate_devices();
    std::vector<sycl::queue> queues;
    for (auto device : devices) {
      queues.push_back(sycl::queue{device});
    }
    omp_set_num_threads(queues.size());
    #pragma omp parallel for
    for (...) {
      auto Q = queues[omp_get_thread_num()];
      // Each iteration will allocate quite a bit of memory
      // then run quite a few kernels
      // Lastly, free device memory again
    }
    

    I get the correct results in both instances, but using multiple devices uses the same total amount of time as using a single device, where kernel compute and memory handling takes roughly twice as much time as with a single queue.

    I'm wondering if this is a bad approach and if there are any suggestions for using multiple GPUs with hipSYCL (I have looked at the multi device queue, but it seems more difficult than the above with memory handling).

    discussion 
    opened by nordmoen 21
  • [SYCL2020] Group functions

    [SYCL2020] Group functions

    This PR adds a naive implementation for group_functions to hipSYCL. This includes implementations of

    • group_broadcast
    • group_barrier
    • group_{any,all,none]_of
    • group_reduce
    • group_{ex,in}clusive_scan

    on CPU and NVIDIA/AMD GPUs, as well as tests for these functions. I will provide optimized versions in later PRs. It also includes (group_)functions using two pointers (beginning/end), but these are not in the specification and are not meant to be used yet (as such they reside in the detail namespace).

    At the moment all tests pass (except some problems with the pointer-based functions on CPU which sometimes fail, I am still investigating). (private results)

    I would love to get some feedback. If you find some template-parameters or formatting you don't like, there is a chance I missed them in one of my cleanup/refactoring attempts. Just tell me so I can fix it. Some small changes like splitting the tests into multiple files for faster compilation might be added here,

    opened by DieGoldeneEnte 21
  • Permission issue in manual build

    Permission issue in manual build

    Hello, I'm trying to build hipSYCL manually following the steps in README file.

    git clone --recurse-submodules https://github.com/illuhad/hipSYCL
    cd hipSYCL
    mkdir build
    cd build
    cmake -DCMAKE_INSTALL_PREFIX=. ..
    

    and this is the output

    -- The C compiler identification is GNU 4.8.5
    -- The CXX compiler identification is GNU 4.8.5
    -- Check for working C compiler: /usr/bin/cc
    -- Check for working C compiler: /usr/bin/cc -- works
    -- Detecting C compiler ABI info
    -- Detecting C compiler ABI info - done
    -- Detecting C compile features
    -- Detecting C compile features - done
    -- Check for working CXX compiler: /usr/bin/c++
    -- Check for working CXX compiler: /usr/bin/c++ -- works
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    CMake Warning (dev) at CMakeLists.txt:17 (set):
      implicitly converting 'INTEGER' to 'STRING' type.
    This warning is for project developers.  Use -Wno-dev to suppress it.
    
    -- Looking for pthread.h
    -- Looking for pthread.h - found
    -- Looking for pthread_create
    -- Looking for pthread_create - not found
    -- Looking for pthread_create in pthreads
    -- Looking for pthread_create in pthreads - not found
    -- Looking for pthread_create in pthread
    -- Looking for pthread_create in pthread - found
    -- Found Threads: TRUE
    -- Found CUDA: /software/nvidia/cuda/10.0 (found version "10.0")
    -- Boost version: 1.57.0
    -- Found the following Boost libraries:
    --   filesystem
    --   system
    -- Boost version: 1.57.0
    -- Configuring done
    -- Generating done
    -- Build files have been written to: /path/hipSYCL/build
    

    then I tried to build it with: make install it causes fatal error:

    make install
    Scanning dependencies of target hipSYCL_cuda
    [  2%] Building CXX object src/libhipSYCL/CMakeFiles/hipSYCL_cuda.dir/application.cpp.o
    syclcc fatal error: [Errno 13] Permission denied: '/path/hipSYCL/bin/'
    make[2]: *** [src/libhipSYCL/CMakeFiles/hipSYCL_cuda.dir/application.cpp.o] Error 255
    make[1]: *** [src/libhipSYCL/CMakeFiles/hipSYCL_cuda.dir/all] Error 2
    make: *** [all] Error 2
    

    however the permissions for source and build directory have not changed since their creation and also /path/hipSYCL/bin/ exists and contains syclcc syclcc-clang.

    opened by mteimoori 20
  • Add dedicated backend queues for inorder queues and priority queue support

    Add dedicated backend queues for inorder queues and priority queue support

    • Introduces rt::inorder_executor for simple, straight-forward in-order execution
    • Make rt::multi_queue_executor rely on multiple inorder_executor
    • Add mechanism to runtime to create dedicated inorder_executor objects (which backends may or may not support)
    • Change default behavior of in-order queue to request dedicated in-order executor, if supported by backend (otherwise, uses backend-provided executor). All fancy hipSYCL extensions like hipSYCL_retarget should still work, even if a dedicated inorder-executor has been requested, so no code should break. This allows expert users to have more control over scheduling decisions
    • Add hipSYCL_priority{int} queue property. When used with dedicated in-order executor (i.e. in-order queue), passes this priority to HIP/CUDA stream creation methods. On HIP, uses hipStreamCreateWithPriority. I'm not sure if this actually does something or if we need to create the stream with CU mask to achieve the desired effect.

    @al42and @pszi1ard this one is for you, apologies for the delay. Things were buggy for a long time, I hope it works now, although I would still consider it early access ;)

    opened by illuhad 19
  • Strategy: The future of the source-to-source transformation

    Strategy: The future of the source-to-source transformation

    Since we'll have a clang plugin hopefully soonish that will directly allow the clang CUDA/HIP frontend to ingest SYCL code (see issue #34), we could in principle drop the source-to-source transformation entirely. I'd like to start a discussion here with hipSYCL users and developers to get some feedback on possible futures of the hipSYCL compilation toolchain. Is the source-to-source transformation important to you and we should support both source-to-source and the new clang plugin? Do you need nvcc support? Or is clang support (with the plugin) sufficient for you?

    Here are some pros of the source-to-source transformation that come to my mind:

    • Allows compilation of SYCL code with nvcc. This can be interesting from a marketing position ("you can do anything that nvcc can and can use the newest CUDA features right away")
    • Possible to specify areas in the code with preprocessor definitions that hipSYCL shouldn't modify. This could be beneficial if you're interested in mix-and-match with SYCL and CUDA/HIP code.

    The new clang plugin on the other hand gives us:

    • Much more robustness and reliability (there are edge cases in the source-to-source transformation...)
    • Faster compilation speed
    • Solid SYCL support
    • Paves the way for runtime selection whether a kernel should be executed on host or device (could in principle however also be implemented with source-to-source and clang's CUDA implementation [but likely not with nvcc])
    • Potentially even some parts of the C++ standard library could be used in kernels (although not yet with the initial version of the plugin)
    • Implementation of specific optimizations in the future since we have access to the IR

    While not impossible, it may require some additional effort to support both the current source-to-source transformation and the new clang plugin approach because the clang plugin treats any function without attributes implicitly as __host__ __device__. This means that all functions for SYCL kernels (e.g. math functions) must also support compilation for both host and device. At the moment, we assume in the runtime that everything used in kernels is __device__ only. This is also assumed by the current source-to-source transformation. Also, if we still have to support source-to-source, it may limit our ability to implement things with IR transformations.

    discussion 
    opened by illuhad 18
  • [WIP] Add runtime components for Metal backend

    [WIP] Add runtime components for Metal backend

    This is a work-in-progress draft of the Metal backend. It currently adds the metal_hardware_context class and half-implements the Metal allocator. Many components, such as the allocator and blit encoder, have working prototypes written in Swift. They're located at metal-usm. I just need to translate them to C++.

    opened by philipturner 0
  • hipSYCL generates extra empty kernels

    hipSYCL generates extra empty kernels

    With this code there are generated 5 kernels for a single gpu target. Seems each kernel lambda generates kernel with actual content and one extra empty kernel (and one additional extra empty kernel is generated regardless of how many kernels there are). It is even more when I use "named" kernels.

    	sycl::queue q;
    	q.single_task([]()
    	{
    		__hipsycl_if_target_hip(asm("s_nop 1"));
    	}).wait();
    	q.single_task([]()
    	{
    		__hipsycl_if_target_hip(asm("s_nop 2"));
    	}).wait();
    

    Only these two kernels which start with _Z16 contain actual code.

    _Z16__hipsycl_kernelIZ4mainEUlvE_Evv.kd
    _Z30__hipsycl_kernel_name_templateIZ4mainEUlvE0_Evv.kd
    _Z30__hipsycl_kernel_name_templateIZ4mainEUlvE_Evv.kd
    _Z16__hipsycl_kernelIZ4mainEUlvE0_Evv.kd
    _Z30__hipsycl_kernel_name_templateI24__hipsycl_unnamed_kernelEvv.kd
    
    discussion 
    opened by misos1 1
  • A few questions about hyipSYCL's capabilities

    A few questions about hyipSYCL's capabilities

    Hi, apologies if this is mentioned anywhere in the docs but I couldn't find the answers to some questions I had about hipSYCL:

    • AMD has notoriously poor support for ROCm on their consumer GPUs. What happens if a program built with hipSYCL detects an AMD GPU that isn't supported by ROCm? Does it fallback to basic OpenCL or the CPU etc?
    • Is/will Aarch64 a valid target? I found mentions of a metal backend which would have to work on Aarch64 for any Apple Silicon device, would this transfer to being able to run on other Aarch64 based platforms. My employer is looking to do some demonstrations on machines that have an Arm based CPU and Nvidia or AMD gpu.
    • Also based on the metal ticket, that would also mean MacOS support. Is there any plan for non-experimental Windows support? I saw you have a page for working on Windows, but it's not a stable process. This isn't critical, but would be preferred if we could target all three major platforms

    Thanks

    discussion 
    opened by LouChiSoft 3
  • WIP: Generic half (fp16) support

    WIP: Generic half (fp16) support

    This WIP PR adds a generic sycl::half class that is supported on all backends/compilation flows, however, when native half support is unavailable, arithmetic operations may be carried out in fp32.

    This PR only provides the class itself and basic arithmetic functionality, no math builtins. Also no tests yet :P

    Depends on #862, because implementing this requires taking into account the SSCP future.

    opened by illuhad 0
  • Set `local_size` in reduction to default value

    Set `local_size` in reduction to default value

    Passing a local_size of 1 in a parallel_for reduction previously caused an infinite loop. This PR sets the local_size to a default value of 128 if the user passed 1.

    Fixes #857

    opened by nilsfriess 2
Releases(v0.9.3)
  • v0.9.3(Aug 31, 2022)

    Highlights

    • Improved compatibility with new clang versions and ROCm clang
    • New extensions, e.g.
      • coarse grained events. These are zero-construction-cost events at the expense of lower synchronization performance, and hence a good match if the returned event of an operation is not expected to be used
      • queue priorities for in-order queues on certain backends
    • Added hip.explicit-multipass compilation flow
    • Multiple optimizations that can potentially reduce runtime overheads substantially
      • Use event pools in CUDA/HIP backends
      • Use asynchronous garbage collector thread to clean up old DAG nodes to remove garbage collection from the kernel submission path
      • Use std::weak_ptr instead of shared_ptr to express dependencies in the DAG; making old DAG nodes and their associated events eligible earlier for reuse by the event pool.
    • In-order queues map 1:1 to dedicated CUDA or HIP streams for more explicit scheduling control
    • Unified kernel cache and data format for all explicit multipass compilation flow (hipSYCL container format, HCF)
    • Manage hipSYCL runtime lifetime by refcounting all SYCL objects created by the user instead of just having a global object; this can resolve errors when terminating the program on some backends.
    • Simplify deployment when no std::filesystem is available
    • New tool: hipsycl-hcf-tool to inspect and edit HCF files
    • New tool: hipsycl-info to print information about detected devices.

    What's Changed (details)

    • Fix SPIR-V isnan() builtin by @illuhad in https://github.com/illuhad/hipSYCL/pull/710
    • Don't spill OpenMP pragmas and add .sycl as file ending by @illuhad in https://github.com/illuhad/hipSYCL/pull/711
    • Update installation scripts by @sbalint98 in https://github.com/illuhad/hipSYCL/pull/677
    • Fix typo in macro name causing harmless warnings by @al42and in https://github.com/illuhad/hipSYCL/pull/715
    • Check all dyn casts in analyzeModule. by @fodinabor in https://github.com/illuhad/hipSYCL/pull/717
    • Align name mangling in clang 13 host pass with upstream clang and restrict uses of createDeviceMangleContext() by @illuhad in https://github.com/illuhad/hipSYCL/pull/720
    • Add missing include directive for unordered_map by @normallytangent in https://github.com/illuhad/hipSYCL/pull/735
    • Make random number generators for embedded_pointer unique id thread_local by @illuhad in https://github.com/illuhad/hipSYCL/pull/738
    • Fix multi-threaded task processing by @illuhad in https://github.com/illuhad/hipSYCL/pull/739
    • dag_node: Only use backend wait() functionality if we are not yet complete by @illuhad in https://github.com/illuhad/hipSYCL/pull/742
    • Describe boost 1.78 build system bug in documentation by @illuhad in https://github.com/illuhad/hipSYCL/pull/744
    • Add released LLVM 14 to Linux CIs. by @fodinabor in https://github.com/illuhad/hipSYCL/pull/747
    • Add global kernel cache and HCF infrastructure by @illuhad in https://github.com/illuhad/hipSYCL/pull/736
    • Fix fiinding boost library path for boost with cmake intgeration by @sbalint98 in https://github.com/illuhad/hipSYCL/pull/748
    • Use reference-counting of user SYCL objects to manage runtime lifetime by @illuhad in https://github.com/illuhad/hipSYCL/pull/749
    • Restrict queries of event state by @illuhad in https://github.com/illuhad/hipSYCL/pull/750
    • Fix signature of __hipsycl_atomic_store for double and float by @al42and in https://github.com/illuhad/hipSYCL/pull/751
    • [CUDA][HIP] Add event pool by @illuhad in https://github.com/illuhad/hipSYCL/pull/757
    • Add coarse grained events extension by @illuhad in https://github.com/illuhad/hipSYCL/pull/754
    • Make max cached nodes configurable by @illuhad in https://github.com/illuhad/hipSYCL/pull/759
    • [cbs] Fix compatibility issues with upstream Clang/LLVM by @aaronmondal in https://github.com/illuhad/hipSYCL/pull/763
    • [CBS] Fix runtime issues with opaque pointers by @fodinabor in https://github.com/illuhad/hipSYCL/pull/765
    • [Plugin] Resolve version macros in HIPSYCL_STRINGIFY by @aaronmondal in https://github.com/illuhad/hipSYCL/pull/773
    • Add missing sycl::nd_range::get_group_range function by @al42and in https://github.com/illuhad/hipSYCL/pull/775
    • Add HIPSYCL_RT_SANITIZE cmake option by @illuhad in https://github.com/illuhad/hipSYCL/pull/779
    • Update ROCm installation documentation by @illuhad in https://github.com/illuhad/hipSYCL/pull/780
    • Remove unnecessary linking against boost for the clang plugin by @illuhad in https://github.com/illuhad/hipSYCL/pull/781
    • Use weak_ptr in node requirements list by @illuhad in https://github.com/illuhad/hipSYCL/pull/771
    • [CI] fix compilation on MSVC 2017 by @fxzjshm in https://github.com/illuhad/hipSYCL/pull/784
    • dag_submitted_ops: Manage node lifetime by asynchronously waiting instead of event queries by @illuhad in https://github.com/illuhad/hipSYCL/pull/761
    • Optimize queue::wait() by waiting on nodes in reverse submission order by @illuhad in https://github.com/illuhad/hipSYCL/pull/787
    • Remove OpenMP dependency for sequential backend by @illuhad in https://github.com/illuhad/hipSYCL/pull/786
    • Optimize inorder queue::wait() by @illuhad in https://github.com/illuhad/hipSYCL/pull/788
    • Add support for HIP explicit multipass by @illuhad in https://github.com/illuhad/hipSYCL/pull/790
    • Add hipsycl-info tool by @illuhad in https://github.com/illuhad/hipSYCL/pull/791
    • Fix ThreadSanitizer complaint about worker_thread::_continue by @al42and in https://github.com/illuhad/hipSYCL/pull/794
    • Avoid printing unprintable from memset_operation::dump by @al42and in https://github.com/illuhad/hipSYCL/pull/795
    • Fix linking errors with libstdc++ < 9 by @al42and in https://github.com/illuhad/hipSYCL/pull/667
    • Use device managers in allocators instead of setting device directly by @illuhad in https://github.com/illuhad/hipSYCL/pull/796
    • Work around nvc++ bug by not having empty if target branches in mem_fence() by @illuhad in https://github.com/illuhad/hipSYCL/pull/798
    • Manually check version of clang if ROCm is used. by @fodinabor in https://github.com/illuhad/hipSYCL/pull/800
    • Implement sincos and sinh math builtins by @nmnobre in https://github.com/illuhad/hipSYCL/pull/802
    • Add dedicated backend queues for inorder queues and priority queue support by @illuhad in https://github.com/illuhad/hipSYCL/pull/770
    • Add HIPSYCL_EXT_QUEUE_PRIORITY flag by @al42and in https://github.com/illuhad/hipSYCL/pull/804
    • Fix CMake error with ROCm 4.5 Clang by @al42and in https://github.com/illuhad/hipSYCL/pull/806
    • Add option to compile tests with reduced local mem usage by @illuhad in https://github.com/illuhad/hipSYCL/pull/805
    • omp.library-only: Fix incorrect addition of master group offset to group id by @illuhad in https://github.com/illuhad/hipSYCL/pull/814
    • Bump version to 0.9.3 by @illuhad in https://github.com/illuhad/hipSYCL/pull/803

    New Contributors

    • @normallytangent made their first contribution in https://github.com/illuhad/hipSYCL/pull/735
    • @aaronmondal made their first contribution in https://github.com/illuhad/hipSYCL/pull/763
    • @nmnobre made their first contribution in https://github.com/illuhad/hipSYCL/pull/802

    Thank you to our first-time contributors!

    Full Changelog: https://github.com/illuhad/hipSYCL/compare/v0.9.2...v0.9.3

    Source code(tar.gz)
    Source code(zip)
  • v0.9.2(Feb 14, 2022)

    Changes compared to the previous release 0.9.1 (selection)

    The following is an incomplete list of changes and improvements:

    Highlights

    • Initial support for operating as a pure CUDA library for NVIDIA's proprietary nvc++ compiler, without any additional hipSYCL compiler magic. In this flow, LLVM is not required and new NVIDIA hardware can be targeted as soon as NVIDIA adds support in nvc++.
    • Initial support for dedicated compiler support in the CPU backend. These new compilation passes can greatly improve performance of nd_range parallel for kernels on CPU. This allows executing SYCL code efficiently on any CPU supported by LLVM.
    • Scoped parallelism API v2 for a more performance portable programming model
    • Reimplement explicit multipass support for clang >= 13. This allows targeting multiple backends simultaneously, and was previously only supported on clang 11. Kernel names in the binary are now always demangleable as __hipsycl_kernel<KernelNameT> or __hipsycl_kernel<KernelBodyT>.

    SYCL support

    • Support for new SYCL 2020 features such as atomic_ref, device selector API, device aspect API and others
    • Support for SYCL 2020 final group algorithm interface
    • Add support for the profiling API
    • ... more

    Extensions

    • Add initial support for multi-device queue hipSYCL extension to automatically distribute work across multiple devices
    • Add initial support for queue::get_wait_list() hipSYCL extension to allow barrier-like semantics at the queue level
    • Add accessor_variant extension which allows accessors to automatically optimize the internal data layout of the accessor object depending on how they were constructed. This can save registers on device without any changes needed by the user.
    • Add handler::update_device() extension in analogy to already existing update_host(). This can be e.g. used to prefetch data.
    • Complete buffer-USM interoperability API
    • Add support for explicit buffer policy extension and asynchronous buffers

    See the documentation on extensions for more details.

    Optimizations

    • Automatic work distribution across multiple streams
    • Fix massive performance bug caused by a bug in the kernel cache in the Level Zero backend
    • Optimize CUDA backend to perform aggressive CUDA module caching in an explicit multipass scenario. This can greatly improve performance of the cuda.explicit-multipass compilation flow when multiple translation units are involved.
    • Several performance fixes and improvements in the hipSYCL runtime. Especially when spawning many tasks, performance can now be significantly better.
    • ... more

    Bug fixes and other improvements

    Yes, a lot of them :-)

    Source code(tar.gz)
    Source code(zip)
  • v0.9.1(Mar 29, 2021)

    hipSYCL 0.9.1

    -- This release is dedicated to the memory of Oliver M. Some things just end too soon.

    New major features

    • Add new "explicit multipass" compilation model, allowing to simultaneously target all of hipSYCL's backends. This means hipSYCL can now compile to a binary that runs can run on devices from multiple vendors. Details on the compilation flow can be found here: https://github.com/illuhad/hipSYCL/blob/develop/doc/compilation.md
    • Introduce plugin architecture for backends of the hipSYCL runtime. This means hipSYCL now looks for backend plugins at runtime, allowing to extend an already existing hipSYCL installation with support for additional hardware without changing the already installed components.
    • Initial, experimental support for Intel GPUs using Level Zero and SPIR-V
    • Introducing initial support for large portions of oneDPL using our fork at https://github.com/hipSYCL/oneDPL
    • hipSYCL is now also tested on Windows in CI, although Windows support is still experimental.

    New features and extensions

    • Command group properties that can influence how kernels or other operations are scheduled or executed:
      • hipSYCL_retarget command group property. Execute an operation submitted to a queue on an arbitrary device instead of the one the queue is bound to.
      • hipSYCL_prefer_group_size<Dim> command group property. Provides a recommendation to hipSYCL which group size to choose for basic parallel for kernels.
      • hipSYCL_prefer_execution_lane command group property. Provides a hint to the runtime on which backend queue (e.g. CUDA stream) an operation should be executed. This can be used to optimize kernel concurrency or overlap of data transfers and compute in case the hipSYCL scheduler does not already automatically submit an optimal configuration.
    • Comprehensive interoperability framework between buffers and USM pointers. This includes extracting USM pointers from existing buffer objects, turning any buffer into a collection of USM pointers, as well as constructing buffer objects on top of existing USM pointers.
    • The hipSYCL_page_size buffer property can be used to enable data state tracking inside a buffer at a granularity below the buffer size. This can be used to allow multiple kernels to concurrently write to the same buffer as long as they access different hipSYCL data management pages. Unlike subbuffers, this also works with multi-dimensional strided memory accesses.
    • Synchronous sycl::mem_advise() as free function
    • handler::prefetch_host() and queue::prefetch_host() for a simpler mechanism of prefetching USM allocations to host memory.
    • Explicit buffer policies to make programmer intent clearer as well as asynchronous buffer types that do not block in the destructor, which can improve performance. For example, auto v = sycl::make_async_view(ptr, range) constructs a buffer that operates directly on the input pointer and does not block in the destructor.
    • HIPSYCL_VISIBLITY_MASK environment variable can be used to select which backends should be loaded.

    See https://github.com/illuhad/hipSYCL/blob/develop/doc/extensions.md for a list of all hipSYCL extensions with more details.

    Optimizations and improvements

    • Hand-tuned optimizations for SYCL 2020 group algorithms
    • Automatic distribution of kernels across multiple CUDA/HIP streams
    • Improved support for newer ROCm versions
    • SYCL 2020 accessor deduction guides and host_accessor
    • Improve handling of Multi-GPU setups
    • Significant performance improvements for queue::wait()
    • Early DAG optimizations to improve handling of complex and large dependency graphs
    • Optimizations to elide unnecessary synchronization between DAG nodes

    Bug fixes and other improvements

    Yes, a lot of them!

    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Dec 10, 2020)

    hipSYCL 0.9.0

    hipSYCL 0.9 is packed with tons of new features compared to the older 0.8 series:

    Support for key SYCL 2020 features

    hipSYCL 0.9.0 introduces support for several key SYCL 2020 features, including:

    • Unified shared memory provides a pointer-based memory model as an alternative to the traditional buffer-accessor model
    • SYCL 2020 generalized backend model and backend interoperability provides generic mechanisms for interoperability between the underlying backend objects and SYCL
    • Queue shortcuts for kernel invocation and USM memory management functions
    • Inorder queues to submit kernels in order when a task graph is not required
    • Unnamed kernal lambdas (requires building hipSYCL against clang >= 10)
    • Subgroups
    • Group algorithms for parallel primitives at work group and subgroup level (Note that the interface may change slightly with the release of SYCL 2020 final, optimization is ongoing)
    • Reductions provide a simple way to carry out arbitrary amounts of reduction operations across all work items of a kernel using either predefined or user-provided reduction operators (Note that the interface may change slightly with the release of SYCL 2020 final, optimization is ongoing). Currently only scalar reductions are supported. Multiple simultaneous reductions are supported. In addition to the requirements of the SYCL specification, we also support reductions for the hierarchical and scoped parallelism models.
    • ... and more! See here for more information on the SYCL 2020 coverage of current hipSYCL: https://github.com/hipSYCL/featuresupport

    Unique hipSYCL extensions

    There are two new extensions in hipSYCL 0.9.0:

    • Enqueuing custom backend operations for highly efficient backend interoperability: https://github.com/illuhad/hipSYCL/blob/develop/doc/enqueue-custom-operation.md
    • Scoped parallellism is a novel kernel execution model designed for performance portability between host and device backends: https://github.com/illuhad/hipSYCL/blob/develop/doc/scoped-parallelism.md

    New runtime library

    hipSYCL 0.9.0 is the first release containing the entirely rewritten, brand new runtime library, which includes features such as:

    • Single library for all backends (libhipSYCL-rt) instead of libraries for each backend (libhipSYCL_cpu, libhipSYCL_cuda etc)
    • Strict seperation between backend specific code and generic code, clear, simple interface to add new backends, making it easy to add additional backends in the future
    • Multiple runtime backends can now be active at the same time and interact
    • SYCL interface is now header-only; bootstrap mode in syclcc is no longer required and has been removed. When building hipSYCL, only the runtime needs to be compiled which can be done with any regular C++ compiler. This should simplify the build process greatly.
    • Architecture supports arbitrary execution models in different backends - queue/stream based, task graphs etc.
    • CUDA and CPU backends do not depend on HIP API anymore. The CUDA backend now goes directly to CUDA without going through HIP, and the CPU backend goes directly to OpenMP without going through hipCPU. hipCPU and HIP submodules are no longer required and have been removed.
    • Strict separation between SYCL interface and runtime, making it easy to expose new features (e.g. SYCL 2020) in the SYCL interface by leveraging the SYCL runtime interfaces underneath.
    • For each operation, SYCL interface can pass additional information to runtime/scheduler using hints framework. Device on which an operation is executed is just another hint for the runtime.
    • Support for lazy DAG execution (Note: Only partially activated by default)
    • Almost entirely callback-free execution model in CUDA/ROCm backends for potentially higher task throughput
    • New memory management system and improved multi-GPU support
      • manages arbitrary allocations on multiple devices
      • manages memory potentially below buffer granularity, using 3D page table to track invalid memory regions (not yet fully exposed)
    • Backend queues (e.g. CUDA streams) are maintained by the backend in a pool, the scheduler then distributes operations across the queues. No matter how many sycl::queues exist, compute/memory-overlap always works equally well. This means a sycl::queue is now nothing more than an interface to the runtime.
    • Vastly improved error handling. Proper implementation of async errors/error handlers. Task execution will be cancelled when an error is detected.
    • ROCm backend: Add support for 3D data transfers

    syclcc and compilation improvements

    • new --hipsycl-targets flag that allows to compile for multiple targets and backends, e.g. syclcc --hipsycl-targets="omp;hip:gfx906,gfx900" compiles for the OpenMP backend as well as for Vega 10 and Vega 20. Note that simultaneous compilation for both NVIDIA and AMD GPUs is not supported due to clang limitations.
    • The compiler arguments and linker flags passed to backend compilers are now all exposed in cmake (and syclcc.json), giving the user more control to adapt the compilation flow to individual requirements. This can be helpful for uncommon setup scenarios where different flags may be required.

    Performance improvements

    • New execution model for nd_range parallel for on CPU, bringing several orders of magnitudes of performance. Note that nd_range parallel for is inherently difficult to implement in library-only CPU backends, and basic parallel for or our scoped parallelism extension should be preferred if possible.

    Fixes and other improvements

    Yes, a lot of them :-)

    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(Sep 24, 2019)

    Note: hipSYCL 0.8.0 is deprecated, users are encouraged to use our package repositories instead

    This is the release of hipSYCL 0.8.0. We provide the following packages:

    • hipSYCL-base provides the basic LLVM compiler stack that is needed in any case
    • hipSYCL-rocm provides a compatible ROCm stack that additionally allows hipSYCL to target AMD GPUs
    • hipSYCL provides the actual hipSYCL libraries, tools and headers

    While we cannot provide matching CUDA packages for NVIDIA support due to legal reasons, scripts for installing a matching CUDA distribution as well as scripts to generate CUDA packages are provided. You will find further information in the readme here on github.

    At the moment, Arch Linux, CentOS 7 and Ubuntu 18.04 packages are provided.

    Source code(tar.gz)
    Source code(zip)
    hipSYCL-0.8.0-20190924-archlinux-x86_64.pkg.tar.xz(339.03 KB)
    hipSYCL-0.8.0-20190924-centos-7.x86_64.rpm(386.55 KB)
    hipSYCL-0.8.0-20190924-ubuntu-18.04.deb(342.65 KB)
    hipSYCL-base-0.8.0-20190924-archlinux-x86_64.pkg.tar.xz(246.50 MB)
    hipSYCL-base-0.8.0-20190924-centos-7.x86_64.rpm(357.18 MB)
    hipSYCL-base-0.8.0-20190924-ubuntu-18.04.deb(258.21 MB)
    hipSYCL-rocm-0.8.0-20190924-archlinux-x86_64.pkg.tar.xz(569.58 MB)
    hipSYCL-rocm-0.8.0-20190924-centos-7.x86_64.rpm(712.43 MB)
    hipSYCL-rocm-0.8.0-20190924-ubuntu-18.04.deb(606.87 MB)
  • v0.8.0-rc1(Sep 19, 2019)

    This is a prerelease of hipSYCL 0.8.0. In particular, it serves to test new packages of the entire hipSYCL stack. We provide the following packages:

    • hipSYCL-base provides the basic LLVM compiler stack that is needed in any case
    • hipSYCL-rocm provides a compatible ROCm stack that additionally allows hipSYCL to target AMD GPUs
    • hipSYCL provides the actual hipSYCL libraries, tools and headers

    While we cannot provide matching CUDA packages due to legal reasons, CUDA installation scripts will be provided for the actual hipSYCL 0.8.0 release.

    At the moment, Arch Linux and Ubuntu 18.04 packages are provided.

    Source code(tar.gz)
    Source code(zip)
    hipSYCL-0.8.0-20190919-arch-x86_64.pkg.tar.xz(337.23 KB)
    hipSYCL-0.8.0-20190919-ubuntu-18.04.deb(340.08 KB)
    hipSYCL-base-0.8.0-20190919-arch-x86_64.pkg.tar.xz(240.88 MB)
    hipSYCL-base-0.8.0-20190919-ubuntu-18.04.deb(251.18 MB)
    hipSYCL-rocm-0.8.0-20190919-arch-x86_64.pkg.tar.xz(574.30 MB)
    hipSYCL-rocm-0.8.0-20190919-ubuntu-18.04.deb(612.35 MB)
Owner
Aksel Alpay
Aksel Alpay
Vgpu unlock - Unlock vGPU functionality for consumer grade GPUs.

vgpu_unlock Unlock vGPU functionality for consumer-grade Nvidia GPUs. Important! This tool is not guarenteed to work out of the box in some cases, so

Jonathan Johansson 3.6k Dec 29, 2022
A fast multi-producer, multi-consumer lock-free concurrent queue for C++11

moodycamel::ConcurrentQueue An industrial-strength lock-free queue for C++. Note: If all you need is a single-producer, single-consumer queue, I have

Cameron 7.4k Jan 3, 2023
A bounded multi-producer multi-consumer concurrent queue written in C++11

MPMCQueue.h A bounded multi-producer multi-consumer concurrent queue written in C++11. It's battle hardened and used daily in production: In the Frost

Erik Rigtorp 836 Dec 25, 2022
C++11 thread safe, multi-producer, multi-consumer blocking queue, stack & priority queue class

BlockingCollection BlockingCollection is a C++11 thread safe collection class that provides the following features: Modeled after .NET BlockingCollect

Code Ex Machina LLC 50 Nov 23, 2022
A mod menu base for GTA - Custom UI, backend == BigBase

Custom Base This is a mod menu base made for Grand Theft Auto V and compatible with GTA Online aswell. Making a parent submenu Parent submenus are the

null 4 Jan 16, 2022
Powerful multi-threaded coroutine dispatcher and parallel execution engine

Quantum Library : A scalable C++ coroutine framework Quantum is a full-featured and powerful C++ framework build on top of the Boost coroutine library

Bloomberg 491 Dec 30, 2022
A library for enabling task-based multi-threading. It allows execution of task graphs with arbitrary dependencies.

Fiber Tasking Lib This is a library for enabling task-based multi-threading. It allows execution of task graphs with arbitrary dependencies. Dependenc

RichieSams 796 Dec 30, 2022
Optimized primitives for collective multi-GPU communication

NCCL Optimized primitives for inter-GPU communication. Introduction NCCL (pronounced "Nickel") is a stand-alone library of standard communication rout

NVIDIA Corporation 1.9k Dec 30, 2022
lc is a fast multi-threaded line counter.

Fast multi-threaded line counter in Modern C++ (2-10x faster than `wc -l` for large files)

Pranav 14 Oct 25, 2022
A library OS for Linux multi-process applications, with Intel SGX support

Graphene Library OS with Intel SGX Support A Linux-compatible Library OS for Multi-Process Applications NOTE: We are in the middle of transitioning ou

The Gramine Project 323 Jan 4, 2023
KRATOS Multiphysics ("Kratos") is a framework for building parallel, multi-disciplinary simulation software

KRATOS Multiphysics ("Kratos") is a framework for building parallel, multi-disciplinary simulation software, aiming at modularity, extensibility, and high performance. Kratos is written in C++, and counts with an extensive Python interface.

KratosMultiphysics 774 Dec 29, 2022
This is a C++ package for multi-armed bandit simulations

This is a C++ package for multi-armed bandit simulations.

jkomiyama 132 Dec 15, 2022
Concurrency Kit 2.1k Jan 4, 2023
An implementation of Actor, Publish-Subscribe, and CSP models in one rather small C++ framework. With performance, quality, and stability proved by years in the production.

What is SObjectizer? What distinguishes SObjectizer? SObjectizer is not like TBB, taskflow or HPX Show me the code! HelloWorld example Ping-Pong examp

Stiffstream 314 Dec 26, 2022
Arcana.cpp - Arcana.cpp is a collection of helpers and utility code for low overhead, cross platform C++ implementation of task-based asynchrony.

Arcana.cpp Arcana is a collection of general purpose C++ utilities with no code that is specific to a particular project or specialized technology are

Microsoft 67 Nov 23, 2022
Parallel-util - Simple header-only implementation of "parallel for" and "parallel map" for C++11

parallel-util A single-header implementation of parallel_for, parallel_map, and parallel_exec using C++11. This library is based on multi-threading on

Yuki Koyama 27 Jun 24, 2022
Fast, generalized, implementation of the Chase-Lev lock-free work-stealing deque for C++17

riften::Deque A bleeding-edge lock-free, single-producer multi-consumer, Chase-Lev work stealing deque as presented in the paper "Dynamic Circular Wor

Conor Williams 120 Dec 22, 2022
An ultra-simple thread pool implementation for running void() functions in multiple worker threads

void_thread_pool.cpp © 2021 Dr Sebastien Sikora. [email protected] Updated 06/11/2021. What is it? void_thread_pool.cpp is an ultra-simple

Seb Sikora 1 Nov 19, 2021