Kokkos C++ Performance Portability Programming EcoSystem: The Programming Model - Parallel Execution and Memory Abstraction

Overview

Kokkos

Kokkos: Core Libraries

Kokkos Core implements a programming model in C++ for writing performance portable applications targeting all major HPC platforms. For that purpose it provides abstractions for both parallel execution of code and data management. Kokkos is designed to target complex node architectures with N-level memory hierarchies and multiple types of execution resources. It currently can use CUDA, HPX, OpenMP and Pthreads as backend programming models with several other backends in development.

Kokkos Core is part of the Kokkos C++ Performance Portability Programming EcoSystem, which also provides math kernels (https://github.com/kokkos/kokkos-kernels), as well as profiling and debugging tools (https://github.com/kokkos/kokkos-tools).

Learning about Kokkos

A programming guide can be found on the Wiki, the API reference is under development.

For questions find us on Slack: https://kokkosteam.slack.com or open a github issue.

For non-public questions send an email to crtrott(at)sandia.gov

A separate repository with extensive tutorial material can be found under https://github.com/kokkos/kokkos-tutorials.

Furthermore, the 'example/tutorial' directory provides step by step tutorial examples which explain many of the features of Kokkos. They work with simple Makefiles. To build with g++ and OpenMP simply type 'make' in the 'example/tutorial' directory. This will build all examples in the subfolders. To change the build options refer to the Programming Guide in the compilation section.

To learn more about Kokkos consider watching one of our presentations:

Contributing to Kokkos

We are open and try to encourage contributions from external developers. To do so please first open an issue describing the contribution and then issue a pull request against the develop branch. For larger features it may be good to get guidance from the core development team first through the github issue.

Note that Kokkos Core is licensed under standard 3-clause BSD terms of use. Which means contributing to Kokkos allows anyone else to use your contributions not just for public purposes but also for closed source commercial projects. For specifics see the LICENSE file contained in the repository or distribution.

Requirements

Primary tested compilers on X86 are:

  • GCC 5.3.0
  • GCC 5.4.0
  • GCC 5.5.0
  • GCC 6.1.0
  • GCC 7.2.0
  • GCC 7.3.0
  • GCC 8.1.0
  • Intel 17.0.1
  • Intel 17.4.196
  • Intel 18.2.128
  • Clang 4.0.0
  • Clang 6.0.0 for CUDA (CUDA Toolkit 9.0)
  • Clang 7.0.0 for CUDA (CUDA Toolkit 9.1)
  • Clang 8.0.0 for CUDA (CUDA Toolkit 9.2)
  • PGI 18.7
  • NVCC 9.1 for CUDA (with gcc 6.1.0)
  • NVCC 9.2 for CUDA (with gcc 7.2.0)
  • NVCC 10.0 for CUDA (with gcc 7.4.0)
  • NVCC 10.1 for CUDA (with gcc 7.4.0)
  • NVCC 11.0 for CUDA (with gcc 8.4.0)

Primary tested compilers on Power 8 are:

  • GCC 6.4.0 (OpenMP,Serial)
  • GCC 7.2.0 (OpenMP,Serial)
  • IBM XL 16.1.0 (OpenMP, Serial)
  • NVCC 9.2.88 for CUDA (with gcc 7.2.0 and XL 16.1.0)

Primary tested compilers on Intel KNL are:

  • Intel 17.2.174 (with gcc 6.2.0 and 6.4.0)
  • Intel 18.2.199 (with gcc 6.2.0 and 6.4.0)

Primary tested compilers on ARM (Cavium ThunderX2)

  • GCC 7.2.0
  • ARM/Clang 18.4.0

Other compilers working:

  • X86:
    • Cygwin 2.1.0 64bit with gcc 4.9.3
    • GCC 8.1.0 (not warning free)

Known non-working combinations:

  • Power8:
    • Pthreads backend
  • ARM
    • Pthreads backend

Build system:

  • CMake >= 3.10: required
  • CMake >= 3.13: recommended
  • CMake >= 3.18: Fortran linkage. This does not affect most mixed Fortran/Kokkos builds. See build issues.

Primary tested compiler are passing in release mode with warnings as errors. They also are tested with a comprehensive set of backend combinations (i.e. OpenMP, Pthreads, Serial, OpenMP+Serial, ...). We are using the following set of flags:

  • GCC:

       -Wall -Wunused-parameter -Wshadow -pedantic
       -Werror -Wsign-compare -Wtype-limits
       -Wignored-qualifiers -Wempty-body
       -Wclobbered -Wuninitialized
    
  • Intel:

      -Wall -Wunused-parameter -Wshadow -pedantic
      -Werror -Wsign-compare -Wtype-limits
      -Wuninitialized
    
  • Clang:

      -Wall -Wunused-parameter -Wshadow -pedantic
      -Werror -Wsign-compare -Wtype-limits
      -Wuninitialized
    
  • NVCC:

      -Wall -Wunused-parameter -Wshadow -pedantic
      -Werror -Wsign-compare -Wtype-limits
      -Wuninitialized
    

Other compilers are tested occasionally, in particular when pushing from develop to master branch. These are tested less rigorously without -Werror and only for a select set of backends.

Building and Installing Kokkos

Kokkos provide a CMake build system and a raw Makefile build system. The CMake build system is strongly encouraged and will be the most rigorously supported in future releases. Full details are given in the build instructions. Basic setups are shown here:

CMake

The best way to install Kokkos is using the CMake build system. Assuming Kokkos lives in $srcdir:

cmake $srcdir \
  -DCMAKE_CXX_COMPILER=$path_to_compiler \
  -DCMAKE_INSTALL_PREFIX=$path_to_install \
  -DKokkos_ENABLE_OPENMP=On \
  -DKokkos_ARCH_HSW=On \
  -DKokkos_ENABLE_HWLOC=On \
  -DKokkos_HWLOC_DIR=$path_to_hwloc

then simply type make install. The Kokkos CMake package will then be installed in $path_to_install to be used by downstream packages.

To validate the Kokkos build, configure with

 -DKokkos_ENABLE_TESTS=On

and run make test after completing the build.

For your CMake project using Kokkos, code such as the following:

find_package(Kokkos)
...
target_link_libraries(myTarget Kokkos::kokkos)

should be added to your CMakeLists.txt. Your configure should additionally include

-DKokkos_DIR=$path_to_install/cmake/lib/Kokkos

or

-DKokkos_ROOT=$path_to_install

for the install location given above.

Spack

An alternative to manually building with the CMake is to use the Spack package manager. To get started, download the Spack repo.

A basic installation would be done as:
````bash
> spack install kokkos

Spack allows options and and compilers to be tuned in the install command.

> spack install [email protected] %[email protected] +openmp

This example illustrates the three most common parameters to Spack:

  • Variants: specified with, e.g. +openmp, this activates (or deactivates with, e.g. ~openmp) certain options.
  • Version: immediately following kokkos the @version can specify a particular Kokkos to build
  • Compiler: a default compiler will be chosen if not specified, but an exact compiler version can be given with the %option.

For a complete list of Kokkos options, run:

> spack info kokkos

Spack currently installs packages to a location determined by a unique hash. This hash name is not really "human readable". Generally, Spack usage should never really require you to reference the computer-generated unique install folder. More details are given in the build instructions. If you must know, you can locate Spack Kokkos installations with:

> spack find -p kokkos ...

where ... is the unique spec identifying the particular Kokkos configuration and version. Some more details can found in the Kokkos spack documentation or the Spack website.

Raw Makefile

A bash script is provided to generate raw makefiles. To install Kokkos as a library create a build directory and run the following

> $KOKKOS_PATH/generate_makefile.bash --prefix=$path_to_install

Once the Makefile is generated, run:

> make kokkoslib
> make install

To additionally run the unit tests:

> make build-test
> make test

Run generate_makefile.bash --help for more detailed options such as changing the device type for which to build.

Inline Builds vs. Installed Package

For individual projects, it may be preferable to build Kokkos inline rather than link to an installed package. The main reason is that you may otherwise need many different configurations of Kokkos installed depending on the required compile time features an application needs. For example there is only one default execution space, which means you need different installations to have OpenMP or Pthreads as the default space. Also for the CUDA backend there are certain choices, such as allowing relocatable device code, which must be made at installation time. Building Kokkos inline uses largely the same process as compiling an application against an installed Kokkos library.

For CMake, this means copying over the Kokkos source code into your project and adding add_subdirectory(kokkos) to your CMakeLists.txt.

For raw Makefiles, see the example benchmarks/bytes_and_flops/Makefile which can be used with an installed library and or an inline build.

Kokkos and CUDA UVM

Kokkos does support UVM as a specific memory space called CudaUVMSpace. Allocations made with that space are accessible from host and device. You can tell Kokkos to use that as the default space for Cuda allocations. In either case UVM comes with a number of restrictions:

  • You can't access allocations on the host while a kernel is potentially running. This will lead to segfaults. To avoid that you either need to call Kokkos::Cuda::fence() (or just Kokkos::fence()), after kernels, or you can set the environment variable CUDA_LAUNCH_BLOCKING=1.
  • In multi socket multi GPU machines without NVLINK, UVM defaults to using zero copy allocations for technical reasons related to using multiple GPUs from the same process. If an executable doesn't do that (e.g. each MPI rank of an application uses a single GPU [can be the same GPU for multiple MPI ranks]) you can set CUDA_MANAGED_FORCE_DEVICE_ALLOC=1. This will enforce proper UVM allocations, but can lead to errors if more than a single GPU is used by a single process.

Citing Kokkos

If you publish work which mentions Kokkos, please cite the following paper:

@article{CarterEdwards20143202,
  title = "Kokkos: Enabling manycore performance portability through polymorphic memory access patterns ",
  journal = "Journal of Parallel and Distributed Computing ",
  volume = "74",
  number = "12",
  pages = "3202 - 3216",
  year = "2014",
  note = "Domain-Specific Languages and High-Level Frameworks for High-Performance Computing ",
  issn = "0743-7315",
  doi = "https://doi.org/10.1016/j.jpdc.2014.07.003",
  url = "http://www.sciencedirect.com/science/article/pii/S0743731514001257",
  author = "H. Carter Edwards and Christian R. Trott and Daniel Sunderland"
}
LICENSE

License

Under the terms of Contract DE-NA0003525 with NTESS, the U.S. Government retains certain rights in this software.

Comments
  • ScatterView

    ScatterView

    @stanmoore1

    This is an implementation of ReductionView (issue #825) that seems to meet the implementation and performance requirements. The interface may need tweaking in terms of names, etc.

    One key difference from what was proposed in #825 : due to the way UniqueToken works, an extra level of indirection is needed to encapsulate the calls to UniqueToken::acquire and release at the beginning and end of the user's code. Basically, you call view.access() and get another thing that can actually be accessed like a View:

    auto view_access = reduction_view.access();
    view_access(i, 0) += dx;
    view_access(i, 1) += dy;
    view_access(i, 2) += dz;
    

    Unit tests are in place that suggest there are no race conditions or bugs in the implementation. An OpenMP performance test was also added which confirms that the data-duplicated implementation is faster than the atomic one.

    Feature Request 
    opened by ibaned 100
  • Control Profiling hooks programmatically

    Control Profiling hooks programmatically

    Per #2973 , this adds two features (though it doesn't close 2973, I need one more orthogonal PR for that).

    This allows

    1. A code to set the profiling hooks via an API call
    2. A code to pause/resume profiling

    For each callback, there's an associated setter. When a profiling hook is set, it's recorded in a struct. On pause, every callback is null'd out. On resume, the callbacks are repopulated from that struct.

    opened by DavidPoliakoff 64
  • jje/par deep copy

    jje/par deep copy

    There are some things to TODO, particularly, making my casts C++ style.

    I've been passing tests in Trilinos/Kokkos, but I recently made a few changes per D. Sunderland. I haven't rigorously tested the copy since.

    The code is intended to fit into the Kokkos Core library (that is, it is compiled w/a CPP, so this actually speeds up build times.)

    Scalability data is in my gitlab repo. On Power9 and HSW the threaded transfer rates are exceptionally faster, but there is a threshold that needs to be tuned to avoid threads when the work is too small.

    opened by jjellio 63
  • HPX backend

    HPX backend

    This adds an HPX backend to the Kokkos::Experimental namespace. It contains (AFAICT) all functionality required of a Kokkos backend. It only runs with deprecated code disabled.

    There is an HPX option called enable_async_dispatch which makes parallel dispatch asynchronous with the HPX backend. It's off by default. All work submitted through Kokkos is still processed sequentially. (By the way, I've gone and added a bunch of Kokkos::fence calls to tests to deal with the asynchronous version. I may have added them in too many places.)

    A preprocessor option (KOKKOS_HPX_IMPLEMENTATION) switches between two different implementations for most of the parallel dispatch functions. One is the obvious HPX way of implementing the backend, but since HPX still lacks some optimizations there is a slightly more verbose and optimized version. The optimized version is the default. I haven't made a build system option for this since I don't expect any user to want to change that---it's mainly for testing.

    I've tried to use existing Kokkos utilities as much as possible but some things I ended up implementing just for the HPX backend.

    I've had it running within Trilinos (with HPX as a TPL) but I'm not very confident that I've done everything the correct tribits way. Comments on this would be appreciated.

    KokkosKernels PR (and maybe Trilinos) will follow. I see there's no support for ROCm in KokkosKernels or Trilinos. Is it because you don't want to do that until it's out of the Kokkos::Experimental namespace or just because it hasn't been done yet?

    Finally, let me know how you prefer to deal with:

    • formatting: I've run clang-format with default settings on new files related to the HPX backend, and haven't changed formatting in other files. Is that OK?
    • rebasing/squashing: I haven't cleaned up the history yet at all. Do you prefer one big squashed commit, or a few cleaned up commits?
    • documentation: Is the wiki the place to document details about using the backend? E.g. I would recommend users to start the HPX runtime instead of letting Kokkos do it for performance reasons. Not that I expect people to flock to the HPX backend.

    Edit: This PR also changes the calculation of chunk sizes to suit HPX better, but that can be separated out (especially if something like #1866 would be available).

    Blocks Promotion 
    opened by msimberg 51
  • Over-fencing in profiling system

    Over-fencing in profiling system

    Currently, both the "begin" and "end" APIs in the Kokkos profiling interface will call Kokkos::fence(). For a program running with CUDA_LAUCH_BLOCKING=1 in the environment, that means that a Kokkos CUDA kernel is surrounded by 3 synchronizations: begin, CUDA_LAUNCH_BLOCKING, and end.

    We are thinking of:

    1. not doing the "end" fence if CUDA_LAUNCH_BLOCKING=1
    2. not doing the "begin" fence ever
    Enhancement 
    opened by ibaned 48
  • List of Catchable Antipatterns in Kokkos

    List of Catchable Antipatterns in Kokkos

    This is an initial list of things we want a compiler tool to at least check for and maybe fix automatically in projects using Kokkos.

    • [ ] Implicit this capture in parallel_for
      • [X] Detected when parallel_for takes a lambda using implicit this capture
      • [X] Avoid detection when one of the arguments to the parellel_for is a TeamThreadRange
      • [ ] Other exceptions? @dhollman?
      • [ ] Harden for general use, for example ensure that parallel_for is located in the Kokkos namespace and is not a different parallel_for, also ensure that it works with named lambda instead of temporary ones.
    • [ ] User configurable warning on capture by value or capture by reference for specific types
    • [ ] Ensure that functor in parallel_{scan,reduce} takes the reduction argument by reference.
    opened by calewis 47
  • Kokkos Tuning Variables

    Kokkos Tuning Variables

    This is a draft PR primarily intended to spur discussion. The idea behind the PR is that we have no idea what future architectures might bring, and that having to do all the crazy tuning that we do and make the compromises that we make might be incredibly expensive in terms of human hours and sanity. It would be nice if we could instead declare our tuning parameters, and then tune them until they are good.

    This aims to give us an interface in which we declare tuning variables and expose them to Tools via hooks, much like our Profiling hooks. This stab at the problem has two main pieces: Tuning Variables and Context Variables ("Thing to learn via ML" and "Features for ML algorithms to eat" respectively).

    First, the existence of such a Variable is declared. Each variable has

    1. a name
    2. a unique id
    3. a value type (boolean, int, float, text/char*)
    4. a value category (categorical, ordinal, interval, ratio)
    5. a descriptor of the candidate values (are they a set or a range)
    6. The actual candidate values

    Note that these are not independent, a boolean is not going to be interval data. Then, we can declare the values of context variables, or request the values of tuning variables. My hope is that the declaration of context variables might leak out into codes, so we don't just learn block size from kernel name and bounds, but from "number of particles" or "mesh refinement."

    This is still very much a WIP, but I wanted to discuss

    1. Code organization. I think we no longer just have Kokkos_Profiling_Interface.hpp, we might now have Kokkos_Tools_Interface, which includes Kokkos_Tuning_Interface, Kokkos_Profiling_Interface, Kokkos_Debugging_Interface (once I get around to writing it), but I want people's views
    2. Is this approach utterly "set it on fire" wrong? Mildly wrong? In each case, how can we make it even more wrong? (edit: @crtrott apparently wants code that is less wrong)
    3. If not, what should be tuning variables in an initial draft? So far Christian has told me block size (Ye Auld Autotuning Parameter) and whether to store functors in constant memory

    I'll probably be taking a true swing at this next week, but I wanted this to marinade somebody's brain over the weekend so they'd tell me what needs to happen here.

    Related Tools PR: https://github.com/kokkos/kokkos-tools/pull/89 Related Kernels PR with implementation example: https://github.com/kokkos/kokkos-kernels/pull/712

    opened by DavidPoliakoff 46
  • Non-fencing, asynchrony-aware, tools interface

    Non-fencing, asynchrony-aware, tools interface

    This is an issue to request people's thoughts for a version 2 of the Tools interface, one which doesn't require fences, and so doesn't perturb the application beyond all reason. I'm curious about everybody's thoughts, but really want feedback from

    • @jrmadsen
    • @daboehme
    • @wrwilliams
    • @nmhamster
    • @khuck (and the rest of my UO compatriots. I don't know Sameer's user ID, but I hope we can get his feedback)

    Update: if you're coming to this topic, start here, the rest of this is a bit of a terrible intro to the problem

    As they're the people I know who have dove headfirst into such concepts already. Some hard requirements:

    • Tools targeting V1 should work for the foreseeable future
    • This interface shall-must-will-shalt be C
    • Need to be able to prove that this works with Kokkos Graphs, and with code that uses separate Cuda (for example) instances

    Initial thoughts: one question I'm relatively confident in an answer to, but wanted to raise, is whether it's Kokkos or the tool that's responsible for managing things like the CUDA event API. That feels like our problem, asking every tool to write the same code has never been a favorite of mine.

    But I think the answer to that question shapes the most important parts of the design, if we decide it's the Tool's problem to look at the underlying asynchrony, it's a hugely different design.

    In either branch of this tree, I think what winds up happening is that v2 callbacks also return a decision on whether to fence. Pseudocode for a design here would look like

    
    struct toolResponse {
      int should_fence; // int-as-bool, can use bool if people prefer
      int unused_1; // allow for us to expand what a tool can tell us later without changing interface
     ...
     int unused_n;
    }
    
    struct EventSet { // currently exists, showing some changes
      beginParallelForFunction begin_v1;
      v2BeginParallelForFunction begin_v2; // the v2 function takes a toolResponse* as an arg, otherwise is as in v1
      ...
    }
    
    void Profiling::initialize() {
      events.begin_v2 = dlsym("kokkosp_begin_parallel_for_v2");
      if(begin_v2 == nullptr) {
        events.begin_v1 = dlsym("kokkosp_begin_parallel_for");
      }
    }
    
    toolResponse create_default_tool_response(){
      return toolResponse { 1 }; // should_fence = true
    }
    
    void Profiling::beginParallelFor(all_the_args_begin_parallel_for_takes) {
     if((begin_v1==nullptr) && (begin_v2==nullptr)) { return; }
     auto toolResponse = create_default_tool_response();
      if(begin_v2 != nullptr) {
        (*events.begin_v2)(all_the_v1_args, &toolResponse);
      }
      else (begin_v1!=nullptr){
          (*events.begin_v2)(all_the_v1_args);
      }
      if(toolResponse.shouldFence != 0) {
        Kokkos::fence();
      }
    }
    

    I could see slight tweaks there, perhaps a tool doesn't return whether to fence, but what ("you need to fence the GPU").

    Also, I've basically left this huge gap, where some magical solution figures out how to hide the CUDA event API, the HIP equivalent, OpenMPTarget's stuff, whatever SYCL thinks up... I'm going to be honest and say I don't have a good idea there. That's a place I'd love to lean on the expertise of the community. I have a feeling we have an enum-union around a set of different [somethings], and the tool looks at whether it can handle the kind of asynchrony, based on the enum, to decide whether to fence. I don't know though, my best answer so far is to tag smart people and see what they come up with.

    So the asks in this issue are

    • Am I missing any requirements? Are any of my requirements wrong?
    • Does the toolResponse part of the interface sound reasonable?
    • How on earth are we going to abstract these different asynchronous interfaces?
    opened by DavidPoliakoff 44
  • Modernize stack trace functionality

    Modernize stack trace functionality

    @mhoemmen contributed this code for new and improved stack traces.

    We want to first merge this in on its own and then later integrate it with other mechanisms such as contracts for error handling.

    opened by ibaned 44
  • Kokkos release 3.6.0

    Kokkos release 3.6.0

    This issue is to track status of the kokkos 3.6.0 release.

    Pull requests have been created for: Kokkos: #4827 KokkosKernels: kokkos/kokkos-kernels#1345 Trilinos: trilinos/Trilinos#10253

    InDevelop 
    opened by ndellingwood 39
  • CMake: Fix having kokkos as a subdirectory in a pure cmake project

    CMake: Fix having kokkos as a subdirectory in a pure cmake project

    Ok I worked a bit on this. This is a simple CMakeLists.txt for a project which has snapshoted kokkos in the top level project directory.

    cmake_minimum_required (VERSION 2.8.12)
    project (Example)
    
    #include Kokkos:
    add_subdirectory (kokkos)
    include_directories(kokkos/core/src)
    include_directories(kokkos/algorithms/src)
    include_directories(kokkos/containers/src)
    include_directories(${CMAKE_BINARY_DIR}/kokkos/core/src)
    include_directories(${CMAKE_BINARY_DIR}/kokkos/algorithms/src)
    include_directories(${CMAKE_BINARY_DIR}/kokkos/containers/src)
    
    #What variables are set??
    #get_cmake_property(_variableNames VARIABLES)
    #foreach (_variableName ${_variableNames})
    #    message(STATUS "${_variableName}=${${_variableName}}")
    #endforeach()
    
    add_executable (example simple_reduce_lambda.cpp)
    target_link_libraries (example kokkoscore)
    
    set(CMAKE_CXX_FLAGS "-std=c++11 -O3")
    

    The directory contains: CMakeLists.txt kokkos simple_reduce_lambda.cpp

    Creating a build dir and then doing cmake ../ from it configures the build, make builds.

    But: can we do better than doing the include_directories buisness: i.e. define variables inside kokkos which can be consumed on the outside. Also need to figure out all the architecture settings which are currently not cmake variables.

    Enhancement 
    opened by crtrott 39
  • WIP: Fixup shared spaces

    WIP: Fixup shared spaces

    Fixup for discussion https://github.com/kokkos/kokkos/pull/5405#discussion_r982473816 (continued in the dev-meeting + decision to only define the symbols KOKKOS_HAS_SHARED_SPACE and KOKKOS_HAS_SHARED_HOST_PINNED_SPACE but not define them to 1 for enabled. Decision mainly to keep it consistent in Kokkos). I also removed some unnecessary brackets in cmake. Already rebased on https://github.com/kokkos/kokkos/pull/5405 and https://github.com/kokkos/kokkos/pull/5508 thus will be blocked by these two.

    opened by JBludau 0
  • Fix warning and undefined behavior in tools unit tests

    Fix warning and undefined behavior in tools unit tests

    • Update cmake test registering helper function to avoid runtime warnings about using the deprecated KOKKOS_PROFILE_LIBRARY environment variable
    Warning: environment variable 'KOKKOS_PROFILE_LIBRARY' is deprecated. Use 'KOKKOS_TOOLS_LIBS' instead. Raised by Kokkos::initialize().
    
    • Partial fix for issues raised by Clang's UndefinedBehaviorSanitizer
    13: /kokkos/core/src/impl/Kokkos_Profiling.cpp:322:5: runtime error: call to function kokkosp_allocate_data through pointer to incorrect function type 'void (*)(Kokkos_Profiling_SpaceHandle, const char *, const void *, unsigned long)'
    13: /kokkos/core/unit_test/tools/printing-tool.cpp:90: note: kokkosp_allocate_data defined here
    13: SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /kokkos/core/src/impl/Kokkos_Profiling.cpp:322:5 in
    13: /kokkos/core/src/impl/Kokkos_Profiling.cpp:322:5: runtime error: call to function kokkosp_begin_deep_copy through pointer to incorrect function type 'void (*)(Kokkos_Profiling_SpaceHandle, const char *, const void *, Kokkos_Profiling_SpaceHandle, const char *, const void *, unsigned long)'
    13: /kokkos/core/unit_test/tools/printing-tool.cpp:106: note: kokkosp_begin_deep_copy defined here
    13: SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /kokkos/core/src/impl/Kokkos_Profiling.cpp:322:5 in
    13: /kokkos/core/src/impl/Kokkos_Profiling.cpp:322:5: runtime error: call to function kokkosp_push_profile_region through pointer to incorrect function type 'void (*)(const char *)'
    13: /kokkos/core/unit_test/tools/printing-tool.cpp:81: note: kokkosp_push_profile_region defined here
    13: SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /kokkos/core/src/impl/Kokkos_Profiling.cpp:322:5 in
    

    The changes proposed resolve the kokkosp_push_profile_region issue but kokkosp_allocate_data and kokkosp_begin_deep_copy error reporting are still there because of the SpaceHandle argument and our relying on type punning. Fixing these is would be much more involved (because of backward compatibility considerations) and beyond the scope of this PR.

    opened by dalg24 0
  • Serialize `SharedSpace` test

    Serialize `SharedSpace` test

    Fix #5502 As Trilinos and other project might run tests in parallel and we saw that TestSharedSpace might be sensitive to other stuff running on the same GPU, it is marked with the property RUN_SERIAL.

    opened by JBludau 1
  • Kokkos does not export AMD flags to Trilinos

    Kokkos does not export AMD flags to Trilinos

    Describe the bug When configuring Trilinos to build Kokkos with HIP enabled the --offload-arch=gfx90N flag is apparently not added to Trilinos' flags. This leads to obvious compilation issues, setting the environment variable ROCM_TARGET_LST fixes the issue but defeats the point of specifying the ARCH to Kokkos at configure time. Adding the following logic in CMakeList.txt at line 239 also fixes this issue:

      IF(KOKKOS_ENABLE_HIP)
        FOREACH(OPTION ${KOKKOS_AMDGPU_OPTIONS})
          STRING(FIND "${OPTION}" " " OPTION_HAS_WHITESPACE)
          IF(OPTION_HAS_WHITESPACE EQUAL -1)
            LIST(APPEND KOKKOS_COMPILE_OPTIONS_TMP "${OPTION}")
          ELSE()
            LIST(APPEND KOKKOS_COMPILE_OPTION_TMP "\"${OPTION}\"")
          ENDIF()
        ENDFOREACH()
      ENDIF()
    
    Patch Release 
    opened by lucbv 4
  • SYCL:  Remove workaround for submit_barrier not being enqueued properly

    SYCL: Remove workaround for submit_barrier not being enqueued properly

    Needs https://github.com/intel/llvm/pull/6359 and https://github.com/intel/llvm/pull/6888 (or manually setting SYCL_PI_LEVEL_ZERO_USE_MULTIPLE_COMMANDLIST_BARRIERS=1 as environment variable). Also updates the AOT architectures. I would consider merging it already if it passes CI (which I expect).

    opened by masterleinad 0
Releases(3.7.00)
Owner
Kokkos
Kokkos C++ Performance Portability Programming EcoSystem
Kokkos
C++-based high-performance parallel environment execution engine for general RL environments.

EnvPool is a highly parallel reinforcement learning environment execution engine which significantly outperforms existing environment executors. With

Sea AI Lab 622 Sep 29, 2022
Powerful multi-threaded coroutine dispatcher and parallel execution engine

Quantum Library : A scalable C++ coroutine framework Quantum is a full-featured and powerful C++ framework build on top of the Boost coroutine library

Bloomberg 469 Sep 28, 2022
Parallel-hashmap - A family of header-only, very fast and memory-friendly hashmap and btree containers.

The Parallel Hashmap Overview This repository aims to provide a set of excellent hash map implementations, as well as a btree alternative to std::map

Gregory Popovitch 1.6k Sep 30, 2022
Shared-Memory Parallel Graph Partitioning for Large K

KaMinPar The graph partitioning software KaMinPar -- Karlsruhe Minimal Graph Partitioning. KaMinPar is a shared-memory parallel tool to heuristically

Karlsruhe High Quality Graph Partitioning 16 Sep 27, 2022
A General-purpose Parallel and Heterogeneous Task Programming System

Taskflow Taskflow helps you quickly write parallel and heterogeneous tasks programs in modern C++ Why Taskflow? Taskflow is faster, more expressive, a

Taskflow 7.3k Sep 26, 2022
A General-purpose Parallel and Heterogeneous Task Programming System

Taskflow Taskflow helps you quickly write parallel and heterogeneous task programs in modern C++ Why Taskflow? Taskflow is faster, more expressive, an

Taskflow 7.3k Sep 28, 2022
ParallelComputingPlayground - Shows different programming techniques for parallel computing on CPU and GPU

ParallelComputingPlayground Shows different programming techniques for parallel computing on CPU and GPU. Purpose The idea here is to compute a Mandel

Morten Nobel-Jørgensen 2 May 16, 2020
Concurrency Kit 2.1k Sep 30, 2022
Material for the UIBK Parallel Programming Lab (2021)

UIBK PS Parallel Systems (703078, 2021) This repository contains material required to complete exercises for the Parallel Programming lab in the 2021

null 12 May 6, 2022
Cpp-taskflow - Modern C++ Parallel Task Programming Library

Cpp-Taskflow A fast C++ header-only library to help you quickly write parallel programs with complex task dependencies Why Cpp-Taskflow? Cpp-Taskflow

null 4 Mar 30, 2021
A library for enabling task-based multi-threading. It allows execution of task graphs with arbitrary dependencies.

Fiber Tasking Lib This is a library for enabling task-based multi-threading. It allows execution of task graphs with arbitrary dependencies. Dependenc

RichieSams 783 Oct 4, 2022
SymQEMU: Compilation-based symbolic execution for binaries

SymQEMU This is SymQEMU, a binary-only symbolic executor based on QEMU and SymCC. It currently extends QEMU 4.1.1 and works with the most recent versi

null 215 Sep 22, 2022
Exploration of x86-64 ISA using speculative execution.

Haruspex /həˈrʌspeks/ A religious official in ancient Rome who predicted the future or interpreted the meaning of events by examining the insides of b

Can Bölük 279 Sep 7, 2022
An optimized C library for math, parallel processing and data movement

PAL: The Parallel Architectures Library The Parallel Architectures Library (PAL) is a compact C library with optimized routines for math, synchronizat

Parallella 296 Jul 24, 2022
Parallel algorithms (quick-sort, merge-sort , enumeration-sort) implemented by p-threads and CUDA

程序运行方式 一、编译程序,进入sort-project(cuda-sort-project),输入命令行 make 程序即可自动编译为可以执行文件sort(cudaSort)。 二、运行可执行程序,输入命令行 ./sort 或 ./cudaSort 三、删除程序 make clean 四、指定线程

Fu-Yun Wang 3 May 30, 2022
EnkiTS - A permissively licensed C and C++ Task Scheduler for creating parallel programs. Requires C++11 support.

Support development of enkiTS through Github Sponsors or Patreon enkiTS Master branch Dev branch enki Task Scheduler A permissively licensed C and C++

Doug Binks 1.3k Sep 29, 2022
Partr - Parallel Tasks Runtime

Parallel Tasks Runtime A parallel task execution runtime that uses parallel depth-first (PDF) scheduling [1]. [1] Shimin Chen, Phillip B. Gibbons, Mic

null 32 Jul 17, 2022
Thrust - The C++ parallel algorithms library.

Thrust: Code at the speed of light Thrust is a C++ parallel programming library which resembles the C++ Standard Library. Thrust's high-level interfac

NVIDIA Corporation 4.2k Oct 3, 2022
KRATOS Multiphysics ("Kratos") is a framework for building parallel, multi-disciplinary simulation software

KRATOS Multiphysics ("Kratos") is a framework for building parallel, multi-disciplinary simulation software, aiming at modularity, extensibility, and high performance. Kratos is written in C++, and counts with an extensive Python interface.

KratosMultiphysics 728 Sep 29, 2022