Patterns and behaviors for GPU computing


moderngpu 2.0

(c) 2016 Sean Baxter

You can drop me a line here

Full documentation with github wiki under heavy construction.

Latest update:

2.12 2016 June 8 -
  Fixed problem in load_two_streams_reg when loading from unaligned types.

moderngpu is a productivity library for general-purpose computing on GPUs. It is a header-only C++ library written for CUDA. The unique value of the library is in its accelerated primitives for solving irregularly parallel problems.

Release notes

2.11 2016 June 6 -
  Removed decltype() calls on __device__-tagged lambdas. This introduces
    two breaking changes: transform_scan and fill_function now take explicit
    types as their first template arguments.

2.10 2016 May 15 -
  Allow for non-pow2 sized launches. Rewrote cta_reduce_t to support these

2.09 2016 May 7 -
  Greatly improved and more standard tuple class.
  Optimized tuple value caching for lbs-related functions. 

2.08 2016 Apr 24 -
  Restricted pointer promotion on transform functions using variadic arguments.
  Fixed reduction bug in stream compaction.

2.07 2016 Apr 17 -
  Added mechanism for passing kernel arguments through variadic parameter pack.
  Added occupancy calculator in launch_box.hxx.
2.06 2016 Apr 12 - 
  Fixed critical kernel versioning bug. Now uses cudaFuncGetAttributes
  ptxVersion instead of binaryVersion to select launch box parameters.
  Thanks @antonmks.
2.05 2016 Apr 3 -
  Restructured segmented sort and added segmented_sort_indices.
  Wrote more robust test.
  Modified cities demo to use segmented_sort_indices.
  TODO: Segmented sort with bitfield for segmented headers.

2.04 2016 Apr 2 -
  Fixed multiple definition of dummy_k kernel when including 
    standard_context_t in multiple translation units.

2.03 2016 Apr 2 -
  Added transform_compact pattern and

2.02 2016 Apr 1 -
  Added dynamic work-creation function lbs_workcreate.
  Improved ease of calling cta_scan_t.
  cta_reduce_t now uses shfl for all datatypes and operators on sm_30+.

2.01 2016 Mar 31 -
  Refactored scan implementation and added scan_event.

2.00 2016 Mar 28 - 
  moderngpu 2.0 first release.

  Everything rewritten.
  Use -std=c++11 and --expt-extended-lambda to build.
  Developed and tested with CUDA 7.5.17, g++ 4.9.3 on 64-bit linux.

  Tests and samples build on OSX clang-700-1.81.
  Invocations of these kernels with certain arguments can push Visual 
  Studio 2013 to the breaking point. "Exceeded token length" is a common
  error when passing iterators to high-level transforms. Users may be better
  off avoiding Visual Studio until CUDA 8.0 is released with VS 2015 support.

  Visual Studio 2013 has broken constexpr support, so with that version
  moderngpu chooses to redefine its constexpr functions as macros, injecting
  them into the global scope. Sorry!

  TODO: Complete kernel unit tests. 
  TODO: API references.
  TODO: Multiset operations for feature parity with moderngpu 1.0.
  TODO: Restore the moderngpu 1.0 text by integrating it with this markdown.
  • Improving moderngpu's support (Windows, CMake, ...)

    Improving moderngpu's support (Windows, CMake, ...)

    • [x] Simple cmake support, maybe this can be done better or expanded on.
    • [x] Tested with the tests folder, works really well.
    • [x] Removed std::binary_function, apparently, you don't need it: Definitely the cause of problem with C++17.
    • [x] Enable all SMs.
    • [x] We now add ModernGPU version to cmake as well, this increments the minor version to 2.13.
    • [x] Addressing the use of __attribute__((alignment))
    • [x] Added support for demos and tutorials, tested and compiles.
    • [x] Added CI using github actions (workflows for both Windows and Ubuntu).
    • [x] Not doing this; ~~Switch to using C++ alignas?~~

    NOTE: This PR removes the old Makefile.

    opened by neoblizz 3
  • Cuda 8, VS2105

    Cuda 8, VS2105


    I am having trouble getting reduce_by_key to compile. I think there are some compiler issues. My system is VS2015 with Cuda 8.

    The following "attribute((aligned))" in tuple.hxx


    makes the compiler choke.

    The second problem is the restrict keyword usage in meta.hxx:


    It seems, that the __restrict__keyword ist not allowed for type definitions. It throws this compiler error:

    Removing problem 1 and all types that include restrict works. However, I am not sure how this influences the functionality of the library.

    Thanks for the great work!

    opened by crohkohl 3
  • wrong tid in cta_launch with non power of two NT

    wrong tid in cta_launch with non power of two NT

    using cta_launch with NT not being a power of two results in wrong tid values passed to the lambda. simple test case:

    static const int NT = 96
    mgpu::cta_launch<NT>([=] MGPU_DEVICE (const int tid, const int block) {
          printf("thread %d %d %d\n", threadIdx.x, tid, threadIdx.x & (NT - 1));
    }, 1, ctx);

    For NT = 32, 64 and 128 this works fine. However setting NT to any multiple of 32 should be valid, right?

    opened by christiankerl 3
  • Error while running test

    Error while running test

    I am trying to run test on mergesort and segreduce, only to find the folling error

    NVIDIA GeForce RTX 2080 : 1710.000 Mhz   (Ordinal 0)
    46 SMs enabled. Compute Capability sm_75
    FreeMem:   7862MB   TotalMem:   7981MB   64-bit pointers.
    Mem Clock: 7000.000 Mhz x 256 bits   (448.0 GB/s)
    ECC Disabled
    Floating point exception (core dumped)

    How could I fix the bug?

    opened by fishmingyu 2
  • Avoid cudaMalloc and cudaFree within the mergesort kernel

    Avoid cudaMalloc and cudaFree within the mergesort kernel

    Hello, I am testing both moderngpu's mergesort and cub's radixsort for an implementation that I am currently working on.

    I have benchmarked both and came to the conclusion that (at least for my data) the radix sort implementation from the CUB library is slightly faster due to the fact that it does not need several mallocs while sorting. On the other hand, moderngpu's mergesort appears to be performing mallocs every iteration, see below:


    The box highlight in red is composed of several cudaMalloc and cudaFree. I was wondering if it is possible to get rid of these allocs, since this would make moderngpu faster than CUB in my application. For instance, in CUB I can simply pass the pointers with pre-allocated memory to the kernel launch.

    Is there anyway to do this in moderngpu? Thank you for your time, Esteban

    ----------Edit: more information: I timed the time of one my apps run and found that the sys time is higher than that of CUB (I assume due to the cudaMallocs, because as it is, mgpu takes approx 10 to 30% less time.) Run with CUB: real 3m26.482s user 3m10.728s sys 0m15.584s

    Run with Mgpu: real 3m34.427s user 3m11.548s sys 0m22.884s

    opened by estebanpw 2
  • Update


    Variable is just used to store and is not being used for anything else, this will cause warnings and errors while compiling with options. Just to be safer side, it would be better without the variable.

    opened by rgsl888prabhu 2
  • advice on tuning launch_params for transform_lbs

    advice on tuning launch_params for transform_lbs

    hi, thanks for the nice library. I'm using transform_lbs in a scenario where every work item needs to perform quite some computation and I'm unsure how to properly set the launch_params. Just reducing the values from the default launch_params_t<128, 11, 8> to something like launch_params_t<128, 3, 3> already increased the performance quite a bit. However, I'm unsure about how to set vt and vt0. should it be odd numbers? always vt > vt0? is the correctness affected if these numbers are not properly set? thanks for your help.

    opened by christiankerl 2
  • moderngpu on mac os x

    moderngpu on mac os x

    anyone having (recent) experience on converting *.sln files to *.cmake? successfully used one perl script in the past, now trying

    opened by nyotis 2
  • Cannot build demos on Linux

    Cannot build demos on Linux

    make: *** No rule to make target demo.o', needed bydemo'. Stop.

    [email protected]:~/Development/quantcrunch/demo$ make --version GNU Make 3.81 Copyright (C) 2006 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

    This program built for x86_64-pc-linux-gnu

    opened by ghost 2
  • Dependency on CUDA 5.0.props - Error opening the project in VS

    Dependency on CUDA 5.0.props - Error opening the project in VS


    I have CUDA 5.5 installed and VS2012.

    I tried to open the project but I am getting following error -

    C:\Alenka\moderngpu\demo\demo.vcxproj(32,5): The imported project "C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V110\BuildCustomizations\CUDA 5.0.props" was not found. Confirm that the path in the declaration is correct, and that the file exists on disk.

    I am getting the same error for all the projects.

    Thanks, Sawan

    opened by ghost 2
  • Error in building

    Error in building

    When I run make I get the following error:

    demo/ internal error: assertion failed: gen_paren_or_brace_dynamic_init: bad kind (cp_gen_be.c, line 21252 in gen_paren_or_brace_dynamic_init)

    1 catastrophic error detected in the compilation of "demo/". Compilation aborted. Aborted (core dumped)

    I am not sure how to fix this. Any suggestions?

    opened by ullasAparanji 1
  • Broken link in

    Broken link in "Reduce and Scan"

    The "Reduce and Scan" chapter tries to link to "NVIDIA's Mark Harris on Optimizing Parallel Reduction in CUDA" but uses


    instead of

    opened by didito 0
  • Problem with compiling bf-knn

    Problem with compiling bf-knn

    cd bf-knn nvcc -arch=sm_21 -I ../moderngpu/include/ -o bf-knn-demo i am facing a problem when compiling this code in CMD. help me error1

    ERROR:<44> : fatal error C1083: can not open include file 'moderngpu,cuh' No such a file or directory kindly help me how to include this file

    opened by ghost 0
  • Usage of caching iterators

    Usage of caching iterators

    I am trying to optimize the runtime of many calls to mgpu::lbs_segreduce with the same segment configuration. If I understand correctly, there is a way to use caching iterators. Could you maybe give an example how to use them? There is no documentation on that topic yet.

    opened by crohkohl 0
  • cudaErrorInvalidConfiguration (error 9) on segmented_sort() of trivial-sized data set

    cudaErrorInvalidConfiguration (error 9) on segmented_sort() of trivial-sized data set

    moderngpu git tag: v2.12_june_8_2016 CentOS 7, Linux version 3.10.0-327.22.2.el7.x86_64 gcc (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6) Cuda compilation tools, release 7.5, V7.5.17

    Calling segmented_sort() on a trivial data set (e.g., 1 segment of 24 values) results in cudaErrorInvalidConfiguration (error 9). Determined the cause to be located at kernel_segsort.hxx:69:

    op_counters = fill<int2>(int2(), num_passes, context);

    Per kernel_segsort.hxx:43-45, num_passes is set to 0 if num_ctas is equal to 1. Allocating an extra element in op_counters fixes the issue by allowing fill() to succeed:

    op_counters = fill<int2>(int2(), num_passes + 1, context);

    Minimum working example:

    using uint = unsigned int;
    #include "moderngpu/context.hxx"
    #include "moderngpu/kernel_segsort.hxx"
    #include <exception>
    #include <sstream>
    int main(int, char**) {
        using namespace mgpu;
        using namespace std;
        standard_context_t context;
        uint nValues = 24, nSegments = 1;
        uint *dKeys, *dVals, *dSegs;
        cudaMalloc(&dKeys, sizeof(*dKeys) * nValues);
        cudaMalloc(&dVals, sizeof(*dVals) * nValues);
        cudaMalloc(&dSegs, sizeof(*dSegs) * (nSegments + 1));
        uint hKeys[] = {
            98, 63, 82, 50, 40, 36, 44, 36, 49, 17, 21, 71,
            82, 67, 54, 74, 60, 15, 85, 58, 11, 42, 22, 97
        uint hVals[] = {
            34, 11, 98, 41, 83, 34, 49, 40, 61, 14, 10, 82,
            19, 27, 80, 66, 54, 99, 79, 90, 70, 42, 16, 78
        uint hSegs[] = {0, nValues};
        cudaMemcpy(dKeys, &hKeys[0], sizeof(*dKeys) * nValues, cudaMemcpyHostToDevice);
        cudaMemcpy(dVals, &hVals[0], sizeof(*dVals) * nValues, cudaMemcpyHostToDevice);
        cudaMemcpy(dSegs, &hSegs[0], sizeof(*dSegs) * (nSegments + 1), cudaMemcpyHostToDevice);
            dKeys, dVals, (int)nValues,
            dSegs, (int)nSegments,
            less_t<uint>(), context);
        cudaError_t error = cudaGetLastError();
        if (error != cudaSuccess) {
            ostringstream msg;
            msg << "cudaGetLastError(): " << cudaGetErrorString(error) << endl;
            throw runtime_error(msg.str());
        return 0;
    opened by impracticably 0
  • good launch_params_t for mergesort

    good launch_params_t for mergesort

    I'm currently in the process of replacing thrust with moderngpu because of

    My problem now is to choose an adequate launch_params_t depending on the number of elements to be sorted, as that number can be changed by the user.

    Thrust chooses this automatically in a way, I did not find out how, but it works quite good.

    Any recommendations?

    opened by florianjacob 0
  • Support for

    Support for "tie"

    Hi! Would it be possible to complete support for mgpu::tie? As is, it's missing support for the assignment of mgpu::tuple<args_t...> to


    #include <moderngpu/tuple.hxx>
    mgpu::tuple<int, float>
    example_tuple() {
      return mgpu::make_tuple(1, 2.f);
    mgpu::tuple<int, float>
    example_using_tie() {
      int a;
      float b;
      mgpu::tie(a,b) = example_tuple();
      return mgpu::make_tuple(a,b);

    yields error: no operator "=" matches these operands
                operand types are: mgpu::tuple<int &, float &> = mgpu::tuple<int, float>


    opened by bergdorf 0
  • v2.13.0(Dec 22, 2021)

    General build advice; please see what GitHub actions are successfully building with and use that instead. The latest version of CUDA and the respective environment at the time of release builds fine! The gist of this release is improving the general build support for Linux and Windows. Added some CI to help make the project more robust.

    v2.13.0 Changelog


    • cmake support (requires minimum cmake version 3.19) (also fixes #44),
    • Windows support,
    • GitHub actions --- continuous integration, supports latest ubuntu and windows builds,
    • Add supports for ampere architecture (SM 86 and below).


    • Add supports for demos and tutorials, tested (Fixes #45),
    • Removed std::binary_function (deprecated in C++11 and removed in C++17), fixes compilation for C++17 (Fixes #46),
    • Alignment issues solved (maybe in future we can use C++ alignas instead),
    • Untested, but may have fixes #43, issue with clang-cuda compile,
    • Fixes issue #30, added a possible way to cite the project.


    • is_restrict and remove_restrict, cause a compiler issue on Windows (Fixes #23)
    Source code(tar.gz)
    Source code(zip)
A C++ GPU Computing Library for OpenCL

Boost.Compute Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL. The core library is a thin C++ wrapper over the OpenCL API an 1.4k Jan 5, 2023
A C++ GPU Computing Library for OpenCL

Boost.Compute Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL. The core library is a thin C++ wrapper over the OpenCL API an 1.3k Dec 30, 2022
Fidelius - YeeZ Privacy Computing

Fidelius - YeeZ Privacy Computing Introduction In order to empower data collaboration between enterprises and help enterprises use data to enhance the

YeeZTech 59 Dec 9, 2022
A C++17 thread pool for high-performance scientific computing.

We present a modern C++17-compatible thread pool implementation, built from scratch with high-performance scientific computing in mind. The thread pool is implemented as a single lightweight and self-contained class, and does not have any dependencies other than the C++17 standard library, thus allowing a great degree of portability

Barak Shoshany 1.1k Jan 4, 2023
ArrayFire: a general purpose GPU library.

ArrayFire is a general-purpose library that simplifies the process of developing software that targets parallel and massively-parallel architectures i

ArrayFire 4k Dec 27, 2022
OpenCL based GPU accelerated SPH fluid simulation library

libclsph An OpenCL based GPU accelerated SPH fluid simulation library Can I see it in action? Demo #1 Demo #2 Why? Libclsph was created to explore the

null 47 Jul 27, 2022
Optimized primitives for collective multi-GPU communication

NCCL Optimized primitives for inter-GPU communication. Introduction NCCL (pronounced "Nickel") is a stand-alone library of standard communication rout

NVIDIA Corporation 1.9k Dec 30, 2022
stdgpu: Efficient STL-like Data Structures on the GPU

stdgpu: Efficient STL-like Data Structures on the GPU Features | Examples | Documentation | Building | Integration | Contributing | License | Contact

Patrick Stotko 777 Jan 8, 2023
Concurrency Kit 2.1k Jan 4, 2023
An implementation of Actor, Publish-Subscribe, and CSP models in one rather small C++ framework. With performance, quality, and stability proved by years in the production.

What is SObjectizer? What distinguishes SObjectizer? SObjectizer is not like TBB, taskflow or HPX Show me the code! HelloWorld example Ping-Pong examp

Stiffstream 314 Dec 26, 2022
Simple and fast C library implementing a thread-safe API to manage hash-tables, linked lists, lock-free ring buffers and queues

libhl C library implementing a set of APIs to efficiently manage some basic data structures such as : hashtables, linked lists, queues, trees, ringbuf

Andrea Guzzo 392 Dec 3, 2022
Sqrt OS is a simulation of an OS scheduler and memory manager using different scheduling algorithms including Highest Priority First (non-preemptive), Shortest Remaining Time Next, and Round Robin

A CPU scheduler determines an order for the execution of its scheduled processes; it decides which process will run according to a certain data structure that keeps track of the processes in the system and their status.

null 10 Sep 7, 2022
RocketOS is a Unix based OS that uses legacy BIOS and GRUB and is written in C17. It is being developed for educational purposes primarily, but it still is a serious project. It is currently in its infancy.

RocketOS What is RocketOS? RocketOS is a Unix based OS that uses legacy BIOS and GRUB and is written in C17. It is being developed for educational pur

null 30 Sep 19, 2022
Suman Raj Khanal 7 Nov 24, 2021
OOX: Out-of-Order Executor library. Yet another approach to efficient and scalable tasking API and task scheduling.

OOX Out-of-Order Executor library. Yet another approach to efficient and scalable tasking API and task scheduling. Try it Requirements: Install cmake,

Intel Corporation 18 Oct 25, 2022
Parallel-hashmap - A family of header-only, very fast and memory-friendly hashmap and btree containers.

The Parallel Hashmap Overview This repository aims to provide a set of excellent hash map implementations, as well as a btree alternative to std::map

Gregory Popovitch 1.7k Jan 3, 2023
Bolt is a C++ template library optimized for GPUs. Bolt provides high-performance library implementations for common algorithms such as scan, reduce, transform, and sort.

Bolt is a C++ template library optimized for heterogeneous computing. Bolt is designed to provide high-performance library implementations for common

null 360 Dec 27, 2022
A General-purpose Parallel and Heterogeneous Task Programming System

Taskflow Taskflow helps you quickly write parallel and heterogeneous tasks programs in modern C++ Why Taskflow? Taskflow is faster, more expressive, a

Taskflow 7.6k Dec 31, 2022