Portable header-only C++ low level SIMD library

Overview

libsimdpp

Travis build status Appveyor build status Join the chat at https://gitter.im/libsimdpp/Lobby

libsimdpp is a portable header-only zero-overhead C++ low level SIMD library. The library presents a single interface over SIMD instruction sets present in x86, ARM, PowerPC and MIPS architectures. On architectures that support different SIMD instruction sets the library allows the same source code files to be compiled for each SIMD instruction set and then hooked into an internal or third-party dynamic dispatch mechanism. This allows the capabilities of the processor to be queried on runtime and the most efficient implementation to be selected.

The library sits somewhere in the middle between programming directly in SIMD intrinsics and even higher-level SIMD libraries. As much control as possible is given to the developer, so that it's possible to exactly predict what code the compiler will generate.

No API-breaking changes are planned for the foreseeable future.

Documentation

Online documentation is provided here.

Compiler and instruction set support

  • This describes the current branch only which may be unstable or otherwise unfit for use. For available releases please see the libsimdpp wiki.

The library supports the following architectures and instruction sets:

  • x86, x86-64: SSE2, SSE3, SSSE3, SSE4.1, AVX, AVX2, FMA3, FMA4, AVX512F, AVX512BW, AVX512DQ, AVX512VL, XOP, popcnt
  • ARM 32-bit: NEON, NEONv2
  • ARM 64-bit: NEON, NEONv2
  • PowerPC 32-bit big-endian: Altivec, VSX v2.06, VSX v2.07
  • PowerPC 64-bit little-endian: Altivec, VSX v2.06, VSX v2.07
  • MIPS 32-bit little-endian: MSA
  • MIPS 64-bit little-endian: MSA

The primary development of the library happens in C++11. A C++98-compatible version of the library is provided on the cxx98 branch.

Supported compilers:

  • C++11 version:

    • GCC: 4.8-7.x
    • Clang: 3.3-4.0
    • Xcode 7.0-9.x
    • MSVC: 2013, 2015, 2017
    • ICC (on both Linux and Windows): 2013, 2015, 2016, 2017
  • C++98 version

    • GCC: 4.4-7.x
    • Clang: 3.3-4.0
    • Xcode 7.0-9.x
    • MSVC: 2013, 2015, 2017
    • ICC (on both Linux and Windows): 2013, 2015, 2016, 2017

Newer versions of the aforementioned compilers will generally work with either C++11 or C++98 version of the library. Older versions of these compilers will generally work with the C++98 version of the library.

Various compiler versions are not supported on various instruction sets due to compiler bugs or incompletely implemented instruction sets. See simdpp/detail/workarounds.h for more details.

  • MSVC and ICC are only supported on x86 and x86-64.

  • AVX is not supported on Clang 3.6 or GCC 4.4

  • AVX2 is not supported on Clang 3.6.

  • AVX512F is not supported on:

    • GCC 5.x and older
    • Clang 5.0 and older
    • MSVC
  • NEON armv7 is not supported on Clang 3.3 and older.

  • NEON aarch64 is not supported on GCC 4.8 and older

  • Altivec on little-endian PPC is not suppported on GCC 5.x and older.

  • VSX on big-endian PPC is not supported on GCC 5.x and older.

  • MSA is not supported on GCC 6.x and older.

Contributing

Contributions are welcome. Please see CONTRIBUTING.md for more information.

License

The library may be freely used in commercial and non-commercial software. The code is distributed under the Boost Software License, Version 1.0. Some internal development scripts are licensed under different licenses -- see comments in these files. The documentation is licensed under CC-BY-SA.

Boost Software License - Version 1.0 - August 17th, 2003

Permission is hereby granted, free of charge, to any person or organization obtaining a copy of the software and accompanying documentation covered by this license (the "Software") to use, reproduce, display, distribute, execute, and transmit the Software, and to prepare derivative works of the Software, and to permit third-parties to whom the Software is furnished to do so, all subject to the following:

The copyright notices in the Software and this entire statement, including the above license grant, this restriction and the following disclaimer, must be included in all copies of the Software, in whole or in part, and all derivative works of the Software, unless such copies or derivative works are solely in the form of machine-executable object code generated by a source language processor.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Issues
  • Dispatcher macros don't work with template functions

    Dispatcher macros don't work with template functions

    I'm trying to figure out how to generate the macros when using the dispatcher. First, one need to add -DEMIT_DISPATCHER and the list of supported platforms. But it's not really great for cross platform automated build . Then, my functions are templated factory builders, and this doesn't work...

    I was wondering if using the get compilable macro with cmake could generate a comma separated list (easy to do with CMake) and then use it with Boost preprocessor macros to generate the proper code in a template acceptable way. Any thoughts on this? Without this 2 features, the SIMD filters I'm trying to build for my library are just unusable :/

    opened by mbrucher 24
  • patch to build ARM, ARM+NEON on cxx98 branch

    patch to build ARM, ARM+NEON on cxx98 branch

    cxx98's working great for me on x86! I had to make these changes to get ARM / ARM+NEON to build today (using GCC 4.7 from Android NDK r8e, though that's only because I've been too lazy to upgrade to r9 / GCC 4.8) Most changes are uninteresting. On the interesting side, I wasn't sure what to do about vceqq_s64, which is ARM64-only.

    opened by mtklein 9
  • Cxx03

    Cxx03

    I've been eyeing at libsimdpp a potential way to unify my project's jumbled mix of various non-vectorized, SSE2, SSSE3, and NEON implementations. This project can't require C++11 yet, and probably won't be able to for several years.

    Do you feel very strongly that libsimdpp must stay as a C++11 codebase?

    This is an incomplete patch with some of the changes that might be necessary if you were to drop back to targeting an older version of C++. I still need to remove lambdas, fix up more returns like in c586a73, find an alternative for std::array, and I'm sure more. I'm happy to do all this work and test it however you recommend. But I figured at this point there are enough changes here that you'd probably have an opinion on whether you'd want to upstream C++<11 support and live with the code like this, or if you want to stick to C++11.

    Mike

    opened by mtklein 8
  • SSE2 implementation of i_to_float32(const uint32<4>& a) could be more efficient

    SSE2 implementation of i_to_float32(const uint32<4>& a) could be more efficient

    the conditional div/2, conv, mul*2 looks less efficient and is less accurate than the bias, conv, bias method:

    float conv(uint32_t a)
    {
        return (float)((int32_t)(a - 0x80000000)) + (float)0x80000000;
    }
    

    A similar method can be used to reverse this for converting float->uint32_t, with an additional trunc() step to properly truncate towards 0 and not 0x80000000.

    opened by peabody-korg 7
  • VS 2017 strange behavior for AVX512

    VS 2017 strange behavior for AVX512

    Can be seen here: https://ci.appveyor.com/project/mbrucher/audiotk/build/2.2.0.565/job/2rl0xf4j6aa3fbam For VS2015, everything is fine, the AVX512 test fails, and we don't compile the AVX version of the code. For VS2017, the test succeeds, but can't be compiled after. Known issue or new one?

    opened by mbrucher 7
  • Possible performance problem?

    Possible performance problem?

    I am trying to multiply two float32<8> numbers with SIMDPP_ARCH_X86_AVX setting.

    The code something like: float32 bigi = load(i); float32 bigm = load(modifiers); bigi = mul(bigi, bigm);

    It works ok, but when I try to trace the code step-by-step I see that after multiplication the code goes to following piece of code: template<class R, class T> SIMDPP_INL R cast_memcpy(const T& t) { static_assert(sizeof(R) == sizeof(T), "Size mismatch"); R r; ::memcpy(&r, &t, sizeof(R)); return r; }

    I don't understand why we need to do memcpy after each operation. It's a big performance gap.

    opened by davagin 5
  • how to build the dynamic_dispatch example ?

    how to build the dynamic_dispatch example ?

    Hello, Probably a silly question, but since there is absolutely no install or basic usage documentation...

    So I got the git repo, then:

    cd examples/dynamic_dispatch
    make test
    

    and I get:

    In file included from test.cc:4:0:
    ../../simdpp/dispatch/get_arch_gcc_builtin_cpu_supports.h: In function ‘simdpp::Arch simdpp::get_arch_gcc_builtin_cpu_supports()’:
    ../../simdpp/dispatch/get_arch_gcc_builtin_cpu_supports.h:24:41: error: Parameter to builtin not valid: avx512f
         if (__builtin_cpu_supports("avx512f")) {
    
    
    
    gcc (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4
    
    
    opened by kforner 5
  • math ops fail to do int/float conversion or generate an error on SSE2

    math ops fail to do int/float conversion or generate an error on SSE2

    float32x4 foo(float32x4 a, int32x4 b)
    {
        return a+b;
    }
    

    results in a single addps instruction for SSE2, as if b were a float32.

    float32x4 foo(float32x4 a, int32x4 b)
    {
        return add(a,b);
    }
    

    results in a compilation error, which is resolved by explicitly converting b with to_float32(). Presumably use of + should fail in the same was as use of add() (or even better would be to automatically convert between float/int just scalar operations would)

    opened by peabody-korg 4
  • Feature Request: foreach on vectors

    Feature Request: foreach on vectors

    Hi!

    I'm not very experienced with C++ and especially not with this library, but I've found that some of my core uses of this library require patterns like:

    uint64_t count = _mm_popcnt_u64(extract<0>(x));
    #if UINT64_VECTOR_SIZE >= 2
    count += _mm_popcnt_u64(extract<1>(x));
    #if UINT64_VECTOR_SIZE >= 4
    count += _mm_popcnt_u64(extract<2>(x));
    count += _mm_popcnt_u64(extract<3>(x));
    #if UINT64_VECTOR_SIZE >= 8
    count += _mm_popcnt_u64(extract<4>(x));
    count += _mm_popcnt_u64(extract<5>(x));
    count += _mm_popcnt_u64(extract<6>(x));
    count += _mm_popcnt_u64(extract<7>(x));
    #if UINT64_VECTOR_SIZE > 8
    #error "we do not support vectors longer than 8, please file an issue"
    #endif
    #endif
    #endif
    

    It would be awesome if there was some syntax like:

    uint64_t count = 0
    x.foreach<64>( [=](e) {
      count += _mm_popcnt_u64(e);
    })
    

    I'm happy to hack this up, but I'd need some guidance/scaffolding about how to approach the problem in the framework of libsimdpp.

    opened by danking 4
  • CMake tip - if dispatcher generated cpp files are not rebuilding after changing a header

    CMake tip - if dispatcher generated cpp files are not rebuilding after changing a header

    (OSX 10.10, Apple Clang 7.0.2, CMake 3.5.0)

    This is probably a bug in CMake (still looking into it) but I found it while working with libsimdpp, so thought other users might find this helpful.

    File setup

    • src/main.cpp //main program entry point
    • src/code.cpp //the code that dispatcher will copy
    • src/common.h //some common header, included in code.cpp and main.cpp
    • include/simdpp/... //simdpp include dir

    CMakeLists.txt:

    [...]
    simdpp_get_runnable_archs(RUNNABLE_ARCHS)
    simdpp_multiarch(GEN_ARCH_FILES src/code.cpp ${RUNNABLE_ARCHS})
    add_executable(simd-test src/main.cpp ${GEN_ARCH_FILES})
    target_include_directories(simd-test PRIVATE ${CMAKE_SOURCE_DIR}/include/)
    

    Background

    The simdpp_multiarch() CMake function (from SimdppMultiarch.cmake) will use configure_file() to copy ${CMAKE_SOURCE_DIR}/src/code.cpp into the build dir (e.g. ${CMAKE_BINARY_DIR}/src/code_simdpp_-x86_avx.cpp etc). It will also manually add an the include dir back to the original location:

    SimdppMultiarch.cmake line 434:
    set(CXX_FLAGS "-I\"${CMAKE_CURRENT_SOURCE_DIR}/${SRC_PATH}\" ${CXX_FLAGS}")
    

    This ensures that local includes, such as #include "common.h" in code.cpp will still work at compile time.

    Problem

    The problem is that when CMake generates the file dependencies, it seems to ignore the file-specific include search path set on the generated files. This means that ${CMAKE_BINARY_DIR}/CMakeFiles/simd-test.dir/depend.make will not include src/common.h and when you change common.h without changing code.cpp, none of the generated files are recompiled! This results in linking with stale object files (which include the old version of common.h) and programs that could crash or be incorrect in subtle ways.

    Workaround

    Add the local directory of code.cpp to the target include dir with a command like this:

    target_include_directories(simd-test PRIVATE ${PROJECT_SOURCE_DIR}/src)
    

    (You may need to update the path for your project, or have multiple of these lines if you simdpp_multiarch() files from multiple directories.)

    There may be a way to update simdpp_multiarch() to handle this automatically but a simple solution eludes me at the moment.

    opened by JoshBlake 4
  • workaround: _mm_set_epi64x identifier not found

    workaround: _mm_set_epi64x identifier not found

    I am building a Python extension that uses libsimdpp. As I want to provide compatibility with Python 2.7 (yep, it's still pretty popular) I need to compile against VS 2008 (using the cxx98 branch). There, I am getting following error when compiling with SSE2 options enabled:

    error C3861: '_mm_set_epi64x': identifier not found
    

    According to https://msdn.microsoft.com/en-us/library/dk2sdw0h(v=vs.90).aspx the correct header file is intrin.h and just adding it makes it indeed work.

    Newer VS versions don't have that problem. Have you heard about this before? Why does libsimdpp not include intrin.h? Does my workaround look ok?

    Full logs: https://ci.spacy.io/builders/sense2vec-win64-py27-64-install/builds/47/steps/shell_2/logs/stdio Workaround: https://github.com/spacy-io/sense2vec/commit/1d94617e324dc635dcf95bb2d793d8358d64a405

    opened by henningpeters 4
  • How can we combine with Intel® SSE2 (Streaming SIMD Extensions 2)?

    How can we combine with Intel® SSE2 (Streaming SIMD Extensions 2)?

    Hi,

    I have existing code in Intel SIMD SSE2, how can I use API of libsimdpp if the variable I have is in SSE2, for example _mm_set_epi32? Is there any way so that, for example, I can use script such as: int32<4> = add(A, B); where A and B are SSE data type of _mm_set_epi32? Many thanks.

    opened by ardianumam 4
  • Fused multiply-add/sub not emulated

    Fused multiply-add/sub not emulated

    The fused multiply add/sub operations are unavailable for some instruction sets and result in a linker error that is hard to track down. There is also no preprocessor definition that would make it easy to detect this absence.

    Why are those instructions not emulated if not available? I don't see any reason why this should be problematic.

    opened by Jazzdoodle 0
  • sign(float64<N>) generates incorrect code on NEON64 using gcc 9.3.0 with -ffast-math

    sign(float64) generates incorrect code on NEON64 using gcc 9.3.0 with -ffast-math

    auto foo24(float64x2 a) { return sign(a); }

    generates:

    0000000000000000 <Inspiration::foo24(Inspiration::FloatVectorBase<simdpp::arch_neonfltsp::float64<4u, void> >)>:
       0:	6f00e402 	movi	v2.2d, #0x0
       4:	d10383ff 	sub	sp, sp, #0xe0
       8:	910383ff 	add	sp, sp, #0xe0
       c:	4e221c00 	and	v0.16b, v0.16b, v2.16b
      10:	4e221c21 	and	v1.16b, v1.16b, v2.16b
      14:	d65f03c0 	ret
    

    The problem seems to trace back to the way bit_and() is being processed in the implementation of _i_sign(float64):

        return bit_and(a, 0x8000000000000000);
    

    The 0x8000000'00000000 constant gets bit-converted to the floating point constant -0.0. Because of -ffast-math -0.0 and 0.0 are equivalent, so this winds up masking with 0 instead.

    opened by peabody-korg 0
Releases(v2.1)
  • v2.1(Dec 14, 2017)

    The library supports the following architectures and instruction sets:

    • x86, x86-64: SSE2, SSE3, SSSE3, SSE4.1, AVX, AVX2, FMA3, FMA4, AVX512F, AVX512BW, AVX512DQ, AVX512VL, XOP
    • ARM 32-bit: NEON, NEONv2
    • ARM 64-bit: NEON, NEONv2
    • PowerPC 32-bit big-endian: Altivec, VSX v2.06, VSX v2.07
    • PowerPC 64-bit little-endian: Altivec, VSX v2.06, VSX v2.07
    • MIPS 32-bit little-endian: MSA
    • MIPS 64-bit little-endian: MSA

    Supported compilers:

    • C++11 version:

      • GCC: 4.8-7.x
      • Clang: 3.3-4.0
      • Xcode 7.0-9.x
      • MSVC: 2013, 2015, 2017
      • ICC (on both Linux and Windows): 2013, 2015, 2016, 2017
    • C++98 version

      • GCC: 4.4-7.x
      • Clang: 3.3-4.0
      • Xcode 7.0-9.x
      • MSVC: 2013, 2015, 2017
      • ICC (on both Linux and Windows): 2013, 2015, 2016, 2017

    Newer versions of the aforementioned compilers will generally work with either C++11 or C++98 version of the library. Older versions of these compilers will generally work with the C++98 version of the library.

    Changes since v2.0:

    • Various bug fixes
    • Documentation has been significantly improved. The public API is now almost fully documented.
    • Added support for MIPS MSA instruction set.
    • Added support for PowerPC VSX v2.06 and v2.07 instruction sets.
    • Added support for x86 AVX512BW, AVX512DQ and AVX512VL instruction sets.
    • Added support for 64-bit little-endian PowerPC.
    • Added support for arbitrary width vectors in extract() and insert().
    • Added support for arbitrary source vectors to to_int8(), to_uint8(), to_int16(), to_uint16(), to_int32(), to_uint32(), to_int64(), to_uint64(), to_float32(), to_float64().
    • Added support for per-element integer shifts to shift_r() and shift_l(). Fallback paths are provided for SSE2-AVX instruction sets that lack hardware per-element integer shift support.
    • Make shuffle_bytes16(), shuffle_zbytes16(), permute_bytes16() and permute_zbytes() more generic.
    • New functions: popcnt, reduce_popcnt, for_each, to_mask().
    • Xcode is now supported.
    • The library has been refactored in such a way that older compilers are able to optimize vector emulation code paths much better than before.
    • Deprecation: implicit conversion operators to native vector types has been deprecated and a replacement method has been provided instead. The implicit conversion operators may lead to wrong code being accepted without a compile error on Clang.
    Source code(tar.gz)
    Source code(zip)
  • cxx98-v2.1(Dec 14, 2017)

    The library supports the following architectures and instruction sets:

    • x86, x86-64: SSE2, SSE3, SSSE3, SSE4.1, AVX, AVX2, FMA3, FMA4, AVX512F, AVX512BW, AVX512DQ, AVX512VL, XOP
    • ARM 32-bit: NEON, NEONv2
    • ARM 64-bit: NEON, NEONv2
    • PowerPC 32-bit big-endian: Altivec, VSX v2.06, VSX v2.07
    • PowerPC 64-bit little-endian: Altivec, VSX v2.06, VSX v2.07
    • MIPS 32-bit little-endian: MSA
    • MIPS 64-bit little-endian: MSA

    Supported compilers:

    • C++11 version:

      • GCC: 4.8-7.x
      • Clang: 3.3-4.0
      • Xcode 7.0-9.x
      • MSVC: 2013, 2015, 2017
      • ICC (on both Linux and Windows): 2013, 2015, 2016, 2017
    • C++98 version

      • GCC: 4.4-7.x
      • Clang: 3.3-4.0
      • Xcode 7.0-9.x
      • MSVC: 2013, 2015, 2017
      • ICC (on both Linux and Windows): 2013, 2015, 2016, 2017

    Newer versions of the aforementioned compilers will generally work with either C++11 or C++98 version of the library. Older versions of these compilers will generally work with the C++98 version of the library.

    Changes since v2.0:

    • Various bug fixes
    • Documentation has been significantly improved. The public API is now almost fully documented.
    • Added support for MIPS MSA instruction set.
    • Added support for PowerPC VSX v2.06 and v2.07 instruction sets.
    • Added support for x86 AVX512BW, AVX512DQ and AVX512VL instruction sets.
    • Added support for 64-bit little-endian PowerPC.
    • Added support for arbitrary width vectors in extract() and insert().
    • Added support for arbitrary source vectors to to_int8(), to_uint8(), to_int16(), to_uint16(), to_int32(), to_uint32(), to_int64(), to_uint64(), to_float32(), to_float64().
    • Added support for per-element integer shifts to shift_r() and shift_l(). Fallback paths are provided for SSE2-AVX instruction sets that lack hardware per-element integer shift support.
    • Make shuffle_bytes16(), shuffle_zbytes16(), permute_bytes16() and permute_zbytes() more generic.
    • New functions: popcnt, reduce_popcnt, for_each, to_mask().
    • Xcode is now supported.
    • The library has been refactored in such a way that older compilers are able to optimize vector emulation code paths much better than before.
    • Deprecation: implicit conversion operators to native vector types has been deprecated and a replacement method has been provided instead. The implicit conversion operators may lead to wrong code being accepted without a compile error on Clang.
    Source code(tar.gz)
    Source code(zip)
  • v2.0(Aug 20, 2017)

    The library supports the following architectures and instruction sets:

    • x86, x86-64: SSE2, SSE3, SSSE3, SSE4.1, AVX, AVX2, FMA3, FMA4, AVX-512F, XOP
    • ARM, ARM64: NEON
    • PowerPC: Altivec

    Supported compilers:

    • C++11 version:

      • GCC: 4.8-6.x
      • Clang: 3.3-4.0
      • MSVC: 2013, 2015, 2017
      • ICC (on both Linux and Windows): 2013, 2015, 2016, 2017
    • C++98 version

      • GCC: 4.4-6.x
      • Clang: 3.3-4.0
      • MSVC: 2013, 2015, 2017
      • ICC (on both Linux and Windows): 2013, 2015, 2016, 2017

    Clang 3.3 is not supported on ARM. MSVC and ICC are only supported on x86 and x86-64.

    Newer versions of the aforementioned compilers will generally work with either C++11 or C++98 version of the library. Older versions of these compilers will generally work with the C++98 version of the library.

    Changes since 2.0-rc2:

    • Intel compiler is now supported on Windows. Newer versions of other compilers are now supported.
    • Various bug fixes.

    Changes since 1.0:

    • Expression template-based backend. It is used only for functions that may benefit from micro-optimizations (e.g. when several instructions can be merged into one).
    • Support for vectors much longer than the native vector type. The only limitation is that the length must be a power of 2. The widest available instructions are used for the particular vector type.
    • Visual Studio and Intel Compiler support
    • AVX-512F, Altivec and NEONv2 support
    • Vector initialization is simplified, for example: int32<8> v = make_uint(2); or int* p = ...; v = load(p);.
    • Curriously recurring template pattern is used to categorize vector types. Function templates no longer need to be written for each vector type or their combination, instead, an appropriate vector category may be used.
    • Each vector type can be explicitly constructed from any other vector with the same size.
    • Most functions accept much wider range of vector type combinations. For example, bitwise functions accept any two vectors of the same size.
    • If different vector types are used as arguments to such functions, the return type is computed as if one or both of the arguments were "promoted" according to certain rules. For example, int32 + int32 --> int32, whereas uint32 + int32 --> uint32, and uint32 + float32 --> float32. See simdpp/types/tag.h for more information.
    • API break: int128 and int256 types have been removed. On some architectures such as AVX512 it's more efficient to have different physical representations for vectors with different element widths. E.g. 8-bit integer elements would use 256-bit vectors and 32-bit integer elements would use 512-bit vectors.
    • API break: basic_int## types have been removed. The CRTP-based type categorization and promotion rules make second inheritance-based vector categorization system impossible. In majority of cases basic_int## can be straightforwardly replaced with uint##.
    • API break: {vector type}::make_const, {vector type}::zero and
      {vector type}::ones have been removed to simplify the library. Use the new make_int, make_uint, make_float, make_zero and make_ones free functions that produce a construct expression.
    • API break: broadcast family of functions have been renamed to splat
    • API break: permute family of functions has been renamed to permute2 and permute4 depending on the number of template arguments taken.
    • API break: value conversion functions such as to_float32x4 have been renamed and now returns a vector with the same number of elements as the source vector.
    • API break: SIMDPP_USER_ARCH_INFO now accepts any expression, not only a function
    • API break: unsigned conversions have been renamed to to_uintXX to reduce confusion.
    • API break: saturated add and sub are now called add_sat and sub_sat

    No further significant API changes are planned.

    Source code(tar.gz)
    Source code(zip)
  • cxx98-v2.0(Aug 20, 2017)

    The library supports the following architectures and instruction sets:

    • x86, x86-64: SSE2, SSE3, SSSE3, SSE4.1, AVX, AVX2, FMA3, FMA4, AVX-512F, XOP
    • ARM, ARM64: NEON
    • PowerPC: Altivec

    Supported compilers:

    • C++11 version:

      • GCC: 4.8-6.x
      • Clang: 3.3-4.0
      • MSVC: 2013, 2015, 2017
      • ICC (on both Linux and Windows): 2013, 2015, 2016, 2017
    • C++98 version

      • GCC: 4.4-6.x
      • Clang: 3.3-4.0
      • MSVC: 2013, 2015, 2017
      • ICC (on both Linux and Windows): 2013, 2015, 2016, 2017

    Clang 3.3 is not supported on ARM. MSVC and ICC are only supported on x86 and x86-64.

    Newer versions of the aforementioned compilers will generally work with either C++11 or C++98 version of the library. Older versions of these compilers will generally work with the C++98 version of the library.

    Changes since 2.0-rc2:

    • Intel compiler is now supported on Windows. Newer versions of other compilers are now supported.
    • Various bug fixes.

    Changes since 1.0:

    • Expression template-based backend. It is used only for functions that may benefit from micro-optimizations (e.g. when several instructions can be merged into one).
    • Support for vectors much longer than the native vector type. The only limitation is that the length must be a power of 2. The widest available instructions are used for the particular vector type.
    • Visual Studio and Intel Compiler support
    • AVX-512F, Altivec and NEONv2 support
    • Vector initialization is simplified, for example: int32<8> v = make_uint(2); or int* p = ...; v = load(p);.
    • Curriously recurring template pattern is used to categorize vector types. Function templates no longer need to be written for each vector type or their combination, instead, an appropriate vector category may be used.
    • Each vector type can be explicitly constructed from any other vector with the same size.
    • Most functions accept much wider range of vector type combinations. For example, bitwise functions accept any two vectors of the same size.
    • If different vector types are used as arguments to such functions, the return type is computed as if one or both of the arguments were "promoted" according to certain rules. For example, int32 + int32 --> int32, whereas uint32 + int32 --> uint32, and uint32 + float32 --> float32. See simdpp/types/tag.h for more information.
    • API break: int128 and int256 types have been removed. On some architectures such as AVX512 it's more efficient to have different physical representations for vectors with different element widths. E.g. 8-bit integer elements would use 256-bit vectors and 32-bit integer elements would use 512-bit vectors.
    • API break: basic_int## types have been removed. The CRTP-based type categorization and promotion rules make second inheritance-based vector categorization system impossible. In majority of cases basic_int## can be straightforwardly replaced with uint##.
    • API break: {vector type}::make_const, {vector type}::zero and
      {vector type}::ones have been removed to simplify the library. Use the new make_int, make_uint, make_float, make_zero and make_ones free functions that produce a construct expression.
    • API break: broadcast family of functions have been renamed to splat
    • API break: permute family of functions has been renamed to permute2 and permute4 depending on the number of template arguments taken.
    • API break: value conversion functions such as to_float32x4 have been renamed and now returns a vector with the same number of elements as the source vector.
    • API break: SIMDPP_USER_ARCH_INFO now accepts any expression, not only a function
    • API break: unsigned conversions have been renamed to to_uintXX to reduce confusion.
    • API break: saturated add and sub are now called add_sat and sub_sat

    No further significant API changes are planned.

    Source code(tar.gz)
    Source code(zip)
  • v2.0-rc2(Apr 3, 2016)

    The library supports the following architectures and instruction sets:

    • x86, x86-64: SSE2, SSE3, SSSE3, SSE4.1, AVX, AVX2, FMA3, FMA4, AVX-512F, XOP
    • ARM, ARM64: NEON
    • PowerPC: Altivec

    Supported compilers:

    • C++11 version:
      • GCC: 4.8-5.3
      • Clang: 3.3-3.8
      • MSVC: 2013
      • ICC: 2013, 2015
    • C++98 version
      • GCC: 4.4-5.3
      • Clang: 3.3-3.8
      • MSVC: 2013
      • ICC: 2013, 2015

    Clang 3.3 is not supported on ARM. MSVC and ICC are only supported on x86 and x86-64.

    Newer versions of the aforementioned compilers will generally work with either C++11 or C++98 version of the library. Older versions of these compilers will generally work with the C++98 version of the library.

    Changes since 1.0:

    • Expression template-based backend. It is used only for functions that may benefit from micro-optimizations (e.g. when several instructions can be merged into one).
    • Support for vectors much longer than the native vector type. The only limitation is that the length must be a power of 2. The widest available instructions are used for the particular vector type.
    • Visual Studio and Intel Compiler support
    • AVX-512F, Altivec and NEONv2 support
    • Vector initialization is simplified, for example: int32<8> v = make_uint(2); or int* p = ...; v = load(p);.
    • Curriously recurring template pattern is used to categorize vector types. Function templates no longer need to be written for each vector type or their combination, instead, an appropriate vector category may be used.
    • Each vector type can be explicitly constructed from any other vector with the same size.
    • Most functions accept much wider range of vector type combinations. For example, bitwise functions accept any two vectors of the same size.
    • If different vector types are used as arguments to such functions, the return type is computed as if one or both of the arguments were "promoted" according to certain rules. For example, int32 + int32 --> int32, whereas uint32 + int32 --> uint32, and uint32 + float32 --> float32. See simdpp/types/tag.h for more information.
    • API break: int128 and int256 types have been removed. On some architectures such as AVX512 it's more efficient to have different physical representations for vectors with different element widths. E.g. 8-bit integer elements would use 256-bit vectors and 32-bit integer elements would use 512-bit vectors.
    • API break: basic_int## types have been removed. The CRTP-based type categorization and promotion rules make second inheritance-based vector categorization system impossible. In majority of cases basic_int## can be straightforwardly replaced with uint##.
    • API break: {vector type}::make_const, {vector type}::zero and
      {vector type}::ones have been removed to simplify the library. Use the new make_int, make_uint, make_float, make_zero and make_ones free functions that produce a construct expression.
    • API break: broadcast family of functions have been renamed to splat
    • API break: permute family of functions has been renamed to permute2 and permute4 depending on the number of template arguments taken.
    • API break: value conversion functions such as to_float32x4 have been renamed and now returns a vector with the same number of elements as the source vector.
    • API break: SIMDPP_USER_ARCH_INFO now accepts any expression, not only a function
    • API break: unsigned conversions have been renamed to to_uintXX to reduce confusion.
    • API break: saturated add and sub are now called add_sat and sub_sat

    No further significant API changes are planned.

    Source code(tar.gz)
    Source code(zip)
  • cxx98-v2.0-rc2(Apr 3, 2016)

    The library supports the following architectures and instruction sets:

    • x86, x86-64: SSE2, SSE3, SSSE3, SSE4.1, AVX, AVX2, FMA3, FMA4, AVX-512F, XOP
    • ARM, ARM64: NEON
    • PowerPC: Altivec

    Supported compilers:

    • C++11 version:
      • GCC: 4.8-5.3
      • Clang: 3.3-3.8
      • MSVC: 2013
      • ICC: 2013, 2015
    • C++98 version
      • GCC: 4.4-5.3
      • Clang: 3.3-3.8
      • MSVC: 2013
      • ICC: 2013, 2015

    Clang 3.3 is not supported on ARM. MSVC and ICC are only supported on x86 and x86-64.

    Newer versions of the aforementioned compilers will generally work with either C++11 or C++98 version of the library. Older versions of these compilers will generally work with the C++98 version of the library.

    Changes since 1.0:

    • Expression template-based backend. It is used only for functions that may benefit from micro-optimizations (e.g. when several instructions can be merged into one).
    • Support for vectors much longer than the native vector type. The only limitation is that the length must be a power of 2. The widest available instructions are used for the particular vector type.
    • Visual Studio and Intel Compiler support
    • AVX-512F, Altivec and NEONv2 support
    • Vector initialization is simplified, for example: int32<8> v = make_uint(2); or int* p = ...; v = load(p);.
    • Curriously recurring template pattern is used to categorize vector types. Function templates no longer need to be written for each vector type or their combination, instead, an appropriate vector category may be used.
    • Each vector type can be explicitly constructed from any other vector with the same size.
    • Most functions accept much wider range of vector type combinations. For example, bitwise functions accept any two vectors of the same size.
    • If different vector types are used as arguments to such functions, the return type is computed as if one or both of the arguments were "promoted" according to certain rules. For example, int32 + int32 --> int32, whereas uint32 + int32 --> uint32, and uint32 + float32 --> float32. See simdpp/types/tag.h for more information.
    • API break: int128 and int256 types have been removed. On some architectures such as AVX512 it's more efficient to have different physical representations for vectors with different element widths. E.g. 8-bit integer elements would use 256-bit vectors and 32-bit integer elements would use 512-bit vectors.
    • API break: basic_int## types have been removed. The CRTP-based type categorization and promotion rules make second inheritance-based vector categorization system impossible. In majority of cases basic_int## can be straightforwardly replaced with uint##.
    • API break: {vector type}::make_const, {vector type}::zero and
      {vector type}::ones have been removed to simplify the library. Use the new make_int, make_uint, make_float, make_zero and make_ones free functions that produce a construct expression.
    • API break: broadcast family of functions have been renamed to splat
    • API break: permute family of functions has been renamed to permute2 and permute4 depending on the number of template arguments taken.
    • API break: value conversion functions such as to_float32x4 have been renamed and now returns a vector with the same number of elements as the source vector.
    • API break: SIMDPP_USER_ARCH_INFO now accepts any expression, not only a function
    • API break: unsigned conversions have been renamed to to_uintXX to reduce confusion.
    • API break: saturated add and sub are now called add_sat and sub_sat

    No further significant API changes are planned.

    Source code(tar.gz)
    Source code(zip)
  • v2.0-rc1(Mar 16, 2016)

    The library supports the following architectures and instruction sets:

    • x86, x86-64: SSE2, SSE3, SSSE3, SSE4.1, AVX, AVX2, FMA3, FMA4, AVX-512F, XOP
    • ARM, ARM64: NEON
    • PowerPC: Altivec

    Supported compilers:

    • C++11 version:
      • GCC: 4.8-5.3
      • Clang: 3.3-3.8
      • MSVC: 2013
      • ICC: 2013, 2015
    • C++98 version
      • GCC: 4.4-5.3
      • Clang: 3.3-3.8
      • MSVC: 2013
      • ICC: 2013, 2015

    Clang 3.3 is not supported on ARM. MSVC and ICC are only supported on x86 and x86-64.

    Newer versions of the aforementioned compilers will generally work with either C++11 or C++98 version of the library. Older versions of these compilers will generally work with the C++98 version of the library.

    Changes since 1.0:

    • Expression template-based backend. It is used only for functions that may benefit from micro-optimizations (e.g. when several instructions can be merged into one).
    • Support for vectors much longer than the native vector type. The only limitation is that the length must be a power of 2. The widest available instructions are used for the particular vector type.
    • Visual Studio and Intel Compiler support
    • AVX-512F, Altivec and NEONv2 support
    • Vector initialization is simplified, for example: int32<8> v = make_uint(2); or int* p = ...; v = load(p);.
    • Curriously recurring template pattern is used to categorize vector types. Function templates no longer need to be written for each vector type or their combination, instead, an appropriate vector category may be used.
    • Each vector type can be explicitly constructed from any other vector with the same size.
    • Most functions accept much wider range of vector type combinations. For example, bitwise functions accept any two vectors of the same size.
    • If different vector types are used as arguments to such functions, the return type is computed as if one or both of the arguments were "promoted" according to certain rules. For example, int32 + int32 --> int32, whereas uint32 + int32 --> uint32, and uint32 + float32 --> float32. See simdpp/types/tag.h for more information.
    • API break: int128 and int256 types have been removed. On some architectures such as AVX512 it's more efficient to have different physical representations for vectors with different element widths. E.g. 8-bit integer elements would use 256-bit vectors and 32-bit integer elements would use 512-bit vectors.
    • API break: basic_int## types have been removed. The CRTP-based type categorization and promotion rules make second inheritance-based vector categorization system impossible. In majority of cases basic_int## can be straightforwardly replaced with uint##.
    • API break: {vector type}::make_const, {vector type}::zero and
      {vector type}::ones have been removed to simplify the library. Use the new make_int, make_uint, make_float, make_zero and make_ones free functions that produce a construct expression.
    • API break: broadcast family of functions have been renamed to splat
    • API break: permute family of functions has been renamed to permute2 and permute4 depending on the number of template arguments taken.
    • API break: value conversion functions such as to_float32x4 have been renamed and now returns a vector with the same number of elements as the source vector.
    • API break: SIMDPP_USER_ARCH_INFO now accepts any expression, not only a function
    • API break: unsigned conversions have been renamed to to_uintXX to reduce confusion.
    • API break: saturated add and sub are now called add_sat and sub_sat

    No further significant API changes are planned.

    Source code(tar.gz)
    Source code(zip)
  • cxx98-v2.0-rc1(Mar 16, 2016)

    The library supports the following architectures and instruction sets:

    • x86, x86-64: SSE2, SSE3, SSSE3, SSE4.1, AVX, AVX2, FMA3, FMA4, AVX-512F, XOP
    • ARM, ARM64: NEON
    • PowerPC: Altivec

    Supported compilers:

    • C++11 version:
      • GCC: 4.8-5.3
      • Clang: 3.3-3.8
      • MSVC: 2013
      • ICC: 2013, 2015
    • C++98 version
      • GCC: 4.4-5.3
      • Clang: 3.3-3.8
      • MSVC: 2013
      • ICC: 2013, 2015

    Clang 3.3 is not supported on ARM. MSVC and ICC are only supported on x86 and x86-64.

    Newer versions of the aforementioned compilers will generally work with either C++11 or C++98 version of the library. Older versions of these compilers will generally work with the C++98 version of the library.

    Changes since 1.0:

    • Expression template-based backend. It is used only for functions that may benefit from micro-optimizations (e.g. when several instructions can be merged into one).
    • Support for vectors much longer than the native vector type. The only limitation is that the length must be a power of 2. The widest available instructions are used for the particular vector type.
    • Visual Studio and Intel Compiler support
    • AVX-512F, Altivec and NEONv2 support
    • Vector initialization is simplified, for example: int32<8> v = make_uint(2); or int* p = ...; v = load(p);.
    • Curriously recurring template pattern is used to categorize vector types. Function templates no longer need to be written for each vector type or their combination, instead, an appropriate vector category may be used.
    • Each vector type can be explicitly constructed from any other vector with the same size.
    • Most functions accept much wider range of vector type combinations. For example, bitwise functions accept any two vectors of the same size.
    • If different vector types are used as arguments to such functions, the return type is computed as if one or both of the arguments were "promoted" according to certain rules. For example, int32 + int32 --> int32, whereas uint32 + int32 --> uint32, and uint32 + float32 --> float32. See simdpp/types/tag.h for more information.
    • API break: int128 and int256 types have been removed. On some architectures such as AVX512 it's more efficient to have different physical representations for vectors with different element widths. E.g. 8-bit integer elements would use 256-bit vectors and 32-bit integer elements would use 512-bit vectors.
    • API break: basic_int## types have been removed. The CRTP-based type categorization and promotion rules make second inheritance-based vector categorization system impossible. In majority of cases basic_int## can be straightforwardly replaced with uint##.
    • API break: {vector type}::make_const, {vector type}::zero and
      {vector type}::ones have been removed to simplify the library. Use the new make_int, make_uint, make_float, make_zero and make_ones free functions that produce a construct expression.
    • API break: broadcast family of functions have been renamed to splat
    • API break: permute family of functions has been renamed to permute2 and permute4 depending on the number of template arguments taken.
    • API break: value conversion functions such as to_float32x4 have been renamed and now returns a vector with the same number of elements as the source vector.
    • API break: SIMDPP_USER_ARCH_INFO now accepts any expression, not only a function
    • API break: unsigned conversions have been renamed to to_uintXX to reduce confusion.
    • API break: saturated add and sub are now called add_sat and sub_sat

    No further significant API changes are planned.

    Source code(tar.gz)
    Source code(zip)
Owner
Povilas Kanapickas
Povilas Kanapickas
C++20's jthread for C++11 and later in a single-file header-only library

jthread lite: C++20's jthread for C++11 and later A work in its infancy. Suggested by Peter Featherstone. Contents Example usage In a nutshell License

Martin Moene 43 Mar 31, 2022
Header-Only C++20 Coroutines library

CPP20Coroutines Header-Only C++20 Coroutines library This repository aims to demonstrate the capabilities of C++20 coroutines. generator Generates val

null 15 Mar 2, 2022
Header-only library for multithreaded programming

CsLibGuarded Introduction The CsLibGuarded library is a standalone header only library for multithreaded programming. This library provides templated

CopperSpice 178 May 11, 2022
Header-only library for multithreaded programming

CsLibGuarded Introduction The CsLibGuarded library is a standalone header only library for multithreaded programming. This library provides templated

CopperSpice 178 May 11, 2022
DwThreadPool - A simple, header-only, dependency-free, C++ 11 based ThreadPool library.

dwThreadPool A simple, header-only, dependency-free, C++ 11 based ThreadPool library. Features C++ 11 Minimal Source Code Header-only No external depe

Dihara Wijetunga 26 May 29, 2022
Cpp-mempool - C++ header-only mempool library

cpp-mempool C++ header-only mempool library

Hardik Patel 13 Jun 21, 2022
Arcana.cpp - Arcana.cpp is a collection of helpers and utility code for low overhead, cross platform C++ implementation of task-based asynchrony.

Arcana.cpp Arcana is a collection of general purpose C++ utilities with no code that is specific to a particular project or specialized technology are

Microsoft 63 Jun 22, 2022
EOSP ThreadPool is a header-only templated thread pool writtent in c++17.

EOSP Threadpool Description EOSP ThreadPool is a header-only templated thread pool writtent in c++17. It is designed to be easy to use while being abl

null 1 Apr 22, 2022
Fiber - A header only cross platform wrapper of fiber API.

Fiber Header only cross platform wrapper of fiber API A fiber is a particularly lightweight thread of execution. Which is useful for implementing coro

Tony Wang 42 Jan 7, 2022
Parallel-util - Simple header-only implementation of "parallel for" and "parallel map" for C++11

parallel-util A single-header implementation of parallel_for, parallel_map, and parallel_exec using C++11. This library is based on multi-threading on

Yuki Koyama 27 Jun 24, 2022
Parallel-hashmap - A family of header-only, very fast and memory-friendly hashmap and btree containers.

The Parallel Hashmap Overview This repository aims to provide a set of excellent hash map implementations, as well as a btree alternative to std::map

Gregory Popovitch 1.5k Jun 27, 2022
Single header asymmetric stackful cross-platform coroutine library in pure C.

minicoro Minicoro is single-file library for using asymmetric coroutines in C. The API is inspired by Lua coroutines but with C use in mind. The proje

Eduardo Bart 301 Jun 17, 2022
Coro - Single-header library facilities for C++2a Coroutines

coro This is a collection of single-header library facilities for C++2a Coroutines. coro/include/ co_future.h Provides co_future<T>, which is like std

Arthur O'Dwyer 58 Jun 17, 2022
Mx - C++ coroutine await, yield, channels, i/o events (single header + link to boost)

mx C++11 coroutine await, yield, channels, i/o events (single header + link to boost). This was originally part of my c++ util library kit, but I'm se

Grady O'Connell 4 Sep 21, 2019
Px - Single header C++ Libraries for Thread Scheduling, Rendering, and so on...

px 'PpluX' Single header C++(11/14) Libraries Name Code Description px_sched px_sched.h Task oriented scheduler. See more px_render px_render.h Multit

PpluX 439 Jun 24, 2022
Bolt is a C++ template library optimized for GPUs. Bolt provides high-performance library implementations for common algorithms such as scan, reduce, transform, and sort.

Bolt is a C++ template library optimized for heterogeneous computing. Bolt is designed to provide high-performance library implementations for common

null 356 Jun 17, 2022
oneAPI DPC++ Library (oneDPL) https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-library.html

oneAPI DPC++ Library (oneDPL) The oneAPI DPC++ Library (oneDPL) aims to work with the oneAPI DPC++ Compiler to provide high-productivity APIs to devel

oneAPI-SRC 624 Jun 21, 2022
ArrayFire: a general purpose GPU library.

ArrayFire is a general-purpose library that simplifies the process of developing software that targets parallel and massively-parallel architectures i

ArrayFire 3.9k Jul 2, 2022