Bitset Sort, a faster std::sort replacement.

Related tags

CLI bitsetsort
Overview

Bitset Sort

Bitset Sort is a variant of quick sort, specifically BlockQuickSort. Bitset Sort uses a carefully written partition function to let the compiler generates SIMD instructions without actually writing SIMD intrinsics in the loop.

RandomInput

Bitset Sort is interface compatible with std::sort and meant to replace std::sort in libc++.

Bitset Sort is 3.4x faster (or spends 71% less time) than libc++ std::sort when sorting uint64s and 1.58x faster (or spends 37% less time) when sorting std::string.

Bitset Sort uses multiple techniques to improve runtime performance of sort. This includes sorting networks, a variant of merge sort called Bitonic Order Merge Sort that is faster for small N, and pattern recognitions.

Improvement Highlights

Bitset Sort recognizes 3 key areas in a generic quick sort algorithm to improve sorting performance.

  • Block Partition
  • Small sort
  • Pattern recognition

Block Partition

A typical quick sort algorithm has the following partition loop. For randomized input data, the outcome of comparison has 50-50 chance of true or false therefore a hardware branch predictor cannot be effective.

        while (first < last) {
            std::iter_swap(first, last);
            while (comp(*++first, pivot));
            while (!comp(*--last, pivot));
        }

(excerpt from pdqsort.h)

BlockQuickSort almost eliminates branch predictions in the quick sort by doing comparisons in a block and swaps elements in a batch.

  ...
     for i = 0, ... , B − 1 do
       offsetsL[numL] ← i
       numL += (pivot ≥ A[l + i])
     end for
  ...
  num = min(numL, numR)
  for j = 0, . . . , num − 1 do
    swap(A[l] + offsetsL[startL + j], A[r − offsetsR[startR + j])
  end for

(excerpt from BlockQuickSort Algorithm 3)

This loop will make B comparisons in a single for-loop and stores offsets only when the comparison returns true (pivot >= A[l + i]). This loop does not have a branch but a data dependency on variable numL. Swaps can be done in a batch without a branch.

Bitset Sort improves the above partition function by using a bitset (64 bit set) which allows the C++ compiler to generate SIMD instructions.

    for (int __j = 0; __j < _Bitset::__block_size;) {
        __left_bitset |= (static_cast<__storage_t>(__comp(__pivot, *__iter)) << __j);
        __j++;
        __iter++;
    }

(excerpt from bitsetsort.h)

This loop is very similar to the BlockQuickSort except that it does not have a data dependency on any variable. It deterministically stores a comparison result into a bitset (a single 64-bit integer).

For a block size of 8, 16, 32, and 64, this for-loop can be compiled into an unrolled loop of SIMD instructions.

To see the unrolled loop in action, the following is the generated code of the Bitset partition function by Clang++ 12.

clang++12 -O3 -march=znver3 -stdlib=libc++

.LBB4_16:                               #   in Loop: Header=BB4_15 Depth=1
        vmovdqu xmm2, xmmword ptr [r13]
        vmovdqu xmm3, xmmword ptr [r13 + 16]
        vmovdqu xmm4, xmmword ptr [r13 + 32]
        vpcmpgtd        xmm2, xmm2, xmm0
        vpcmpgtd        xmm3, xmm3, xmm0
        vpcmpgtd        xmm4, xmm4, xmm0
        vpmovzxdq       ymm2, xmm2              # ymm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
        vpmovzxdq       ymm3, xmm3              # ymm3 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero
        vpand   ymm2, ymm2, ymmword ptr [rip + .LCPI4_0]
        vpand   ymm3, ymm3, ymmword ptr [rip + .LCPI4_1]
        vpmovzxdq       ymm4, xmm4              # ymm4 = xmm4[0],zero,xmm4[1],zero,xmm4[2],zero,xmm4[3],zero
        vpand   ymm4, ymm4, ymmword ptr [rip + .LCPI4_2]
        vpor    ymm2, ymm3, ymm2
        vmovdqu xmm3, xmmword ptr [r13 + 48]
        vpcmpgtd        xmm3, xmm3, xmm0
        vpmovzxdq       ymm3, xmm3              # ymm3 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero
        vpand   ymm3, ymm3, ymmword ptr [rip + .LCPI4_3]
        vpor    ymm3, ymm4, ymm3
        vmovdqu xmm4, xmmword ptr [r13 + 80]
        vpor    ymm2, ymm2, ymm3
        vmovdqu xmm3, xmmword ptr [r13 + 64]
        vpcmpgtd        xmm4, xmm4, xmm0
        vpcmpgtd        xmm3, xmm3, xmm0
        vpmovzxdq       ymm4, xmm4              # ymm4 = xmm4[0],zero,xmm4[1],zero,xmm4[2],zero,xmm4[3],zero
        vpmovzxdq       ymm3, xmm3              # ymm3 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero
        vpand   ymm4, ymm4, ymm6
        vpand   ymm3, ymm3, ymm5
        vpor    ymm3, ymm3, ymm4
        vmovdqu xmm4, xmmword ptr [r13 + 96]
        vpcmpgtd        xmm4, xmm4, xmm0
...

Small sort

Quicksort has a higher overhead for sorting smaller arrays. Many practical implementation uses a simpler sort algorithm such as insertion sort when dealing with smaller arrays.

Bitset Sort uses a sorting network (Batcher sorting network) and Bitonic Order Merge Sort for smaller arrays. For N <= 8, it uses a conditional move based sorting networks. For N > 8 and N <= 32, it uses Bitonic Order Merge Sort for faster sorting.

Bitonic Order Merge Sort

Bitset Sort uses Bitonic Order Merge Sort for N <= 32 as it is faster than insertion sorts or other small sorts. Bitonic Order Merge Sort deserves a separate document but we will only explain the fundamental ideas for the sake of this document.

A bitonic order is a repeatition of ascending sorted arrays followed by descending sorted arrays.

Example.

[1 5] [6 2] [3 10] [5 4]

A typical merge of two sorted arrays would require 3 comparisons for each element. See the following code.

// Merges a[lstart..lend] and a[rstart..rend] into b[out...]
l = lstart;
r = rstart;
while (l < lend && r < rend) {
    if (a[l] < a[r]) {
        b[out++] = a[l++];
    } else {
        b[out++] = a[r++];
    }
}
// Copies remainder of a[l..lend] or a[r..rend] to b[out...]

It requires 1) a boundary check on l, 2) a boundary check on r, 3) a comparison on element.

If an array is bitonic order, it can replace two boundary checks by a single for-loop. Typically, a for-loop is better executed in modern environments as it is more heavily optimized in both compilers and processors.

// Merges a[lstart..lend] and a[rstart..rend] into b[out...], assuming len = total number of elements to be merged.
l = lstart;
r = rend-1;
for (int i = 0; i < len; i++) {
    if (a[l] < a[r]) {
        b[out++] = a[l++];
    } else {
        b[out++] = b[r--];
    }
}

You might wonder what would happen if l reaches lend or r reaches rstart.

[1 5 7 10] [1000 900 800 700]
 ^                       ^
 l                       r

 out = [1]

[1 5 7 10] [1000 900 800 700]
   ^                     ^
   l                     r

out = [1 5]

[1 5 7 10] [1000 900 800 700]
     ^                   ^
     l                   r

out = [1 5 7]

[1 5 7 10] [1000 900 800 700]
       ^                 ^
       l                 r

out = [1 5 7 10]

[1 5 7 10] [1000 900 800 700]
            ^            ^
            l            r

out = [1 5 7 10 700]

The edge case is nicely handled because a[l] is the largest element on the right hand-side. It works similarly when r crosses over to the left hand-side.

Pattern Recognition

PdqSort nicely integrated a pattern recognition for ascending, descending, pipe organ, and single element patterns into the quick sort function. By carefully picking pivots and partitioning arrays, PdqSort does not mix up already sorted arrays like a simple quick sort.

Bitset Sort adopts the idea of PdqSort partially to exhibit O(n) behavior for known patterns. It might be possible to exploit more of known patterns but the author does not intend to optimize these patterns any further at this point. This is the author's belief that real world workload resembles randomized inputs rather than a few simple patterns.

Test Result

There are two setups.

Ryzen 5950X

  • AMD Ryzen 5950X 3.4G 16 core processor (Zen 3 architecture). Precision Boost Overdrive (an overclocking feature) is turned off for a stable result.
  • Ubuntu 20.04 LTS under Windows Subsystem for Linux 2 in Windows 10.
  • clang-12 deb packages for Ubuntu 20.04. (https://apt.llvm.org/)

Intel E5 1650

  • Intel E5 1650 3.6G 12 core processor (Broadwell architecture). Dynamic frequency scaling is turned off.
  • Debian Linux
  • clang-12 pre-built packages.

The graph shows time spent per element. It is expected to follow O(n log2 n).

AMD Ryzen 5950x (Zen 3)

AMD Ryzen 5950x

Intel E5 1650 (Broadwell)

Intel E5 1650

Bitset Sort performs better than PdqSort and std::sort in randomized inputs.

Bitset Sort performs worse in a few patterns in a few sizes. Please note the absolute timescale. Compared to the randomized inputs, these were a few nanosecond per element.

Acknowledgements

BLockQuickSort deserves much more attention than it has. BlockQuickSort trail-blazed a new way of sorting.

PdqSort improved BlockQuickSort with various ideas, making it very practical implementation.

Bitset Sort implementation took PdqSort as a motivational example, reimplemented the partition function and various part of the code. The author acknowledges PdqSort's neat implementation and various techniques to improve sorting performance.

References

  1. BlockQuickSort, Stefan Edelkamp and Armin Weiß, https://dl.acm.org/doi/fullHtml/10.1145/3274660
  2. PdpSort, Orson Peters, https://github.com/orlp/pdqsort
Issues
  • Fix slow path in reverse_conditional_swap

    Fix slow path in reverse_conditional_swap

    It seems that this is slow branch, and its logic should be identical to fast branch.

      inline void operator()(_RandomAccessIterator __x,
                             _RandomAccessIterator __y) const {
        typedef typename _VSTD::iterator_traits<_RandomAccessIterator>::value_type
            value_type;
        bool __result = !comp_(*__x, *__y);
        // Expect a compiler would short-circuit the following if-block.
        if (_VSTD::is_trivially_copy_constructible<value_type>::value &&
            _VSTD::is_trivially_copy_assignable<value_type>::value &&
            sizeof(value_type) <= 4 * sizeof(size_t)) {
          value_type __min = __result ? _VSTD::move(*__x) : _VSTD::move(*__y);
          *__y = __result ? _VSTD::move(*__y) : _VSTD::move(*__x);
          *__x = _VSTD::move(__min);
        } else {
          if (!__result) {
            _VSTD::iter_swap(__x, __y);
          }
        }
      }
    

    Logic of fast branch,

    if (result)
        min = x;
        y = y;
        x = x;
    else
        min = y;
        y = x;
        x = y;
    

    We swap elements only if result is false.

    Example to reproduce sort bug:

    int main(int argc, char ** argv)
    {
        (void)(argc);
        (void)(argv);
    
        std::vector<std::pair<int64_t, int64_t>> values = {
            {1, 1},
            {3, -1},
            {2, 1},
            {7, -1},
            {3, 1},
            {999, -1},
            {4, 1},
            {7, -1},
            {5, 1},
            {8, -1}
        };
    
        ::stdext::bitsetsort(values.begin(), values.end());
        bool is_sorted = std::is_sorted(values.begin(), values.end());
    
        std::cout << "Array " << values.size() << " is sorted " << is_sorted << std::endl;
    
        for (auto & value : values)
            std::cout << value.first << " " << value.second << std::endl;
    
        return 0;
    }
    

    Output before fix:

    Array 10 is sorted 0
    1 1
    2 1
    3 -1
    3 1
    4 1
    7 -1
    7 -1
    8 -1
    5 1
    999 -1
    

    Output after fix

    Array 10 is sorted 1
    1 1
    2 1
    3 -1
    3 1
    4 1
    5 1
    7 -1
    7 -1
    8 -1
    999 -1
    
    opened by kitaisreal 0
A readline and libedit replacement that supports UTF-8, syntax highlighting, hints and Windows and is BSD licensed.

Read Evaluate Print Loop ++ A small, portable GNU readline replacement for Linux, Windows and MacOS which is capable of handling UTF-8 characters. Unl

Marcin Konarski 577 Aug 6, 2022
Yori is a CMD replacement shell that supports backquotes, job control, and improves tab completion, file matching, aliases, command history, and more.

Yori is a CMD replacement shell that supports backquotes, job control, and improves tab completion, file matching, aliases, command history, and more.

Malcolm Smith 1.1k Aug 7, 2022
A wrapper of C++ std::vector for functional programming

Say hello to functional C++ vectors A wrapper for C++ std::vector geared towards functional programming and fluent APIs. The primary focus is readabil

null 19 Jul 26, 2022
A dynamic array similar to std::vector

Vector [] vector is a dynamic array implemented using c++ and it has almost all the features of std::vector including iterators. Usage can be used in

Tanish Porwal 1 Dec 21, 2021
C++ functions matching the interface and behavior of python string methods with std::string

Pystring is a collection of C++ functions which match the interface and behavior of python's string class methods using std::string. Implemented in C+

Sony Pictures Imageworks 732 Aug 10, 2022
A drop-in replacement for std::list with 293% faster insertion, 57% faster erasure, 17% faster iteration and 77% faster sorting on average. 20-24% speed increase in use-case testing.

plf_list A drop-in replacement for std::list with (on average): 293% faster insertion 57% faster erasure 17% faster iteration 77% faster sorting 70% f

Matt Bentley 116 Jul 25, 2022
A drop-in replacement for std::list with 293% faster insertion, 57% faster erasure, 17% faster iteration and 77% faster sorting on average. 20-24% speed increase in use-case testing.

plf::list A drop-in replacement for std::list with (on average): 293% faster insertion 57% faster erasure 17% faster iteration 77% faster sorting 70%

Matt Bentley 116 Jul 25, 2022
Parallel algorithms (quick-sort, merge-sort , enumeration-sort) implemented by p-threads and CUDA

程序运行方式 一、编译程序,进入sort-project(cuda-sort-project),输入命令行 make 程序即可自动编译为可以执行文件sort(cudaSort)。 二、运行可执行程序,输入命令行 ./sort 或 ./cudaSort 三、删除程序 make clean 四、指定线程

Fu-Yun Wang 3 May 30, 2022
Simple Useful Libraries: The C++17 header-only dynamic bitset

dynamic_bitset Simple Useful Libraries: The C++17 header-only dynamic bitset Requirements To use this dynamic bitset, you will need a C++17 compliant

Maxime Pinard 112 Aug 5, 2022
Simple Useful Libraries: C++17/20 header-only dynamic bitset

dynamic_bitset Simple Useful Libraries: C++17/20 header-only dynamic bitset Requirements To use this dynamic bitset, you will need a C++17 (or later)

Maxime Pinard 112 Aug 5, 2022
A faster,smaller,Address Sanitizer,200X Faster,95% Smaller.

FirASAN(Fir Address Sanitizer) A faster,smaller,Address Sanitizer 200X Faster,95% Smaller. FirASAN 结论: 内存消耗 CPU消耗 ASAN原版 +100-150% +200%以上 FirASAN +5%

null 10 Jul 14, 2022
A faster drop-in replacement for giflib. It uses more RAM, but you get more speed.

GIFLIB-Turbo What is it? A faster drop-in replacement for GIFLIB Why did you write it? Starting in the late 80's, I was fascinated with computer graph

Larry Bank 27 Jun 9, 2022
mold is a faster drop-in replacement for existing Unix linkers

mold: A Modern Linker mold is a faster drop-in replacement for existing Unix linkers. It is several times faster than LLVM lld linker, the second-fast

Rui Ueyama 8.4k Aug 11, 2022
Benchmarking a trivial replacement for std::vector

std::vector replacement benchmark Dependencies You'll need gnuplot and bash to run ./bench.sh. In addition to that, you'll need to have gcc and clang

Dale Weiler 8 Jul 2, 2022
Improved and configurable drop-in replacement to std::function that supports move only types, multiple overloads and more

fu2::function an improved drop-in replacement to std::function Provides improved implementations of std::function: copyable fu2::function move-only fu

Denis Blank 397 Aug 3, 2022
Fast C++ container combining the best features of std::vector and std::deque

veque The double-ended vector A very fast C++17 container combining the best features of std::vector and std::deque "In Most Cases, Prefer Using deque

null 99 Jul 22, 2022
Invoke.hpp - std::invoke/std::apply analogs for C++11/14

invoke.hpp std::invoke/std::apply analogs for C++11/14 Requirements gcc >= 4.9 clang >= 3.8 msvc >= 2015 Installation invoke.hpp is a header-only libr

Matvey Cherevko 32 Jun 30, 2022
Cpp-std-fwd - forward declarations for C++ std headers

cpp-std-fwd Forward declarations for most useful runtime classes of the C++ 17 standard library. DISCLAIMER: This project is meant as a proof-of-conce

Philip Trettner 67 Jul 10, 2022
Bolt is a C++ template library optimized for GPUs. Bolt provides high-performance library implementations for common algorithms such as scan, reduce, transform, and sort.

Bolt is a C++ template library optimized for heterogeneous computing. Bolt is designed to provide high-performance library implementations for common

null 355 Jun 27, 2022
A stable adaptive partitioning comparison sort.

Intro This document describes a partitioning stable adaptive comparison-based sort named gridsort. Binary Cube Gridsort sorts data by storing data in

null 199 Jul 13, 2022
Fast comparison-based sort algorithm

nanosort Algorithm nanosort aims to be a fast comparison-based sorting algorithm, tuned for POD types of reasonably small sizes. nanosort implements a

Arseny Kapoulkine 37 Aug 2, 2022
SortNode is a JS binding for SORT: Simple, online, and real-time tracking of multiple objects in a video sequence.

SortNode is a JS binding for SORT: Simple, online, and real-time tracking of multiple objects in a video sequence.

Techainer 10 Aug 2, 2022
Blitsort is an in-place stable adaptive rotate merge sort

Origin Blitsort is a rotate merge sort based on quadsort. Visualization In the visualization below nine tests are performed on 256 elements. Random or

null 338 Jul 26, 2022
A stable adaptive branchless partitioning comparison sort.

Intro This document describes a partitioning stable adaptive comparison-based sort named fluxsort. Benchmarks and a visualization are available at the

null 185 Jul 18, 2022
Quick sort is a highly efficient sorting algorithm and is based on partitioning of array of data into smaller arrays.

Quick sort is a highly efficient sorting algorithm and is based on partitioning of array of data into smaller arrays. A large array is partitioned into two arrays one of which holds values smaller than the specified value, say pivot, based on which the partition is made and another array holds values greater than the pivot value.

Lakshan Sandanayaka 2 Sep 21, 2021
Hand tuned sort routines for different C types

c 2008,2009,2021 Andrew I. Schein See LICENSE file. Update for 2021 This repo disappeared from the internet when Bitbucket turned off its mercurial se

Andrew Schein 2 Nov 24, 2021
This is a Program, to sort Arrays with the QuickSort Algorithm.

QuickSort This is a program, to sort arrays with the QuickSort Algorithm. The Algorithm is optimized to be quick, but it isn't the fastest. I have wri

Niklas 1 Oct 29, 2021
This repo will sort of document my story of learning vulkan with VulkanSDK and cl (msvc) on windows.

Learning Vulkan This repo is a means of documenting my journey with learning Vulkan's basics on windows. Because of the binaries in the LunarG VulkanS

null 2 Dec 8, 2021
An injector is simply a program that injects some sort of file into your game

example-injector What it injector? An injector is simply a program that injects some sort of file into your game. This could be something as benign as

Speedy 19 Jul 23, 2022