A stable adaptive branchless partitioning comparison sort.

Overview

Intro

This document describes a partitioning stable adaptive comparison-based sort named fluxsort. Benchmarks and a visualization are available at the bottom.

Analyzer

Fluxsort starts out with an analyzer that detects fully sorted arrays and sorts reverse order arrays using n comparisons. It also obtains a measure of presortedness and switches to quadsort if the array is less than 25% random.

Partitioning

Partitioning is performed in a top-down manner similar to quicksort. Fluxsort obtains the pseudomedian of 9 for partitions smaller than 1024 elements, and the pseudomedian of 15 otherwise. The median element obtained will be referred to as the pivot. Partitions that grow smaller than 24 elements are sorted with quadsort.

After obtaining a pivot the array is parsed from start to end. Elements smaller than the pivot are copied in-place to the start of the array, elements greater than the pivot are copied to swap memory. The partitioning routine is called recursively on the two partitions in main and swap memory.

Recursively partitioning through both swap and main memory is accomplished through the ptx pointer, which despite being simple in implementation is likely a novel technique since the logic behind it is a bit of a brain-twister.

Worst case handling

To avoid run-away recursion fluxsort switches to quadsort for both partitions if one partition is less than 1/16th the size of the other partition. On a distribution of random unique values the chance of a false positive is 1 in 65,536 for the pseudomedian of 9 and 1 in 16,777,216 for the pseudomedian of 15.

Combined with the analyzer fluxsort starts out with this makes the existence of killer patterns unlikely, other than a 2x performance slowdown by triggering the use of quadsort prematurely.

Branchless optimizations

Fluxsort uses a branchless comparison optimization similar to the method described in "BlockQuicksort: How Branch Mispredictions don't affect Quicksort" by Stefan Edelkamp and Armin Weiss.

Median selection uses a novel branchless comparison technique that selects the pseudomedian of 9 using between 8 and 12 (10.66 average) comparisons, and the pseudomedian of 15 using between 14 and 25 (21.33 average) comparisons.

These optimizations do not work as well when the comparisons themselves are branched and the largest performance increase is on 32 and 64 bit integers.

Memory

Fluxsort allocates n elements of swap memory, which is shared with quadsort. Recursion requires log n stack memory.

Data Types

The C implementation of fluxsort supports long doubles and 8, 16, 32, and 64 bit data types. By using pointers it's possible to sort any other data type, like strings.

Interface

Fluxsort uses the same interface as qsort, which is described in man qsort.

Big O

                 ┌───────────────────────┐┌───────────────────────┐
                 │comparisons            ││swap memory           
┌───────────────┐├───────┬───────┬───────┤├───────┬───────┬───────┤┌──────┐┌─────────┐┌─────────┐
│name           ││min    │avg    │max    ││min    │avg    │max    ││stable││partition││adaptive │
├───────────────┤├───────┼───────┼───────┤├───────┼───────┼───────┤├──────┤├─────────┤├─────────┤
│fluxsort       ││n      │n log n│n log n││1      │n      │n      ││yes   ││yes      ││yes     
├───────────────┤├───────┼───────┼───────┤├───────┼───────┼───────┤├──────┤├─────────┤├─────────┤
│quadsort       ││n      │n log n│n log n││1      │n      │n      ││yes   ││no       ││yes     
├───────────────┤├───────┼───────┼───────┤├───────┼───────┼───────┤├──────┤├─────────┤├─────────┤
│quicksort      ││n      │n log n│n²     ││1      │1      │1      ││no    ││yes      ││no      
├───────────────┤├───────┼───────┼───────┤├───────┼───────┼───────┤├──────┤├─────────┤├─────────┤
│introsort      ││n log n│n log n│n log n││1      │1      │1      ││no    ││yes      ││no      
└───────────────┘└───────┴───────┴───────┘└───────┴───────┴───────┘└──────┘└─────────┘└─────────┘

Porting

People wanting to port fluxsort might want to have a look at twinsort, which is a simplified implementation of quadsort. Fluxsort itself is relatively simple.

Visualization

In the visualization below four tests are performed on 512 elements: Random, Generic, Random Half, and Ascending Tiles. Partitions greater than 48 elements use the pseudomedian of 15 to select the pivot.

fluxsort visualization

Benchmarks

The following benchmark was on WSL gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) using the wolfsort benchmark. The source code was compiled using g++ -O3 -w -fpermissive bench.c. The bar graph shows the best run out of 100 on 100,000 32 bit integers. Comparisons for fluxsort and std::stable_sort are inlined.

fluxsort vs stdstablesort

data table
Name Items Type Best Average Loops Samples Distribution
stablesort 100000 64 0.006123 0.006153 1 100 random order
fluxsort 100000 64 0.002477 0.002488 1 100 random order
Name Items Type Best Average Loops Samples Distribution
stablesort 100000 32 0.006008 0.006033 1 100 random order
fluxsort 100000 32 0.002329 0.002345 1 100 random order
stablesort 100000 32 0.000679 0.000683 1 100 ascending order
fluxsort 100000 32 0.000037 0.000037 1 100 ascending order
stablesort 100000 32 0.001375 0.001402 1 100 ascending saw
fluxsort 100000 32 0.000842 0.000854 1 100 ascending saw
stablesort 100000 32 0.003827 0.003853 1 100 generic order
fluxsort 100000 32 0.001129 0.001140 1 100 generic order
stablesort 100000 32 0.000901 0.000912 1 100 descending order
fluxsort 100000 32 0.000048 0.000048 1 100 descending order
stablesort 100000 32 0.001021 0.001035 1 100 descending saw
fluxsort 100000 32 0.000351 0.000365 1 100 descending saw
stablesort 100000 32 0.002034 0.002060 1 100 random tail
fluxsort 100000 32 0.001485 0.001492 1 100 random tail
stablesort 100000 32 0.003520 0.003542 1 100 random half
fluxsort 100000 32 0.002077 0.002088 1 100 random half
stablesort 100000 32 0.000922 0.000956 1 100 ascending tiles
fluxsort 100000 32 0.000674 0.000692 1 100 ascending tiles

The following benchmark was on WSL gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1). The source code was compiled using gcc -O3 bench.c. The bar graph shows the best run out of 100 on 100,000 32 bit integers. Comparisons for fluxsort and qsort are not inlined. The stdlib qsort() in the benchmark is a mergesort variant.

fluxsort vs qsort

data table
Name Items Type Best Average Compares Samples Distribution
qsort 100000 64 0.009427 0.009512 1536491 100 random order
fluxsort 100000 64 0.004853 0.004862 2001035 100 random order
Name Items Type Best Average Compares Samples Distribution
qsort 100000 32 0.008472 0.008609 1536634 100 random order
fluxsort 100000 32 0.004117 0.004136 1990342 100 random order
qsort 100000 32 0.002017 0.002194 815024 100 ascending order
fluxsort 100000 32 0.000140 0.000140 99999 100 ascending order
qsort 100000 32 0.002817 0.002855 915019 100 ascending saw
fluxsort 100000 32 0.001514 0.001525 558847 100 ascending saw
qsort 100000 32 0.006382 0.006435 1532339 100 generic order
fluxsort 100000 32 0.002333 0.002349 1269601 100 generic order
qsort 100000 32 0.002450 0.002479 853904 100 descending order
fluxsort 100000 32 0.000150 0.000150 99999 100 descending order
qsort 100000 32 0.002826 0.002916 1063907 100 descending saw
fluxsort 100000 32 0.001171 0.001200 697343 100 descending saw
qsort 100000 32 0.003705 0.003751 1012028 100 random tail
fluxsort 100000 32 0.002251 0.002261 681125 100 random tail
qsort 100000 32 0.005447 0.005497 1200835 100 random half
fluxsort 100000 32 0.003728 0.003747 1889402 100 random half
qsort 100000 32 0.003873 0.004301 1209200 100 ascending tiles
fluxsort 100000 32 0.000999 0.001005 400063 100 ascending tiles
Comments
  • fluxsort comparison behavior

    fluxsort comparison behavior

    Hi, I've recently been doing some research on sort implementations and noticed that in my tests that fluxsort is not stable. The test and benchmark suite is written in Rust, the following test fails for fluxsort but passes for other stable sorts such as Rust slice::sort and C++ std::stable_sort but not for fluxsort. You can find the test code here https://github.com/Voultapher/sort-research-rs/blob/2642b98dc60cb6c9a6dbdec706a71ebdfbdf89e1/tests/main.rs#L219. In essence it packs two i32 into one u64 and extracts in the comparison function the two i32 again and only uses the first one as comparison and then later checks that the order of the ascending second elements remains in order.

    Also on a sidenote, I'm seeing rather different benchmark results, where fluxsort pretty much never beats pdqsort.

    opened by Voultapher 8
  • Suboptimal code-gen in the fundamental branchless-swap building block

    Suboptimal code-gen in the fundamental branchless-swap building block

    The fundamental branchless swap_if code produces suboptimal code on x86-64. I ported it to Rust and noticed that changing it yielded a 50% performance uplift for that function on Zen3, this will of course depend on the the hardware, but cmov seems to yield better results than setl/setg style code that is currently being produced. Probably helped by doing 8 instead of 10 instructions.

    Here is the current version:

    • C https://godbolt.org/z/39jaxjzh9
    • Rust https://godbolt.org/z/vYerGMcqq

    And here is the version that produces cmov code:

    • C https://godbolt.org/z/GrTvx1z8x (WIP, only good code gen for clang LLVM)
    • Rust https://godbolt.org/z/9qnfY6h3v

    I think if you can find a way to reliably produce cmov instructions like LLVM does, you should see a noticeable speed improvement.

    opened by Voultapher 3
  • "Long double" is not an 128bit integer

    iirc, long double is a weird 80-bit floating point number that x86 apparently supports Depending on the compiled platform (32bit x86 or x86_64) the 80 bits get rounded up to 12 or 16 bytes https://github.com/scandum/fluxsort/blob/main/src/fluxsort.h#L177

    opened by Soveu 2
  • Improve benchmark

    Improve benchmark

    Time is a lousy benchmark, especially with a low numbers of elements running on an OS. There a lot of way of profiling but I would suggest using an LLVM based profiling solution like this one because it measures the number of intermediate language instructions executed. This avoids compiler/microcode optimizations and gives you the most representative output for all platforms.

    opened by GravisZro 2
  • Collaborating?

    Collaborating?

    First off, sorry again to have jumped to conclusions on Hacker News. As a programming language guy I'm in the market for a stable sort and Fluxsort seems to have the fundamentals down!

    I wanted to know if you're interested in incorporating some or all of what I do into this repository. I'd like to help improve Fluxsort's worst cases and pattern recognition with techniques like pdqsort uses, and expand the interface to support other operations. Some possible topics, which I can also split into individual issues:

    • Add more sophisticated benchmarks with subtle patterns
    • Candidate randomization to avoid poor pivot selection for patterned as well as random data
    • Move the call stack into the memory buffer to avoid stack overflow and improve speed
    • Recognize signs of mostly-sorted inputs during pivot selection
    • Specialized integer code; maybe SIMD instructions, non-stable optimizations, or hybridization with counting sort

    (pdqsort depends on instability a lot, but I think there are usually ways to avoid this)

    As evidence that I have some clue what I'm talking about, here's a small improvement: the following inner loop for flux_partition gives about a 3% speedup in my benchmarks on random 1e6-element 4-byte int arrays.

    size_t n=0;
    val = cmp(ptx + 0, &piv) <= 0; pta[n] = ptx[0]; pts[0-n] = ptx[0]; n += val;
    // etc.
    
    pta += n; pts += 8-n;
    ptx += 8;
    
    opened by mlochbaum 28
Owner
null
Space partitioning structures visualization

vptree-draw SVG export of several 2D space partitioning structures. Code structure geo.h minimalist 2D geometry structures (Point, Vector, Box, Sphere

Basile Fraboni 35 Dec 7, 2022
This is a Program, to sort Arrays with the QuickSort Algorithm.

QuickSort This is a program, to sort arrays with the QuickSort Algorithm. The Algorithm is optimized to be quick, but it isn't the fastest. I have wri

Niklas 1 Oct 29, 2021
A stable adaptive partitioning comparison sort.

Intro This document describes a partitioning stable adaptive comparison-based sort named gridsort. Binary Cube Gridsort sorts data by storing data in

null 200 Dec 7, 2022
Blitsort is an in-place stable adaptive rotate merge sort

Origin Blitsort is a rotate merge sort based on quadsort. Visualization In the visualization below nine tests are performed on 256 elements. Random or

null 667 Jan 1, 2023
Parallel algorithms (quick-sort, merge-sort , enumeration-sort) implemented by p-threads and CUDA

程序运行方式 一、编译程序,进入sort-project(cuda-sort-project),输入命令行 make 程序即可自动编译为可以执行文件sort(cudaSort)。 二、运行可执行程序,输入命令行 ./sort 或 ./cudaSort 三、删除程序 make clean 四、指定线程

Fu-Yun Wang 3 May 30, 2022
Quick sort is a highly efficient sorting algorithm and is based on partitioning of array of data into smaller arrays.

Quick sort is a highly efficient sorting algorithm and is based on partitioning of array of data into smaller arrays. A large array is partitioned into two arrays one of which holds values smaller than the specified value, say pivot, based on which the partition is made and another array holds values greater than the pivot value.

Lakshan Sandanayaka 4 Jan 2, 2023
Fast comparison-based sort algorithm

nanosort Algorithm nanosort aims to be a fast comparison-based sorting algorithm, tuned for POD types of reasonably small sizes. nanosort implements a

Arseny Kapoulkine 40 Dec 29, 2022
Bitset Sort, a faster std::sort replacement.

Bitset Sort Bitset Sort is a variant of quick sort, specifically BlockQuickSort. Bitset Sort uses a carefully written partition function to let the co

null 66 Dec 1, 2022
Space partitioning structures visualization

vptree-draw SVG export of several 2D space partitioning structures. Code structure geo.h minimalist 2D geometry structures (Point, Vector, Box, Sphere

Basile Fraboni 35 Dec 7, 2022
Shared-Memory Parallel Graph Partitioning for Large K

KaMinPar The graph partitioning software KaMinPar -- Karlsruhe Minimal Graph Partitioning. KaMinPar is a shared-memory parallel tool to heuristically

Karlsruhe High Quality Graph Partitioning 17 Nov 10, 2022
Simple command line tool that processes image files using the FidelityFX Super Resolution (FSR) or Contrast Adaptive Sharpening (CAS) shader systems.

Simple command line tool that processes image files using the FidelityFX Super Resolution (FSR) or Contrast Adaptive Sharpening (CAS) shader systems.

GPUOpen Effects 190 Dec 12, 2022
around — adaptive rounding operation

around — adaptive rounding operation Attempts to perform nice rounding of a floating point number, like a human would do. Usage: around.h #include "ar

Jan Ringoš 9 Oct 21, 2022
Adaptive Runtime AUTOSAR Linux Simulator

Adaptive-AUTOSAR Adaptive AUTOSAR is a simulated Adaptive Platform enviroment over Linux defined by AUTOSAR. The goal of this project is to implement

Armin Kassemi Langroodi 154 Jan 4, 2023
A composable container for Adaptive ROS 2 Node computations. Select between FPGA, CPU or GPU at run-time.

adaptive_component A composable stateless container for Adaptive ROS 2 Node computations. Select between FPGA, CPU or GPU at run-time. Nodes using har

ROS 2 Hardware Acceleration Working Group 7 Sep 9, 2022
BAAF-Net - Semantic Segmentation for Real Point Cloud Scenes via Bilateral Augmentation and Adaptive Fusion (CVPR 2021)

Semantic Segmentation for Real Point Cloud Scenes via Bilateral Augmentation and Adaptive Fusion (CVPR 2021) This repository is for BAAF-Net introduce

null 90 Dec 29, 2022
A software serial driver package by using the hardware timer capture / comparison functionality.

Soft serial 1.简介 Soft serial 是利用硬件定时器捕获/比较功能实现软件模拟串口的软件包。 1.1目录结构 Soft serial 软件包目录结构如下所示: soft_serial ├───inc // 头文件目录 │

齐永忠 2 Jul 14, 2022
C implementation of C++ Utility functions Integer Comparison Macros

C implementation of C++ Utility functions Integer Comparison Macros

Robert C. Seacord 21 Oct 31, 2022
Harbour DB speed tests comparison

hbDBSpeedTests Harbour DB speed tests comparison - Registers Count: 821051 MySql configuration( 1 or 2 ) /data/mysql/dbstru.zip - Import structure of

DIEGO H FAZIO 3 Nov 18, 2021
Web to Desktop framework comparison

Web to Desktop framework comparison This repository was made to create an objective comparison of multiple framework that grant us to "transform" our

Axel 593 Dec 29, 2022
Fft-benchmark - A benchmark for comparison of FFT algorithms performance

FFT benchmark A benchmark for comparison of FFT algorithms performance. Currently supports Intel IPP, KFR, FFTW and KissFFT. Requires: Clang 6.0+ (GCC

KFR 17 Dec 21, 2022