libsais is a library for linear time suffix array and burrows wheeler transform construction based on induced sorting algorithm.

Overview

libsais

libsais is a library for fast (see Benchmarks below) linear time suffix array and Burrows-Wheeler transform construction based on induced sorting algorithm described in the following paper: Ge Nong, Sen Zhang and Wai Hong Chan, Two Efficient Algorithms for Linear Suffix Array Construction, 2008.

Copyright (c) 2021 Ilya Grebnov [email protected]

This program is based on the work of sais-lite library by Yuta Mori.

Introduction

libsais library provides simple C99 API to construct suffix array and Burrows-Wheeler transformed string from a given string over constant-size alphabet. The algorithm runs in a linear time using typically only ~12KB of extra memory with 2n bytes as absolute worst-case extra working space, where n is the length of the string.

Note, libsais library is heavily dependent on fast memory and software prefetching, so it might not be suitable for embeded workloads or heavily concurrent workloads which could saturate dual channel RAM.

License

libsais library is released under the Apache License Version 2.0

Changes

  • February 23, 2021
    • Initial release.

API

    /**
    * Constructs the suffix array of a given string.
    * @param T [0..n-1] The input string.
    * @param SA [0..n-1] The output array of suffixes.
    * @param n The length of the given string.
    * @return 0 if no error occurred, -1 or -2 otherwise.
    */
    int libsais(const unsigned char * T, int * SA, int n);

    /**
    * Constructs the burrows-wheeler transformed string of a given string.
    * @param T [0..n-1] The input string.
    * @param U [0..n-1] The output string. (can be T)
    * @param A [0..n-1] The temporary array.
    * @param n The length of the given string.
    * @return The primary index if no error occurred, -1 or -2 otherwise.
    */
    int libsais_bwt(const unsigned char * T, unsigned char * U, int * A, int n);

Benchmarks

  • CPU: Intel Core i7-9700K Processor (12M Cache, 5 GHz OC)
  • RAM: 16 GB Dual Channel 4133 MHz (17-17-17-37-400-2T OC)
  • OS: Microsoft Windows 10 Pro 64 Bit
  • Compiler: Clang 10.0.0 '-O3 -DNDEBUG'

The times are the minimum of five runs measuring single-threaded performance of suffix array construction.

Silesia Corpus

file size libsais 1.0.0 divsufsort 2.0.2 improvement
dickens 10192446 0.202 sec ( 50.43 MB/s) 0.448 sec ( 22.78 MB/s) +121.43%
mozilla 51220480 1.120 sec ( 45.73 MB/s) 1.803 sec ( 28.41 MB/s) +60.93%
mr 9970564 0.185 sec ( 53.90 MB/s) 0.374 sec ( 26.66 MB/s) +102.23%
nci 33553445 0.625 sec ( 53.70 MB/s) 1.184 sec ( 28.35 MB/s) +89.42%
ooffice 6152192 0.116 sec ( 53.03 MB/s) 0.178 sec ( 34.61 MB/s) +53.25%
osdb 10085684 0.228 sec ( 44.31 MB/s) 0.368 sec ( 27.38 MB/s) +61.83%
reymont 6627202 0.123 sec ( 53.78 MB/s) 0.253 sec ( 26.15 MB/s) +105.65%
samba 21606400 0.407 sec ( 53.06 MB/s) 0.663 sec ( 32.57 MB/s) +62.93%
sao 7251944 0.181 sec ( 40.04 MB/s) 0.232 sec ( 31.29 MB/s) +27.99%
webster 41458703 1.019 sec ( 40.70 MB/s) 2.195 sec ( 18.89 MB/s) +115.49%
xml 5345280 0.079 sec ( 68.05 MB/s) 0.139 sec ( 38.36 MB/s) +77.39%

Large Canterbury Corpus

file size libsais 1.0.0 divsufsort 2.0.2 improvement
bible.txt 4047392 0.066 sec ( 61.53 MB/s) 0.142 sec ( 28.48 MB/s) +116.00%
E.coli 4638690 0.078 sec ( 59.12 MB/s) 0.203 sec ( 22.83 MB/s) +158.97%
world192.txt 2473400 0.037 sec ( 67.32 MB/s) 0.075 sec ( 32.92 MB/s) +104.48%

Manzini's Corpus

file size libsais 1.0.0 divsufsort 2.0.2 improvement
chr22.dna 34553758 0.772 sec ( 44.76 MB/s) 2.050 sec ( 16.86 MB/s) +165.50%
etext99 105277340 3.190 sec ( 33.01 MB/s) 7.091 sec ( 14.85 MB/s) +122.30%
howto 39422105 0.926 sec ( 42.58 MB/s) 1.890 sec ( 20.86 MB/s) +104.08%
jdk13c 69728899 1.641 sec ( 42.50 MB/s) 3.086 sec ( 22.60 MB/s) +88.08%
linux-2.4.5.tar 116254720 2.744 sec ( 42.36 MB/s) 5.073 sec ( 22.92 MB/s) +84.85%
rctail96 114711151 3.183 sec ( 36.04 MB/s) 6.512 sec ( 17.62 MB/s) +104.56%
rfc 116421901 2.919 sec ( 39.88 MB/s) 5.710 sec ( 20.39 MB/s) +95.60%
sprot34.dat 109617186 3.010 sec ( 36.42 MB/s) 6.408 sec ( 17.11 MB/s) +112.90%
w3c2 104201579 2.524 sec ( 41.29 MB/s) 4.612 sec ( 22.59 MB/s) +82.72%

Large Text Compression Benchmark Corpus

file size libsais 1.0.0 divsufsort 2.0.2 improvement
enwik8 100000000 2.878 sec ( 34.74 MB/s) 6.216 sec ( 16.09 MB/s) +115.96%
enwik9 1000000000 36.786 sec ( 27.18 MB/s) 77.817 sec ( 12.85 MB/s) +111.54%

The Gauntlet Corpus

file size libsais 1.0.0 divsufsort 2.0.2 improvement
abac 200000 0.003 sec ( 75.94 MB/s) 0.002 sec ( 129.08 MB/s) -41.17%
abba 10500596 0.167 sec ( 62.77 MB/s) 0.444 sec ( 23.66 MB/s) +165.35%
book1x20 15375420 0.323 sec ( 47.55 MB/s) 0.719 sec ( 21.38 MB/s) +122.37%
fib_s14930352 14930352 0.344 sec ( 43.38 MB/s) 1.119 sec ( 13.34 MB/s) +225.11%
fss10 12078908 0.240 sec ( 50.22 MB/s) 0.850 sec ( 14.20 MB/s) +253.60%
fss9 2851443 0.041 sec ( 69.53 MB/s) 0.128 sec ( 22.25 MB/s) +212.42%
houston 3839141 0.037 sec ( 104.15 MB/s) 0.022 sec ( 177.07 MB/s) -41.18%
paper5x80 956322 0.014 sec ( 69.76 MB/s) 0.024 sec ( 40.22 MB/s) +73.45%
test1 2097152 0.041 sec ( 51.06 MB/s) 0.073 sec ( 28.75 MB/s) +77.63%
test2 2097152 0.039 sec ( 54.12 MB/s) 0.049 sec ( 42.85 MB/s) +26.29%
test3 2097088 0.036 sec ( 58.05 MB/s) 0.045 sec ( 47.04 MB/s) +23.41%

Pizza & Chilli Corpus

file size libsais 1.0.0 divsufsort 2.0.2 improvement
dblp.xml 296135874 8.628 sec ( 34.32 MB/s) 16.145 sec ( 18.34 MB/s) +87.11%
dna 403927746 14.983 sec ( 26.96 MB/s) 37.108 sec ( 10.89 MB/s) +147.67%
english.1024MB 1073741824 46.777 sec ( 22.95 MB/s) 99.729 sec ( 10.77 MB/s) +113.20%
pitches 55832855 1.253 sec ( 44.56 MB/s) 2.390 sec ( 23.36 MB/s) +90.74%
proteins 1184051855 59.239 sec ( 19.99 MB/s) 120.934 sec ( 9.79 MB/s) +104.15%
sources 210866607 5.407 sec ( 39.00 MB/s) 10.227 sec ( 20.62 MB/s) +89.13%

Pizza & Chilli Repetitive Corpus

file size libsais 1.0.0 divsufsort 2.0.2 improvement
cere 461286644 16.447 sec ( 28.05 MB/s) 35.326 sec ( 13.06 MB/s) +114.79%
coreutils 205281778 5.260 sec ( 39.03 MB/s) 10.842 sec ( 18.93 MB/s) +106.12%
einstein.de.txt 92758441 2.493 sec ( 37.21 MB/s) 4.696 sec ( 19.75 MB/s) +88.41%
einstein.en.txt 467626544 15.487 sec ( 30.19 MB/s) 32.412 sec ( 14.43 MB/s) +109.28%
Escherichia_Coli 112689515 3.381 sec ( 33.33 MB/s) 7.690 sec ( 14.65 MB/s) +127.44%
influenza 154808555 3.843 sec ( 40.28 MB/s) 9.292 sec ( 16.66 MB/s) +141.76%
kernel 257961616 7.039 sec ( 36.65 MB/s) 13.718 sec ( 18.80 MB/s) +94.88%
para 429265758 14.429 sec ( 29.75 MB/s) 33.080 sec ( 12.98 MB/s) +129.26%
world_leaders 46968181 0.881 sec ( 53.34 MB/s) 1.313 sec ( 35.76 MB/s) +49.17%
Comments
  • Crash for file size close to 2GB

    Crash for file size close to 2GB

    As an old user looking forward to switch from libdivsufsort, I noticed libsais would crash as file size approaches 2GB (with or without giving extra space), while divsufsort won't as long as file size is strictly under 2G (210241024*1024). I wonder what is the max size doable without switching to 64-bit version.

    opened by wupengcheng6819 5
  • Feature request?

    Feature request?

    When dealing with large files, for space reasons I often need a partial suffix array, that is, the suffix array of every other (or every n-th) positions of the file. I am guessing it would be much faster than sorting the whole file. Would this be a sensible feature or it would require a considerate amount of work?

    opened by wupengcheng6819 2
  • Overlapping parameters are passed to memcpy

    Overlapping parameters are passed to memcpy

    I was playing around with libsais and Silesia corpus when I encountered a possibility for overlapping parameters passed to memcpy. This can happen precisely at libsais_reconstruct_compacted_lms_suffixes_32s_2k_omp libsais/src/libsais.c:3912.

    Whether this causes problems or not, seems to depend on environment. In an environment where calls to memcpy are redirected to memmove this does not seem to cause problems and is only detected with AddressSanitizer. However in an environment where memcpy is actually utilized, this causes either a corrupted SA and/or segmentation fault, and is detected also with Valgrind.

    I was able to isolate the part of the payload causing this error and wrote a minimal reproducer for it. Payload is a 32kiB block extracted from concatenation of Silesia corpus contents in alphabetical order.

    How to use reproducer: Compile with gcc -o reproducer -g -fsanitize=address reproducer.c libsais.c Run with ./reproducer payload

    If there is some information that I could provide additionally, please feel free to ask.

    opened by akiutoslahti 2
  • UB in libsais_unbwt

    UB in libsais_unbwt

    Hi! The following C code will trigger undefined behaviour in libsais' unbwt code:

    unsigned char arr[] = { 0xFB, 0xB7, 0x46, 0xA8, 0x13, 0xBC, 0x88, 0xC8, 0x9B, 0xBC, 0x97, 0xCB, 0x1A, 0xA6, 0xAE, 0x96, 0xBC, 0xBD, 0x13, 0xB7, 0xA3, 0xE2, 0x95, 0x88, 0x9B, 0xB6, 0xC2, 0x87, 0x65, 0x77, 0xF7, 0xB8, 0x8E, 0xCE, 0xE1, 0xCB, 0x9F, 0x63, 0x9B, 0xF3, 0xCB, 0x63, 0x42, 0x26, 0x14, 0x2F, 0xC4, 0xCE, 0x43 };
    int size = 49; int bwt_idx = 1;
    
    int main(void) {
        s32 * A = (s32 *) malloc(sizeof(s32) * (size + 1));
        libsais_unbwt(arr, arr, A, size, NULL, bwt_idx);
        printf("%d\n", arr[0]);
    }
    

    The Valgrind output follows:

    ==2044781== Use of uninitialised value of size 8
    ==2044781==    at 0x11D5C2: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==
    ==2044781== Use of uninitialised value of size 8
    ==2044781==    at 0x11D5EA: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==
    ==2044781== Use of uninitialised value of size 8
    ==2044781==    at 0x11D5A8: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==
    ==2044781== Conditional jump or move depends on uninitialised value(s)
    ==2044781==    at 0x11D5CA: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==
    ==2044781== Use of uninitialised value of size 8
    ==2044781==    at 0x11D607: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
    ==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)
    

    After a closer investigation, this happens on the if statement below:

    static void libsais_unbwt_decode_1(u8 * RESTRICT U, sa_uint_t * RESTRICT P, sa_uint_t * RESTRICT bucket2,
                                       u16 * RESTRICT fastbits, fast_uint_t shift, fast_uint_t * i0, fast_uint_t k) {
        u16 * RESTRICT U0 = (u16 *)(void *)U;
    
        fast_uint_t i, p0 = *i0;
    
        for (i = 0; i != k; ++i) {
            u16 c0 = fastbits[p0 >> shift];
            if (bucket2[c0] <= p0) {
                do {
                    c0++;
                } while (bucket2[c0] <= p0);
            }
            p0 = P[p0];
            U0[i] = libsais_bswap16(c0);
        }
    
        *i0 = p0;
    }
    
    opened by kspalaiologos 1
  • __builtin_bswap16 presence check for (older) GCCs that don't support it

    __builtin_bswap16 presence check for (older) GCCs that don't support it

    __builtin_bswap16(x) is gcc and clang built-in 16-bit byte swap macro. In libsais it is chosen by preprocessor if __GNUC__ or __clang__ are defined but __builtin_bswap16(x) is not present in every gcc even though __GNUC__ is defined. Although as it's been apparently patched in new gcc but it is important to be aware of the fact that checking for __GNUC__ is not sufficient indicator of the built-in presence. Hence the pull request.

    opened by tansy 0
  • Add CMake script for library build and install

    Add CMake script for library build and install

    The install logic also allows seamlessly use the library in any another project via

    find_package(sais REQUIRED)
    ...
    target_link_libraries(mytarget sais::sais)
    

    The user should use the #include <libsais/libsais.h> include path. This path will be transitively added via target_link_libraries().

    The test_package dir contains the simplest project which tries to link library using the find_package().

    opened by leha-bot 0
Releases(v2.7.1)
Owner
Ilya Grebnov
Ilya Grebnov
This algorithm is amazing and take a high performance to search something under array.

Sequential Binary Algorithm O(n) Algoritmo Este é um algoritmo de complexidade O(log n), que possui uma alta performance em percorrer um vetor de inte

Guilherme Serafim Kollet 1 Oct 26, 2021
A drop-in replacement for std::list with 293% faster insertion, 57% faster erasure, 17% faster iteration and 77% faster sorting on average. 20-24% speed increase in use-case testing.

plf::list A drop-in replacement for std::list with (on average): 293% faster insertion 57% faster erasure 17% faster iteration 77% faster sorting 70%

Matt Bentley 117 Nov 8, 2022
Sorting data on a stack with a limited set of instructions

push_swap Sorting data on a stack with a limited set of instructions Final Score: 125/100 Average 100 set speed: 575 Actions (Minimum 490 / Maximum 61

null 14 Nov 22, 2022
Repository containing all the sorting algorithms.

Sorting-Algorithms Repository containing all the sorting algorithms. Bubble Sort Bubble Sort is a simple algorithm which is used to sort a given set o

Rimpi Rani Baruah 4 Nov 15, 2022
Linear Linked List Library

list.h Implementations for singly-linked and doubly-linked list functions. Basic Working Example #include <stdio.h> #include <stdlib.h> #include "list

Nick Bulischeck 45 Nov 17, 2022
Is a linear data structure with O(log n) searches and O(cbrt n) insertions and index lookups.

A binary cube is a linear data structure that contains a sorted two dimensional dynamic array of nodes which each point to a sorted array

null 22 Jul 13, 2022
Random access array of tightly packed unsigned integers

PackedArray: random access array of tightly packed unsigned integers TLDR PackedArray comes to the rescue when you're in a desperate need for an uint9

Gregory Pakosz 137 Nov 11, 2022
A collection of array rotation algorithms.

The most commonly used rotation algorithms (aka block swaps) were documented around 1981 and haven't notably changed since. Below I'll describe severa

null 35 Nov 12, 2022
simple hash table linear probing method can expand/reduce

hash table a simple c hash table implementation based on https://benhoyt.com/writings/hash-table-in-c/ project can store different data types (data al

Fabio Murer 1 Oct 3, 2021
Structure-of-array synthesis in C++20

Structure-of-array synthesis in C++20

celtera 62 Nov 7, 2022
Data Structure Studying Group: array, linked list, stack, queue, deque, tree, graph. Summary of these concepts are descripted in the markdown files.

DATA_STRUCTURE STUDY ?? CURRICULUM ✔️ Day1 array linked list ✔️ Day2 circular linked list double linked list polynomial addtion reverse linked list ✔️

Soyeon Kim 3 Jul 22, 2022
Simple C++ Genetic Algorithm library

crsGA: Simple C++ Genetic Algorithm library crsGA is a simple C++ template library for developing genetic algorithms, plus some other utilities (Logge

Rafael Gaitán 6 Apr 24, 2022
Explore the world of Data Structures and Algorithm

Hey Everyone! ?? DSA- PlayYard is the first open source project of Lets Grow More Community. It is the perfect place to start with or to test your DSA

LetsGrowMore 20 Oct 9, 2022
Implementation of two of the most famous subdivision algorithm: Loops Subdivision and CatMull-Clark Subdivision.

3D-Subdivision-Surface Implementation of two of the most famous subdivision algorithms: Loops Subdivision for Triangles and CatMull-Clark Subdivision

Phan Sang 11 Nov 18, 2022
I have shared all the operations I learnt (still learning) regarding DataStructures and Algorithm.

DataStructures-and-Algorithm If you appreciate my work, please ?? this repository. It motivates me. Why companies like Amazon, Microsoft, Google focus

Subham Sharma 25 Nov 6, 2022
Data Structures and Algorithm, UPES

Data Structures and Algorithm, UPES This repository contains all the Lab experiments I have performed while learning the course Data Structures and Al

Mohak Bajaj 3 Oct 31, 2022
180+ Algorithm & Data Structure Problems using C++

180+ Algorithm & Data Structure Problems using C++

Ravi Mandliya 5.1k Nov 24, 2022
Teaching materials for Algorithm Bootcamp: Data Structure.

Data Structure Materials Materials Topics Code Introduction to Data Structures Struct Pointer Dynamic Memory Allocation 00_intro_to_ds.cpp Linked List

pakgembus 49 Oct 15, 2022
Open-source graph editor, with built-it step-by-step Dijkstra's Algorithm.

Visual Dijkstra - Simple visual graph editor, with built-in step-by-step Dijkstra's algorithm Visual Dijkstra is a free and open-source tool, designed

Samuele Girgenti 31 Oct 10, 2022