🏅State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes

Overview

The PGM-index

The Piecewise Geometric Model index (PGM-index) is a data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes while providing the same worst-case query time guarantees.

Website | Documentation | Paper | Slides | Python wrapper | AÂł Lab

GitHub Workflow Status Language grade: C/C++ License GitHub stars GitHub forks Run on Repl.it

Quickstart

This is a header-only library. It does not need to be installed. Just clone the repo with

git clone https://github.com/gvinciguerra/PGM-index.git
cd PGM-index

and copy the include/pgm directory to your system's or project's include path.

The examples/simple.cpp file shows how to index and query a vector of random integers with the PGM-index:

#include <vector>
#include <cstdlib>
#include <iostream>
#include <algorithm>
#include "pgm/pgm_index.hpp"

int main() {
    // Generate some random data
    std::vector<int> data(1000000);
    std::generate(data.begin(), data.end(), std::rand);
    data.push_back(42);
    std::sort(data.begin(), data.end());

    // Construct the PGM-index
    const int epsilon = 128; // space-time trade-off parameter
    pgm::PGMIndex<int, epsilon> index(data);

    // Query the PGM-index
    auto q = 42;
    auto range = index.search(q);
    auto lo = data.begin() + range.lo;
    auto hi = data.begin() + range.hi;
    std::cout << *std::lower_bound(lo, hi, q);

    return 0;
}

Run and edit this and other examples on Repl.it. Or run it locally via:

g++ examples/simple.cpp -std=c++17 -I./include -o simple
./simple

Classes overview

Other than the pgm::PGMIndex class in the example above, this library provides the following classes:

  • pgm::DynamicPGMIndex supports insertions and deletions.
  • pgm::MultidimensionalPGMIndex stores points in k dimensions and supports orthogonal range queries.
  • pgm::MappedPGMIndex stores data on disk and uses a PGMIndex for fast search operations.
  • pgm::CompressedPGMIndex compresses the segments to reduce the space usage of the index.
  • pgm::OneLevelPGMIndex uses a binary search on the segments rather than a recursive structure.
  • pgm::BucketingPGMIndex uses a top-level lookup table to speed up the search on the segments.
  • pgm::EliasFanoPGMIndex uses a top-level succinct structure to speed up the search on the segments.

The full documentation is available here.

Compile the tests and the tuner

After cloning the repository, build the project with

cmake . -DCMAKE_BUILD_TYPE=Release
make -j8

The test runner will be placed in test/. The tuner executable will be placed in tuner/. The benchmark executable will be placed in benchmark/.

License

This project is licensed under the terms of the Apache License 2.0.

If you use the library please put a link to the website and cite the following paper:

Paolo Ferragina and Giorgio Vinciguerra. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. PVLDB, 13(8): 1162-1175, 2020.

@article{Ferragina:2020pgm,
  Author = {Paolo Ferragina and Giorgio Vinciguerra},
  Title = {The {PGM-index}: a fully-dynamic compressed learned index with provable worst-case bounds},
  Year = {2020},
  Volume = {13},
  Number = {8},
  Pages = {1162--1175},
  Doi = {10.14778/3389133.3389135},
  Url = {https://pgm.di.unipi.it},
  Issn = {2150-8097},
  Journal = {{PVLDB}}}
Comments
  • Application of PGM-index

    Application of PGM-index

    Hi, this is a interesting work! Thanks for your contribution.

    But I have some problems:

    1. in the examples, we can see it trains a map function from "Key" (the value of key) to its stored positions. If we really want to use it in real example: the user want to access the value based on the key. Then how values should be organzed with PGM-index? In other words, many (key, value) pairs are stored in a large array and sorted by key (maybe a value is a pointer or offset, etc. It suggests the length of key and value is almost fixed). How to make full use of PGM-index to search a key and return the corresponding value? I believe it should find the position, and then use (position + the length of key) to return the value. However, PGM-index seems need significant upgrade to support this function. Is there any other solutions?

    2. In the paper, it seems "works for any kind of keys that can be mapped to reals while preserving their order. Examples include integers, strings, etc." is very difficult to understand and hard to make implementation. My reasons are given below:

      • issue one: the function of map is to map a key (string) to a real, and then use the sorted reals to construct PGM-index and find the offset? Why not I directly map the key to its offset or value. Since the mapping between keys and reals is acceptable, then mapping between keys and offsets (values) should also be acceptable.
      • issue two: One important thing is that how string (key type in may DB) can be support in current system. As you can see, a string could have unlimited length. It should be very hard to bound the errors, especially when many short fixed-length strings mixed with a few long strings appears. I also refer to the #8, this hint is very nice, but other issues come out: map string directly to uint128_t at bit-level (maybe) can solve the issue one, but it caused: (1) the max length of string is 24 (?) (2) the wasted space, said some strings have 19 chars, and it can only use uint128_t to do the map. some spaces is wasted then.
    question 
    opened by authwork 17
  • Add support for a vector of int as a key

    Add support for a vector of int as a key

    Hello! @gvinciguerra With the growing big data, sometimes we may need a vector of integers as the key to index some values. Currectly support is to use a hash function to combine such a vector of integer into a uint128_t number.

    Is there any possible to extend the build-in ML model (f(x) = kx + b). For example, [x1, x2, x3, ..., xn], build a ML model f(x1, x2, x3, ..., xn) = c0 + c1x1 + c2x2 + ... + cnxn?

    In this way, it could become more general. Currently, I use a multiple-level PGM index (#26) to implement this, but seems much more overhead (size, time, etc) than the single-level.

    Many thanks.

    question 
    opened by authwork 12
  • PGMIndex return incorrect range on books dataset

    PGMIndex return incorrect range on books dataset

    Hi,

    for some values of epsilon, PGM returns incorrect ranges on the books dataset from SOSD.

    Dataset: https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/JGVF9A/A6HDNT&version=10.0

    Code to reproduce the issue:

    #include <vector>
    #include <fstream>
    #include <iostream>
    
    #include "pgm/pgm_index.hpp"
    
    
    template<typename Key>
    std::vector<Key> load_data(const std::string &filename) {
        using key_type = Key;
    
        /* Open file. */
        std::ifstream in(filename, std::ios::binary);
        if (!in.is_open())
            exit(EXIT_FAILURE);
    
        /* Read number of keys. */
        uint64_t n_keys;
        in.read(reinterpret_cast<char*>(&n_keys), sizeof(uint64_t));
    
        /* Initialize vector. */
        std::vector<key_type> data;
        data.resize(n_keys);
    
        /* Read keys. */
        in.read(reinterpret_cast<char*>(data.data()), n_keys * sizeof(key_type));
        in.close();
    
        return data;
    }
    
    int main(int argc, char *argv[])
    {
        /* Load keys. */
        auto keys = load_data<uint64_t>("data/books_200M_uint64");
        std::cout << "keys loaded" << std::endl;
    
        /* Build PGM. */
        const int epsilon = 2048;
        pgm::PGMIndex<uint64_t, epsilon> pgm(keys);
        std::cout << "pgm built" << std::endl;
    
        /* Lookup all keys. */
        for (std::size_t i = 0; i != keys.size(); ++i) {
            auto key = keys.at(i);
            auto range = pgm.search(key);
            if (range.lo > i or range.hi < i) {
                std::cout << "key not in range\n"
                          << "key: " << key << '\n'
                          << "pos: " << i << '\n'
                          << "lo: " << range.lo << '\n'
                          << "hi: " << range.hi << '\n'
                          << std::endl;
            }
        }
    }
    

    Output for epsilon=2048:

    ...
    key not in range
    key: 880841552717461696
    pos: 47236183
    lo: 47236184
    hi: 47240282
    
    key not in range
    key: 880842439537799360
    pos: 47236231
    lo: 47236232
    hi: 47240330
    

    Output for epsilon=16384:

    ...
    key not in range
    key: 1154823330085936896
    pos: 61887138
    lo: 61887141
    hi: 61919911
    
    key not in range
    key: 1154823338493908224
    pos: 61887139
    lo: 61887141
    hi: 61919911
    
    key not in range
    key: 1154823341621615232
    pos: 61887140
    lo: 61887141
    hi: 61919911
    
    opened by marcelmaltry 11
  • multi-level PGM-index and its size

    multi-level PGM-index and its size

    @gvinciguerra Hello, I have done a experiment to build multi-level PGM-index. Why? I use it to support very large integer, like a 256-bit long integer. Assume I have a uint256_t num, I set num[0] is a uint128_t with high 128 bits of num, num[1] is also a uint128_t with low 128 bits of num. To index with PGM-index

    typedef PGM-index<uint128_t, value> type1;
    typedef PGM-index<uint128_t, type1 *> type2;
    

    for insert. I will first insert a type1 variable into type2, then insert the real value to the corresponding type1, such as:

    type2 index2nd;
    auto pp = index2nd.find(num[0]);
    if(pp == index2nd.end()){
        type1 *pp2 = new type1();
        index2nd.insert_or_assign(num[0], pp2);
        pp2->insert_or_assign(num[1], value);
    }
    else{
        pp.second->insert_or_assign(num[1], value);
    }
    

    To calculate the size of the whole index:

    uint64_t size = index2nd.size_in_bytes();
    for (auto &e : index2nd) {
          size += e.second->size_in_bytes();
    }
    

    Does size correctly calculate the memory size of index2nd?

    question 
    opened by authwork 6
  • Scalability issue

    Scalability issue

    Hello,@gvinciguerra. I have done a comparison betweem PGM-index with unordered_map (hash map) by using

    pgm::DynamicPGMIndex<__uint128_t, A> d_index; // A is a structure with multiple uint64_t
    for(__uint128_t i=0; i<0xFFFFFFFFFFFFFFFF; i++){
    	A a;
    	a.a = (uint64_t)i;
    	a.b = (uint64_t)2*i;
    	a.c = (uint64_t)3*i;
    	d_index.insert_or_assign(i, a);
    }
    A b;
    A *c = &b;
    for(__uint128_t i=0; i<0xFFFFFFFFFFFFFFFF; i++){
    	auto iter = d_index.find(i);
    	*c = iter->second;
    	std::cout << b.a << " " << b.b << " " << b.c << std::endl;
    }
    

    The memory consumption of DynamicPGMIndex can achieve 98% and it will cause:

    terminate called after throwing an instance of 'std::bad_alloc'
      what():  std::bad_alloc
    Aborted
    

    while the case using unordered_map only has very low memory consumption. Theoretical speaking, Hash map should use much more memory (pointers in linked list, RB tree etc) Why this experiment show DynamicPGMIndex uses more memory? And how we solve it?

    Another issue is mentioned in #17 How to achieve dynamic insertable set by using PGM index?

    Many thanks

    question 
    opened by authwork 6
  • What does this adjustment do?

    What does this adjustment do?

    https://github.com/gvinciguerra/PGM-index/commit/e346995724dd12a7b0b5c039ec3475fe097cb54d added code that appears to add 1 to the end of a run of 2 or more equal values, so long as that wouldn't result in the new value creating or extending a run. But the title is "small fixes" and there isn't any comment. Can you help me understand what this is for?

    documentation question 
    opened by taralx 6
  • Request guidance on object-to-integer conversion.

    Request guidance on object-to-integer conversion.

    This question was asked earlier in https://github.com/gvinciguerra/PGM-index/issues/17.

    I do not want to beat a dead horse. I just want to put a better/harder/clearer period to it. I've taken the advice in 17. Unless a reply here convinces me otherwise I will use a radix trie as per author's recommendation and come back to PGM when it's a better fit.

    But to make sure I'm not missing something it's worth pointing out ... whatever the underlying search structure one has to which we wish to adjoin PGM or, even better, replace in whole or in part with PGM indexing we are confronted with the problem of converting keys to integers. My keys are something like 32-bytes or so on average. There seems to be no trivial way to do this. Certainly it's not the PGM's authors problem to solve, but it might be helpful to identify practical or semi-practical avenues:

    • Regard the key as a bit string and turn it into a (potentially very large) double then compress it perhaps ala https://en.wikipedia.org/wiki/Arithmetic_coding
    • Some compression repos that might help are https://github.com/RoaringBitmap/CRoaring https://github.com/powturbo/TurboPFor-Integer-Compression https://github.com/lemire/streamvbyte https://github.com/facebook/zstd

    But please note while compression offers many gains here and in other parts of code, compression has to be order preserving to be helpful for indexing.

    • In defense of PGM authors might point that out extended doubles --- even if they far exceed a typical 8 byte Intel double while still honoring vanilla 8-byte double properties --- still offers benefits that far outweigh caring around keys for indexing

    There's also the idea of order preserving minimum hashes. Here the idea is to hash a key and take the resulting hash into PGM. The only thing I could find was https://github.com/DominikHorn/exotic-hashing but I don't think those implementations go very far.

    Finally it might be worth pointing out that databases through other work they already had to do have-in-hand integers for keys because they needed an associative map taking integer i to the ith-key. Perhaps i was a byte offset in a file or an offset from base memory address etc.. In that case PGM can be adjoined to good benefit.

    To close this issue can we collectively point out some 3-5 avenues for good, practical order preserving integer production from keys for those who don't naturally fall into it otherwise?

    opened by rodgarrison 5
  • Use of GPL code is incompatible with Apache License

    Use of GPL code is incompatible with Apache License

    The sdsl library is licensed under the GPL v3 license. You cannot use code with that license in this library without explicit permission from the owner of that library as the terms are incompatible.

    opened by lederernc 5
  • Multiple-level Dynamic PGM-index

    Multiple-level Dynamic PGM-index

    @gvinciguerra I have done a test among unordered_map, ALEX, and pgm_index. The result is very strange

    It costs 932159
    Hash table size is 175731056
    It costs 8694683
    alex size is 1575900636
    Killed # pgm test for too large memory consumption
    

    Is there any suggestion for designing Multiple-level Dynamic PGM-index? Many thanks in advance. If you would like to have a try, please use

    g++ -std=c++17 -fopenmp -I./PGM-index/include/ -I./ALEX/src/ bench.cpp -march=native -Wall -Wextra
    
    #include <iostream>
    #include <sys/random.h>
    #include <assert.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <string>
    #include <time.h>
    #include <vector>
    #include <iostream>
    #include <algorithm>
    #include <unordered_map>
    #include <sys/time.h>
    #include "pgm/pgm_index.hpp"
    #include "pgm/pgm_index_dynamic.hpp"
    #include "pgm/pgm_index_variants.hpp"
    #include "core/alex.h"
    
    struct KKK {
        uint64_t a;
        uint32_t b, c, d;
    };
    
    struct VVV{
    	uint64_t a, b, c;
    };
    
    typedef pgm::DynamicPGMIndex<uint32_t, VVV> u32P;
    typedef pgm::DynamicPGMIndex<uint64_t, u32P*> u64PGM2;
    typedef pgm::DynamicPGMIndex<uint64_t, u64PGM2*> u64PGM1;
    
    void pgm_test(std::vector<KKK> key, std::vector<VVV> value){
    	u64PGM1 table;
    	int number = key.size();
    	struct timeval start, end;
    	gettimeofday(&start, NULL);
    
    	for (int i = 0; i < number; i++) {
    		uint64_t k1 = key[i].a;
    		uint64_t k2 = ((uint64_t)key[i].b) << 32 + (uint64_t)key[i].c;
    		uint32_t k3 = key[i].d;
    		auto iter = table.find(k1);
    		if(iter == table.end()){
    			table.insert_or_assign(k1, new u64PGM2());
    			iter = table.find(k1);
    		}
    		auto p2 = iter->second;
    		auto iter1 = p2->find(k2);
    		if(iter1 == p2->end()){
    			p2->insert_or_assign(k2, new u32P());
    			iter1 = p2->find(k2);
    		}
    		iter1->second->insert_or_assign(k3, value[i]);
    	}
    	for (int i = 0; i < number; i++) {
    		uint64_t k1 = key[i].a;
    		uint64_t k2 = ((uint64_t)key[i].b) << 32 + (uint64_t)key[i].c;
    		uint32_t k3 = key[i].d;
    		auto iter = table.find(k1)->second->find(k2)->second->find(k3);
    		VVV b = iter->second;
    		if (b.a != value[i].a || b.b != value[i].b
    				|| b.c != value[i].c) {
    			printf("failed\n");
    			return;
    		}
    	}
    	gettimeofday(&end, NULL);
    	int time_len = 1000000 * (end.tv_sec - start.tv_sec)
    			+ (end.tv_usec - start.tv_usec);
    	std::cout << "It costs " << time_len << std::endl;
    	uint64_t size = table.size_in_bytes();
    
    	for (auto &e : table) {
    		size += e.second->size_in_bytes();
    		for(auto &e1 : *(e.second)){
    			size += e1.second->size_in_bytes();
    		}
    	}
    	printf("pgm size is %lld\n", size);
    }
    
    typedef alex::Alex<uint32_t, VVV> u32A;
    typedef alex::Alex<uint64_t, u32A*> u64P2;
    typedef alex::Alex<uint64_t, u64P2*> u64P1;
    
    void alex_test(std::vector<KKK> key,
    		std::vector<VVV> value) {
    	u64P1 table;
    	int number = key.size();
    	struct timeval start, end;
    	gettimeofday(&start, NULL);
    	for (int i = 0; i < number; i++) {
    		uint64_t k1 = key[i].a;
    		uint64_t k2 = ((uint64_t)key[i].b) << 32 + (uint64_t)key[i].c;
    		uint32_t k3 = key[i].d;
    		auto iter = table.find(k1);
    		if (iter == table.end()) {
    			table.insert(k1, new u64P2());
    			iter = table.find(k1);
    		}
    		auto p2 = iter.payload();
    		auto iter1 = p2->find(k2);
    		if (iter1 == p2->end()) {
    			p2->insert(k2, new u32A());
    			iter1 = p2->find(k2);
    		}
    		iter1.payload()->insert(k3, value[i]);
    	}
    	for (int i = 0; i < number; i++) {
    		uint64_t k1 = key[i].a;
    		uint64_t k2 = ((uint64_t)key[i].b) << 32 + (uint64_t)key[i].c;
    		uint32_t k3 = key[i].d;
    		auto iter = table.find(k1).payload()->find(k2).payload()->find(k3);
    		VVV b = iter.payload();
    		if (b.a != value[i].a || b.b != value[i].b
    				|| b.c != value[i].c) {
    			printf("failed\n");
    			return;
    		}
    	}
    	gettimeofday(&end, NULL);
    	int time_len = 1000000 * (end.tv_sec - start.tv_sec)
    			+ (end.tv_usec - start.tv_usec);
    	std::cout << "It costs " << time_len << std::endl;
    
    	uint64_t size = table.data_size() + table.model_size();
    
    	for (auto e = table.begin(); e != table.end(); e++) {
    		auto t2 = e.payload();
    		size += t2->data_size();
    		size += t2->model_size();
    		for(auto e1 = t2->begin(); e1 != t2->end(); e1++){
    			size += e1.payload()->data_size();
    			size += e1.payload()->model_size();
    		}
    	}
    	printf("alex size is %lld\n", size);
    }
    
    struct TupleHasher {
        std::size_t
        operator()(const KKK &key) const {
            return key.a;
        }
    };
    
    struct TupleEqualer {
        bool operator()(const KKK &lhs, const KKK &rhs) const {
            return (lhs.a == rhs.a) && (lhs.b == rhs.b) && (lhs.c == rhs.c) && (lhs.d == rhs.d);
        }
    };
    
    void hashmap_test(std::vector<KKK> key,
    		std::vector<VVV> value) {
    	std::unordered_map<KKK, VVV, TupleHasher, TupleEqualer> table;
    	int number = key.size();
    	struct timeval start, end;
    	gettimeofday(&start, NULL);
    	for (int i = 0; i < number; i++) {
    		table[key[i]] = value[i];
    	}
    	for (int i = 0; i < number; i++) {
    		VVV b = table[key[i]];
    		if (b.a != value[i].a || b.b != value[i].b
    				|| b.c != value[i].c) {
    			printf("failed\n");
    			return;
    		}
    	}
    	gettimeofday(&end, NULL);
    	int time_len = 1000000 * (end.tv_sec - start.tv_sec)
    			+ (end.tv_usec - start.tv_usec);
    	std::cout << "It costs " << time_len << std::endl;
    	uint64_t size = 0;
    
    	for (unsigned i = 0; i < table.bucket_count(); ++i) {
    		size_t bucket_size = table.bucket_size(i);
    		if (bucket_size == 0) {
    			size += sizeof(void*);
    		} else {
    			size += bucket_size * (sizeof(KKK) + sizeof(VVV));
    			size += (bucket_size - 1) * sizeof(void*);
    		}
    	}
    
    	printf("Hash table size is %lld\n", size);
    }
    
    int main(){
    	std::vector<KKK> x_key;
    	std::vector<VVV> x_value;
    	for(int i=0; i<3199807; i++){
    		KKK a;
    		VVV b;
                   // you can also generate them randomly
    		a.a = i >> 96;
    		a.b = i >> 64;
    		a.c = i >> 32;
    		a.d = i;
    		b.a = 1 * i;
    		b.b = 2 * i;
    		b.c = 3 * i;
    		x_key.push_back(a);
    		x_value.push_back(b);
    	}
    	hashmap_test(x_key, x_value);
    	alex_test(x_key, x_value);
    	pgm_test(x_key, x_value);
    }
    
    duplicate invalid 
    opened by authwork 4
  • Recommendations for large list of strings?

    Recommendations for large list of strings?

    I tried to convert my 125 million domains to a unique set of integers but the integer values exceeded the max for 64bit ints. Does anyone know a way to solve this? Maybe something obvious I am not seeing.

    question 
    opened by mySYSMON 3
  • Logic error: Points must be increasing by x.

    Logic error: Points must be increasing by x.

    Hello! Got another one for you :)

    Dataset: https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/JGVF9A/EATHF7

    PGMIndex<uint64_t, 16, 4> and PGMIndex<uint64_t, 32, 4> both build without an issue, but PGMIndex<uint64_t, 64, 4> fails:

    terminate called after throwing an instance of 'std::logic_error'                                                  
      what():  Points must be increasing by x.                                                                                                           
    

    Backtrace:

    (gdb) bt                                                                                                           
    #0  0x00007ffff7a87ce5 in raise () from /usr/lib/libc.so.6                                                         
    #1  0x00007ffff7a71857 in abort () from /usr/lib/libc.so.6                                                         
    #2  0x00007ffff7e2c81d in __gnu_cxx::__verbose_terminate_handler ()                                                
        at /build/gcc/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95                                                  
    #3  0x00007ffff7e392ca in __cxxabiv1::__terminate (handler=<optimized out>)                                        
        at /build/gcc/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48                                                
    #4  0x00007ffff7e39337 in std::terminate () at /build/gcc/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58        
    #5  0x00007ffff7e3959e in __cxxabiv1::__cxa_throw (obj=obj@entry=0x555555b29420, 
        tinfo=0x555555b04098 <typeinfo for std::logic_error@@GLIBCXX_3.4>, 
        dest=0x7ffff7e4f630 <std::logic_error::~logic_error()>)
        at /build/gcc/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95
    #6  0x00005555557ad18e in OptimalPiecewiseLinearModel<unsigned long, unsigned long>::add_point (
        this=this@entry=0x7fffffffd9b0, x=@0x7fffffffd760: 0, y=@0x7fffffffd768: 8942)
        at /usr/include/c++/9.3.0/bits/stl_iterator.h:806
    #7  0x00005555557bdeeb in make_segmentation<PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >)::{lambda(auto:1)#4}, PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >)::{lambda(auto:1, auto:2, auto:3)#3}>(unsigned long, unsigned long, PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >)::{lambda(auto:1)#4}, PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >)::{lambda(auto:1)#4}) (n=n@entry=8943, error=error@entry=4, in=..., out=..., out@entry=...)
        at /usr/include/c++/9.3.0/bits/stl_pair.h:378
    #8  0x00005555557bee24 in PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > > (last=0, first=..., 
        this=0x555555b2dea0) at /usr/include/c++/9.3.0/bits/stl_vector.h:1040
    #9  PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex (
        data=std::vector of length 200000000, capacity 268435456 = {...}, this=0x555555b2dea0)
        at /home/ryan/SOSD-private/competitors/PGM-index/include/pgm_index.hpp:113
    #10 std::make_unique<PGMIndex<unsigned long, 64ul, 4ul, double>, std::vector<unsigned long, std::allocator<unsigned
     long> >&> () at /usr/include/c++/9.3.0/bits/unique_ptr.h:857
    #11 PGM<unsigned long, 64>::Build (data=..., this=0x7fffffffdc30)
        at /home/ryan/SOSD-private/dtl/../competitors/pgm_index.h:23
    
    
    opened by RyanMarcus 3
  • Add CodeQL workflow for GitHub code scanning

    Add CodeQL workflow for GitHub code scanning

    Hi gvinciguerra/PGM-index!

    This is a one-off automatically generated pull request from LGTM.com :robot:. You might have heard that we’ve integrated LGTM’s underlying CodeQL analysis engine natively into GitHub. The result is GitHub code scanning!

    With LGTM fully integrated into code scanning, we are focused on improving CodeQL within the native GitHub code scanning experience. In order to take advantage of current and future improvements to our analysis capabilities, we suggest you enable code scanning on your repository. Please take a look at our blog post for more information.

    This pull request enables code scanning by adding an auto-generated codeql.yml workflow file for GitHub Actions to your repository — take a look! We tested it before opening this pull request, so all should be working :heavy_check_mark:. In fact, you might already have seen some alerts appear on this pull request!

    Where needed and if possible, we’ve adjusted the configuration to the needs of your particular repository. But of course, you should feel free to tweak it further! Check this page for detailed documentation.

    Questions? Check out the FAQ below!

    FAQ

    Click here to expand the FAQ section

    How often will the code scanning analysis run?

    By default, code scanning will trigger a scan with the CodeQL engine on the following events:

    • On every pull request — to flag up potential security problems for you to investigate before merging a PR.
    • On every push to your default branch and other protected branches — this keeps the analysis results on your repository’s Security tab up to date.
    • Once a week at a fixed time — to make sure you benefit from the latest updated security analysis even when no code was committed or PRs were opened.

    What will this cost?

    Nothing! The CodeQL engine will run inside GitHub Actions, making use of your unlimited free compute minutes for public repositories.

    What types of problems does CodeQL find?

    The CodeQL engine that powers GitHub code scanning is the exact same engine that powers LGTM.com. The exact set of rules has been tweaked slightly, but you should see almost exactly the same types of alerts as you were used to on LGTM.com: we’ve enabled the security-and-quality query suite for you.

    How do I upgrade my CodeQL engine?

    No need! New versions of the CodeQL analysis are constantly deployed on GitHub.com; your repository will automatically benefit from the most recently released version.

    The analysis doesn’t seem to be working

    If you get an error in GitHub Actions that indicates that CodeQL wasn’t able to analyze your code, please follow the instructions here to debug the analysis.

    How do I disable LGTM.com?

    If you have LGTM’s automatic pull request analysis enabled, then you can follow these steps to disable the LGTM pull request analysis. You don’t actually need to remove your repository from LGTM.com; it will automatically be removed in the next few months as part of the deprecation of LGTM.com (more info here).

    Which source code hosting platforms does code scanning support?

    GitHub code scanning is deeply integrated within GitHub itself. If you’d like to scan source code that is hosted elsewhere, we suggest that you create a mirror of that code on GitHub.

    How do I know this PR is legitimate?

    This PR is filed by the official LGTM.com GitHub App, in line with the deprecation timeline that was announced on the official GitHub Blog. The proposed GitHub Action workflow uses the official open source GitHub CodeQL Action. If you have any other questions or concerns, please join the discussion here in the official GitHub community!

    I have another question / how do I get in touch?

    Please join the discussion here to ask further questions and send us suggestions!

    opened by lgtm-com[bot] 0
  • Implementation of a one-level PGM based on the Eytzinger array

    Implementation of a one-level PGM based on the Eytzinger array

    Since this structure implies the presence of binary search between levels, we thought that we could replace a regular array with one that is optimized for cache queries, accordingly, this speeds up binary search. We also added types for a single-level PGM and a BPGM with this optimization. In this version, we store both arrays, since after transferring the data over them, it is impossible to iterate in the desired order.

    help wanted 
    opened by RomaA2000 2
  • Add support for Windows?

    Add support for Windows?

    On Visual Studio 2019, the code does not build:

    $ cl -I../include /O2 /permissive- /std:c++17 simple.cpp
    Microsoft (R) C/C++ Optimizing Compiler Version 19.28.29337 for x64
    Copyright (C) Microsoft Corporation.  All rights reserved.
    
    simple.cpp
    Compilation with -fopenmp is optional but recommended
    C:\cygwin\tmp\PGM-index\include\pgm\piecewise_linear_model.hpp(37): error C4235: nonstandard extension used: '__int128' keyword not supported on this architecture
    C:\cygwin\tmp\PGM-index\include\pgm\piecewise_linear_model.hpp(37): error C2059: syntax error: '>'
    C:\cygwin\tmp\PGM-index\include\pgm\piecewise_linear_model.hpp(37): error C2062: type 'unknown-type' unexpected
    C:\cygwin\tmp\PGM-index\include\pgm\piecewise_linear_model.hpp(40): error C2631: 'OptimalPiecewiseLinearModel': a class or enum cannot be defined in an alias template
    ../include\pgm/pgm_index.hpp(24): error C2143: syntax error: missing ';' before '{'
    ../include\pgm/pgm_index.hpp(24): error C2447: '{': missing function header (old-style formal list?)
    simple.cpp(24): error C2039: 'PGMIndex': is not a member of 'pgm'
    C:\cygwin\tmp\PGM-index\include\pgm\piecewise_linear_model.hpp(32): note: see declaration of 'pgm'
    simple.cpp(24): error C2065: 'PGMIndex': undeclared identifier
    simple.cpp(24): error C2062: type 'int' unexpected
    simple.cpp(28): error C2065: 'index': undeclared identifier
    simple.cpp(31): error C2672: 'lower_bound': no matching overloaded function found
    simple.cpp(31): error C2893: Failed to specialize function template '_FwdIt std::lower_bound(_FwdIt,_FwdIt,const _Ty &)'
    C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.28.29333\include\xutility(5871): note: see declaration of 'std::lower_bound'
    simple.cpp(31): note: With the following template arguments:
    simple.cpp(31): note: '_FwdIt=auto'
    simple.cpp(31): note: '_Ty=auto'
    simple.cpp(31): error C2780: '_FwdIt std::lower_bound(_FwdIt,const _FwdIt,const _Ty &,_Pr)': expects 4 arguments - 3 provided
    C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.28.29333\include\xutility(5849): note: see declaration of 'std::lower_bound'
    simple.cpp(35): fatal error C1004: unexpected end-of-file found
    
    enhancement 
    opened by e4lam 0
  • Embed into other languages

    Embed into other languages

    Hey thanks for the library. Is there any possibility for a C version of this? Then it would be much easier to embed in other languages. A C-wrapper would also be a good compromise.

    enhancement good first issue 
    opened by flip111 6
  • implicit interval tree

    implicit interval tree

    I'm curious what would be required to support an interval tree in the PGM.

    I see that the type K used in the index must be numeric. Perhaps this could be relaxed?

    enhancement good first issue 
    opened by ekg 1
Owner
Giorgio Vinciguerra
PhD Student in Computer Science
Giorgio Vinciguerra
Rajesh Kumar Sah 1 Nov 20, 2021
This project Orchid-Fst implements a fast text string dictionary search data structure: Finite state transducer (short for FST) in c++ language.This FST C++ open source project has much significant advantages.

Orchid-Fst 1. Project Overview This project Orchid-Fst implements a fast text string dictionary search data structure: Finite state transducer , which

Bin Ding 10 Oct 18, 2022
100daysofDSA - Explore about arrays, linked lists, stacks & queues, graphs, and more to master the foundations of data structures & algorithms!

Explore about arrays, linked lists, stacks & queues, graphs, and more to master the foundations of data structures & algorithms!

null 27 Oct 29, 2022
A collection of libraries, data structures, and more that I have created to make coding in C less painful.

ctools A collection of libraries, data structures, and more that I have created to make coding in C less painful. Data structures There are many data

null 3 Nov 27, 2021
Directed Acyclic Graph Execution Engine (DAGEE) is a C++ library that enables programmers to express computation and data movement, as task graphs that are scheduled concurrently and asynchronously on both CPUs and GPUs.

Directed Acyclic Graph Execution Engine (DAGEE) is a C++ library that enables programmers to express computation and data movement, as tasks in a graph structure, where edges represent task dependencies

null 28 Dec 18, 2022
Given two arrays X and Y of positive integers, find the number of pairs such that x^y > y^x (raised to power of) where x is an element from X and y is an element from Y.

Number-of-pairs Given two arrays X and Y of positive integers, find the number of pairs such that x^y > y^x (raised to power of) where x is an element

Rajesh Kumar Sah 1 Nov 20, 2021
This is a Program, to sort Arrays with the QuickSort Algorithm.

QuickSort This is a program, to sort arrays with the QuickSort Algorithm. The Algorithm is optimized to be quick, but it isn't the fastest. I have wri

Niklas 1 Oct 29, 2021
180+ Algorithm & Data Structure Problems using C++

180+ Algorithm & Data Structure Problems using C++

Ravi Mandliya 5.2k Jan 8, 2023
A C++ data container replicating std::stack functionality but with better performance than standard library containers in a stack context.

plf::stack A data container replicating std::stack functionality but with better performance than standard library containers in a stack context. C++9

Matt Bentley 47 Sep 11, 2022
Data Structure and Algorithms problems across various platforms.

Covering various practice problems of Arrays, Stacks, queue, tree, graph and different Algorithms. A collection of solutions for Data Structure and Algorithms problems across various platforms in different languages.

Nivedita Singh 42 Oct 29, 2022
The ultimate guide for data structure and algorithms

Quick Notes: For Arrays: (methods that can be applied) sorting and then doing something, hashtable, two pointers in a loop are some of the operations

Intervue 44 Nov 25, 2022
Finite State Machine implementation using std::variant

mp::fsm Implementation of Finite State Machine presented by me on CppCon 2018 in a talk Effective replacement of dynamic polymorphism with std::varian

Mateusz Pusz 65 Jan 3, 2023
Teaching materials for Algorithm Bootcamp: Data Structure.

Data Structure Materials Materials Topics Code Introduction to Data Structures Struct Pointer Dynamic Memory Allocation 00_intro_to_ds.cpp Linked List

pakgembus 50 Jan 7, 2023
Total 21 math problem solved by c language with 4 types of functional reference. Also added circular repeated linked list system of Data structure.

C-ProblemSolve Total 21 math problem solved by c language with 4 types of functional reference. Also added circular repeated linked list system of Dat

MH Miyazi 3 Aug 28, 2021
Typesafe, Generic & Fastest Set Data structure implementation in C

Typesafe & Fast as fuck Set in C Key Features Extremely fast non-cryptographic hash algorithm XXHash Complete Typesafe APIs Double Hashing to avoid bo

Robus Gauli 4 Sep 6, 2021
A simple & stupid data structure library for C

HDataStruct - Hans' Data Structures Library A simple & stupid data structure library for C. Introduction This is a small C library containing several

Hans Wan 3 Oct 6, 2021
Hashmap data structure with string keys

Hashmap Library implementing the hash map data structure with open addressing, robin hood hashing more presicely, as a collision resolution strategy.

Elmo Moilanen 2 Jun 23, 2022
Data structure manipulation in C

data-structure Data structure manipulation in C This is a set of functions for handling a singly linked list Pros: easy to read and short. the linked

ghassenhd 1 Nov 27, 2021