­čĆůState-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes

Overview

The PGM-index

The Piecewise Geometric Model index (PGM-index) is a data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes while providing the same worst-case query time guarantees.

Website | Documentation | Paper | Slides | Python wrapper | A┬│ Lab

GitHub Workflow Status Language grade: C/C++ License GitHub stars GitHub forks Run on Repl.it

Quickstart

This is a header-only library. It does not need to be installed. Just clone the repo with

git clone https://github.com/gvinciguerra/PGM-index.git
cd PGM-index

and copy the include/pgm directory to your system's or project's include path.

The examples/simple.cpp file shows how to index and query a vector of random integers with the PGM-index:

#include <vector>
#include <cstdlib>
#include <iostream>
#include <algorithm>
#include "pgm/pgm_index.hpp"

int main() {
    // Generate some random data
    std::vector<int> data(1000000);
    std::generate(data.begin(), data.end(), std::rand);
    data.push_back(42);
    std::sort(data.begin(), data.end());

    // Construct the PGM-index
    const int epsilon = 128; // space-time trade-off parameter
    pgm::PGMIndex<int, epsilon> index(data);

    // Query the PGM-index
    auto q = 42;
    auto range = index.search(q);
    auto lo = data.begin() + range.lo;
    auto hi = data.begin() + range.hi;
    std::cout << *std::lower_bound(lo, hi, q);

    return 0;
}

Run and edit this and other examples on Repl.it. Or run it locally via:

g++ examples/simple.cpp -std=c++17 -I./include -o simple
./simple

Classes overview

Other than the pgm::PGMIndex class in the example above, this library provides the following classes:

  • pgm::DynamicPGMIndex supports insertions and deletions.
  • pgm::MultidimensionalPGMIndex stores points in k dimensions and supports orthogonal range queries.
  • pgm::MappedPGMIndex stores data on disk and uses a PGMIndex for fast search operations.
  • pgm::CompressedPGMIndex compresses the segments to reduce the space usage of the index.
  • pgm::OneLevelPGMIndex uses a binary search on the segments rather than a recursive structure.
  • pgm::BucketingPGMIndex uses a top-level lookup table to speed up the search on the segments.
  • pgm::EliasFanoPGMIndex uses a top-level succinct structure to speed up the search on the segments.

The full documentation is available here.

Compile the tests and the tuner

After cloning the repository, build the project with

cmake . -DCMAKE_BUILD_TYPE=Release
make -j8

The test runner will be placed in test/. The tuner executable will be placed in tuner/.

License

This project is licensed under the terms of the Apache License 2.0.

If you use the library please put a link to the website and cite the following paper:

Paolo Ferragina and Giorgio Vinciguerra. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. PVLDB, 13(8): 1162-1175, 2020.

@article{Ferragina:2020pgm,
  Author = {Paolo Ferragina and Giorgio Vinciguerra},
  Title = {The {PGM-index}: a fully-dynamic compressed learned index with provable worst-case bounds},
  Year = {2020},
  Volume = {13},
  Number = {8},
  Pages = {1162--1175},
  Doi = {10.14778/3389133.3389135},
  Url = {https://pgm.di.unipi.it},
  Issn = {2150-8097},
  Journal = {{PVLDB}}}
Issues
  • Application of PGM-index

    Application of PGM-index

    Hi, this is a interesting work! Thanks for your contribution.

    But I have some problems:

    1. in the examples, we can see it trains a map function from "Key" (the value of key) to its stored positions. If we really want to use it in real example: the user want to access the value based on the key. Then how values should be organzed with PGM-index? In other words, many (key, value) pairs are stored in a large array and sorted by key (maybe a value is a pointer or offset, etc. It suggests the length of key and value is almost fixed). How to make full use of PGM-index to search a key and return the corresponding value? I believe it should find the position, and then use (position + the length of key) to return the value. However, PGM-index seems need significant upgrade to support this function. Is there any other solutions?

    2. In the paper, it seems "works for any kind of keys that can be mapped to reals while preserving their order. Examples include integers, strings, etc." is very difficult to understand and hard to make implementation. My reasons are given below:

      • issue one: the function of map is to map a key (string) to a real, and then use the sorted reals to construct PGM-index and find the offset? Why not I directly map the key to its offset or value. Since the mapping between keys and reals is acceptable, then mapping between keys and offsets (values) should also be acceptable.
      • issue two: One important thing is that how string (key type in may DB) can be support in current system. As you can see, a string could have unlimited length. It should be very hard to bound the errors, especially when many short fixed-length strings mixed with a few long strings appears. I also refer to the #8, this hint is very nice, but other issues come out: map string directly to uint128_t at bit-level (maybe) can solve the issue one, but it caused: (1) the max length of string is 24 (?) (2) the wasted space, said some strings have 19 chars, and it can only use uint128_t to do the map. some spaces is wasted then.
    question 
    opened by authwork 17
  • Add support for a vector of int as a key

    Add support for a vector of int as a key

    Hello! @gvinciguerra With the growing big data, sometimes we may need a vector of integers as the key to index some values. Currectly support is to use a hash function to combine such a vector of integer into a uint128_t number.

    Is there any possible to extend the build-in ML model (f(x) = kx + b). For example, [x1, x2, x3, ..., xn], build a ML model f(x1, x2, x3, ..., xn) = c0 + c1x1 + c2x2 + ... + cnxn?

    In this way, it could become more general. Currently, I use a multiple-level PGM index (#26) to implement this, but seems much more overhead (size, time, etc) than the single-level.

    Many thanks.

    question 
    opened by authwork 12
  • PGMIndex return incorrect range on books dataset

    PGMIndex return incorrect range on books dataset

    Hi,

    for some values of epsilon, PGM returns incorrect ranges on the books dataset from SOSD.

    Dataset: https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/JGVF9A/A6HDNT&version=10.0

    Code to reproduce the issue:

    #include <vector>
    #include <fstream>
    #include <iostream>
    
    #include "pgm/pgm_index.hpp"
    
    
    template<typename Key>
    std::vector<Key> load_data(const std::string &filename) {
        using key_type = Key;
    
        /* Open file. */
        std::ifstream in(filename, std::ios::binary);
        if (!in.is_open())
            exit(EXIT_FAILURE);
    
        /* Read number of keys. */
        uint64_t n_keys;
        in.read(reinterpret_cast<char*>(&n_keys), sizeof(uint64_t));
    
        /* Initialize vector. */
        std::vector<key_type> data;
        data.resize(n_keys);
    
        /* Read keys. */
        in.read(reinterpret_cast<char*>(data.data()), n_keys * sizeof(key_type));
        in.close();
    
        return data;
    }
    
    int main(int argc, char *argv[])
    {
        /* Load keys. */
        auto keys = load_data<uint64_t>("data/books_200M_uint64");
        std::cout << "keys loaded" << std::endl;
    
        /* Build PGM. */
        const int epsilon = 2048;
        pgm::PGMIndex<uint64_t, epsilon> pgm(keys);
        std::cout << "pgm built" << std::endl;
    
        /* Lookup all keys. */
        for (std::size_t i = 0; i != keys.size(); ++i) {
            auto key = keys.at(i);
            auto range = pgm.search(key);
            if (range.lo > i or range.hi < i) {
                std::cout << "key not in range\n"
                          << "key: " << key << '\n'
                          << "pos: " << i << '\n'
                          << "lo: " << range.lo << '\n'
                          << "hi: " << range.hi << '\n'
                          << std::endl;
            }
        }
    }
    

    Output for epsilon=2048:

    ...
    key not in range
    key: 880841552717461696
    pos: 47236183
    lo: 47236184
    hi: 47240282
    
    key not in range
    key: 880842439537799360
    pos: 47236231
    lo: 47236232
    hi: 47240330
    

    Output for epsilon=16384:

    ...
    key not in range
    key: 1154823330085936896
    pos: 61887138
    lo: 61887141
    hi: 61919911
    
    key not in range
    key: 1154823338493908224
    pos: 61887139
    lo: 61887141
    hi: 61919911
    
    key not in range
    key: 1154823341621615232
    pos: 61887140
    lo: 61887141
    hi: 61919911
    
    opened by marcelmaltry 11
  • multi-level PGM-index and its size

    multi-level PGM-index and its size

    @gvinciguerra Hello, I have done a experiment to build multi-level PGM-index. Why? I use it to support very large integer, like a 256-bit long integer. Assume I have a uint256_t num, I set num[0] is a uint128_t with high 128 bits of num, num[1] is also a uint128_t with low 128 bits of num. To index with PGM-index

    typedef PGM-index<uint128_t, value> type1;
    typedef PGM-index<uint128_t, type1 *> type2;
    

    for insert. I will first insert a type1 variable into type2, then insert the real value to the corresponding type1, such as:

    type2 index2nd;
    auto pp = index2nd.find(num[0]);
    if(pp == index2nd.end()){
        type1 *pp2 = new type1();
        index2nd.insert_or_assign(num[0], pp2);
        pp2->insert_or_assign(num[1], value);
    }
    else{
        pp.second->insert_or_assign(num[1], value);
    }
    

    To calculate the size of the whole index:

    uint64_t size = index2nd.size_in_bytes();
    for (auto &e : index2nd) {
          size += e.second->size_in_bytes();
    }
    

    Does size correctly calculate the memory size of index2nd?

    question 
    opened by authwork 6
  • What does this adjustment do?

    What does this adjustment do?

    https://github.com/gvinciguerra/PGM-index/commit/e346995724dd12a7b0b5c039ec3475fe097cb54d added code that appears to add 1 to the end of a run of 2 or more equal values, so long as that wouldn't result in the new value creating or extending a run. But the title is "small fixes" and there isn't any comment. Can you help me understand what this is for?

    documentation question 
    opened by taralx 6
  • Scalability issue

    Scalability issue

    Hello,@gvinciguerra. I have done a comparison betweem PGM-index with unordered_map (hash map) by using

    pgm::DynamicPGMIndex<__uint128_t, A> d_index; // A is a structure with multiple uint64_t
    for(__uint128_t i=0; i<0xFFFFFFFFFFFFFFFF; i++){
    	A a;
    	a.a = (uint64_t)i;
    	a.b = (uint64_t)2*i;
    	a.c = (uint64_t)3*i;
    	d_index.insert_or_assign(i, a);
    }
    A b;
    A *c = &b;
    for(__uint128_t i=0; i<0xFFFFFFFFFFFFFFFF; i++){
    	auto iter = d_index.find(i);
    	*c = iter->second;
    	std::cout << b.a << " " << b.b << " " << b.c << std::endl;
    }
    

    The memory consumption of DynamicPGMIndex can achieve 98% and it will cause:

    terminate called after throwing an instance of 'std::bad_alloc'
      what():  std::bad_alloc
    Aborted
    

    while the case using unordered_map only has very low memory consumption. Theoretical speaking, Hash map should use much more memory (pointers in linked list, RB tree etc) Why this experiment show DynamicPGMIndex uses more memory? And how we solve it?

    Another issue is mentioned in #17 How to achieve dynamic insertable set by using PGM index?

    Many thanks

    question 
    opened by authwork 6
  • Use of GPL code is incompatible with Apache License

    Use of GPL code is incompatible with Apache License

    The sdsl library is licensed under the GPL v3 license. You cannot use code with that license in this library without explicit permission from the owner of that library as the terms are incompatible.

    opened by lederernc 5
  • Multiple-level Dynamic PGM-index

    Multiple-level Dynamic PGM-index

    @gvinciguerra I have done a test among unordered_map, ALEX, and pgm_index. The result is very strange

    It costs 932159
    Hash table size is 175731056
    It costs 8694683
    alex size is 1575900636
    Killed # pgm test for too large memory consumption
    

    Is there any suggestion for designing Multiple-level Dynamic PGM-index? Many thanks in advance. If you would like to have a try, please use

    g++ -std=c++17 -fopenmp -I./PGM-index/include/ -I./ALEX/src/ bench.cpp -march=native -Wall -Wextra
    
    #include <iostream>
    #include <sys/random.h>
    #include <assert.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <string>
    #include <time.h>
    #include <vector>
    #include <iostream>
    #include <algorithm>
    #include <unordered_map>
    #include <sys/time.h>
    #include "pgm/pgm_index.hpp"
    #include "pgm/pgm_index_dynamic.hpp"
    #include "pgm/pgm_index_variants.hpp"
    #include "core/alex.h"
    
    struct KKK {
        uint64_t a;
        uint32_t b, c, d;
    };
    
    struct VVV{
    	uint64_t a, b, c;
    };
    
    typedef pgm::DynamicPGMIndex<uint32_t, VVV> u32P;
    typedef pgm::DynamicPGMIndex<uint64_t, u32P*> u64PGM2;
    typedef pgm::DynamicPGMIndex<uint64_t, u64PGM2*> u64PGM1;
    
    void pgm_test(std::vector<KKK> key, std::vector<VVV> value){
    	u64PGM1 table;
    	int number = key.size();
    	struct timeval start, end;
    	gettimeofday(&start, NULL);
    
    	for (int i = 0; i < number; i++) {
    		uint64_t k1 = key[i].a;
    		uint64_t k2 = ((uint64_t)key[i].b) << 32 + (uint64_t)key[i].c;
    		uint32_t k3 = key[i].d;
    		auto iter = table.find(k1);
    		if(iter == table.end()){
    			table.insert_or_assign(k1, new u64PGM2());
    			iter = table.find(k1);
    		}
    		auto p2 = iter->second;
    		auto iter1 = p2->find(k2);
    		if(iter1 == p2->end()){
    			p2->insert_or_assign(k2, new u32P());
    			iter1 = p2->find(k2);
    		}
    		iter1->second->insert_or_assign(k3, value[i]);
    	}
    	for (int i = 0; i < number; i++) {
    		uint64_t k1 = key[i].a;
    		uint64_t k2 = ((uint64_t)key[i].b) << 32 + (uint64_t)key[i].c;
    		uint32_t k3 = key[i].d;
    		auto iter = table.find(k1)->second->find(k2)->second->find(k3);
    		VVV b = iter->second;
    		if (b.a != value[i].a || b.b != value[i].b
    				|| b.c != value[i].c) {
    			printf("failed\n");
    			return;
    		}
    	}
    	gettimeofday(&end, NULL);
    	int time_len = 1000000 * (end.tv_sec - start.tv_sec)
    			+ (end.tv_usec - start.tv_usec);
    	std::cout << "It costs " << time_len << std::endl;
    	uint64_t size = table.size_in_bytes();
    
    	for (auto &e : table) {
    		size += e.second->size_in_bytes();
    		for(auto &e1 : *(e.second)){
    			size += e1.second->size_in_bytes();
    		}
    	}
    	printf("pgm size is %lld\n", size);
    }
    
    typedef alex::Alex<uint32_t, VVV> u32A;
    typedef alex::Alex<uint64_t, u32A*> u64P2;
    typedef alex::Alex<uint64_t, u64P2*> u64P1;
    
    void alex_test(std::vector<KKK> key,
    		std::vector<VVV> value) {
    	u64P1 table;
    	int number = key.size();
    	struct timeval start, end;
    	gettimeofday(&start, NULL);
    	for (int i = 0; i < number; i++) {
    		uint64_t k1 = key[i].a;
    		uint64_t k2 = ((uint64_t)key[i].b) << 32 + (uint64_t)key[i].c;
    		uint32_t k3 = key[i].d;
    		auto iter = table.find(k1);
    		if (iter == table.end()) {
    			table.insert(k1, new u64P2());
    			iter = table.find(k1);
    		}
    		auto p2 = iter.payload();
    		auto iter1 = p2->find(k2);
    		if (iter1 == p2->end()) {
    			p2->insert(k2, new u32A());
    			iter1 = p2->find(k2);
    		}
    		iter1.payload()->insert(k3, value[i]);
    	}
    	for (int i = 0; i < number; i++) {
    		uint64_t k1 = key[i].a;
    		uint64_t k2 = ((uint64_t)key[i].b) << 32 + (uint64_t)key[i].c;
    		uint32_t k3 = key[i].d;
    		auto iter = table.find(k1).payload()->find(k2).payload()->find(k3);
    		VVV b = iter.payload();
    		if (b.a != value[i].a || b.b != value[i].b
    				|| b.c != value[i].c) {
    			printf("failed\n");
    			return;
    		}
    	}
    	gettimeofday(&end, NULL);
    	int time_len = 1000000 * (end.tv_sec - start.tv_sec)
    			+ (end.tv_usec - start.tv_usec);
    	std::cout << "It costs " << time_len << std::endl;
    
    	uint64_t size = table.data_size() + table.model_size();
    
    	for (auto e = table.begin(); e != table.end(); e++) {
    		auto t2 = e.payload();
    		size += t2->data_size();
    		size += t2->model_size();
    		for(auto e1 = t2->begin(); e1 != t2->end(); e1++){
    			size += e1.payload()->data_size();
    			size += e1.payload()->model_size();
    		}
    	}
    	printf("alex size is %lld\n", size);
    }
    
    struct TupleHasher {
        std::size_t
        operator()(const KKK &key) const {
            return key.a;
        }
    };
    
    struct TupleEqualer {
        bool operator()(const KKK &lhs, const KKK &rhs) const {
            return (lhs.a == rhs.a) && (lhs.b == rhs.b) && (lhs.c == rhs.c) && (lhs.d == rhs.d);
        }
    };
    
    void hashmap_test(std::vector<KKK> key,
    		std::vector<VVV> value) {
    	std::unordered_map<KKK, VVV, TupleHasher, TupleEqualer> table;
    	int number = key.size();
    	struct timeval start, end;
    	gettimeofday(&start, NULL);
    	for (int i = 0; i < number; i++) {
    		table[key[i]] = value[i];
    	}
    	for (int i = 0; i < number; i++) {
    		VVV b = table[key[i]];
    		if (b.a != value[i].a || b.b != value[i].b
    				|| b.c != value[i].c) {
    			printf("failed\n");
    			return;
    		}
    	}
    	gettimeofday(&end, NULL);
    	int time_len = 1000000 * (end.tv_sec - start.tv_sec)
    			+ (end.tv_usec - start.tv_usec);
    	std::cout << "It costs " << time_len << std::endl;
    	uint64_t size = 0;
    
    	for (unsigned i = 0; i < table.bucket_count(); ++i) {
    		size_t bucket_size = table.bucket_size(i);
    		if (bucket_size == 0) {
    			size += sizeof(void*);
    		} else {
    			size += bucket_size * (sizeof(KKK) + sizeof(VVV));
    			size += (bucket_size - 1) * sizeof(void*);
    		}
    	}
    
    	printf("Hash table size is %lld\n", size);
    }
    
    int main(){
    	std::vector<KKK> x_key;
    	std::vector<VVV> x_value;
    	for(int i=0; i<3199807; i++){
    		KKK a;
    		VVV b;
                   // you can also generate them randomly
    		a.a = i >> 96;
    		a.b = i >> 64;
    		a.c = i >> 32;
    		a.d = i;
    		b.a = 1 * i;
    		b.b = 2 * i;
    		b.c = 3 * i;
    		x_key.push_back(a);
    		x_value.push_back(b);
    	}
    	hashmap_test(x_key, x_value);
    	alex_test(x_key, x_value);
    	pgm_test(x_key, x_value);
    }
    
    duplicate invalid 
    opened by authwork 4
  • Logic error: Points must be increasing by x.

    Logic error: Points must be increasing by x.

    Hello! Got another one for you :)

    Dataset: https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/JGVF9A/EATHF7

    PGMIndex<uint64_t, 16, 4> and PGMIndex<uint64_t, 32, 4> both build without an issue, but PGMIndex<uint64_t, 64, 4> fails:

    terminate called after throwing an instance of 'std::logic_error'                                                  
      what():  Points must be increasing by x.                                                                                                           
    

    Backtrace:

    (gdb) bt                                                                                                           
    #0  0x00007ffff7a87ce5 in raise () from /usr/lib/libc.so.6                                                         
    #1  0x00007ffff7a71857 in abort () from /usr/lib/libc.so.6                                                         
    #2  0x00007ffff7e2c81d in __gnu_cxx::__verbose_terminate_handler ()                                                
        at /build/gcc/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95                                                  
    #3  0x00007ffff7e392ca in __cxxabiv1::__terminate (handler=<optimized out>)                                        
        at /build/gcc/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48                                                
    #4  0x00007ffff7e39337 in std::terminate () at /build/gcc/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58        
    #5  0x00007ffff7e3959e in __cxxabiv1::__cxa_throw ([email protected]=0x555555b29420, 
        tinfo=0x555555b04098 <typeinfo for std::[email protected]@GLIBCXX_3.4>, 
        dest=0x7ffff7e4f630 <std::logic_error::~logic_error()>)
        at /build/gcc/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95
    #6  0x00005555557ad18e in OptimalPiecewiseLinearModel<unsigned long, unsigned long>::add_point (
        [email protected]=0x7fffffffd9b0, [email protected]: 0, [email protected]: 8942)
        at /usr/include/c++/9.3.0/bits/stl_iterator.h:806
    #7  0x00005555557bdeeb in make_segmentation<PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >)::{lambda(auto:1)#4}, PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >)::{lambda(auto:1, auto:2, auto:3)#3}>(unsigned long, unsigned long, PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >)::{lambda(auto:1)#4}, PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >)::{lambda(auto:1)#4}) ([email protected]=8943, [email protected]=4, in=..., out=..., [email protected]=...)
        at /usr/include/c++/9.3.0/bits/stl_pair.h:378
    #8  0x00005555557bee24 in PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > > (last=0, first=..., 
        this=0x555555b2dea0) at /usr/include/c++/9.3.0/bits/stl_vector.h:1040
    #9  PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex (
        data=std::vector of length 200000000, capacity 268435456 = {...}, this=0x555555b2dea0)
        at /home/ryan/SOSD-private/competitors/PGM-index/include/pgm_index.hpp:113
    #10 std::make_unique<PGMIndex<unsigned long, 64ul, 4ul, double>, std::vector<unsigned long, std::allocator<unsigned
     long> >&> () at /usr/include/c++/9.3.0/bits/unique_ptr.h:857
    #11 PGM<unsigned long, 64>::Build (data=..., this=0x7fffffffdc30)
        at /home/ryan/SOSD-private/dtl/../competitors/pgm_index.h:23
    
    
    opened by RyanMarcus 3
  • Recommendations for large list of strings?

    Recommendations for large list of strings?

    I tried to convert my 125 million domains to a unique set of integers but the integer values exceeded the max for 64bit ints. Does anyone know a way to solve this? Maybe something obvious I am not seeing.

    question 
    opened by mySYSMON 3
  • Implementation of a one-level PGM based on the Eytzinger array

    Implementation of a one-level PGM based on the Eytzinger array

    Since this structure implies the presence of binary search between levels, we thought that we could replace a regular array with one that is optimized for cache queries, accordingly, this speeds up binary search. We also added types for a single-level PGM and a BPGM with this optimization. In this version, we store both arrays, since after transferring the data over them, it is impossible to iterate in the desired order.

    help wanted 
    opened by RomaA2000 1
  • Add support for Windows?

    Add support for Windows?

    On Visual Studio 2019, the code does not build:

    $ cl -I../include /O2 /permissive- /std:c++17 simple.cpp
    Microsoft (R) C/C++ Optimizing Compiler Version 19.28.29337 for x64
    Copyright (C) Microsoft Corporation.  All rights reserved.
    
    simple.cpp
    Compilation with -fopenmp is optional but recommended
    C:\cygwin\tmp\PGM-index\include\pgm\piecewise_linear_model.hpp(37): error C4235: nonstandard extension used: '__int128' keyword not supported on this architecture
    C:\cygwin\tmp\PGM-index\include\pgm\piecewise_linear_model.hpp(37): error C2059: syntax error: '>'
    C:\cygwin\tmp\PGM-index\include\pgm\piecewise_linear_model.hpp(37): error C2062: type 'unknown-type' unexpected
    C:\cygwin\tmp\PGM-index\include\pgm\piecewise_linear_model.hpp(40): error C2631: 'OptimalPiecewiseLinearModel': a class or enum cannot be defined in an alias template
    ../include\pgm/pgm_index.hpp(24): error C2143: syntax error: missing ';' before '{'
    ../include\pgm/pgm_index.hpp(24): error C2447: '{': missing function header (old-style formal list?)
    simple.cpp(24): error C2039: 'PGMIndex': is not a member of 'pgm'
    C:\cygwin\tmp\PGM-index\include\pgm\piecewise_linear_model.hpp(32): note: see declaration of 'pgm'
    simple.cpp(24): error C2065: 'PGMIndex': undeclared identifier
    simple.cpp(24): error C2062: type 'int' unexpected
    simple.cpp(28): error C2065: 'index': undeclared identifier
    simple.cpp(31): error C2672: 'lower_bound': no matching overloaded function found
    simple.cpp(31): error C2893: Failed to specialize function template '_FwdIt std::lower_bound(_FwdIt,_FwdIt,const _Ty &)'
    C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.28.29333\include\xutility(5871): note: see declaration of 'std::lower_bound'
    simple.cpp(31): note: With the following template arguments:
    simple.cpp(31): note: '_FwdIt=auto'
    simple.cpp(31): note: '_Ty=auto'
    simple.cpp(31): error C2780: '_FwdIt std::lower_bound(_FwdIt,const _FwdIt,const _Ty &,_Pr)': expects 4 arguments - 3 provided
    C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.28.29333\include\xutility(5849): note: see declaration of 'std::lower_bound'
    simple.cpp(35): fatal error C1004: unexpected end-of-file found
    
    enhancement 
    opened by e4lam 0
  • Embed into other languages

    Embed into other languages

    Hey thanks for the library. Is there any possibility for a C version of this? Then it would be much easier to embed in other languages. A C-wrapper would also be a good compromise.

    enhancement good first issue 
    opened by flip111 6
  • Postgres?

    Postgres?

    This sounds exciting. Is it going to be implemented in a DB like Postgres soon?

    enhancement 
    opened by GeoffreyPlitt 3
  • implicit interval tree

    implicit interval tree

    I'm curious what would be required to support an interval tree in the PGM.

    I see that the type K used in the index must be numeric. Perhaps this could be relaxed?

    enhancement good first issue 
    opened by ekg 1
Owner
Giorgio Vinciguerra
PhD Student in Computer Science
Giorgio Vinciguerra
C++ implementation of a fast hash map and hash set using robin hood hashing

A C++ implementation of a fast hash map and hash set using robin hood hashing The robin-map library is a C++ implementation of a fast hash map and has

Thibaut Goetghebuer-Planchon 579 Jul 27, 2021
C++ implementation of a fast hash map and hash set using hopscotch hashing

A C++ implementation of a fast hash map and hash set using hopscotch hashing The hopscotch-map library is a C++ implementation of a fast hash map and

Thibaut Goetghebuer-Planchon 479 Jul 22, 2021
A fast, memory efficient hash map for C++

I now recommend using the parallel hashmap instead of sparsepp, unless if you are stuck with a non C++11 compatible compiler, or if using a little bit

Gregory Popovitch 1.1k Jul 23, 2021
A c++ toolbox of locality-sensitive hashing (LSH), provides several popular LSH algorithms, also support python and matlab.

LSHBOX-0.9 A C++ Toolbox of Locality-Sensitive Hashing for Large Scale Image Retrieval, Also Support Python and MATLAB. Change Log 2015.07.04 A new LS

null 251 Jul 6, 2021
Fast & memory efficient hashtable based on robin hood hashing for C++11/14/17/20

Ô×Á robin_hood unordered map & set robin_hood::unordered_map and robin_hood::unordered_set is a platform independent replacement for std::unordered_map

Martin Ankerl 849 Jul 24, 2021
ring-span lite - A C++yy-like ring_span type for C++98, C++11 and later in a single-file header-only library

ring-span lite: A circular buffer view for C++98 and later Contents Example usage In a nutshell Dependencies Installation Synopsis Reported to work wi

Martin Moene 102 Jul 19, 2021
Simple Useful Libraries: The C++17 header-only dynamic bitset

dynamic_bitset Simple Useful Libraries: The C++17 header-only dynamic bitset Requirements To use this dynamic bitset, you will need a C++17 compliant

Maxime Pinard 67 Jul 14, 2021
Retro coding in C/C++ in a 3D template with full low-level control.

BRIEF INFO ON THE 2021 WORLD TEMPLATE Project website: https://jacco.ompf2.com/voxel-world-template/ Purpose: This template has been designed to make

Jacco Bikker 59 Jun 12, 2021
A drop-in replacement for std::list with 293% faster insertion, 57% faster erasure, 17% faster iteration and 77% faster sorting on average. 20-24% speed increase in use-case testing.

plf_list A drop-in replacement for std::list with (on average): 293% faster insertion 57% faster erasure 17% faster iteration 77% faster sorting 70% f

Matt Bentley 105 Jul 19, 2021
Template Library of Tree Data Structures in C++17

Template Library of Tree Data Structures in C++17 Implementations AVL Tree Binary Search Tree BTree KD-Tree Splay Tree Trie Notes This project is for

George Fotopoulos 149 Feb 8, 2021
Prime Number Projects in C#/C++/Python

Primes | A Software Drag Race Source code to Dave's Garage video benchmarking the same prime number sieve in Python, C#, and C++. Community contributi

null 2 Jul 25, 2021