Simple constant key/value storage library, for read-heavy systems with infrequent large bulk inserts.

Related tags

Database sparkey
Overview

Sparkey is a simple constant key/value storage library. It is mostly suited for read heavy systems with infrequent large bulk inserts. It includes both a C library for working with sparkey index and log files (libsparkey), and a command line utility for getting info about and reading values from a sparkey index/log (sparkey).

Travis

Continuous integration with travis.

Build Status

Dependencies

  • GNU build system (autoconf, automake, libtool)
  • Snappy

Optional

Building

autoreconf --install
./configure
make
make check

API documentation can be generated with doxygen.

Installing

sudo make install && sudo ldconfig

Related projects

Description

Sparkey is an extremely simple persistent key-value store. You could think of it as a read-only hashtable on disk and you wouldn't be far off. It is designed and optimized for some server side usecases at Spotify but it is written to be completely generic and makes no assumptions about what kind of data is stored.

Some key characteristics:

  • Supports data sizes up to 2^63 - 1 bytes.
  • Supports iteration, get, put, delete
  • Optimized for bulk writes.
  • Immutable hash table.
  • Any amount of concurrent independent readers.
  • Only allows one writer at a time per storage unit.
  • Cross platform storage file.
  • Low overhead per entry.
  • Constant read startup cost
  • Low number of disk seeks per read
  • Support for block level compression.
  • Data agnostic, it just maps byte arrays to byte arrays.

What it's not:

  • It's not a distributed key value store - it's just a hash table on disk.
  • It's not a compacted data store, but that can be implemented on top of it, if needed.
  • It's not robust against data corruption.

The usecase we have for it at Spotify is serving data that rarely gets updated to users or other services. The fast and efficient bulk writes makes it feasible to periodically rebuild the data, and the fast random access reads makes it suitable for high throughput low latency services. For some services we have been able to saturate network interfaces while keeping cpu usage really low.

Limitations

The hash writing process requires memory allocation of num_entries * 16 * 1.3 bytes. This means that you may run out of memory if trying to write a hash index for too many entries. For instance, with 16 GB available RAM you may write 825 million entries.

This limitation has been removed in sparkey-java, but it has not yet been implemented in this version.

Usage

Sparkey is meant to be used as a library embedded in other software. Take a look at the API documentation which gives examples on how to use it.

License

Apache License, Version 2.0

Design

Sparkey uses two files on disk to store its data. The first is the sparkey log file (.spl), which is simply a sequence of key value pairs. This is an append-only file. You're not allowed to modify it in the middle, and you can't use more than one writer to append to it.

The other file is the sparkey index file (.spi) which is a just a hashtable pointing at entries in the log. This is an immutable file, so you would typically only update it once you're done with your bulk appends.

Doing a random lookup involves first finding the proper entry in the hashtable, and then doing a seek to the right offset in the log file. On average, this means two disk seeks per access for a cold disk cache. If you mlock the index file, it goes down to one seek. For some of our usecases, the total data set is less than the available RAM, so it makes sense to mlock everything.

The advantages of having two files instead of just one (another solution would be to append the hash table at the end) is that it's trivial to mlock one of the files and not the other. It also enables us to append more data to existing log files, even after it's already in use.

History

Sparkey is the product of hackdays at Spotify, where our developers get to spend some of their time on anything they think is interesting.

We have several usecases where we need to serve large amounts of static data with high throughput and low latency. To do this, we've built our own services, backed by various storage systems. Our flow consists of first generating large static storage files in our offline-systems, which then gets pushed out to the user facing services to serve the data.

The storage solutions we used for that have all served us well for a time, but they had limitations that became problematic.

  • We used to rely a lot on CDB (which is a really great piece of software). It performed blazingly quick and produces compact files. We only stopped using it when our data started growing close to the 4 GB limit
  • We also used (and still use) Tokyo Cabinet for a bunch of usecases. It performs really well for reading, but the write throughput really suffers when you can no longer keep the entire dataset in memory, and there were issues with opening the same file multiple times from the same process.

We needed a key-value store with the following characteristics:

  • random read throughput comparable to tokyo cabinet and cdb.
  • high throughput bulk writes.
  • low overhead.
  • high limit on data size.

For fun, we started hacking on a new key-value store on our internal hackdays, where developers get to work on whatever they're interested in. The result was this project.

Performance

A very simple benchmark program is included - see src/bench.c. The program is designed to be easily extended to measure other key value stores if anyone wants to. Running it on a production-like server (Intel(R) Xeon(R) CPU L5630 @ 2.13GHz) we get the following:

Testing bulk insert of 1000 elements and 1000.000 random lookups
  Candidate: Sparkey
    creation time (wall):     0.00
    creation time (cpu):      0.00
    throughput (puts/cpusec): 1098272.88
    file size:                28384
    lookup time (wall):          0.50
    lookup time (cpu):           0.58
    throughput (lookups/cpusec): 1724692.62

Testing bulk insert of 1000.000 elements and 1000.000 random lookups
  Candidate: Sparkey
    creation time (wall):     0.50
    creation time (cpu):      0.69
    throughput (puts/cpusec): 1448618.25
    file size:                34177984
    lookup time (wall):          1.00
    lookup time (cpu):           0.78
    throughput (lookups/cpusec): 1284477.75

Testing bulk insert of 10.000.000 elements and 1000.000 random lookups
  Candidate: Sparkey
    creation time (wall):     7.50
    creation time (cpu):      7.73
    throughput (puts/cpusec): 1294209.62
    file size:                413777988
    lookup time (wall):          1.00
    lookup time (cpu):           0.99
    throughput (lookups/cpusec): 1014608.94

Testing bulk insert of 100.000.000 elements and 1000.000 random lookups
  Candidate: Sparkey
    creation time (wall):     82.00
    creation time (cpu):      81.58
    throughput (puts/cpusec): 1225726.75
    file size:                4337777988
    lookup time (wall):          2.00
    lookup time (cpu):           1.98
    throughput (lookups/cpusec): 503818.84

Testing bulk insert of 1000 elements and 1000.000 random lookups
  Candidate: Sparkey compressed(1024)
    creation time (wall):     0.00
    creation time (cpu):      0.00
    throughput (puts/cpusec): 1101445.38
    file size:                19085
    lookup time (wall):          3.50
    lookup time (cpu):           3.30
    throughput (lookups/cpusec): 303335.78

Testing bulk insert of 1000.000 elements and 1000.000 random lookups
  Candidate: Sparkey compressed(1024)
    creation time (wall):     0.50
    creation time (cpu):      0.75
    throughput (puts/cpusec): 1333903.25
    file size:                19168683
    lookup time (wall):          3.00
    lookup time (cpu):           2.91
    throughput (lookups/cpusec): 343833.28

Testing bulk insert of 10.000.000 elements and 1000.000 random lookups
  Candidate: Sparkey compressed(1024)
    creation time (wall):     8.50
    creation time (cpu):      8.50
    throughput (puts/cpusec): 1176634.25
    file size:                311872187
    lookup time (wall):          3.00
    lookup time (cpu):           2.99
    throughput (lookups/cpusec): 334490.22

Testing bulk insert of 100.000.000 elements and 1000.000 random lookups
  Candidate: Sparkey compressed(1024)
    creation time (wall):     90.50
    creation time (cpu):      90.46
    throughput (puts/cpusec): 1105412.00
    file size:                3162865465
    lookup time (wall):          3.50
    lookup time (cpu):           3.60
    throughput (lookups/cpusec): 277477.41

File format details

Log file format

The contents of the log file starts with a constant size header, describing some metadata about the log file. After that is just a sequence of entries, where each entry consists of a type, key and a value.

Each entry begins with two Variable Length Quantity (VLQ) non-negative integers, A and B. The type is determined by the A. If A = 0, it's a DELETE, and B represents the length of the key to delete. If A > 0, it's a PUT and the key length is A - 1, and the value length is B.

(It gets slightly more complex if block level compression is used, but we'll ignore that for now.)

Hash file format

The contents of the hash file starts with a constant size header, similarly to the log file. The rest of the file is a hash table, represented as capacity * slotsize bytes. The capacity is simply an upper bound of the number of live entries multiplied by a hash density factor > 1.0. The default implementation uses density factor = 1.3. Each slot consists of two parts, the hash value part and the address. The size of the hash value is either 4 or 8 bytes, depending on the hash algorithm. It currently uses murmurhash32 if the number of entries is small, and a 64 bit truncation of murmurhash128 if the number of entries is large. The address is simply a reference into the log file, either as 4 or 8 bytes, depending on the size of the log file. That means that the slotsize is usually 16 bytes for any reasonably large set of entries. By storing the hash value itself in each slot we're wasting some space, but in return we can expect to avoid visiting the log file in most cases.

Hash lookup algorithm

One of few non-trivial parts in Sparkey is the way it does hash lookups. With hashtables there is always a risk of collisions. Even if the hash itself may not collide, the assigned slots may.

(It recently came to my attention that the method described below is basically the same thing as Robin Hood hashing with backward shift deletion)

Let's define displacement as the distance from the calculated optimal slot for a given hash to the slot it's actually placed in. Distance in this case is defined as the number of steps you need to move forward from your optimal slot to reach the actual slot.

The trivial and naive solution for this is to simply start with an empty hash table, move through the entries and put them in the first available slot, starting from the optimal slot, and this is almost what we do.

If we consider the average displacement, we can't really do better than that. We can however minimize the maximum displacement, which gives us some nice properties:

  • We can store the maximum displacement in the header, so we have an upper bound on traversals. We could possibly even use this information to binary search for the entry.
  • As soon as we reach an entry with higher displacement than the thing we're looking for, we can abort the lookup.

It's very easy to set up the hash table like this, we just need to do insertions into slots instead of appends. As soon as we reach a slot with a smaller displacement than our own, we shift the following slots up until the first empty slot one step and insert our own element.

Let's illustrate it with an example - let's start off with an empty hash table with a capacity of 7:

         hash value   log offset    optimal slot    displacement
       +------------+------------+
slot 0 |            |            |
       +------------+------------+
slot 1 |            |            |
       +------------+------------+
slot 2 |            |            |
       +------------+------------+
slot 3 |            |            |
       +------------+------------+
slot 4 |            |            |
       +------------+------------+
slot 5 |            |            |
       +------------+------------+
slot 6 |            |            |
       +------------+------------+

We add the key "key0" which happens to end up in slot 3, h("key0") % 7 == 3. The slot is empty, so this is trivial:

         hash value   log offset    optimal slot    displacement
       +------------+------------+
slot 0 |            |            |
       +------------+------------+
slot 1 |            |            |
       +------------+------------+
slot 2 |            |            |
       +------------+------------+
slot 3 | h(key0)    | 1          |  3               0
       +------------+------------+
slot 4 |            |            |
       +------------+------------+
slot 5 |            |            |
       +------------+------------+
slot 6 |            |            |
       +------------+------------+

Now we add "key1" which happens to end up in slot 4:

         hash value   log offset    optimal slot    displacement
       +------------+------------+
slot 0 |            |            |
       +------------+------------+
slot 1 |            |            |
       +------------+------------+
slot 2 |            |            |
       +------------+------------+
slot 3 | h(key0)    | 1          |  3               0
       +------------+------------+
slot 4 | h(key1)    | 11         |  4               0
       +------------+------------+
slot 5 |            |            |
       +------------+------------+
slot 6 |            |            |
       +------------+------------+

Now we add "key2" which also wants to be in slot 3. This is a conflict, so we skip forward until we found a slot which has a lower displacement than our current displacement. When we find that slot, all following entries until the next empty slot move down one step:

         hash value   log offset    optimal slot    displacement
       +------------+------------+
slot 0 |            |            |
       +------------+------------+
slot 1 |            |            |
       +------------+------------+
slot 2 |            |            |
       +------------+------------+
slot 3 | h(key0)    | 1          |  3               0
       +------------+------------+
slot 4 | h(key2)    | 21         |  3               1
       +------------+------------+
slot 5 | h(key1)    | 11         |  4               1
       +------------+------------+
slot 6 |            |            |
       +------------+------------+

Let's add "key3" which maps to slot 5. We can't push down key1, because it already has displacement 1 and our current displacement for key3 is 0, so we have to move forward:

         hash value   log offset    optimal slot    displacement
       +------------+------------+
slot 0 |            |            |
       +------------+------------+
slot 1 |            |            |
       +------------+------------+
slot 2 |            |            |
       +------------+------------+
slot 3 | h(key0)    | 1          |  3               0
       +------------+------------+
slot 4 | h(key2)    | 21         |  3               1
       +------------+------------+
slot 5 | h(key1)    | 11         |  4               1
       +------------+------------+
slot 6 | h(key3)    | 31         |  5               1
       +------------+------------+

Adding "key4" for slot 3. It ends up in slot 5 with displacement 2 and key3 loops around to slot 0:

         hash value   log offset    optimal slot    displacement
       +------------+------------+
slot 0 | key(key3)  | 31         |  5               2
       +------------+------------+
slot 1 |            |            |
       +------------+------------+
slot 2 |            |            |
       +------------+------------+
slot 3 | h(key0)    | 1          |  3               0
       +------------+------------+
slot 4 | h(key2)    | 21         |  3               1
       +------------+------------+
slot 5 | h(key4)    | 41         |  3               2
       +------------+------------+
slot 6 | h(key1)    | 11         |  4               2
       +------------+------------+

Now, if we search for key123 which maps to slot 3 (but doesn't exist!), we can stop scanning as soon as we reach slot 6, because then the current displacement (3) is higher than the displacement of the entry at the current slot (2).

Compression

Sparkey also supports block level compression using google snappy. You select a block size which is then used to split the contents of the log into blocks. Each block is compressed independently with snappy. This can be useful if your bottleneck is file size and there is a lot of redundant data across adjacent entries. The downside of using this is that during lookups, at least one block needs to be decompressed. The larger blocks you choose, the better compression you may get, but you will also have higher lookup cost. This is a tradeoff that needs to be empirically evaluated for each use case.

Issues
  • Completion of error handling

    Completion of error handling

    I have looked at a few source files for your current software. I have noticed that some checks for return codes are missing.

    Would you like to add more error handling for return values from functions like the following?

    opened by elfring 12
  • Improve hash algorithm description in README

    Improve hash algorithm description in README

    I downloaded the source code and read it today. But I still have some difficulty understanding how the hash algorithm works. Could you write more description about the algorithm in this project to help us learn faster? A more detailed design might be better. Thanks a lot.

    opened by liangdong 10
  • Add zstd support

    Add zstd support

    Zstd is a fast compressor with a better compression ratio than Snappy. It's been gaining support pretty quickly! This PR adds a "compressor" abstraction to Sparkey, and adds Zstd as an option.

    Tests and benchmarks seem to indicate this works well. Data is about 60% of the size of Snappy data, but lookups are between 30-70% slower on 4K blocks—which makes sense, Zstd usually targets larger block sizes. For my use case, that's a very worthwhile tradeoff, I'm limited on I/O much more than CPU.

    I also have a Java port: https://github.com/vasi/sparkey-java/tree/zstd

    It might be annoying to some folks that this now requires Zstd. We could make it optional, but then someone's build of Sparkey may or may not be able to open a given file, so it's a trade-off. Happy to defer to you on this.

    opened by vasi 8
  • sparkey_logiter_reset doesn't reset

    sparkey_logiter_reset doesn't reset

    I may have misinterpreted the documentation, but it doesn't look like sparkey_logiter_reset actually resets the iterator.

    Test case:

    
    #include <stdlib.h>
    #include <stdio.h>
    #include <string.h>
    #include <sparkey/sparkey.h>
    #include <assert.h>
    
    #define SPARKEY_ASSERT(rc) ({  \
      if (SPARKEY_SUCCESS != rc) { \
        fprintf(                   \
            stderr                 \
          , "error: %s (%d)\n"     \
          , sparkey_errstring(rc)  \
          , __LINE__               \
        );                         \
        exit(1);                   \
      }                            \
    });
    
    int
    main(void) {
      sparkey_logwriter *writer = NULL;
      sparkey_logreader *reader = NULL;
      sparkey_logiter *iterator = NULL;
      const char *key1 = "key1";
      const char *value1 = "value1";
      size_t key1size = strlen(key1);
      size_t value1size = strlen(value1);
      const char *key2 = "key2";
      const char *value2 = "value2";
      size_t key2size = strlen(key2);
      size_t value2size = strlen(value2);
      uint64_t wanted;
      uint64_t actual;
      uint8_t *buffer = NULL;
    
      // create a log
      SPARKEY_ASSERT(sparkey_logwriter_create(
          &writer
        , "test.spl"
        , SPARKEY_COMPRESSION_NONE
        , 0
      ));
    
      // write some stuff
      SPARKEY_ASSERT(sparkey_logwriter_put(
          writer
        , key1size
        , (uint8_t *) key1
        , value1size
        , (uint8_t *) value1
      ));
      SPARKEY_ASSERT(sparkey_logwriter_put(
          writer
        , key2size
        , (uint8_t *) key2
        , value2size
        , (uint8_t *) value2
      ));
    
      SPARKEY_ASSERT(sparkey_logwriter_close(&writer));
    
      SPARKEY_ASSERT(sparkey_logreader_open(&reader, "test.spl"));
      SPARKEY_ASSERT(sparkey_logiter_create(&iterator, reader));
    
      // get first key
      SPARKEY_ASSERT(sparkey_logiter_next(iterator, reader));
      wanted = sparkey_logiter_keylen(iterator);
      assert((buffer = malloc(wanted)));
      SPARKEY_ASSERT(sparkey_logiter_fill_key(
          iterator
        , reader
        , wanted
        , buffer
        , &actual
      ));
    
      printf("buffer: %s\n", buffer);
      assert(0 == strcmp("key1", (char *) buffer));
      free(buffer);
    
      // reset iterator
      SPARKEY_ASSERT(sparkey_logiter_reset(iterator, reader));
    
      // get key again
      SPARKEY_ASSERT(sparkey_logiter_next(iterator, reader));
      wanted = sparkey_logiter_keylen(iterator);
      assert((buffer = malloc(wanted)));
      SPARKEY_ASSERT(sparkey_logiter_fill_key(
          iterator
        , reader
        , wanted
        , buffer
        , &actual
      ));
    
      printf("buffer: %s (after reset)\n", buffer);
      assert(0 == strcmp("key1", (char *) buffer));
      free(buffer);
    
      // cleanup
      sparkey_logiter_close(&iterator);
      sparkey_logreader_close(&reader);
      free(buffer);
    
      return 0;
    }
    

    Yields:

    $ gcc -lsparkey reset.c -o reset -Wall -Wextra
    $ ./reset
    buffer: key1
    buffer: key2 (after reset)
    Assertion failed: (0 == strcmp("key1", (char *) buffer)), function main, file reset.c, line 98.
    Abort trap: 6
    
    opened by stephenmathieson 6
  • Non-exact seeks?

    Non-exact seeks?

    Assuming I have two keys in my store: a and c. I can get exact keys, but would it be possible to seek for the next match?

    Example:

    sparkey_hash_seek(myreader, (uint8_t*)"a", 1, myiter); // exact match, same as sparkey_hash_get
    sparkey_hash_seek(myreader, (uint8_t*)"b", 1, myiter); // cannot find "b", so advance to next live key: "c" 
    

    I am not sure if/how this could work with the slot traversal algorithm, but this is quite a useful feature when e.g. looking up ranges, etc.

    Thanks

    opened by dim 5
  • Converting multiple log-files into a hash-file in bulk

    Converting multiple log-files into a hash-file in bulk

    I'm looking to do parallel bulk writes.

    Since log-files can only be written sequentially, a solution might be to write a log-file per concurrent process and then merge them. Merging the log-files into one can be done quite efficiently, but it might make more sense to allow converting multiple log-files into a hash-file directly at sparkey_hash_write.

    It would save the extra work of merging that can only be done with limited parallelism, plus it would for instance allow to spread out the log-files among multiple disks to improve write speeds (and read speeds for creating the hash table).

    If more than one log-file is passed to sparkey_hash_write, might that also allow to do part of the process in parallel? At least reading the log-files and (re-)hashing the keys.

    I'd like to hear the maintainer's opinion on it.

    EDIT: I see that the hash-table doesn't contain the key-values but it addresses the log-file. That means that multiple log-files (in multiple disks) might also improve general read performance.

    opened by oersted 4
  • sparkey_hash_write returns successfully when given an invalid hash size

    sparkey_hash_write returns successfully when given an invalid hash size

    I know this is a bit edge-casey, but when clobbering an existing hash and providing an invalid hash size, sparkey_hash_write incorrectly returns SPARKEY_SUCCESS.

    Test case:

    
    #include <stdlib.h>
    #include <stdio.h>
    #include <string.h>
    #include <sparkey/sparkey.h>
    #include <assert.h>
    
    #define check(rc) ({  \
      if (SPARKEY_SUCCESS != rc) { \
        fprintf(stderr, "error: %s (L#%d)\n", sparkey_errstring(rc), __LINE__); \
        exit(1); \
      } \
    });
    
    int
    main(void) {
      sparkey_logwriter *writer = NULL;
      check(sparkey_logwriter_create(&writer, "test.spl", SPARKEY_COMPRESSION_NONE, 0));
      check(sparkey_logwriter_put(writer, 5, (uint8_t *) "key1", 7, (uint8_t *) "value1"));
      check(sparkey_logwriter_put(writer, 5, (uint8_t *) "key2", 7, (uint8_t *) "value2"));
      check(sparkey_logwriter_put(writer, 5, (uint8_t *) "key3", 7, (uint8_t *) "value3"));
    
      // write a hash with size 4
      check(sparkey_hash_write("test.spi", "test.spl", 4));
      // write a hash with invalid size
      check(sparkey_hash_write("test.spi", "test.spl", 3456789));
    
      assert(0 && "wrote test.spi with an invalid size :(");
    
      return 0;
    }
    
    opened by stephenmathieson 4
  • libsparkey.so.0: cannot open shared object file

    libsparkey.so.0: cannot open shared object file

    I just installed sparkey (Commit: ec80630f) on my local maschine (Ubuntu 12.04) like described in the README file:

    $ git clone [email protected]:spotify/sparkey.git
    $ cd sparkey
    $ autoreconf --install
    $ ./configure
    $ make
    $ make check
    $ sudo make install
    

    But if I want to execute sparkey, I get the following error:

    $ sparkey
    sparkey: error while loading shared libraries: libsparkey.so.0:
      cannot open shared object file: No such file or directory
    

    Can someone please tell me, where sparkey is searching for the file because it is available in /usr/local/lib:

    $ ls -la /usr/local/lib/ | grep sparkey
    -rw-r--r-- 1 root root  79842 Sep  3 16:02 libsparkey.a
    -rwxr-xr-x 1 root root    974 Sep  3 16:02 libsparkey.la
    lrwxrwxrwx 1 root root     19 Sep  3 16:02 libsparkey.so -> libsparkey.so.0.0.0
    lrwxrwxrwx 1 root root     19 Sep  3 16:02 libsparkey.so.0 -> libsparkey.so.0.0.0
    -rwxr-xr-x 1 root root  50900 Sep  3 16:02 libsparkey.so.0.0.0
    

    I also tried to copy/symlink the files to /usr/lib but this didn't work.

    opened by janpieper 3
  • OS X automake failure.

    OS X automake failure.

    Hello,

    Thanks for the very interesting library. However "clock_gettime" is not available on OS X and configuration fails on this system. I think it is missing on Windows too. "clock_gettime" api could be simulated using gettimeofday, or mach_absolute_time. Maybe building benchmark app could be optional?

    I'd suggest to move this project to CMake as this will simplify the "configure" step and will make integration into other projects easy.

    opened by mpapierski 3
  • Tag a Release

    Tag a Release

    Hey @spkrka,

    I know that you've not worked on this repo for a while and are working on a full Java version now, but I was wondering if you could tag a release at whatever latest point you feel is reasonable. I'd like to use this as a dependency for a project, but I'm hesitant (out of habit) to simply refer to HEAD or a specific commit hash.

    Thanks!

    opened by haneefmubarak 2
  • make failure

    make failure

    Under Ubuntu 12.04 + gcc 4.6.3. When I ran make, I got the following message:

    bench.c:253:3: error: format ‘%ld’ expects argument of type ‘long int’, but argument 2 has type ‘size_t’ [-Werror=format]

    printf("    file size:                %ld\n", total_file_size(c->files()));
    

    to

    printf("    file size:                %ld\n", (long int) total_file_size(c->files()));
    
    opened by orikoon 2
  • Build breaks with newer versions of GCC

    Build breaks with newer versions of GCC

    Changing the .travis.yml to use newer versions of ubuntu installs newer versions of gcc, which have implemented additional complication checks. Various files fail various checks. See here https://travis-ci.org/github/mvanwyk/sparkey/builds/725252565

    opened by mvanwyk 0
  • Comparison to NuDB?

    Comparison to NuDB?

    I've been polishing up my NuDB key/value database library which we're using in rippled (https://github.com/ripple/rippled) so I'm looking at other key value systems for comparisons. We were very unhappy with the performance of RocksDB for our use case (which admittedly is rather unique). Interestingly, sparkey comes the closest in terms of implementation to NuDB. They both use an append-only data file, and a key file which implements an on-disk hash table. One difference is that NuDB uses linear hashing so that buckets reach the 50% occupancy. When a bucket gets full, it is "spilled" over into the data file (that bucket now becomes a linked list of buckets).

    I'd be very interested in your feedback on NuDB's design in comparison to sparkey. I'd like to hear about problems you encountered with other key value stores and the data access pattern of your software. I'm curious to know how NuDB compares for your use case, or if there are any techniques in sparkey that I could adopt. The repo is here: https://github.com/vinniefalco/NuDB

    Thanks!

    opened by vinniefalco 6
  • Problem creating hash indices larger than available RAM

    Problem creating hash indices larger than available RAM

    Currently the hash writing requires malloc of the entire index size. If this is larger than available memory, the operation fails.

    We have a use case requiring more than 1.65 billion entries, which results in a 32 GB+ hash index. This is larger than the available memory on the machine that builds it.

    We could replace the malloc with mmap:ing of the file itself but that would lead to a lot of random reads and writes from disk which would be horribly slow.

    I have an idea to reduce this limitation, but no time to do it at the moment.

    1. Write all hash entries to some sort of log file (64 bit hash value, 64 bit data offset).
    2. Calculate hash capacity (num_entries * factor)
    3. Sort the entries in the file by hash value % capacity. (This can be done efficiently without having everything in RAM)
    4. Mmap the large index file
    5. Iterate over the sorted hash entries and write them to the mmapped index. This should be efficient since the mmapped writes will happen with good locality and in sequential order. Except for the last values which may wrap around to the beginning of the file, but that's a small part of the data.

    For sorting the large set of entries, split it into reasonably small chunks and sort those in memory. Then run a tree of sequential file merges.

    opened by spkrka 7
Owner
Spotify
Spotify
LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. Authors: Sanjay Ghem

Google 29.7k Jul 4, 2022
BerylDB is a data structure data manager that can be used to store data as key-value entries.

BerylDB is a data structure data manager that can be used to store data as key-value entries. The server allows channel subscription and is optimized to be used as a cache repository. Supported structures include lists, sets, and keys.

BerylDB 195 Jun 24, 2022
FoundationDB - the open source, distributed, transactional key-value store

FoundationDB is a distributed database designed to handle large volumes of structured data across clusters of commodity servers. It organizes data as

Apple 11.5k Jun 29, 2022
libmdbx is an extremely fast, compact, powerful, embedded, transactional key-value database, with permissive license

One of the fastest embeddable key-value ACID database without WAL. libmdbx surpasses the legendary LMDB in terms of reliability, features and performance.

Леонид Юрьев (Leonid Yuriev) 1k Apr 13, 2022
Serverless SQLite database read from and write to Object Storage Service, run on FaaS platform.

serverless-sqlite Serverless SQLite database read from and write to Object Storage Service, run on FaaS platform. NOTES: This repository is still in t

老雷 7 May 12, 2022
The database built for IoT streaming data storage and real-time stream processing.

The database built for IoT streaming data storage and real-time stream processing.

HStreamDB 501 Jun 29, 2022
OrioleDB – building a modern cloud-native storage engine

OrioleDB is a new storage engine for PostgreSQL, bringing a modern approach to database capacity, capabilities and performance to the world's most-loved database platform.

OrioleDB 1.1k Jul 3, 2022
GalaxyEngine is a MySQL branch originated from Alibaba Group, especially supports large-scale distributed database system.

GalaxyEngine is a MySQL branch originated from Alibaba Group, especially supports large-scale distributed database system.

null 229 Jun 30, 2022
Velox is a new C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.

Velox is a C++ database acceleration library which provides reusable, extensible, and high-performance data processing components

Facebook Incubator 893 Jun 27, 2022
Simple-MySQL-API is a free and easy API to manipulate MySQL with C99 and GCC compiler under GNU/Linux OS.

Simple-MySQL-API is a free and easy API to manipulate MySQL with C99 and GCC compiler under GNU/Linux OS.

Neptune 7 Apr 24, 2022
A project to create a simple ORM.

cpp-ORM Current build status : An ORM project. You can simply create persistent objects using databases. The object representation: Each object have t

Maxime Barbier 5 Dec 14, 2020
A simple flutter application to maintain basic personal notes. A showcase on how to use fastAPI as backend

Flutter App Notes with FastApi as backend A simple flutter application to maintain basic personal notes. The backend of this app was built with FastAP

António Pedro 5 May 31, 2022
❤️ SQLite ORM light header only library for modern C++

SQLite ORM SQLite ORM light header only library for modern C++ Status Branch Travis Appveyor master dev Advantages No raw string queries Intuitive syn

Yevgeniy Zakharov 1.6k Jun 27, 2022
A type safe SQL template library for C++

sqlpp11 A type safe embedded domain specific language for SQL queries and results in C++ Documentation is found in the wiki So what is this about? SQL

Roland Bock 2k Jul 2, 2022
SOCI - The C++ Database Access Library

Originally, SOCI was developed by Maciej Sobczak at CERN as abstraction layer for Oracle, a Simple Oracle Call Interface. Later, several database backends have been developed for SOCI, thus the long name has lost its practicality. Currently, if you like, SOCI may stand for Simple Open (Database) Call Interface or something similar.

SOCI 1.1k Jun 25, 2022
The fastest database-library on Android OS.

Android SQLite3 NDK 封装 Demo下载 (操作:按钮新增 按钮查询 点按编辑 长按删除) 写在前面 sqlite3 开源、集成简单(现在的版本只有2个文件 sqlite3.h sqlite3.c) 这个库抽离自 Telegram 的开源代码、作者:DrKLO 我个人感觉 Tele

水银灯、 2 Dec 27, 2021
A lightweight header-only C++11 library for quick and easy SQL querying with QtSql classes.

EasyQtSql EasyQtSql is a lightweight header-only C++11 library for quick and easy SQL querying with QtSql classes. Features: Header only C++11 library

null 44 Jun 28, 2022
C++11 wrapper for the LMDB embedded B+ tree database library.

lmdb++: a C++11 wrapper for LMDB This is a comprehensive C++ wrapper for the LMDB embedded database library, offering both an error-checked procedural

D.R.Y. C++ 257 Jun 16, 2022
A friendly and lightweight C++ database library for MySQL, PostgreSQL, SQLite and ODBC.

QTL QTL is a C ++ library for accessing SQL databases and currently supports MySQL, SQLite, PostgreSQL and ODBC. QTL is a lightweight library that con

null 155 Jun 26, 2022