RemixDB: A read- and write-optimized concurrent KV store. Fast point and range queries. Extremely low write-amplification.

Overview

REMIX and RemixDB

The REMIX data structure was introduced in paper "REMIX: Efficient Range Query for LSM-trees", FAST'21.

This repository maintains a reference implementation of the REMIX index data structure, as well as a thread-safe embedded key-value store implementation, namely RemixDB. It compiles on recent Linux/FreeBSD/MacOS and supports x86_64 and AArch64 CPUs.

Optimization: Minimizing REMIX (Re-)Building Cost

This implementation employs an optimization to minimize the REMIX building cost. This optimization improves the throughput by 2x (0.96MOPS vs. 0.50MOPS) in a random-write experiment, compared to the implementation described in the REMIX paper. Configuration: klen=16; vlen=120; 2.02 billion KVs; 256GB valid KV data; single-threaded loading in random order; no compression.

When creating a new table file, RemixDB can create a copy of all the keys in the table. Specificially, it encodes all the keys (without values) in sorted order using prefix compression, which creates a Compressed Keys Block (CKB). The CKB is stored at the end of the table file. This feature can be freely turned on and off. There is no compatibility issue when tables with and without the CKB are used together.

When creating a new REMIX, the building process will check if every input table contains a CKB. If true, the process will build the new REMIX using these CKBs. It also leverages the existing REMIX to avoid unncecssary key comparisons. In this way, the new REMIX will be created by reading the old REMIX and the CKBs, without accessing the key-value data blocks of the table files.

In a running system the old REMIX structures are usually cache-resident. The CKBs are only used for REMIX building, which are read into memory in batch, and discarded once the building is finished.

A CKB is often much smaller than the original key-value data block, unless the system manages huge keys with small values. Suppose the average CKB size is 10% of the average key-value data block size, this optimization trades 10% more write I/O and storage space usage for a 90% reduction of read I/O during REMIX building.

remixdb_open opens/creates a remixdb with the optimization turned on. Each newly created sstable will have the CKB. You should use remixdb_open unless it's absolutely necessary to save a little bit disk space. remixdb_open_compact opens a remixdb with the optimization turned off. Each newly created sstable will not contain a CKB. A store created by one of these functions can be safely opened by the other function.

Limitations of the Current Implementation

  • KV size: The maximum key+value size is capped at 65500 bytes. This roughly corresponds to the 64KB block size limit. TODO: store huge KV pairs in a separate file and store the file address of the KV pair in RemixDB.

Configuration and Tuning

CPU affinity

RemixDB employs background threads to perform asynchronous compaction. When possible (on Linux or FreeBSD), these threads are pinned on specific cores for efficiency. To avoid interferences with the foreground threads, it is necessary to separate the cores used by different threads. By default, RemixDB pins 4 compaction threads on the last four cores of the current process's affinity list. For example, on a machine with two 10-core processors, cores 0,2,4,...,16,18 belong to numa node 0, and the rest cores belong to numa node 1. The default behavior is to use the cores from 16 to 19, which is a suboptimal setup. To avoid the performance penalty, one should set environment variable XDB_CPU_LIST=core1,core2,... to specify the cores used by the background threads. There can be up to four compaction threads. The default number is four. A preferred setup for the machine mentioned above would be like this:

$ export XDB_CPU_LIST=12,14,16,18
$ numactl -C 0,2,4,6 ./xdbdemo.out

Maximum number of open files

The current implementation keeps every table file open at run time. This requires a large nofile in /etc/security/limits.conf. For example, add * - nofile 100000 to limits.conf, reboot/relogin, and double-check with ulimit -n.

Maximum Table File Size

MSSTZ_NBLKS (sst.c) controls the maximum number of 4KB blocks in an SST file. The default number is 20400. The maximum value is 65520 (256MB data blocks, plus metadata).

Hugepages

Configuring huge pages can effectively improve RemixDB's performance. Usually a few hundred 2MB hugepages would be sufficient for memory allocation in MemTables. The block cache automatically detects and uses 1GB huge pages when available (otherwise, fall back to 2MB pages, and then 4KB pages). 4x 1GB huge pages should be configured if you set cache size to 4GB.

Getting Started

RemixDB by default uses liburing (io_uring) and thus requires a Linux kernel >= 5.1. It also works with POSIX AIO on all the supported platforms but the performance can be negatively affected.

clang is the default compiler. It usually produces faster code than GCC. To use GCC:

$ make CCC=gcc

jemalloc is highly recommended. If jemalloc is available and you prefer to use it, use M=j with make:

$ make M=j

Similarly, tcmalloc can be linked with M=t.

The xdbdemo.c contains sample code that uses the remixdb_* functions. These functions present a clean programming interface without using special data types or structures.

xdbdemo

To compile and run the demo code:

$ make M=j xdbdemo.out
$ ./xdbdemo.out

xdbtest

xdbtest is a stress test program that uses the remixdb_* functions. It trys to use all the available cores on the affinity list, which can lead to mediocre performance. You should use numactl to specify what cores are available for the tester threads. Suppose you have eight cores (0...7) in total, the best practice is to let the testers to run on the first four cores and assign the last four cores to the compaction threads. The following examples use this configuration.

Run with a 4GB block cache, 4GB MemTables, and a dataset with 32 million KVs (2^25), performing 1 million updates in each round (2^20):

$ make M=j xdbtest.out
$ export XDB_CPU_LIST=4,5,6,7
$ numactl -C 0,1,2,3 ./xdbtest.out /tmp/xdbtest 4096 25 20

To run with smaller memory footprint (a 256MB block cache, 256MB Memtables, and 1 million KVs):

$ export XDB_CPU_LIST=4,5,6,7
$ numactl -C 0,1,2,3 ./xdbtest.out /tmp/xdbtest 256 20 20

This setup consumes up to 850MB memory (RSS) and 1.8GB space in /tmp/xdbtest.

A first run of xdbtest.out should always show stale=0. If you run it again without deleting /tmp/xdbtest, it will show non-zero stale numbers at the beginning but it will quickly drop and eventually reach zero.

xdbexit

xdbexit is a simple program testing crash-recovery. It inserts some new keys and calls remixdb_sync() to make all buffered data persist in the WAL. Then it immediately calls exit() without doing any clean-up. Run it repeatedly. In each run it should show that all the previously inserted KVs are found.

Run with a small footprint:

$ for i in $(seq 1 30); do ./xdbexit.out ./dbdir 256 256; done

Run with in a regular-sized setup:

$ for i in $(seq 1 30); do ./xdbexit.out ./dbdir 4096 4096; done

libremixdb.so

To use remixdb as a shared library, run make libremixdb.so and make install. A PKGBUILD (for Archlinux's pacman) is included as an example packaging script.

You might also like...
Ntfs-3g - NTFS-3G Safe Read/Write NTFS Driver

INTRODUCTION ============ The NTFS-3G driver is an open source, freely available read/write NTFS driver for Linux, FreeBSD, macOS, NetBSD, OpenIndia

This is a set of utilities that allow you to read, write or erase SPI flash chips using a Raspberry Pi Pico (RP2040) chip.

Pico SPI Utilities This is a set of utilities that allow you to read, write or erase SPI flash chips using a Raspberry Pi Pico (RP2040) chip. While th

Android PoC to read/write Huawei's NVME image

hisi-nve Android PoC to read/write Huawei's NVME image Disclaimers Use this tool at your own risk and always backup NVME. This tool was made for educa

CC2500 Low-Cost Low-Power 2.4 GHz RF Transceiver driver for esp-idf
CC2500 Low-Cost Low-Power 2.4 GHz RF Transceiver driver for esp-idf

esp-idf-cc2500 CC2500 Low-Cost Low-Power 2.4 GHz RF Transceiver driver for esp-idf. I ported from this. 2.00mm pitch External Antena 1.27mm pitch PCB

Treexy is a library that implements a compact hierarchical data structure that can store and manipulate volumetric data, discretized on a three-dimensional grid
Treexy is a library that implements a compact hierarchical data structure that can store and manipulate volumetric data, discretized on a three-dimensional grid

Treexy is a library that implements a compact hierarchical data structure that can store and manipulate volumetric data, discretized on a three-dimens

Repository created to store a C function library to use in 42 School
Repository created to store a C function library to use in 42 School

Libft of 42. Make with ❤︎ for Luiz Cezario 📌 Index What's this Repo? List of Functions Technologies How to Run Find a Bug? Or somenthing need to chan

This is where I store my CS homework in BUPT. Also synced with my own gitea instance

BUPT-Homework Introduction This is my repository for the BUPT-Homework project. Structure Root directory: semesters problem sets solutions to the prob

Opt - Class template designed to express optionality without having to sacrifice memory to store additional bool flag

mp::optT, Policy mp::optT, Policy is a class template designed to express optionality. It has interface similar to std::optionalT (see here) but

Comments
  • compile error

    compile error

    First of all, thank you for sharing your works : )

    I tried to compile with clang-12 and kernel version 5.12.10 , however, I still see errors

    [root@localhost remixdb]# make M=j xdbdemo.out
    xdbdemo.out <= xdbdemo.c lib.c kv.c wh.c blkio.c sst.c xdb.c 
    clang -std=gnu18   -march=native -mtune=native -pthread -Wall -Wextra -Wshadow  -fno-builtin-memcpy -fno-builtin-memmove -fno-builtin-memcmp -DNDEBUG -g3 -O3 -flto -fno-stack-protector -ferror-limit=3  -rdynamic -o xdbdemo.out xdbdemo.c lib.c kv.c wh.c blkio.c sst.c xdb.c  -lm -ljemalloc -lrt
    lib.c:710:59: error: use of undeclared identifier 'MAP_HUGE_SHIFT'
      return pages_do_alloc(sz, MAP_PRIVATE | MAP_ANONYMOUS | PAGES_FLAGS_1G);
                                                              ^
    lib.c:696:47: note: expanded from macro 'PAGES_FLAGS_1G'
    #define PAGES_FLAGS_1G ((MAP_HUGETLB | (30 << MAP_HUGE_SHIFT)))
                                                  ^
    lib.c:724:59: error: use of undeclared identifier 'MAP_HUGE_SHIFT'
      return pages_do_alloc(sz, MAP_PRIVATE | MAP_ANONYMOUS | PAGES_FLAGS_2M);
                                                              ^
    lib.c:697:47: note: expanded from macro 'PAGES_FLAGS_2M'
    #define PAGES_FLAGS_2M ((MAP_HUGETLB | (21 << MAP_HUGE_SHIFT)))
                                                  ^
    2 errors generated.
    make: *** [Makefile.common:180: xdbdemo.out] Error 1
    
    

    And I add header in lib.h

    #include <asm-generic/mman-common.h>
    

    and new errors:

    [root@localhost remixdb]# make M=j xdbdemo.out
    xdbdemo.out <= xdbdemo.c lib.c kv.c wh.c blkio.c sst.c xdb.c 
    clang -std=gnu18   -march=native -mtune=native -pthread -Wall -Wextra -Wshadow  -fno-builtin-memcpy -fno-builtin-memmove -fno-builtin-memcmp -DNDEBUG -g3 -O3 -flto -fno-stack-protector -ferror-limit=3  -rdynamic -o xdbdemo.out xdbdemo.c lib.c kv.c wh.c blkio.c sst.c xdb.c  -lm -ljemalloc -lrt
    lib.c:2015:10: error: always_inline function '_pext_u32' requires target feature 'bmi2', but would be inlined into function 'm128_movemask_u16' that is compiled without support for 'bmi2'
      return _pext_u32((u32)_mm_movemask_epi8(v), 0xaaaau);
             ^
    1 error generated.
    make: *** [Makefile.common:180: xdbdemo.out] Error 1
    
    opened by AnonymousUsersX 6
  • Aborts with wsl2

    Aborts with wsl2

    The current Makefile compiles with liburing enabled. But liburing is not implemented in wsl2. When running remixdb tests in wsl2 it will get error 38 ENOSYS (Function not implemented), which will crash with a stack trace printed out.

    Possible workarounds include running in a different environment or manually disabling liburing in the build system or source code.

    bug 
    opened by wuxb45 1
  • Benchmark Results

    Benchmark Results

    YCSB benchmark results as of 7/14/2021. 16B keys, 120B values, 2.02 billion keys, no compression; 4GB block cache (user space), max 4GB memtables, max 8GB WALs. The loading phase only uses one user thread.

    Screenshot-20210714165559-1514x680

    wontfix 
    opened by wuxb45 0
Owner
Xingbo Wu
cache miss
Xingbo Wu
AMD K6-2 (CXT) / K6-2+ / K6-3 / K6-3+ Write Allocate / Write Combine / Write Ordering / Frequency Multiplier Initialization driver for MS-DOS

K6INIT What is this? This is a driver for MS-DOS to replace k6dos.sys which is a bit useless and unflexible. It does not support the CXT versions of t

null 10 Sep 11, 2022
A C++ concepts and range based character encoding and code point enumeration library

Travis CI (Linux:gcc) Text_view A C++ Concepts based character encoding and code point enumeration library. This project is the reference implementati

Tom Honermann 121 Sep 9, 2022
An extremely basic Python script to split word-based data into high and low byte files.

An extremely basic Python script to split word-based data into high and low byte files. This is for use in programming 16 bit computer ROMs.

null 2 Dec 2, 2022
Code for the paper Succinct k-mer Set Representations Using Subset Rank Queries on the Spectral Burrows-Wheeler Transform (SBWT)

SBWT This is the code for the paper Succinct k-mer Set Representations Using Subset Rank Queries on the Spectral Burrows-Wheeler Transform (SBWT). The

Algorithmic Bioinformatics Group @ University of Helsinki 14 Jun 18, 2022
SX1276/77/78/79 Low Power Long Range Transceiver driver for esp-idf

esp-idf-sx127x SX1276/77/78/79 Low Power Long Range Transceiver driver for esp-idf. I based on this. Changes from the original Added support for ESP32

null 20 Jan 4, 2023
SX1262//68 Low Power Long Range Transceiver driver for esp-idf

esp-idf-sx126x SX1262//68 Low Power Long Range Transceiver driver for esp-idf. I ported from here. Ai-Thinker offers several LoRa modules. You can get

null 3 May 9, 2022
The Wren Programming Language. Wren is a small, fast, class-based concurrent scripting language.

Wren is a small, fast, class-based concurrent scripting language Think Smalltalk in a Lua-sized package with a dash of Erlang and wrapped up in a fami

Wren 6.1k Dec 30, 2022
dwm is an extremely fast, small, and dynamic window manager for X.

dwm - dynamic window manager dwm is an extremely fast, small, and dynamic window manager for X. My Patches This is in the order that I patched everyth

Christian Chiarulli 32 Dec 23, 2022
Haxe native extension to read and write windows clipboard.

Haxe Clipboard This is a native library to read and write clipboard data from Haxe. It uses Ammer to generate bindings. Note: This is a Windows only l

Ludovic Bas 12 Nov 11, 2022
This software brings you the possibility to Read and Write the internal Flash of the Nordic nRF52 series with an ESP32

ESP32 nRF52 SWD flasher This software brings you the possibility to Read and Write the internal Flash of the Nordic nRF52 series with an ESP32 using t

null 140 Dec 31, 2022