RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

Overview

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

CircleCI Status TravisCI Status Appveyor Build status PPC64le Build Status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat ([email protected]) and Jeff Dean ([email protected])

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/master/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Design discussions are conducted in https://www.facebook.com/groups/rocksdb.dev/ and https://rocksdb.slack.com/

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.

Comments
  • Memory grows without limit

    Memory grows without limit

    Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://www.facebook.com/groups/rocksdb.dev

    Expected behavior

    Process consumes about 10 megabytes

    Actual behavior

    Memory grows without limit

    Steps to reproduce the behavior

    Run this code:

    https://pastebin.com/Ch8RhsSB

    Sorry RocksDB team, but it is huge problem.

    This is trivial test and I expect it will work as finite state machine Populate memory - flush - re-use memory.

    I see like memory grows.

    abandoned-or-aged-out waiting 
    opened by toktarev 181
  • undefined symbol: clock_gettime

    undefined symbol: clock_gettime

    Using Centos 6 GCC 4.9 using devtools-3 Java 7 Kernel 2.6.32-504.12.2.el6.x86_64

    I am getting the following error after running a test for a while. /usr/java/latest/bin/java: symbol lookup error: /tmp/librocksdbjni2974434001434564758..so: undefined symbol: clock_gettime

    The same test runs fine on RocksDb 3.6. My investigation shows that librt.so is not linked with RocksDb correctly. My test (with 3.10.2) worked correctly after I ran export LD_PRELOAD=/lib64/rtkaio/librt.so.1

    To investigate the 3.10.2 library, I ran nm $nm /tmp/librocksdbjni2974434001434564758..so | grep clock 0000000000332260 T _ZNSt6chrono3_V212steady_clock3nowEv 000000000037e33c R _ZNSt6chrono3_V212steady_clock9is_steadyE 0000000000332230 T _ZNSt6chrono3_V212system_clock3nowEv 000000000037e33d R _ZNSt6chrono3_V212system_clock9is_steadyE U clock_gettime

    I ran nm on the older rocksdb-3.6 nm /tmp/librocksdbjni1323312933457066341..so |grep clock 0000000000287390 T _ZNSt6chrono3_V212steady_clock3nowEv 00000000002c783c R _ZNSt6chrono3_V212steady_clock9is_steadyE 0000000000287360 T _ZNSt6chrono3_V212system_clock3nowEv 00000000002c783d R _ZNSt6chrono3_V212system_clock9is_steadyE

    You can see that clock_gettime is undefined in 3.10.2 highlighted in the result of first nm command. Looking at the code, the single call to this function is only included in the C++ code if OS_LINUX or OS_FREEBSD is defined.

    Judging from the above nm results, do you think, in 3.6, none of the two flags above were set, where as in 3.10, somehow at least one gets set?

    java-api 
    opened by pshareghi 110
  • Sort L0 files by newly introduced epoch_num

    Sort L0 files by newly introduced epoch_num

    Context: Sorting L0 files by largest_seqno has at least two inconvenience:

    • File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. force_consistency_check=true will catch such overlap seqno range even those harmless overlap.
      • For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
        • insert k1@1 to memtable m1
        • ingest file s1 with k2@2, ingest file s2 with k3@3
        • insert k4@4 to m1
        • compact files s1, s2 and result in new file s3 of seqno range [2, 3]
        • flush m1 and result in new file s4 of seqno range [1, 4]. And force_consistency_check=true will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
      • However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
    • Single delete can decrease a file's largest seqno and ordering by largest_seqno can introduce a wrong ordering hence file reordering corruption
      • For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to @ajkr for this example)
        • an existing SST s1 contains only k1@1
        • insert k1@2 to memtable m1
        • ingest file s2 with k3@3, ingest file s3 with k4@4
        • insert single delete k5@5 in m1
        • flush m1 and result in new file s4 of seqno range [2, 5]
        • compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
        • compact s4 and result in new file s6 of seqno range [2] due to single delete
      • By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by force_consistency_check=true, there isn't a good way to prevent this from happening if ordering by largest_seqno

    Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to @ajkr , we now introduce epoch_num which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum epoch_num among input files'). This will avoid the above inconvenience in the following ways:

    • In the first case above, there will no longer be overlap seqno range check in force_consistency_check=true but epoch_number ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class DBCompactionTestL0FilesMisorderCorruption* for more.
    • In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.

    Summary:

    • Introduce epoch_number stored per ColumnFamilyData and sort CF's L0 files by their assigned epoch_number instead of largest_seqno.
      • epoch_number is increased and assigned upon VersionEdit::AddFile() for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the kReservedEpochNumberForFileIngestedBehind)
      • Compaction output file is assigned with the minimum epoch_number among input files'
        • Refit level: reuse refitted file's epoch_number
      • Other paths needing epoch_number treatment:
        • Import column families: reuse file's epoch_number if exists. If not, assign one based on NewestFirstBySeqNo
        • Repair: reuse file's epoch_number if exists. If not, assign one based on NewestFirstBySeqNo.
      • Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon VersionEdit::AddFile() where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
      • Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later Refit(target_level=0)). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
    • Persist epoch_number of each file in manifest and recover epoch_number on db recovery
      • Backward compatibility with old db without epoch_number support is guaranteed by assigning epoch_number to recovered files by NewestFirstBySeqno order. See VersionStorageInfo::RecoverEpochNumbers() for more
      • Forward compatibility with manifest is guaranteed by flexibility of NewFileCustomTag
    • Replace force_consistent_check on L0 with epoch_number and remove false positive check like case 1 with largest_seqno above
      • Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (NewestFirstBySeqno) to check/sort them till we infer their epoch number. See usages of EpochNumberRequirement.
    • Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
    • Misc:
      • update existing tests with epoch_number so make check will pass
      • update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using epoch_number and cover universal/fifo compaction/CompactRange/CompactFile cases
      • assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()

    Test:

    • make check
    • New unit tests under db/db_compaction_test.cc, db/db_test2.cc, db/version_builder_test.cc, db/repair_test.cc
    • Updated tests (i.e, DBCompactionTestL0FilesMisorderCorruption*) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
    • [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the .orig binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on simple black/white box, cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox
    • [Ongoing] normal db stress test
    • [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761
    CLA Signed WIP 
    opened by hx235 103
  • Improve point-lookup performance using a data block hash index

    Improve point-lookup performance using a data block hash index

    Summary:

    Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless BlockBasedTableOptions::data_block_index_type is set to data_block_index_type = kDataBlockBinaryAndHash.

    The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using BlockBasedTableOptions::data_block_hash_table_util_ratio. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too.

    Test Plan:

    added unit test make -j32 check and make sure all test pass

    Some performance numbers. These experiments run against SSDs. CPU Util is the CPU util percentage of the DataBlockIter point-lookup among db_bench. The CPU util percentage is captured by perf.

    # large cache 20GB
           index | Throughput |             | fallback | cache miss | DB Space
    (util_ratio) |     (MB/s) | CPU Util(%) |    ratio |      ratio |     (GB)
    ------------ | -----------| ----------- | -------- | ---------- | --------
          binary |        116 |       27.17 |    1.000 |   0.000494 |     5.41
           hash1 |        123 |       22.21 |    0.524 |   0.000502 |     5.59
         hash0.9 |        126 |       22.89 |    0.559 |   0.000502 |     5.61
         hash0.8 |        129 |       21.65 |    0.487 |   0.000504 |     5.63
         hash0.7 |        127 |       21.12 |    0.463 |   0.000504 |     5.65
         hash0.6 |        130 |       20.62 |    0.423 |   0.000506 |     5.69
         hash0.5 |        132 |       19.34 |    0.311 |   0.000510 |     5.75
    
    
    # small cache 1GB
           index | Throughput |             | fallback | cache miss | DB Space
    (util_ratio) |     (MB/s) | CPU Util(%) |    ratio |      ratio |     (GB)
    ------------ | -----------| ----------- | -------- | ---------- | --------
          binary |       26.8 |        2.02 |    1.000 |   0.923345 |     5.41
           hash1 |       25.9 |        1.49 |    0.524 |   0.924571 |     5.59
         hash0.9 |       27.5 |        1.59 |    0.559 |   0.924561 |     5.61
         hash0.8 |       27.4 |        1.52 |    0.487 |   0.924868 |     5.63
         hash0.7 |       27.7 |        1.44 |    0.463 |   0.924858 |     5.65
         hash0.6 |       26.8 |        1.36 |    0.423 |   0.925160 |     5.69
         hash0.5 |       28.0 |        1.22 |    0.311 |   0.925779 |     5.75
    
    

    Also we compare with the master branch on which the feature PR based to make sure there is no performance regression on the default binary seek case. These experiments run against tmpfs without perf.

    master: b271f956c Fix a TSAN failure (#4250)
    feature: bf411a50b DataBlockHashIndex: inline SeekForGet() to speedup the fallback path
    
    # large cache 20GB
        branch | Throughput | cache miss | DB Space ||       branch | Throughput | cache miss | DB Space
          #run |     (MB/s) |      ratio |     (GB) ||         #run |     (MB/s) |      ratio |     (GB)
    ---------- | -----------| ---------- | -------- || ------------ | -----------| ---------- | --------
    master/1   |      127.5 |   0.000494 |     5.41 ||  feature/1   |      129.9 |   0.000494 |     5.41
    master/2   |      130.7 |   0.000494 |     5.41 ||  feature/2   |      126.3 |   0.000494 |     5.41
    master/3   |      128.7 |   0.000494 |     5.41 ||  feature/3   |      128.7 |   0.000494 |     5.41
    master/4   |      105.4 |   0.000494 |     5.41 ||  feature/4   |      131.1 |   0.000494 |     5.41
    master/5   |      135.8 |   0.000494 |     5.41 ||  feature/5   |      132.7 |   0.000494 |     5.41
    master/avg |      125.6 |   0.000494 |     5.41 ||  feature/avg |      129.7 |   0.000494 |     5.41
    
    
    # small cache 1GB
        branch | Throughput | cache miss | DB Space ||       branch | Throughput | cache miss | DB Space
          #run |     (MB/s) |      ratio |     (GB) ||         #run |     (MB/s) |      ratio |     (GB)
    ---------- | -----------| ---------- | -------- || ------------ | -----------| ---------- | --------
    master/1   |       36.9 |   0.923190 |     5.41 ||  feature/1   |       37.1 |   0.923189 |     5.41
    master/2   |       36.8 |   0.923184 |     5.41 ||  feature/2   |       35.8 |   0.923196 |     5.41
    master/3   |       35.8 |   0.923190 |     5.41 ||  feature/3   |       36.4 |   0.923183 |     5.41
    master/4   |       27.8 |   0.923200 |     5.41 ||  feature/4   |       36.6 |   0.923191 |     5.41
    master/5   |       37.7 |   0.923162 |     5.41 ||  feature/5   |       36.7 |   0.923141 |     5.41
    master/avg |       35.0 |   0.923185 |     5.41 ||  feature/avg |       36.5 |   0.923180 |     5.41
    							
    
    
    # benchmarking command
    # setting: num=200 million, reads=100 million, key_size=8B, value_size=40B, threads=16
    $DB_BENCH  --data_block_index_type=${block_index} \
               --db=${db} \
               --block_size=16000 --level_compaction_dynamic_level_bytes=1 \
               --num=$num \
               --key_size=$ks \
               --value_size=$vs \
               --benchmarks=fillseq --compression_type=snappy \
               --statistics=false --block_restart_interval=1 \
               --compression_ratio=0.4 \
               --data_block_hash_table_util_ratio=${util_ratio} \
               --statistics=true \
               >${write_log}
    
    $DB_BENCH  --data_block_index_type=${block_index} \
               --db=${db} \
               --block_size=16000 --level_compaction_dynamic_level_bytes=1 \
               --use_existing_db=true \
               --num=${num} \
               --reads=${reads} \
               --key_size=$ks \
               --value_size=$vs \
               --benchmarks=readtocache,readrandom \
               --compression_type=snappy \
               --block_restart_interval=16 \
               --compression_ratio=0.4 \
               --cache_size=${cache_size} \
               --data_block_hash_table_util_ratio=${util_ratio} \
               --use_direct_reads \
               --disable_auto_compactions \
               --threads=${threads} \
               --statistics=true \
               > ${read_log}
    
    
    CLA Signed 
    opened by fgwu 88
  • [4/4][ResourceMngmt] Account Bloom/Ribbon filter construction memory in global memory limit

    [4/4][ResourceMngmt] Account Bloom/Ribbon filter construction memory in global memory limit

    Note: This PR is the 4th part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073) and will rebase/merge only after the first three PRs (https://github.com/facebook/rocksdb/pull/9070, https://github.com/facebook/rocksdb/pull/9071, https://github.com/facebook/rocksdb/pull/9130) merge.

    Context: Similar to https://github.com/facebook/rocksdb/pull/8428, this PR is to track memory usage during (new) Bloom Filter (i.e,FastLocalBloom) and Ribbon Filter (i.e, Ribbon128) construction by charging dummy entry to block cache, moving toward the goal of single global memory limit using block cache capacity. It also constrains the size of the banding portion of Ribbon Filter during construction by falling back to Bloom Filter if that banding is, at some point, larger than the available space in the cache under LRUCacheOptions::strict_capacity_limit=true.

    The option to turn on this feature is BlockBasedTableOptions::reserve_table_builder_memory = true which by default is set to false. We decided not to have separate option for separate memory user in table building therefore their memory accounting are all bundled under one general option.

    Summary:

    • Reserved/released cache for creation/destruction of three main memory users using CacheReservationManager:
      • hash entries (i.ehash_entries.size(), we bucket-charge hash entries during insertion for performance),
      • banding (Ribbon Filter only, bytes_coeff_rows +bytes_result_rows + bytes_backtrack),
      • final filter (i.e, mutable_buf's size).
        • Implementation details: in order to use CacheReservationManager::CacheReservationHandle to account final filter's memory, we have to store the CacheReservationManager object and CacheReservationHandle for final filter in XXPH3BitsFilterBuilder as well as explicitly delete the filter bits builder when done with the final filter in block based table.
    • Added option fo run filter_bench with this memory reservation feature

    Test:

    • Added new tests in db_bloom_filter_test to verify filter construction peak cache reservation under combination of BlockBasedTable::Rep::FilterType (e.g, kFullFilter, kPartitionedFilter), BloomFilterPolicy::Mode(e.g, kFastLocalBloom, kStandard128Ribbon, kDeprecatedBlock) and BlockBasedTableOptions::reserve_table_builder_memory
      • To address the concern for slow test: tests with memory reservation under kFullFilter + kStandard128Ribbon and kPartitionedFilter take around 3000 - 6000 ms and others take around 1500 - 2000 ms, in total adding 20000 - 25000 ms to the test suit running locally
    • Added new test in bloom_test to verify Ribbon Filter fallback on large banding in FullFilter
    • Added test in filter_bench to verify that this feature does not significantly slow down Bloom/Ribbon Filter construction speed. Local result averaged over 20 run as below:
      • FastLocalBloom

        • baseline ./filter_bench -impl=2 -quick -runs 20 | grep 'Build avg':
          • Build avg ns/key: 29.56295 (DEBUG_LEVEL=1), 29.98153 (DEBUG_LEVEL=0)
        • new feature (expected to be similar as above)./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg':
          • Build avg ns/key: 30.99046 (DEBUG_LEVEL=1), 30.48867 (DEBUG_LEVEL=0)
        • new feature of RibbonFilter with fallback (expected to be similar as above) ./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg' :
          • Build avg ns/key: 31.146975 (DEBUG_LEVEL=1), 30.08165 (DEBUG_LEVEL=0)
      • Ribbon128

        • baseline ./filter_bench -impl=3 -quick -runs 20 | grep 'Build avg':
          • Build avg ns/key: 129.17585 (DEBUG_LEVEL=1), 130.5225 (DEBUG_LEVEL=0)
        • new feature (expected to be similar as above) ./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg':
          • Build avg ns/key: 131.61645 (DEBUG_LEVEL=1), 132.98075 (DEBUG_LEVEL=0)
        • new feature of RibbonFilter with fallback (expected to be a lot faster than above due to fallback) ./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg' :
          • Build avg ns/key: 52.032965 (DEBUG_LEVEL=1), 52.597825 (DEBUG_LEVEL=0)
          • And the warning message of "Cache reservation for Ribbon filter banding failed due to cache full" is indeed logged to console.
    CLA Signed Merged 
    opened by hx235 81
  • Do not hold mutex when write keys if not necessary

    Do not hold mutex when write keys if not necessary

    Problem Summary

    RocksDB will acquire the global mutex of db instance for every time when user calls Write. When RocksDB schedules a lot of compaction jobs, it will compete the mutex with write thread and it will hurt the write performance.

    Problem Solution:

    I want to use log_write_mutex to replace the global mutex in most case so that we do not acquire it in write-thread unless there is a write-stall event or a write-buffer-full event occur.

    Test plan

    1. make check
    2. CI
    3. COMPILE_WITH_TSAN=1 make db_stress make crash_test make crash_test_with_multiops_wp_txn make crash_test_with_multiops_wc_txn make crash_test_with_atomic_flush
    CLA Signed 
    opened by Little-Wallace 80
  • Rocks DB crash when used via JNI. Version: 6.20.3

    Rocks DB crash when used via JNI. Version: 6.20.3

    Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

    Expected behavior

    Rocks DB continues to serve reads and writes

    Actual behavior

    Rocks DB crashes.

    Steps to reproduce the behavior

    I can provide the backtrace dump reported for now. Do let me know what other pieces of information are needed.

    
    Host: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz, 32 cores, 376G, Red Hat Enterprise Linux Server release 7.8 (Maipo)
    Time: Mon Aug  2 17:21:25 2021 PDT elapsed time: 480.081057 seconds (0d 0h 8m 0s)
    
    ---------------  T H R E A D  ---------------
    
    Current thread (0x00007f0e02e36800):  JavaThread "grpc-default-executor-24" daemon [_thread_in_native, id=48415, stack(0x00007f0defb58000,0x00007f0defc59000)]
    
    Stack: [0x00007f0defb58000,0x00007f0defc59000],  sp=0x00007f0defc56bc0,  free space=1018k
    Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
    C  [librocksdbjni15379259497632127760.so+0x29bb93]  rocksdb::LegacyFileSystemWrapper::NewSequentialFile(std::string const&, rocksdb::FileOptions const&, std::unique_ptr<rocksdb::FSSequentialFile, std::default_delete<rocksdb::FSSequentialFile> >*, rocksdb::IODebugContext*)+0x33
    C  [librocksdbjni15379259497632127760.so+0x3d2df0]  rocksdb::ReadFileToString(rocksdb::FileSystem*, std::string const&, std::string*)+0x90
    C  [librocksdbjni15379259497632127760.so+0x37fc0b]  rocksdb::VersionSet::GetCurrentManifestPath(std::string const&, rocksdb::FileSystem*, std::string*, unsigned long*)+0x5b
    C  [librocksdbjni15379259497632127760.so+0x397b24]  rocksdb::VersionSet::ListColumnFamilies(std::vector<std::string, std::allocator<std::string> >*, std::string const&, rocksdb::FileSystem*)+0x64
    C  [librocksdbjni15379259497632127760.so+0x27e7eb]  rocksdb::DB::ListColumnFamilies(rocksdb::DBOptions const&, std::string const&, std::vector<std::string, std::allocator<std::string> >*)+0x5b
    C  [librocksdbjni15379259497632127760.so+0x1dfda9]  Java_org_rocksdb_RocksDB_listColumnFamilies+0x89
    J 4117  org.rocksdb.RocksDB.listColumnFamilies(JLjava/lang/String;)[[B (0 bytes) @ 0x00007f0e38785fa2 [0x00007f0e38785ec0+0x00000000000000e2]
    J 16357 c2 org.apache.hadoop.hdds.utils.db.RDBStore.<init>(Ljava/io/File;Lorg/rocksdb/DBOptions;Lorg/rocksdb/WriteOptions;Ljava/util/Set;Lorg/apache/hadoop/hdds/utils/db/CodecRegistry;Z)V (717 bytes) @ 0x00007f0e3938e460 [0x00007f0e3938d9e0+0x0000000000000a80]
    J 14139 c2 org.apache.hadoop.ozone.container.metadata.DatanodeStoreSchemaOneImpl.<init>(Lorg/apache/hadoop/hdds/conf/ConfigurationSource;JLjava/lang/String;Z)V (18 bytes) @ 0x00007f0e3906a668 [0x00007f0e39066b00+0x0000000000003b68]
    J 15249 c2 org.apache.hadoop.ozone.container.common.utils.ContainerCache.getDB(JLjava/lang/String;Ljava/lang/String;Ljava/lang/String;Lorg/apache/hadoop/hdds/conf/ConfigurationSource;)Lorg/apache/hadoop/ozone/container/common/utils/ReferenceCountedDB; (339 bytes) @ 0x00007f0e391e30d0 [0x00007f0e391e2ac0+0x0000000000000610]
    J 18254 c2 org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl.getBlock(Lorg/apache/hadoop/ozone/container/common/interfaces/Container;Lorg/apache/hadoop/hdds/client/BlockID;)Lorg/apache/hadoop/ozone/container/common/helpers/BlockData; (299 bytes) @ 0x00007f0e38f40858 [0x00007f0e38f406e0+0x0000000000000178]
    J 17900 c2 org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/common/interfaces/Container;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandResponseProto; (11 bytes) @ 0x00007f0e3966b42c [0x00007f0e39668340+0x00000000000030ec]
    J 17904 c2 org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandResponseProto; (1105 bytes) @ 0x00007f0e39658378 [0x00007f0e39656ba0+0x00000000000017d8]
    J 8487 c2 org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(Ljava/lang/Object;Lorg/apache/hadoop/hdds/function/FunctionWithServiceException;Ljava/lang/Object;Ljava/lang/String;)Ljava/lang/Object; (205 bytes) @ 0x00007f0e38c0e908 [0x00007f0e38c0e6e0+0x0000000000000228]
    J 12956 c2 org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandResponseProto; (38 bytes) @ 0x00007f0e38dd88d0 [0x00007f0e38dd8740+0x0000000000000190]
    J 17877 c2 org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(Ljava/lang/Object;)V (9 bytes) @ 0x00007f0e3962efd8 [0x00007f0e3962ef60+0x0000000000000078]
    J 17878 c2 org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(Ljava/lang/Object;)V (155 bytes) @ 0x00007f0e39631718 [0x00007f0e39631380+0x0000000000000398]
    J 17572 c2 org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext()V (77 bytes) @ 0x00007f0e3923dc40 [0x00007f0e3923d8c0+0x0000000000000380]
    J 14423 c2 org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run()V (35 bytes) @ 0x00007f0e3889bf28 [0x00007f0e3889be80+0x00000000000000a8]
    J 14467 c2 org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run()V (99 bytes) @ 0x00007f0e38bbb27c [0x00007f0e38bbb180+0x00000000000000fc]
    J 15549 c2 java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V [email protected] (187 bytes) @ 0x00007f0e39278620 [0x00007f0e39278460+0x00000000000001c0]
    J 6773 c1 java.util.concurrent.ThreadPoolExecutor$Worker.run()V [email protected] (9 bytes) @ 0x00007f0e31304be4 [0x00007f0e31304b40+0x00000000000000a4]
    J 6760 c1 java.lang.Thread.run()V [email protected] (17 bytes) @ 0x00007f0e31303934 [0x00007f0e313037c0+0x0000000000000174]
    v  ~StubRoutines::call_stub
    V  [libjvm.so+0x88abd6]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*)+0x366
    V  [libjvm.so+0x888bdd]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, Thread*)+0x1ed
    V  [libjvm.so+0x935d0c]  thread_entry(JavaThread*, Thread*)+0x6c
    V  [libjvm.so+0xe2c91a]  JavaThread::thread_main_inner()+0x1fa
    V  [libjvm.so+0xe2933f]  Thread::call_run()+0x14f
    V  [libjvm.so+0xc6fb9e]  thread_native_entry(Thread*)+0xee
    
    Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
    J 4117  org.rocksdb.RocksDB.listColumnFamilies(JLjava/lang/String;)[[B (0 bytes) @ 0x00007f0e38785f2d [0x00007f0e38785ec0+0x000000000000006d]
    J 16357 c2 org.apache.hadoop.hdds.utils.db.RDBStore.<init>(Ljava/io/File;Lorg/rocksdb/DBOptions;Lorg/rocksdb/WriteOptions;Ljava/util/Set;Lorg/apache/hadoop/hdds/utils/db/CodecRegistry;Z)V (717 bytes) @ 0x00007f0e3938e460 [0x00007f0e3938d9e0+0x0000000000000a80]
    J 14139 c2 org.apache.hadoop.ozone.container.metadata.DatanodeStoreSchemaOneImpl.<init>(Lorg/apache/hadoop/hdds/conf/ConfigurationSource;JLjava/lang/String;Z)V (18 bytes) @ 0x00007f0e3906a668 [0x00007f0e39066b00+0x0000000000003b68]
    J 15249 c2 org.apache.hadoop.ozone.container.common.utils.ContainerCache.getDB(JLjava/lang/String;Ljava/lang/String;Ljava/lang/String;Lorg/apache/hadoop/hdds/conf/ConfigurationSource;)Lorg/apache/hadoop/ozone/container/common/utils/ReferenceCountedDB; (339 bytes) @ 0x00007f0e391e30d0 [0x00007f0e391e2ac0+0x0000000000000610]
    J 18254 c2 org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl.getBlock(Lorg/apache/hadoop/ozone/container/common/interfaces/Container;Lorg/apache/hadoop/hdds/client/BlockID;)Lorg/apache/hadoop/ozone/container/common/helpers/BlockData; (299 bytes) @ 0x00007f0e38f40858 [0x00007f0e38f406e0+0x0000000000000178]
    J 17900 c2 org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/common/interfaces/Container;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandResponseProto; (11 bytes) @ 0x00007f0e3966b42c [0x00007f0e39668340+0x00000000000030ec]
    J 17904 c2 org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandResponseProto; (1105 bytes) @ 0x00007f0e39658378 [0x00007f0e39656ba0+0x00000000000017d8]
    J 8487 c2 org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(Ljava/lang/Object;Lorg/apache/hadoop/hdds/function/FunctionWithServiceException;Ljava/lang/Object;Ljava/lang/String;)Ljava/lang/Object; (205 bytes) @ 0x00007f0e38c0e908 [0x00007f0e38c0e6e0+0x0000000000000228]
    J 12956 c2 org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandResponseProto; (38 bytes) @ 0x00007f0e38dd88d0 [0x00007f0e38dd8740+0x0000000000000190]
    J 17877 c2 org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(Ljava/lang/Object;)V (9 bytes) @ 0x00007f0e3962efd8 [0x00007f0e3962ef60+0x0000000000000078]
    J 17878 c2 org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(Ljava/lang/Object;)V (155 bytes) @ 0x00007f0e39631718 [0x00007f0e39631380+0x0000000000000398]
    J 17572 c2 org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext()V (77 bytes) @ 0x00007f0e3923dc40 [0x00007f0e3923d8c0+0x0000000000000380]
    J 14423 c2 org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run()V (35 bytes) @ 0x00007f0e3889bf28 [0x00007f0e3889be80+0x00000000000000a8]
    J 14467 c2 org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run()V (99 bytes) @ 0x00007f0e38bbb27c [0x00007f0e38bbb180+0x00000000000000fc]
    J 15549 c2 java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V [email protected] (187 bytes) @ 0x00007f0e39278620 [0x00007f0e39278460+0x00000000000001c0]
    J 6773 c1 java.util.concurrent.ThreadPoolExecutor$Worker.run()V [email protected] (9 bytes) @ 0x00007f0e31304be4 [0x00007f0e31304b40+0x00000000000000a4]
    J 6760 c1 java.lang.Thread.run()V [email protected] (17 bytes) @ 0x00007f0e31303934 [0x00007f0e313037c0+0x0000000000000174]
    v  ~StubRoutines::call_stub
    
    siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000050```
    [hs_err_pid70046.log](https://github.com/facebook/rocksdb/files/6926076/hs_err_pid70046.log)
    
    java-api 
    opened by kerneltime 77
  • Avoid recompressing cold block in CompressedSecondaryCache

    Avoid recompressing cold block in CompressedSecondaryCache

    Summary: When a block is firstly Lookup from the secondary cache, we just insert a dummy block in the primary cache (charging the actual size of the block) and don’t erase the block from the secondary cache. A standalone handle is returned from Lookup. Only if the block is hit again, we erase it from the secondary cache and add it into the primary cache.

    When a block is firstly evicted from the primary cache to the secondary cache, we just insert a dummy block (size 0) in the secondary cache. When the block is evicted again, it is treated as a hot block and is inserted into the secondary cache.

    Implementation Details Add a new state of LRUHandle: The handle is never inserted into the LRUCache (both hash table and LRU list) and it doesn't experience the above three states. The entry can be freed when refs becomes 0. (refs >= 1 && in_cache == false && IS_STANDALONE == true)

    The behaviors of LRUCacheShard::Lookup() are updated if the secondary_cache is CompressedSecondaryCache:

    1. If a handle is found in primary cache: 1.1. If the handle's value is not nullptr, it is returned immediately. 1.2. If the handle's value is nullptr, this means the handle is a dummy one. For a dummy handle, if it was retrieved from secondary cache, it may still exist in secondary cache.
      • 1.2.1. If no valid handle can be Lookup from secondary cache, return nullptr.
      • 1.2.2. If the handle from secondary cache is valid, erase it from the secondary cache and add it into the primary cache.
    2. If a handle is not found in primary cache: 2.1. If no valid handle can be Lookup from secondary cache, return nullptr. 2.2. If the handle from secondary cache is valid, insert a dummy block in the primary cache (charging the actual size of the block) and return a standalone handle.

    The behaviors of LRUCacheShard::Promote() are updated as follows:

    1. If e->sec_handle has value, one of the following steps can happen: 1.1. Insert a dummy handle and return a standalone handle to caller when secondary_cache_ is CompressedSecondaryCache and e is a standalone handle. 1.2. Insert the item into the primary cache and return the handle to caller. 1.3. Exception handling.
    2. If e->sec_handle has no value, mark the item as not in cache and charge the cache as its only metadata that'll shortly be released.

    The behavior of CompressedSecondaryCache::Insert() is updated:

    1. If a block is evicted from the primary cache for the first time, a dummy item is inserted.
    2. If a dummy item is found for a block, the block is inserted into the secondary cache.

    The behavior of CompressedSecondaryCache:::Lookup() is updated:

    1. If a handle is not found or it is a dummy item, a nullptr is returned.
    2. If erase_handle is true, the handle is erased.

    The behaviors of LRUCacheShard::Release() are adjusted for the standalone handles.

    Test Plan:

    1. stress tests.
    2. unit tests.
    3. CPU profiling for db_bench.
    CLA Signed 
    opened by gitbw95 74
  • Support WriteCommit policy with sync_fault_injection=1

    Support WriteCommit policy with sync_fault_injection=1

    Context: Prior to this PR, correctness testing with un-sync data loss disabled transaction (use_txn=1) thus all of the txn_write_policy . This PR improved that by adding support for one policy - WriteCommit (txn_write_policy=0).

    Summary: They key to this support is (a) handle Mark{Begin, End}Prepare/MarkCommit/MarkRollback in constructing ExpectedState under WriteCommit policy correctly and (b) monitor CI jobs and solve any test incompatibility issue till jobs are stable. (b) will be part of the test plan.

    For (a)

    • During prepare (i.e, between MarkBeginPrepare() and MarkEndPrepare(xid)), ExpectedStateTraceRecordHandler will buffer all writes by adding all writes to an internal WriteBatch.
    • On MarkEndPrepare(), that WriteBatch will be associated with the transaction's xid.
    • During the commit (i.e, on MarkCommit(xid)), ExpectedStateTraceRecordHandler will retrieve and iterate the internal WriteBatch and finally apply those writes to ExpectedState
    • During the rollback (i.e, on MarkRollback(xid)), ExpectedStateTraceRecordHandler will erase the internal WriteBatch from the map.

    For (b) - one major issue described below:

    • TransactionsDB in db stress recovers prepared-but-not-committed txns from the previous crashed run by randomly committing or rolling back it at the start of the current run, see a historical PR predated correctness testing.
    • And we will verify those processed keys in a recovered db against their expected state.
    • However since now we turn on sync_fault_injection=1 where the expected state is constructed from the trace instead of using the LATEST.state from previous run. The expected state now used to verify those processed keys won't contain UNKNOWN_SENTINEL as they should - see test 1 for a failed case.
    • Therefore, we decided to manually update its expected state to be UNKNOWN_SENTINEL as part of the processing.

    Test:

    1. Test exposed the major issue described above. This test will fail without setting UNKNOWN_SENTINEL in expected state during the processing and pass after
    db=/dev/shm/rocksdb_crashtest_blackbox
    exp=/dev/shm/rocksdb_crashtest_expected
    dbt=$db.tmp
    expt=$exp.tmp
    
    rm -rf $db $exp
    mkdir -p $exp
    
    echo "RUN 1"
    ./db_stress \
    --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
    --use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
    pid=$!
    sleep 0.2
    sleep 20
    kill $pid
    sleep 0.2
    
    echo "RUN 2"
    ./db_stress \
    --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
    --use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
    pid=$!
    sleep 0.2
    sleep 20
    kill $pid
    sleep 0.2
    
    echo "RUN 3"
    ./db_stress \
    --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
    --use_txn=1 --txn_write_policy=0 --sync_fault_injection=1
    
    1. Manual testing to ensure ExpectedState is constructed correctly during recovery by verifying it against previously crashed TransactionDB's WAL.
      • Run the following command to crash a TransactionDB with WriteCommit policy. Then ./ldb dump_wal on its WAL file
    db=/dev/shm/rocksdb_crashtest_blackbox
    exp=/dev/shm/rocksdb_crashtest_expected
    rm -rf $db $exp
    mkdir -p $exp
    
    ./db_stress \
    	--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
    	--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
    pid=$!
    sleep 30
    kill $pid
    sleep 1
    
    • Run the following command to verify recovery of the crashed db under debugger. Compare the step-wise result with WAL records (e.g, WriteBatch content, xid, prepare/commit/rollback marker)
       ./db_stress \
    	--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
    	--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1
    
    1. Automatic testing by triggering all RocksDB stress/crash test jobs for 3 rounds with no failure.
    CLA Signed 
    opened by hx235 70
  • Track WAL in MANIFEST: persist WALs to and recover WALs from MANIFEST

    Track WAL in MANIFEST: persist WALs to and recover WALs from MANIFEST

    This PR makes it able to LogAndApply VersionEdits related to WALs, and also be able to Recover from MANIFEST with WAL related VersionEdits.

    The VersionEdits related to WAL are treated similarly as those related to column family operations, they are not applied to versions, but can be in a commit group. Mixing WAL related VersionEdits with other types of edits will make logic in ProcessManifestWrite more complicated, so VersionEdits related to WAL can either be WAL additions or deletions, like column family add and drop.

    Test Plan:

    a set of unit tests are added in version_set_test.cc

    CLA Signed Merged 
    opened by ghost 64
  • Change The Way Level Target And Compaction Score Are Calculated

    Change The Way Level Target And Compaction Score Are Calculated

    Summary: The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most. Basic idea: (1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is: (2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size.

    Test Plan: Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario.

    CLA Signed 
    opened by siying 63
  • Performance for xxh3 on ARM is almost 1.5X better with latest code from xxHash repo

    Performance for xxh3 on ARM is almost 1.5X better with latest code from xxHash repo

    Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

    Expected behavior

    Perf from db_bench --benchmarks=xxh3 should be close on c6i.2xl (x86) and c7g.2xl (ARM)

    Actual behavior

    Perf on x86 is much better

    Steps to reproduce the behavior

    See my blog post. Merging latest code from the the xxHash dev branch will help.

    performance 
    opened by mdcallag 0
  • Fix some unit test failure in ExternalSSTFileBasicTest

    Fix some unit test failure in ExternalSSTFileBasicTest

    Summary: valgrind build for ExternalSSTFileBasicTest/ExternalSSTFileBasicTest.IngestFileWithMixedValueType and ExternalSSTFileBasicTest/ExternalSSTFileBasicTest.IngestFileWithGlobalSeqnoPickedSeqno started failing (see error message in T141554665). I could not repro but I suspect it is due to file ingestion range overlapping with ongoing compaction, which caused a new global seqno being assigned after #10988.

    Test plan: monitor future valgrind tests result.

    CLA Signed 
    opened by cbi42 1
  • Fix CompactionOutputs::AddRangeDels() to not add range tombstone outside SST range

    Fix CompactionOutputs::AddRangeDels() to not add range tombstone outside SST range

    Summary: the following assertion was failing sometimes. This can happen when a range tombstone that starts after an SST file's key range is added to the SST file. For example, suppose we have a range tombstone [a, b)@2 and two SST files in L1 with keys a@5 in file 1 and a@3 in file 2. Current code could add the range tombstone to file 1 while the range tombstone is outside file1's key range. This caused the assertion failure as the start and end key computation logic does not consider this case that should not happen. This PR added fix to exclude such range tombstones and added a unit test for repro.

    rocksdb::CompactionOutputs::AddRangeDels->rocksdb::VersionSet::ApproximateSize : Assertion icmp.Compare(start, end) <= 0'
    

    Test plan:

    1. Added unit test for repro.
    2. Stress test with a smaller max_key and more frequent range deletions: python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=2 --writepercent=58 --readpercen=21 --duration=36000 --range_deletion_width=10000 --max_key=100000
    CLA Signed WIP 
    opened by cbi42 1
  • SIGSEGV in BlockBasedTable DumpDataBlocks (Kafka Streams)

    SIGSEGV in BlockBasedTable DumpDataBlocks (Kafka Streams)

    A project that I'm working on (Kafka Streams) uses RocksDB via JNI. While running tests, I encountered a SIGSEGV. Downstream issue: https://issues.apache.org/jira/browse/KAFKA-14555

    Expected behavior

    No SIGSEGV

    Actual behavior

    JVM Crash with the following stacktrace:

    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGSEGV (0xb) at pc=0x00000001269f2f2c, pid=88913, tid=40199
    #
    # JRE version: OpenJDK Runtime Environment Corretto-17.0.4.9.1 (17.0.4.1+9) (build 17.0.4.1+9-LTS)
    # Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.4.9.1 (17.0.4.1+9-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, parallel gc, bsd-aarch64)
    # Problematic frame:
    # C  [librocksdbjni15989196819046251041.jnilib+0x2def2c]  _ZN7rocksdb15BlockBasedTable14DumpDataBlocksERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEE+0x1650
    
    ---------------  T H R E A D  ---------------
    Current thread is native 
    threadStack: [0x0000000171704000,0x0000000171787000],  sp=0x0000000171784e90,  free space=515k
    Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
    C  [librocksdbjni15989196819046251041.jnilib+0x2def2c]  _ZN7rocksdb15BlockBasedTable14DumpDataBlocksERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEE+0x1650
    C  [librocksdbjni15989196819046251041.jnilib+0x2cff98]  _ZN7rocksdb15BlockBasedTable28PrefetchIndexAndFilterBlocksERKNS_11ReadOptionsEPNS_18FilePrefetchBufferEPNS_20InternalIteratorBaseINS_5SliceEEEPS0_bRKNS_22BlockBasedTableOptionsEimmPNS_23BlockCacheLookupContextE+0x354
    C  [librocksdbjni15989196819046251041.jnilib+0x2ce4d4]  _ZN7rocksdb15BlockBasedTable4OpenERKNS_11ReadOptionsERKNS_16ImmutableOptionsERKNS_10EnvOptionsERKNS_22BlockBasedTableOptionsERKNS_21InternalKeyComparatorEONSt3__110unique_ptrINS_22RandomAccessFileReaderENSG_14default_deleteISI_EEEEyPNSH_INS_11TableReaderENSJ_ISN_EEEERKNSG_10shared_ptrIKNS_14SliceTransformEEEbbibybPNS_17TailPrefetchStatsEPNS_16BlockCacheTracerEmRKNSG_12basic_stringIcNSG_11char_traitsIcEENSG_9allocatorIcEEEEy+0xaa0
    C  [librocksdbjni15989196819046251041.jnilib+0x2bbd04]  _ZNK7rocksdb22BlockBasedTableFactory14NewTableReaderERKNS_11ReadOptionsERKNS_18TableReaderOptionsEONSt3__110unique_ptrINS_22RandomAccessFileReaderENS7_14default_deleteIS9_EEEEyPNS8_INS_11TableReaderENSA_ISE_EEEEb+0x8c
    C  [librocksdbjni15989196819046251041.jnilib+0x18de18]  _ZN7rocksdb10TableCache14GetTableReaderERKNS_11ReadOptionsERKNS_11FileOptionsERKNS_21InternalKeyComparatorERKNS_14FileDescriptorEbbPNS_13HistogramImplEPNSt3__110unique_ptrINS_11TableReaderENSF_14default_deleteISH_EEEERKNSF_10shared_ptrIKNS_14SliceTransformEEEbibmNS_11TemperatureE+0x418
    C  [librocksdbjni15989196819046251041.jnilib+0x18e5e8]  _ZN7rocksdb10TableCache9FindTableERKNS_11ReadOptionsERKNS_11FileOptionsERKNS_21InternalKeyComparatorERKNS_14FileDescriptorEPPNS_5Cache6HandleERKNSt3__110shared_ptrIKNS_14SliceTransformEEEbbPNS_13HistogramImplEbibmNS_11TemperatureE+0x22c
    C  [librocksdbjni15989196819046251041.jnilib+0x18e96c]  _ZN7rocksdb10TableCache11NewIteratorERKNS_11ReadOptionsERKNS_11FileOptionsERKNS_21InternalKeyComparatorERKNS_12FileMetaDataEPNS_18RangeDelAggregatorERKNSt3__110shared_ptrIKNS_14SliceTransformEEEPPNS_11TableReaderEPNS_13HistogramImplENS_17TableReaderCallerEPNS_5ArenaEbimPKNS_11InternalKeyESW_b+0x1ac
    C  [librocksdbjni15989196819046251041.jnilib+0x8fdc8]  _ZN7rocksdb13CompactionJob25ProcessKeyValueCompactionEPNS0_18SubcompactionStateE+0x1be4
    C  [librocksdbjni15989196819046251041.jnilib+0x8d92c]  _ZN7rocksdb13CompactionJob3RunEv+0xed8
    C  [librocksdbjni15989196819046251041.jnilib+0xfb318]  _ZN7rocksdb6DBImpl20BackgroundCompactionEPbPNS_10JobContextEPNS_9LogBufferEPNS0_19PrepickedCompactionENS_3Env8PriorityE+0xbc8
    C  [librocksdbjni15989196819046251041.jnilib+0xf9484]  _ZN7rocksdb6DBImpl24BackgroundCallCompactionEPNS0_19PrepickedCompactionENS_3Env8PriorityE+0xc0
    C  [librocksdbjni15989196819046251041.jnilib+0xf6f58]  _ZN7rocksdb6DBImpl16BGWorkCompactionEPv+0x30
    C  [librocksdbjni15989196819046251041.jnilib+0x3561dc]  _ZN7rocksdb14ThreadPoolImpl4Impl8BGThreadEm+0x1ec
    C  [librocksdbjni15989196819046251041.jnilib+0x35645c]  _ZN7rocksdb14ThreadPoolImpl4Impl15BGThreadWrapperEPv+0x7c
    C  [librocksdbjni15989196819046251041.jnilib+0x357ed8]  _ZN7rocksdb13NewThreadPoolEi+0x2b0
    C  [libsystem_pthread.dylib+0x726c]  _pthread_start+0x94
    

    Full log: hs_err_pid88913.log

    Steps to reproduce the behavior

    1. Clone Kafka https://github.com/apache/kafka from trunk
    2. Run ./gradlew streams:cleanTest streams:test

    Note: I have only seen this failure once so far and have not yet verified these reproduction steps. I am using macOS Monterey 12.6 with an arm64/aarch64 Apple Silicon M1 Max.

    opened by gharris1727 0
  • Optimize data_block_hash_index unit test

    Optimize data_block_hash_index unit test

    (1) DataBlockHashTest : Keep comments consistent with code behavior, 50% utilization -> 0.5 util_ratio (2) DataBlockHashTestCollision: The original implementation is consistent with DataBlockHashTest, and it is not designed for a large number of conflicting scenarios. I have made a repair

    CLA Signed 
    opened by SGZW 0
Releases(v7.8.3)
  • v7.8.3(Dec 15, 2022)

    7.8.3 (2022-11-29)

    • Revert an internal change in 7.8.0 associated with some memory usage churn.

    7.8.2 (2022-11-27)

    Behavior changes

    • Make best-efforts recovery verify SST unique ID before Version construction (#10962)
    • Fix failed memtable flush retry bug that could cause wrongly ordered updates, which would surface to writers as Status::Corruption in case of force_consistency_checks=true (default). It affects use cases that enable both parallel flush (max_background_flushes > 1 or max_background_jobs >= 8) and non-default memtable count (max_write_buffer_number > 2).
    • Tiered Storage: fixed excessive keys written to penultimate level in non-debug builds.

    Bug Fixes

    • Fixed a regression in scan for async_io. During seek, valid buffers were getting cleared causing a regression.
    • Fixed a performance regression in iterator where range tombstones after iterate_upper_bound is processed.

    7.8.1 (2022-11-02)

    Bug Fixes

    • Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if there is IOError while reading the data leading to empty buffer and other buffer already in progress of async read goes again for reading.

    7.8.0 (2022-10-22)

    New Features

    • DeleteRange() now supports user-defined timestamp.
    • Provide support for async_io with tailing iterators when ReadOptions.tailing is enabled during scans.
    • Tiered Storage: allow data moving up from the last level to the penultimate level if the input level is penultimate level or above.
    • Added DB::Properties::kFastBlockCacheEntryStats, which is similar to DB::Properties::kBlockCacheEntryStats, except returns cached (stale) values in more cases to reduce overhead.
    • FIFO compaction now supports migrating from a multi-level DB via DB::Open(). During the migration phase, FIFO compaction picker will:
    • picks the sst file with the smallest starting key in the bottom-most non-empty level.
    • Note that during the migration phase, the file purge order will only be an approximation of "FIFO" as files in lower-level might sometime contain newer keys than files in upper-level.
    • Added an option ignore_max_compaction_bytes_for_input to ignore max_compaction_bytes limit when adding files to be compacted from input level. This should help reduce write amplification. The option is enabled by default.
    • Tiered Storage: allow data moving up from the last level even if it's a last level only compaction, as long as the penultimate level is empty.
    • Add a new option IOOptions.do_not_recurse that can be used by underlying file systems to skip recursing through sub directories and list only files in GetChildren API.
    • Add option preserve_internal_time_seconds to preserve the time information for the latest data. Which can be used to determine the age of data when preclude_last_level_data_seconds is enabled. The time information is attached with SST in table property rocksdb.seqno.time.map which can be parsed by tool ldb or sst_dump.

    Bug Fixes

    • Fix a bug in io_uring_prep_cancel in AbortIO API for posix which expects sqe->addr to match with read request submitted and wrong paramter was being passed.
    • Fixed a regression in iterator performance when the entire DB is a single memtable introduced in #10449. The fix is in #10705 and #10716.
    • Fixed an optimistic transaction validation bug caused by DBImpl::GetLatestSequenceForKey() returning non-latest seq for merge (#10724).
    • Fixed a bug in iterator refresh which could segfault for DeleteRange users (#10739).
    • Fixed a bug causing manual flush with flush_opts.wait=false to stall when database has stopped all writes (#10001).
    • Fixed a bug in iterator refresh that was not freeing up SuperVersion, which could cause excessive resource pinniung (#10770).
    • Fixed a bug where RocksDB could be doing compaction endlessly when allow_ingest_behind is true and the bottommost level is not filled (#10767).
    • Fixed a memory safety bug in experimental HyperClockCache (#10768)
    • Fixed some cases where ldb update_manifest and ldb unsafe_remove_sst_file are not usable because they were requiring the DB files to match the existing manifest state (before updating the manifest to match a desired state).

    Performance Improvements

    • Try to align the compaction output file boundaries to the next level ones, which can reduce more than 10% compaction load for the default level compaction. The feature is enabled by default, to disable, set AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size to false. As a side effect, it can create SSTs larger than the target_file_size (capped at 2x target_file_size) or smaller files.
    • Improve RoundRobin TTL compaction, which is going to be the same as normal RoundRobin compaction to move the compaction cursor.
    • Fix a small CPU regression caused by a change that UserComparatorWrapper was made Customizable, because Customizable itself has small CPU overhead for initialization.
    • Fixed an iterator performance regression for delete range users when scanning through a consecutive sequence of range tombstones (#10877).

    Behavior Changes

    • Sanitize min_write_buffer_number_to_merge to 1 if atomic flush is enabled to prevent unexpected data loss when WAL is disabled in a multi-column-family setting (#10773).

    Public API changes

    • Make kXXH3 checksum the new default, because it is faster on common hardware, especially with kCRC32c affected by a performance bug in some versions of clang (https://github.com/facebook/rocksdb/issues/9891). DBs written with this new setting can be read by RocksDB 6.27 and newer.
    • Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Introduced an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter. More details in rocksdb/includb/block_cache_trace_writer.h.
    Source code(tar.gz)
    Source code(zip)
  • v7.7.8(Dec 15, 2022)

    7.7.8 (2022-11-27)

    Bug Fixes

    • Fix failed memtable flush retry bug that could cause wrongly ordered updates, which would surface to writers as Status::Corruption in case of force_consistency_checks=true (default). It affects use cases that enable both parallel flush (max_background_flushes > 1 or max_background_jobs >= 8) and non-default memtable count (max_write_buffer_number > 2).
    • Tiered Storage: fixed excessive keys written to penultimate level in non-debug builds.
    • Fixed a regression in iterator where range tombstones after iterate_upper_bound is processed.

    7.7.7 (2022-11-15)

    Bug Fixes

    • Fixed a regression in scan for async_io. During seek, valid buffers were getting cleared causing a regression.

    7.7.6 (2022-11-03)

    Bug Fixes

    • Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if there is IOError while reading the data leading to empty buffer and other buffer already in progress of async read goes again for reading.

    7.7.5 (2022-10-28)

    Bug Fixes

    • Fixed an iterator performance regression for delete range users when scanning through a consecutive sequence of range tombstones (#10877).

    7.7.4 (2022-10-28)

    Bug Fixes

    • Fixed a case of calling malloc_usable_size on result of operator new[].
    Source code(tar.gz)
    Source code(zip)
  • v7.7.3(Oct 12, 2022)

  • v7.7.2(Oct 7, 2022)

    7.7.2 (2022-10-05)

    Bug Fixes

    • Fixed a bug in iterator refresh that was not freeing up SuperVersion, which could cause excessive resource pinniung (#10770).
    • Fixed a bug where RocksDB could be doing compaction endlessly when allow_ingest_behind is true and the bottommost level is not filled (#10767).

    Behavior Changes

    • Sanitize min_write_buffer_number_to_merge to 1 if atomic flush is enabled to prevent unexpected data loss when WAL is disabled in a multi-column-family setting (#10773).

    7.7.1 (2022-09-26)

    Bug Fixes

    • Fixed an optimistic transaction validation bug caused by DBImpl::GetLatestSequenceForKey() returning non-latest seq for merge (#10724).
    • Fixed a bug in iterator refresh which could segfault for DeleteRange users (#10739).

    7.7.0 (2022-09-18)

    Bug Fixes

    • Fixed a hang when an operation such as GetLiveFiles or CreateNewBackup is asked to trigger and wait for memtable flush on a read-only DB. Such indirect requests for memtable flush are now ignored on a read-only DB.
    • Fixed bug where FlushWAL(true /* sync */) (used by GetLiveFilesStorageInfo(), which is used by checkpoint and backup) could cause parallel writes at the tail of a WAL file to never be synced.
    • Fix periodic_task unable to re-register the same task type, which may cause SetOptions() fail to update periodical_task time like: stats_dump_period_sec, stats_persist_period_sec.
    • Fixed a bug in the rocksdb.prefetched.bytes.discarded stat. It was counting the prefetch buffer size, rather than the actual number of bytes discarded from the buffer.
    • Fix bug where the directory containing CURRENT can left unsynced after CURRENT is updated to point to the latest MANIFEST, which leads to risk of unsync data loss of CURRENT.
    • Update rocksdb.multiget.io.batch.size stat in non-async MultiGet as well.
    • Fix a bug in key range overlap checking with concurrent compactions when user-defined timestamp is enabled. User-defined timestamps should be EXCLUDED when checking if two ranges overlap.
    • Fixed a bug where the blob cache prepopulating logic did not consider the secondary cache (see #10603).
    • Fixed the rocksdb.num.sst.read.per.level, rocksdb.num.index.and.filter.blocks.read.per.level and rocksdb.num.level.read.per.multiget stats in the MultiGet coroutines
    • Fix a bug in io_uring_prep_cancel in AbortIO API for posix which expects sqe->addr to match with read request submitted and wrong paramter was being passed.
    • Fixed a regression in iterator performance when the entire DB is a single memtable introduced in #10449. The fix is in #10705 and #10716.
    • Fix a bug in io_uring_prep_cancel in AbortIO API for posix which expects sqe->addr to match with read request submitted and wrong paramter was being passed.
    • Fixed a regression in iterator performance when the entire DB is a single memtable introduced in #10449. The fix is in #10705 and #10716.

    Public API changes

    • Add rocksdb_column_family_handle_get_id, rocksdb_column_family_handle_get_name to get name, id of column family in C API
    • Add a new stat rocksdb.async.prefetch.abort.micros to measure time spent waiting for async prefetch reads to abort

    Java API Changes

    • Add CompactionPriority.RoundRobin.
    • Revert to using the default metadata charge policy when creating an LRU cache via the Java API.

    Behavior Change

    • DBOptions::verify_sst_unique_id_in_manifest is now an on-by-default feature that verifies SST file identity whenever they are opened by a DB, rather than only at DB::Open time.
    • Right now, when the option migration tool (OptionChangeMigration()) migrates to FIFO compaction, it compacts all the data into one single SST file and move to L0. This might create a problem for some users: the giant file may be soon deleted to satisfy max_table_files_size, and might cayse the DB to be almost empty. We change the behavior so that the files are cut to be smaller, but these files might not follow the data insertion order. With the change, after the migration, migrated data might not be dropped by insertion order by FIFO compaction.
    • When a block is firstly found from CompressedSecondaryCache, we just insert a dummy block into the primary cache and don’t erase the block from CompressedSecondaryCache. A standalone handle is returned to the caller. Only if the block is found again from CompressedSecondaryCache before the dummy block is evicted, we erase the block from CompressedSecondaryCache and insert it into the primary cache.
    • When a block is firstly evicted from the primary cache to CompressedSecondaryCache, we just insert a dummy block in CompressedSecondaryCache. Only if it is evicted again before the dummy block is evicted from the cache, it is treated as a hot block and is inserted into CompressedSecondaryCache.
    • Improved the estimation of memory used by cached blobs by taking into account the size of the object owning the blob value and also the allocator overhead if malloc_usable_size is available (see #10583).
    • Blob values now have their own category in the cache occupancy statistics, as opposed to being lumped into the "Misc" bucket (see #10601).
    • Change the optimize_multiget_for_io experimental ReadOptions flag to default on.

    New Features

    • RocksDB does internal auto prefetching if it notices 2 sequential reads if readahead_size is not specified. New option num_file_reads_for_auto_readahead is added in BlockBasedTableOptions which indicates after how many sequential reads internal auto prefetching should be start (default is 2).
    • Added new perf context counters block_cache_standalone_handle_count, block_cache_real_handle_count,compressed_sec_cache_insert_real_count, compressed_sec_cache_insert_dummy_count, compressed_sec_cache_uncompressed_bytes, and compressed_sec_cache_compressed_bytes.
    • Memory for blobs which are to be inserted into the blob cache is now allocated using the cache's allocator (see #10628 and #10647).
    • HyperClockCache is an experimental, lock-free Cache alternative for block cache that offers much improved CPU efficiency under high parallel load or high contention, with some caveats. As much as 4.5x higher ops/sec vs. LRUCache has been seen in db_bench under high parallel load.
    • CompressedSecondaryCacheOptions::enable_custom_split_merge is added for enabling the custom split and merge feature, which split the compressed value into chunks so that they may better fit jemalloc bins.

    Performance Improvements

    • Iterator performance is improved for DeleteRange() users. Internally, iterator will skip to the end of a range tombstone when possible, instead of looping through each key and check individually if a key is range deleted.
    • Eliminated some allocations and copies in the blob read path. Also, PinnableSlice now only points to the blob value and pins the backing resource (cache entry or buffer) in all cases, instead of containing a copy of the blob value. See #10625 and #10647.
    • In case of scans with async_io enabled, few optimizations have been added to issue more asynchronous requests in parallel in order to avoid synchronous prefetching.
    • DeleteRange() users should see improvement in get/iterator performance from mutable memtable (see #10547).
    Source code(tar.gz)
    Source code(zip)
  • v7.6.0(Sep 20, 2022)

    New Features

    • Added prepopulate_blob_cache to ColumnFamilyOptions. If enabled, prepopulate warm/hot blobs which are already in memory into blob cache at the time of flush. On a flush, the blob that is in memory (in memtables) get flushed to the device. If using Direct IO, additional IO is incurred to read this blob back into memory again, which is avoided by enabling this option. This further helps if the workload exhibits high temporal locality, where most of the reads go to recently written data. This also helps in case of the remote file system since it involves network traffic and higher latencies.
    • Support using secondary cache with the blob cache. When creating a blob cache, the user can set a secondary blob cache by configuring secondary_cache in LRUCacheOptions.
    • Charge memory usage of blob cache when the backing cache of the blob cache and the block cache are different. If an operation reserving memory for blob cache exceeds the avaible space left in the block cache at some point (i.e, causing a cache full under LRUCacheOptions::strict_capacity_limit = true), creation will fail with Status::MemoryLimit(). To opt in this feature, enable charging CacheEntryRole::kBlobCache in BlockBasedTableOptions::cache_usage_options.
    • Improve subcompaction range partition so that it is likely to be more even. More evenly distribution of subcompaction will improve compaction throughput for some workloads. All input files' index blocks to sample some anchor key points from which we pick positions to partition the input range. This would introduce some CPU overhead in compaction preparation phase, if subcompaction is enabled, but it should be a small fraction of the CPU usage of the whole compaction process. This also brings a behavier change: subcompaction number is much more likely to maxed out than before.
    • Add CompactionPri::kRoundRobin, a compaction picking mode that cycles through all the files with a compact cursor in a round-robin manner. This feature is available since 7.5.
    • Provide support for subcompactions for user_defined_timestamp.
    • Added an option memtable_protection_bytes_per_key that turns on memtable per key-value checksum protection. Each memtable entry will be suffixed by a checksum that is computed during writes, and verified in reads/compaction. Detected corruption will be logged and with corruption status returned to user.
    • Added a blob-specific cache priority level - bottom level. Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. The user can specify the new option low_pri_pool_ratio in LRUCacheOptions to configure the ratio of capacity reserved for low priority cache entries (and therefore the remaining ratio is the space reserved for the bottom level), or configuring the new argument low_pri_pool_ratio in NewLRUCache() to achieve the same effect.

    Public API changes

    • Removed Customizable support for RateLimiter and removed its CreateFromString() and Type() functions.
    • CompactRangeOptions::exclusive_manual_compaction is now false by default. This ensures RocksDB does not introduce artificial parallelism limitations by default.
    • Tiered Storage: change bottommost_temperture to last_level_temperture. The old option name is kept only for migration, please use the new option. The behavior is changed to apply temperature for the last_level SST files only.
    • Added a new experimental ReadOption flag called optimize_multiget_for_io, which when set attempts to reduce MultiGet latency by spawning coroutines for keys in multiple levels.

    Bug Fixes

    • Fix a bug starting in 7.4.0 in which some fsync operations might be skipped in a DB after any DropColumnFamily on that DB, until it is re-opened. This can lead to data loss on power loss. (For custom FileSystem implementations, this could lead to FSDirectory::Fsync or FSDirectory::Close after the first FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.)
    • Fix a bug where GenericRateLimiter could revert the bandwidth set dynamically using SetBytesPerSecond() when a user configures a structure enclosing it, e.g., using GetOptionsFromString() to configure an Options that references an existing RateLimiter object.
    • Fix race conditions in GenericRateLimiter.
    • Fix a bug in FIFOCompactionPicker::PickTTLCompaction where total_size calculating might cause underflow
    • Fix data race bug in hash linked list memtable. With this bug, read request might temporarily miss an old record in the memtable in a race condition to the hash bucket.
    • Fix a bug that best_efforts_recovery may fail to open the db with mmap read.
    • Fixed a bug where blobs read during compaction would pollute the cache.
    • Fixed a data race in LRUCache when used with a secondary_cache.
    • Fixed a bug where blobs read by iterators would be inserted into the cache even with the fill_cache read option set to false.
    • Fixed the segfault caused by AllocateData() in CompressedSecondaryCache::SplitValueIntoChunks() and MergeChunksIntoValueTest.
    • Fixed a bug in BlobDB where a mix of inlined and blob values could result in an incorrect value being passed to the compaction filter (see #10391).
    • Fixed a memory leak bug in stress tests caused by FaultInjectionSecondaryCache.

    Behavior Change

    • Added checksum handshake during the copying of decompressed WAL fragment. This together with #9875, #10037, #10212, #10114 and #10319 provides end-to-end integrity protection for write batch during recovery.
    • To minimize the internal fragmentation caused by the variable size of the compressed blocks in CompressedSecondaryCache, the original block is split according to the jemalloc bin size in Insert() and then merged back in Lookup().
    • PosixLogger is removed and by default EnvLogger will be used for info logging. The behavior of the two loggers should be very similar when using the default Posix Env.
    • Remove [min|max]_timestamp from VersionEdit for now since they are not tracked in MANIFEST anyway but consume two empty std::string (up to 64 bytes) for each file. Should they be added back in the future, we should store them more compactly.
    • Improve universal tiered storage compaction picker to avoid extra major compaction triggered by size amplification. If preclude_last_level_data_seconds is enabled, the size amplification is calculated within non last_level data only which skip the last level and use the penultimate level as the size base.
    • If an error is hit when writing to a file (append, sync, etc), RocksDB is more strict with not issuing more operations to it, except closing the file, with exceptions of some WAL file operations in error recovery path.
    • A WriteBufferManager constructed with allow_stall == false will no longer trigger write stall implicitly by thrashing until memtable count limit is reached. Instead, a column family can continue accumulating writes while that CF is flushing, which means memory may increase. Users who prefer stalling writes must now explicitly set allow_stall == true.
    • Add CompressedSecondaryCache into the stress tests.
    • Block cache keys have changed, which will cause any persistent caches to miss between versions.

    Performance Improvements

    • Instead of constructing FragmentedRangeTombstoneList during every read operation, it is now constructed once and stored in immutable memtables. This improves speed of querying range tombstones from immutable memtables.
    • When using iterators with the integrated BlobDB implementation, blob cache handles are now released immediately when the iterator's position changes.
    • MultiGet can now do more IO in parallel by reading data blocks from SST files in multiple levels, if the optimize_multiget_for_io ReadOption flag is set.
    Source code(tar.gz)
    Source code(zip)
  • v7.5.3(Aug 24, 2022)

    7.5.2 (2022-08-02)

    Bug Fixes

    • Fix a bug starting in 7.4.0 in which some fsync operations might be skipped in a DB after any DropColumnFamily on that DB, until it is re-opened. This can lead to data loss on power loss. (For custom FileSystem implementations, this could lead to FSDirectory::Fsync or FSDirectory::Close after the first FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.)

    7.5.1 (2022-08-01)

    Bug Fixes

    • Fix a bug where rate_limiter_parameter is not passed into PartitionedFilterBlockReader::GetFilterPartitionBlock.

    7.5.0 (2022-07-15)

    New Features

    • Mempurge option flag experimental_mempurge_threshold is now a ColumnFamilyOptions and can now be dynamically configured using SetOptions().
    • Support backward iteration when ReadOptions::iter_start_ts is set.
    • Provide support for ReadOptions.async_io with direct_io to improve Seek latency by using async IO to parallelize child iterator seek and doing asynchronous prefetching on sequential scans.
    • Added support for blob caching in order to cache frequently used blobs for BlobDB.
      • User can configure the new ColumnFamilyOptions blob_cache to enable/disable blob caching.
      • Either sharing the backend cache with the block cache or using a completely separate cache is supported.
      • A new abstraction interface called BlobSource for blob read logic gives all users access to blobs, whether they are in the blob cache, secondary cache, or (remote) storage. Blobs can be potentially read both while handling user reads (Get, MultiGet, or iterator) and during compaction (while dealing with compaction filters, Merges, or garbage collection) but eventually all blob reads go through Version::GetBlob or, for MultiGet, Version::MultiGetBlob (and then get dispatched to the interface -- BlobSource).
    • Add experimental tiered compaction feature AdvancedColumnFamilyOptions::preclude_last_level_data_seconds, which makes sure the new data inserted within preclude_last_level_data_seconds won't be placed on cold tier (the feature is not complete).

    Public API changes

    • Add metadata related structs and functions in C API, including
      • rocksdb_get_column_family_metadata() and rocksdb_get_column_family_metadata_cf() to obtain rocksdb_column_family_metadata_t.
      • rocksdb_column_family_metadata_t and its get functions & destroy function.
      • rocksdb_level_metadata_t and its and its get functions & destroy function.
      • rocksdb_file_metadata_t and its and get functions & destroy functions.
    • Add suggest_compact_range() and suggest_compact_range_cf() to C API.
    • When using block cache strict capacity limit (LRUCache with strict_capacity_limit=true), DB operations now fail with Status code kAborted subcode kMemoryLimit (IsMemoryLimit()) instead of kIncomplete (IsIncomplete()) when the capacity limit is reached, because Incomplete can mean other specific things for some operations. In more detail, Cache::Insert() now returns the updated Status code and this usually propagates through RocksDB to the user on failure.
    • NewClockCache calls temporarily return an LRUCache (with similar characteristics as the desired ClockCache). This is because ClockCache is being replaced by a new version (the old one had unknown bugs) but this is still under development.
    • Add two functions int ReserveThreads(int threads_to_be_reserved) and int ReleaseThreads(threads_to_be_released) into Env class. In the default implementation, both return 0. Newly added xxxEnv class that inherits Env should implement these two functions for thread reservation/releasing features.
    • Removed Customizable support for RateLimiter and removed its CreateFromString() and Type() functions.

    Bug Fixes

    • Fix a bug in which backup/checkpoint can include a WAL deleted by RocksDB.
    • Fix a bug where concurrent compactions might cause unnecessary further write stalling. In some cases, this might cause write rate to drop to minimum.
    • Fix a bug in Logger where if dbname and db_log_dir are on different filesystems, dbname creation would fail wrt to db_log_dir path returning an error and fails to open the DB.
    • Fix a CPU and memory efficiency issue introduce by https://github.com/facebook/rocksdb/pull/8336 which made InternalKeyComparator configurable as an unintended side effect
    • Fix a bug where GenericRateLimiter could revert the bandwidth set dynamically using SetBytesPerSecond() when a user configures a structure enclosing it, e.g., using GetOptionsFromString() to configure an Options that references an existing RateLimiter object.

    Behavior Change

    • In leveled compaction with dynamic levelling, level multiplier is not anymore adjusted due to oversized L0. Instead, compaction score is adjusted by increasing size level target by adding incoming bytes from upper levels. This would deprioritize compactions from upper levels if more data from L0 is coming. This is to fix some unnecessary full stalling due to drastic change of level targets, while not wasting write bandwidth for compaction while writes are overloaded.
    • For track_and_verify_wals_in_manifest, revert to the original behavior before #10087: syncing of live WAL file is not tracked, and we track only the synced sizes of closed WALs. (PR #10330).
    • WAL compression now computes/verifies checksum during compression/decompression.

    Performance Improvements

    • Rather than doing total sort against all files in a level, SortFileByOverlappingRatio() to only find the top 50 files based on score. This can improve write throughput for the use cases where data is loaded in increasing key order and there are a lot of files in one LSM-tree, where applying compaction results is the bottleneck.
    • In leveled compaction, L0->L1 trivial move will allow more than one file to be moved in one compaction. This would allow L0 files to be moved down faster when data is loaded in sequential order, making slowdown or stop condition harder to hit. Also seek L0->L1 trivial move when only some files qualify.
    • In leveled compaction, try to trivial move more than one files if possible, up to 4 files or max_compaction_bytes. This is to allow higher write throughput for some use cases where data is loaded in sequential order, where appying compaction results is the bottleneck.
    Source code(tar.gz)
    Source code(zip)
  • v7.4.5(Aug 2, 2022)

    7.4.5 (2022-09-02)

    Bug Fixes

    • Fix a bug starting in 7.4.0 in which some fsync operations might be skipped in a DB after any DropColumnFamily on that DB, until it is re-opened. This can lead to data loss on power loss. (For custom FileSystem implementations, this could lead to FSDirectory::Fsync or FSDirectory::Close after the first FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.)
    Source code(tar.gz)
    Source code(zip)
  • v7.4.4(Jul 28, 2022)

    7.4.4 (2022-07-19)

    Public API changes

    • Removed Customizable support for RateLimiter and removed its CreateFromString() and Type() functions.

    Bug Fixes

    • Fix a bug where GenericRateLimiter could revert the bandwidth set dynamically using SetBytesPerSecond() when a user configures a structure enclosing it, e.g., using GetOptionsFromString() to configure an Options that references an existing RateLimiter object.

    7.4.3 (2022-07-13)

    Behavior Changes

    • For track_and_verify_wals_in_manifest, revert to the original behavior before #10087: syncing of live WAL file is not tracked, and we track only the synced sizes of closed WALs. (PR #10330).

    7.4.2 (2022-06-30)

    Bug Fixes

    • Fix a bug in Logger where if dbname and db_log_dir are on different filesystems, dbname creation would fail wrt to db_log_dir path returning an error and fails to open the DB.

    7.4.1 (2022-06-28)

    Bug Fixes

    • Pass rate_limiter_priority through filter block reader functions to FileSystem.

    7.4.0 (2022-06-19)

    Bug Fixes

    • Fixed a bug in calculating key-value integrity protection for users of in-place memtable updates. In particular, the affected users would be those who configure protection_bytes_per_key > 0 on WriteBatch or WriteOptions, and configure inplace_callback != nullptr.
    • Fixed a bug where a snapshot taken during SST file ingestion would be unstable.
    • Fixed a bug for non-TransactionDB with avoid_flush_during_recovery = true and TransactionDB where in case of crash, min_log_number_to_keep may not change on recovery and persisting a new MANIFEST with advanced log_numbers for some column families, results in "column family inconsistency" error on second recovery. As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL. If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point. If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
    • Fixed a bug where RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. Now writes to MANIFEST are persisted only after recovery is successful.
    • Fix a race condition in WAL size tracking which is caused by an unsafe iterator access after container is changed.
    • Fix unprotected concurrent accesses to WritableFileWriter::filesize_ by DB::SyncWAL() and DB::Put() in two write queue mode.
    • Fix a bug in WAL tracking. Before this PR (#10087), calling SyncWAL() on the only WAL file of the db will not log the event in MANIFEST, thus allowing a subsequent DB::Open even if the WAL file is missing or corrupted.
    • Fix a bug that could return wrong results with index_type=kHashSearch and using SetOptions to change the prefix_extractor.
    • Fixed a bug in WAL tracking with wal_compression. WAL compression writes a kSetCompressionType record which is not associated with any sequence number. As result, WalManager::GetSortedWalsOfType() will skip these WALs and not return them to caller, e.g. Checkpoint, Backup, causing the operations to fail.
    • Avoid a crash if the IDENTITY file is accidentally truncated to empty. A new DB ID will be written and generated on Open.
    • Fixed a possible corruption for users of manual_wal_flush and/or FlushWAL(true /* sync */), together with track_and_verify_wals_in_manifest == true. For those users, losing unsynced data (e.g., due to power loss) could make future DB opens fail with a Status::Corruption complaining about missing WAL data.
    • Fixed a bug in WriteBatchInternal::Append() where WAL termination point in write batch was not considered and the function appends an incorrect number of checksums.
    • Fixed a crash bug introduced in 7.3.0 affecting users of MultiGet with kDataBlockBinaryAndHash.
    • Add some fixes in async_io which was doing extra prefetching in shorter scans.

    Public API changes

    • Add new API GetUnixTime in Snapshot class which returns the unix time at which Snapshot is taken.
    • Add transaction get_pinned and multi_get to C API.
    • Add two-phase commit support to C API.
    • Add rocksdb_transaction_get_writebatch_wi and rocksdb_transaction_rebuild_from_writebatch to C API.
    • Add rocksdb_options_get_blob_file_starting_level and rocksdb_options_set_blob_file_starting_level to C API.
    • Add blobFileStartingLevel and setBlobFileStartingLevel to Java API.
    • Add SingleDelete for DB in C API
    • Add User Defined Timestamp in C API.
      • rocksdb_comparator_with_ts_create to create timestamp aware comparator
      • Put, Get, Delete, SingleDelete, MultiGet APIs has corresponding timestamp aware APIs with suffix with_ts
      • And Add C API's for Transaction, SstFileWriter, Compaction as mentioned here
    • The contract for implementations of Comparator::IsSameLengthImmediateSuccessor has been updated to work around a design bug in auto_prefix_mode.
    • The API documentation for auto_prefix_mode now notes some corner cases in which it returns different results than total_order_seek, due to design bugs that are not easily fixed. Users using built-in comparators and keys at least the size of a fixed prefix length are not affected.
    • Obsoleted the NUM_DATA_BLOCKS_READ_PER_LEVEL stat and introduced the NUM_LEVEL_READ_PER_MULTIGET and MULTIGET_COROUTINE_COUNT stats
    • Introduced WriteOptions::protection_bytes_per_key, which can be used to enable key-value integrity protection for live updates.

    New Features

    • Add FileSystem::ReadAsync API in io_tracing
    • Add blob garbage collection parameters blob_garbage_collection_policy and blob_garbage_collection_age_cutoff to both force-enable and force-disable GC, as well as selectively override age cutoff when using CompactRange.
    • Add an extra sanity check in GetSortedWalFiles() (also used by GetLiveFilesStorageInfo(), BackupEngine, and Checkpoint) to reduce risk of successfully created backup or checkpoint failing to open because of missing WAL file.
    • Add a new column family option blob_file_starting_level to enable writing blob files during flushes and compactions starting from the specified LSM tree level.
    • Add support for timestamped snapshots (#9879)
    • Provide support for AbortIO in posix to cancel submitted asynchronous requests using io_uring.
    • Add support for rate-limiting batched MultiGet() APIs

    Behavior changes

    • DB::Open(), DB::OpenAsSecondary() will fail if a Logger cannot be created (#9984)
    • Removed support for reading Bloom filters using obsolete block-based filter format. (Support for writing such filters was dropped in 7.0.) For good read performance on old DBs using these filters, a full compaction is required.
    • Per KV checksum in write batch is verified before a write batch is written to WAL to detect any corruption to the write batch (#10114).

    Performance Improvements

    • When compiled with folly (Meta-internal integration; experimental in open source build), improve the locking performance (CPU efficiency) of LRUCache by using folly DistributedMutex in place of standard mutex.
    Source code(tar.gz)
    Source code(zip)
  • v7.4.3(Jul 18, 2022)

    7.4.3 (2022-07-13)

    Behavior Changes

    • For track_and_verify_wals_in_manifest, revert to the original behavior before #10087: syncing of live WAL file is not tracked, and we track only the synced sizes of closed WALs. (PR #10330).

    7.4.2 (2022-06-30)

    Bug Fixes

    • Fix a bug in Logger where if dbname and db_log_dir are on different filesystems, dbname creation would fail wrt to db_log_dir path returning an error and fails to open the DB.

    7.4.1 (2022-06-28)

    Bug Fixes

    • Pass rate_limiter_priority through filter block reader functions to FileSystem.

    7.4.0 (2022-06-19)

    Bug Fixes

    • Fixed a bug in calculating key-value integrity protection for users of in-place memtable updates. In particular, the affected users would be those who configure protection_bytes_per_key > 0 on WriteBatch or WriteOptions, and configure inplace_callback != nullptr.
    • Fixed a bug where a snapshot taken during SST file ingestion would be unstable.
    • Fixed a bug for non-TransactionDB with avoid_flush_during_recovery = true and TransactionDB where in case of crash, min_log_number_to_keep may not change on recovery and persisting a new MANIFEST with advanced log_numbers for some column families, results in "column family inconsistency" error on second recovery. As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL. If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point. If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
    • Fixed a bug where RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. Now writes to MANIFEST are persisted only after recovery is successful.
    • Fix a race condition in WAL size tracking which is caused by an unsafe iterator access after container is changed.
    • Fix unprotected concurrent accesses to WritableFileWriter::filesize_ by DB::SyncWAL() and DB::Put() in two write queue mode.
    • Fix a bug in WAL tracking. Before this PR (#10087), calling SyncWAL() on the only WAL file of the db will not log the event in MANIFEST, thus allowing a subsequent DB::Open even if the WAL file is missing or corrupted.
    • Fix a bug that could return wrong results with index_type=kHashSearch and using SetOptions to change the prefix_extractor.
    • Fixed a bug in WAL tracking with wal_compression. WAL compression writes a kSetCompressionType record which is not associated with any sequence number. As result, WalManager::GetSortedWalsOfType() will skip these WALs and not return them to caller, e.g. Checkpoint, Backup, causing the operations to fail.
    • Avoid a crash if the IDENTITY file is accidentally truncated to empty. A new DB ID will be written and generated on Open.
    • Fixed a possible corruption for users of manual_wal_flush and/or FlushWAL(true /* sync */), together with track_and_verify_wals_in_manifest == true. For those users, losing unsynced data (e.g., due to power loss) could make future DB opens fail with a Status::Corruption complaining about missing WAL data.
    • Fixed a bug in WriteBatchInternal::Append() where WAL termination point in write batch was not considered and the function appends an incorrect number of checksums.
    • Fixed a crash bug introduced in 7.3.0 affecting users of MultiGet with kDataBlockBinaryAndHash.
    • Add some fixes in async_io which was doing extra prefetching in shorter scans.

    Public API changes

    • Add new API GetUnixTime in Snapshot class which returns the unix time at which Snapshot is taken.
    • Add transaction get_pinned and multi_get to C API.
    • Add two-phase commit support to C API.
    • Add rocksdb_transaction_get_writebatch_wi and rocksdb_transaction_rebuild_from_writebatch to C API.
    • Add rocksdb_options_get_blob_file_starting_level and rocksdb_options_set_blob_file_starting_level to C API.
    • Add blobFileStartingLevel and setBlobFileStartingLevel to Java API.
    • Add SingleDelete for DB in C API
    • Add User Defined Timestamp in C API.
      • rocksdb_comparator_with_ts_create to create timestamp aware comparator
      • Put, Get, Delete, SingleDelete, MultiGet APIs has corresponding timestamp aware APIs with suffix with_ts
      • And Add C API's for Transaction, SstFileWriter, Compaction as mentioned here
    • The contract for implementations of Comparator::IsSameLengthImmediateSuccessor has been updated to work around a design bug in auto_prefix_mode.
    • The API documentation for auto_prefix_mode now notes some corner cases in which it returns different results than total_order_seek, due to design bugs that are not easily fixed. Users using built-in comparators and keys at least the size of a fixed prefix length are not affected.
    • Obsoleted the NUM_DATA_BLOCKS_READ_PER_LEVEL stat and introduced the NUM_LEVEL_READ_PER_MULTIGET and MULTIGET_COROUTINE_COUNT stats
    • Introduced WriteOptions::protection_bytes_per_key, which can be used to enable key-value integrity protection for live updates.

    New Features

    • Add FileSystem::ReadAsync API in io_tracing
    • Add blob garbage collection parameters blob_garbage_collection_policy and blob_garbage_collection_age_cutoff to both force-enable and force-disable GC, as well as selectively override age cutoff when using CompactRange.
    • Add an extra sanity check in GetSortedWalFiles() (also used by GetLiveFilesStorageInfo(), BackupEngine, and Checkpoint) to reduce risk of successfully created backup or checkpoint failing to open because of missing WAL file.
    • Add a new column family option blob_file_starting_level to enable writing blob files during flushes and compactions starting from the specified LSM tree level.
    • Add support for timestamped snapshots (#9879)
    • Provide support for AbortIO in posix to cancel submitted asynchronous requests using io_uring.
    • Add support for rate-limiting batched MultiGet() APIs

    Behavior changes

    • DB::Open(), DB::OpenAsSecondary() will fail if a Logger cannot be created (#9984)
    • Removed support for reading Bloom filters using obsolete block-based filter format. (Support for writing such filters was dropped in 7.0.) For good read performance on old DBs using these filters, a full compaction is required.
    • Per KV checksum in write batch is verified before a write batch is written to WAL to detect any corruption to the write batch (#10114).

    Performance Improvements

    • When compiled with folly (Meta-internal integration; experimental in open source build), improve the locking performance (CPU efficiency) of LRUCache by using folly DistributedMutex in place of standard mutex.
    Source code(tar.gz)
    Source code(zip)
  • v7.3.1(Jun 10, 2022)

    7.3.1 (2022-06-08)

    Bug Fixes

    • Fix a bug in WAL tracking. Before this PR (#10087), calling SyncWAL() on the only WAL file of the db will not log the event in MANIFEST, thus allowing a subsequent DB::Open even if the WAL file is missing or corrupted.
    • Fixed a bug for non-TransactionDB with avoid_flush_during_recovery = true and TransactionDB where in case of crash, min_log_number_to_keep may not change on recovery and persisting a new MANIFEST with advanced log_numbers for some column families, results in "column family inconsistency" error on second recovery. As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL. If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point. If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
    • Fixed a bug where RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. Now writes to MANIFEST are persisted only after recovery is successful.

    7.3.0 (2022-05-20)

    Bug Fixes

    • Fixed a bug where manual flush would block forever even though flush options had wait=false.
    • Fixed a bug where RocksDB could corrupt DBs with avoid_flush_during_recovery == true by removing valid WALs, leading to Status::Corruption with message like "SST file is ahead of WALs" when attempting to reopen.
    • Fixed a bug in async_io path where incorrect length of data is read by FilePrefetchBuffer if data is consumed from two populated buffers and request for more data is sent.
    • Fixed a CompactionFilter bug. Compaction filter used to use Delete to remove keys, even if the keys should be removed with SingleDelete. Mixing Delete and SingleDelete may cause undefined behavior.
    • Fixed a bug in WritableFileWriter::WriteDirect and WritableFileWriter::WriteDirectWithChecksum. The rate_limiter_priority specified in ReadOptions was not passed to the RateLimiter when requesting a token.
    • Fixed a bug which might cause process crash when I/O error happens when reading an index block in MultiGet().

    New Features

    • DB::GetLiveFilesStorageInfo is ready for production use.
    • Add new stats PREFETCHED_BYTES_DISCARDED which records number of prefetched bytes discarded by RocksDB FilePrefetchBuffer on destruction and POLL_WAIT_MICROS records wait time for FS::Poll API completion.
    • RemoteCompaction supports table_properties_collector_factories override on compaction worker.
    • Start tracking SST unique id in MANIFEST, which will be used to verify with SST properties during DB open to make sure the SST file is not overwritten or misplaced. A db option verify_sst_unique_id_in_manifest is introduced to enable/disable the verification, if enabled all SST files will be opened during DB-open to verify the unique id (default is false), so it's recommended to use it with max_open_files = -1 to pre-open the files.
    • Added the ability to concurrently read data blocks from multiple files in a level in batched MultiGet. This can be enabled by setting the async_io option in ReadOptions. Using this feature requires a FileSystem that supports ReadAsync (PosixFileSystem is not supported yet for this), and for RocksDB to be compiled with folly and c++20.
    • Add FileSystem::ReadAsync API in io_tracing.

    Public API changes

    • Add rollback_deletion_type_callback to TransactionDBOptions so that write-prepared transactions know whether to issue a Delete or SingleDelete to cancel a previous key written during prior prepare phase. The PR aims to prevent mixing SingleDeletes and Deletes for the same key that can lead to undefined behaviors for write-prepared transactions.
    • EXPERIMENTAL: Add new API AbortIO in file_system to abort the read requests submitted asynchronously.
    • CompactionFilter::Decision has a new value: kRemoveWithSingleDelete. If CompactionFilter returns this decision, then CompactionIterator will use SingleDelete to mark a key as removed.
    • Renamed CompactionFilter::Decision::kRemoveWithSingleDelete to kPurge since the latter sounds more general and hides the implementation details of how compaction iterator handles keys.
    • Added ability to specify functions for Prepare and Validate to OptionsTypeInfo. Added methods to OptionTypeInfo to set the functions via an API. These methods are intended for RocksDB plugin developers for configuration management.
    • Added a new immutable db options, enforce_single_del_contracts. If set to false (default is true), compaction will NOT fail due to a single delete followed by a delete for the same key. The purpose of this temporay option is to help existing use cases migrate.
    • Introduce BlockBasedTableOptions::cache_usage_options and use that to replace BlockBasedTableOptions::reserve_table_builder_memory and BlockBasedTableOptions::reserve_table_reader_memory.
    • Changed GetUniqueIdFromTableProperties to return a 128-bit unique identifier, which will be the standard size now. The old functionality (192-bit) is available from GetExtendedUniqueIdFromTableProperties. Both functions are no longer "experimental" and are ready for production use.
    • In IOOptions, mark prio as deprecated for future removal.
    • In file_system.h, mark IOPriority as deprecated for future removal.
    • Add an option, CompressionOptions::use_zstd_dict_trainer, to indicate whether zstd dictionary trainer should be used for generating zstd compression dictionaries. The default value of this option is true for backward compatibility. When this option is set to false, zstd API ZDICT_finalizeDictionary is used to generate compression dictionaries.
    • Seek API which positions itself every LevelIterator on the correct data block in the correct SST file which can be parallelized if ReadOptions.async_io option is enabled.
    • Add new stat number_async_seek in PerfContext that indicates number of async calls made by seek to prefetch data.

    Bug Fixes

    • RocksDB calls FileSystem::Poll API during FilePrefetchBuffer destruction which impacts performance as it waits for read requets completion which is not needed anymore. Calling FileSystem::AbortIO to abort those requests instead fixes that performance issue.
    • Fixed unnecessary block cache contention when queries within a MultiGet batch and across parallel batches access the same data block, which previously could cause severely degraded performance in this unusual case. (In more typical MultiGet cases, this fix is expected to yield a small or negligible performance improvement.)

    Behavior changes

    • Enforce the existing contract of SingleDelete so that SingleDelete cannot be mixed with Delete because it leads to undefined behavior. Fix a number of unit tests that violate the contract but happen to pass.
    • ldb --try_load_options default to true if --db is specified and not creating a new DB, the user can still explicitly disable that by --try_load_options=false (or explicitly enable that by --try_load_options).
    • During Flush write or Compaction write/read, the WriteController is used to determine whether DB writes are stalled or slowed down. The priority (Env::IOPriority) can then be determined accordingly and be passed in IOOptions to the file system.
    Source code(tar.gz)
    Source code(zip)
  • v7.2.2(May 5, 2022)

    7.2.2 (2022-04-28)

    Bug Fixes

    • Fixed a bug in async_io path where incorrect length of data is read by FilePrefetchBuffer if data is consumed from two populated buffers and request for more data is sent.

    7.2.1 (2022-04-26)

    Bug Fixes

    • Fixed a bug where RocksDB could corrupt DBs with avoid_flush_during_recovery == true by removing valid WALs, leading to Status::Corruption with message like "SST file is ahead of WALs" when attempting to reopen.
    • RocksDB calls FileSystem::Poll API during FilePrefetchBuffer destruction which impacts performance as it waits for read requets completion which is not needed anymore. Calling FileSystem::AbortIO to abort those requests instead fixes that performance issue.

    7.2.0 (2022-04-15)

    Bug Fixes

    • Fixed bug which caused rocksdb failure in the situation when rocksdb was accessible using UNC path
    • Fixed a race condition when 2PC is disabled and WAL tracking in the MANIFEST is enabled. The race condition is between two background flush threads trying to install flush results, causing a WAL deletion not tracked in the MANIFEST. A future DB open may fail.
    • Fixed a heap use-after-free race with DropColumnFamily.
    • Fixed a bug that rocksdb.read.block.compaction.micros cannot track compaction stats (#9722).
    • Fixed file_type, relative_filename and directory fields returned by GetLiveFilesMetaData(), which were added in inheriting from FileStorageInfo.
    • Fixed a bug affecting track_and_verify_wals_in_manifest. Without the fix, application may see "open error: Corruption: Missing WAL with log number" while trying to open the db. The corruption is a false alarm but prevents DB open (#9766).
    • Fix segfault in FilePrefetchBuffer with async_io as it doesn't wait for pending jobs to complete on destruction.
    • Fix ERROR_HANDLER_AUTORESUME_RETRY_COUNT stat whose value was set wrong in portal.h
    • Fixed a bug for non-TransactionDB with avoid_flush_during_recovery = true and TransactionDB where in case of crash, min_log_number_to_keep may not change on recovery and persisting a new MANIFEST with advanced log_numbers for some column families, results in "column family inconsistency" error on second recovery. As a solution the corrupted WALs whose numbers are larger than the corrupted wal and smaller than the new WAL will be moved to archive folder.
    • Fixed a bug in RocksDB DB::Open() which may creates and writes to two new MANIFEST files even before recovery succeeds. Now writes to MANIFEST are persisted only after recovery is successful.

    New Features

    • For db_bench when --seed=0 or --seed is not set then it uses the current time as the seed value. Previously it used the value 1000.
    • For db_bench when --benchmark lists multiple tests and each test uses a seed for a RNG then the seeds across tests will no longer be repeated.
    • Added an option to dynamically charge an updating estimated memory usage of block-based table reader to block cache if block cache available. To enable this feature, set BlockBasedTableOptions::reserve_table_reader_memory = true.
    • Add new stat ASYNC_READ_BYTES that calculates number of bytes read during async read call and users can check if async code path is being called by RocksDB internal automatic prefetching for sequential reads.
    • Enable async prefetching if ReadOptions.readahead_size is set along with ReadOptions.async_io in FilePrefetchBuffer.
    • Add event listener support on remote compaction compactor side.
    • Added a dedicated integer DB property rocksdb.live-blob-file-garbage-size that exposes the total amount of garbage in the blob files in the current version.
    • RocksDB does internal auto prefetching if it notices sequential reads. It starts with readahead size initial_auto_readahead_size which now can be configured through BlockBasedTableOptions.
    • Add a merge operator that allows users to register specific aggregation function so that they can does aggregation using different aggregation types for different keys. See comments in include/rocksdb/utilities/agg_merge.h for actual usage. The feature is experimental and the format is subject to change and we won't provide a migration tool.
    • Meta-internal / Experimental: Improve CPU performance by replacing many uses of std::unordered_map with folly::F14FastMap when RocksDB is compiled together with Folly.
    • Experimental: Add CompressedSecondaryCache, a concrete implementation of rocksdb::SecondaryCache, that integrates with compression libraries (e.g. LZ4) to hold compressed blocks.

    Behavior changes

    • Disallow usage of commit-time-write-batch for write-prepared/write-unprepared transactions if TransactionOptions::use_only_the_last_commit_time_batch_for_recovery is false to prevent two (or more) uncommitted versions of the same key in the database. Otherwise, bottommost compaction may violate the internal key uniqueness invariant of SSTs if the sequence numbers of both internal keys are zeroed out (#9794).
    • Make DB::GetUpdatesSince() return NotSupported early for write-prepared/write-unprepared transactions, as the API contract indicates.

    Public API changes

    • Exposed APIs to examine results of block cache stats collections in a structured way. In particular, users of GetMapProperty() with property kBlockCacheEntryStats can now use the functions in BlockCacheEntryStatsMapKeys to find stats in the map.
    • Add fail_if_not_bottommost_level to IngestExternalFileOptions so that ingestion will fail if the file(s) cannot be ingested to the bottommost level.
    • Add output parameter is_in_sec_cache to SecondaryCache::Lookup(). It is to indicate whether the handle is possibly erased from the secondary cache after the Lookup.
    Source code(tar.gz)
    Source code(zip)
  • v7.1.2(Apr 20, 2022)

    7.1.2 (2022-04-19)

    Bug Fixes

    • Fixed bug which caused rocksdb failure in the situation when rocksdb was accessible using UNC path
    • Fixed a race condition when 2PC is disabled and WAL tracking in the MANIFEST is enabled. The race condition is between two background flush threads trying to install flush results, causing a WAL deletion not tracked in the MANIFEST. A future DB open may fail.
    • Fixed a heap use-after-free race with DropColumnFamily.
    • Fixed a bug that rocksdb.read.block.compaction.micros cannot track compaction stats (#9722).
    • Fixed file_type, relative_filename and directory fields returned by GetLiveFilesMetaData(), which were added in inheriting from FileStorageInfo.
    • Fixed a bug affecting track_and_verify_wals_in_manifest. Without the fix, application may see "open error: Corruption: Missing WAL with log number" while trying to open the db. The corruption is a false alarm but prevents DB open (#9766).
    Source code(tar.gz)
    Source code(zip)
  • v7.1.1(Apr 13, 2022)

    7.1.1 (2022-04-07)

    Bug Fixes

    • Fix segfault in FilePrefetchBuffer with async_io as it doesn't wait for pending jobs to complete on destruction.

    7.1.0 (2022-03-23)

    New Features

    • Allow WriteBatchWithIndex to index a WriteBatch that includes keys with user-defined timestamps. The index itself does not have timestamp.
    • Add support for user-defined timestamps to write-committed transaction without API change. The TransactionDB layer APIs do not allow timestamps because we require that all user-defined-timestamps-aware operations go through the Transaction APIs.
    • Added BlobDB options to ldb
    • BlockBasedTableOptions::detect_filter_construct_corruption can now be dynamically configured using DB::SetOptions.
    • Automatically recover from retryable read IO errors during backgorund flush/compaction.
    • Experimental support for preserving file Temperatures through backup and restore, and for updating DB metadata for outside changes to file Temperature (UpdateManifestForFilesState or ldb update_manifest --update_temperatures).
    • Experimental support for async_io in ReadOptions which is used by FilePrefetchBuffer to prefetch some of the data asynchronously, if reads are sequential and auto readahead is enabled by rocksdb internally.

    Bug Fixes

    • Fixed a major performance bug in which Bloom filters generated by pre-7.0 releases are not read by early 7.0.x releases (and vice-versa) due to changes to FilterPolicy::Name() in #9590. This can severely impact read performance and read I/O on upgrade or downgrade with existing DB, but not data correctness.
    • Fixed a data race on versions_ between DBImpl::ResumeImpl() and threads waiting for recovery to complete (#9496)
    • Fixed a bug caused by race among flush, incoming writes and taking snapshots. Queries to snapshots created with these race condition can return incorrect result, e.g. resurfacing deleted data.
    • Fixed a bug that DB flush uses options.compression even options.compression_per_level is set.
    • Fixed a bug that DisableManualCompaction may assert when disable an unscheduled manual compaction.
    • Fix a race condition when cancel manual compaction with DisableManualCompaction. Also DB close can cancel the manual compaction thread.
    • Fixed a potential timer crash when open close DB concurrently.
    • Fixed a race condition for alive_log_files_ in non-two-write-queues mode. The race is between the write_thread_ in WriteToWAL() and another thread executing FindObsoleteFiles(). The race condition will be caught if __glibcxx_requires_nonempty is enabled.
    • Fixed a bug that Iterator::Refresh() reads stale keys after DeleteRange() performed.
    • Fixed a race condition when disable and re-enable manual compaction.
    • Fixed automatic error recovery failure in atomic flush.
    • Fixed a race condition when mmaping a WritableFile on POSIX.

    Public API changes

    • Added pure virtual FilterPolicy::CompatibilityName(), which is needed for fixing major performance bug involving FilterPolicy naming in SST metadata without affecting Customizable aspect of FilterPolicy. This change only affects those with their own custom or wrapper FilterPolicy classes.
    • options.compression_per_level is dynamically changeable with SetOptions().
    • Added WriteOptions::rate_limiter_priority. When set to something other than Env::IO_TOTAL, the internal rate limiter (DBOptions::rate_limiter) will be charged at the specified priority for writes associated with the API to which the WriteOptions was provided. Currently the support covers automatic WAL flushes, which happen during live updates (Put(), Write(), Delete(), etc.) when WriteOptions::disableWAL == false and DBOptions::manual_wal_flush == false.
    • Add DB::OpenAndTrimHistory API. This API will open DB and trim data to the timestamp specified by trim_ts (The data with timestamp larger than specified trim bound will be removed). This API should only be used at a timestamp-enabled column families recovery. If the column family doesn't have timestamp enabled, this API won't trim any data on that column family. This API is not compatible with avoid_flush_during_recovery option.
    • Remove BlockBasedTableOptions.hash_index_allow_collision which already takes no effect.
    Source code(tar.gz)
    Source code(zip)
  • v7.0.4(Mar 29, 2022)

    7.0.4 (2022-03-29)

    Bug Fixes

    • Fixed a race condition when disable and re-enable manual compaction.
    • Fixed a race condition for alive_log_files_ in non-two-write-queues mode. The race is between the write_thread_ in WriteToWAL() and another thread executing FindObsoleteFiles(). The race condition will be caught if __glibcxx_requires_nonempty is enabled.
    • Fixed a race condition when mmaping a WritableFile on POSIX.
    • Fixed a race condition when 2PC is disabled and WAL tracking in the MANIFEST is enabled. The race condition is between two background flush threads trying to install flush results, causing a WAL deletion not tracked in the MANIFEST. A future DB open may fail.
    • Fixed a heap use-after-free race with DropColumnFamily.
    • Fixed a bug that rocksdb.read.block.compaction.micros cannot track compaction stats (#9722).
    Source code(tar.gz)
    Source code(zip)
  • v6.29.5(Mar 29, 2022)

    6.29.5 (2022-03-29)

    Bug Fixes

    • Fixed a race condition for alive_log_files_ in non-two-write-queues mode. The race is between the write_thread_ in WriteToWAL() and another thread executing FindObsoleteFiles(). The race condition will be caught if __glibcxx_requires_nonempty is enabled.
    • Fixed a race condition when mmaping a WritableFile on POSIX.
    • Fixed a race condition when 2PC is disabled and WAL tracking in the MANIFEST is enabled. The race condition is between two background flush threads trying to install flush results, causing a WAL deletion not tracked in the MANIFEST. A future DB open may fail.
    • Fixed a heap use-after-free race with DropColumnFamily.
    • Fixed a bug that rocksdb.read.block.compaction.micros cannot track compaction stats (#9722).
    Source code(tar.gz)
    Source code(zip)
  • v7.0.3(Mar 25, 2022)

    7.0.3 (2022-03-25)

    Bug Fixes

    • Fixed a major performance bug in which Bloom filters generated by pre-7.0 releases are not read by early 7.0.x releases (and vice-versa) due to changes to FilterPolicy::Name() in #9590. This can severely impact read performance and read I/O on upgrade or downgrade with existing DB, but not data correctness.
    • Fixed a bug that Iterator::Refresh() reads stale keys after DeleteRange() performed.

    Public API changes

    • Added pure virtual FilterPolicy::CompatibilityName(), which is needed for fixing major performance bug involving FilterPolicy naming in SST metadata without affecting Customizable aspect of FilterPolicy. For source code, this change only affects those with their own custom or wrapper FilterPolicy classes, but does break compiled library binary compatibility in a patch release.
    • Since RocksDB 7, RocksJava now requires Java 8 (previously Java 7).
    Source code(tar.gz)
    Source code(zip)
  • v6.29.4(Mar 23, 2022)

    6.29.4 (2022-03-22)

    Bug Fixes

    • Fixed a bug caused by race among flush, incoming writes and taking snapshots. Queries to snapshots created with these race condition can return incorrect result, e.g. resurfacing deleted data.
    • Fixed a bug that DisableManualCompaction may assert when disable an unscheduled manual compaction.
    • Fixed a bug that Iterator::Refresh() reads stale keys after DeleteRange() performed.
    • Fixed a race condition when disable and re-enable manual compaction.
    • Fix a race condition when cancel manual compaction with DisableManualCompaction. Also DB close can cancel the manual compaction thread.
    • Fixed a data race on versions_ between DBImpl::ResumeImpl() and threads waiting for recovery to complete (#9496)
    • Fixed a read-after-free bug in DB::GetMergeOperands().
    • Fixed NUM_INDEX_AND_FILTER_BLOCKS_READ_PER_LEVEL, NUM_DATA_BLOCKS_READ_PER_LEVEL, and NUM_SST_READ_PER_LEVEL stats to be reported once per MultiGet batch per level.
    Source code(tar.gz)
    Source code(zip)
  • v7.0.2(Mar 14, 2022)

  • v7.0.1(Mar 12, 2022)

    Rocksdb Change Log

    7.0.1 (2022-03-02)

    Bug Fixes

    • Fix a race condition when cancel manual compaction with DisableManualCompaction. Also DB close can cancel the manual compaction thread.
    • Fixed a data race on versions_ between DBImpl::ResumeImpl() and threads waiting for recovery to complete (#9496)
    • Fixed a bug caused by race among flush, incoming writes and taking snapshots. Queries to snapshots created with these race condition can return incorrect result, e.g. resurfacing deleted data.

    7.0.0 (2022-02-20)

    Bug Fixes

    • Fixed a major bug in which batched MultiGet could return old values for keys deleted by DeleteRange when memtable Bloom filter is enabled (memtable_prefix_bloom_size_ratio > 0). (The fix includes a substantial MultiGet performance improvement in the unusual case of both memtable_whole_key_filtering and prefix_extractor.)
    • Fixed more cases of EventListener::OnTableFileCreated called with OK status, file_size==0, and no SST file kept. Now the status is Aborted.
    • Fixed a read-after-free bug in DB::GetMergeOperands().
    • Fix a data loss bug for 2PC write-committed transaction caused by concurrent transaction commit and memtable switch (#9571).
    • Fixed NUM_INDEX_AND_FILTER_BLOCKS_READ_PER_LEVEL, NUM_DATA_BLOCKS_READ_PER_LEVEL, and NUM_SST_READ_PER_LEVEL stats to be reported once per MultiGet batch per level.

    Performance Improvements

    • Mitigated the overhead of building the file location hash table used by the online LSM tree consistency checks, which can improve performance for certain workloads (see #9351).
    • Switched to using a sorted std::vector instead of std::map for storing the metadata objects for blob files, which can improve performance for certain workloads, especially when the number of blob files is high.
    • DisableManualCompaction() doesn't have to wait scheduled manual compaction to be executed in thread-pool to cancel the job.

    Public API changes

    • Require C++17 compatible compiler (GCC >= 7, Clang >= 5, Visual Studio >= 2017) for compiling RocksDB and any code using RocksDB headers (previously required C++11). See #9388.
    • Require Java 8 for compiling RocksJava (previously Java 7). See #9541
    • Removed deprecated automatic finalization of RocksJava RocksObjects, the user must explicitly call close() on their RocksJava objects. See #9523.
    • Added ReadOptions::rate_limiter_priority. When set to something other than Env::IO_TOTAL, the internal rate limiter (DBOptions::rate_limiter) will be charged at the specified priority for file reads associated with the API to which the ReadOptions was provided.
    • Remove HDFS support from main repo.
    • Remove librados support from main repo.
    • Remove obsolete backupable_db.h and type alias BackupableDBOptions. Use backup_engine.h and BackupEngineOptions. Similar renamings are in the C and Java APIs.
    • Removed obsolete utility_db.h and UtilityDB::OpenTtlDB. Use db_ttl.h and DBWithTTL::Open.
    • Remove deprecated API DB::AddFile from main repo.
    • Remove deprecated API ObjectLibrary::Register() and the (now obsolete) Regex public API. Use ObjectLibrary::AddFactory() with PatternEntry instead.
    • Remove deprecated option DBOption::table_cache_remove_scan_count_limit.
    • Remove deprecated API AdvancedColumnFamilyOptions::soft_rate_limit.
    • Remove deprecated API AdvancedColumnFamilyOptions::hard_rate_limit.
    • Remove deprecated API DBOption::base_background_compactions.
    • Remove deprecated API DBOptions::purge_redundant_kvs_while_flush.
    • Remove deprecated overloads of API DB::CompactRange.
    • Remove deprecated option DBOptions::skip_log_error_on_recovery.
    • Remove ReadOptions::iter_start_seqnum which has been deprecated.
    • Remove DBOptions::preserved_deletes and DB::SetPreserveDeletesSequenceNumber().
    • Remove deprecated API AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds.
    • Removed timestamp from WriteOptions. Accordingly, added to DB APIs Put, Delete, SingleDelete, etc. accepting an additional argument 'timestamp'. Added Put, Delete, SingleDelete, etc to WriteBatch accepting an additional argument 'timestamp'. Removed WriteBatch::AssignTimestamps(vector) API. Renamed WriteBatch::AssignTimestamp() to WriteBatch::UpdateTimestamps() with clarified comments.
    • Changed type of cache buffer passed to Cache::CreateCallback from void* to const void*.
    • Significant updates to FilterPolicy-related APIs and configuration:
      • Remove public API support for deprecated, inefficient block-based filter (use_block_based_builder=true).
        • Old code and configuration strings that would enable it now quietly enable full filters instead, though any built-in FilterPolicy can still read block-based filters. This includes changing the longstanding default behavior of the Java API.
        • Remove deprecated FilterPolicy::CreateFilter() and FilterPolicy::KeyMayMatch()
        • Remove rocksdb_filterpolicy_create() from C API, as the only C API support for custom filter policies is now obsolete.
        • If temporary memory usage in full filter creation is a problem, consider using partitioned filters, smaller SST files, or setting reserve_table_builder_memory=true.
      • Remove support for "filter_policy=experimental_ribbon" configuration string. Use something like "filter_policy=ribbonfilter:10" instead.
      • Allow configuration string like "filter_policy=bloomfilter:10" without bool, to minimize acknowledgement of obsolete block-based filter.
      • Made FilterPolicy Customizable. Configuration of filter_policy is now accurately saved in OPTIONS file and can be loaded with LoadOptionsFromFile. (Loading an OPTIONS file generated by a previous version only enables reading and using existing filters, not generating new filters. Previously, no filter_policy would be configured from a saved OPTIONS file.)
      • Change meaning of nullptr return from GetBuilderWithContext() from "use block-based filter" to "generate no filter in this case."
        • Also, when user specifies bits_per_key < 0.5, we now round this down to "no filter" because we expect a filter with >= 80% FP rate is unlikely to be worth the CPU cost of accessing it (esp with cache_index_and_filter_blocks=1 or partition_filters=1).
        • bits_per_key >= 0.5 and < 1.0 is still rounded up to 1.0 (for 62% FP rate)
      • Remove class definitions for FilterBitsBuilder and FilterBitsReader from public API, so these can evolve more easily as implementation details. Custom FilterPolicy can still decide what kind of built-in filter to use under what conditions.
      • Also removed deprecated functions
        • FilterPolicy::GetFilterBitsBuilder()
        • NewExperimentalRibbonFilterPolicy()
      • Remove default implementations of
        • FilterPolicy::GetBuilderWithContext()
    • Remove default implementation of Name() from FileSystemWrapper.
    • Rename SizeApproximationOptions.include_memtabtles to SizeApproximationOptions.include_memtables.
    • Remove deprecated option DBOptions::max_mem_compaction_level.
    • Return Status::InvalidArgument from ObjectRegistry::NewObject if a factory exists but the object ould not be created (returns NotFound if the factory is missing).
    • Remove deprecated overloads of API DB::GetApproximateSizes.
    • Remove deprecated option DBOptions::new_table_reader_for_compaction_inputs.
    • Add Transaction::SetReadTimestampForValidation() and Transaction::SetCommitTimestamp(). Default impl returns NotSupported().
    • Add support for decimal patterns to ObjectLibrary::PatternEntry
    • Remove deprecated remote compaction APIs CompactionService::Start() and CompactionService::WaitForComplete(). Please use CompactionService::StartV2(), CompactionService::WaitForCompleteV2() instead, which provides the same information plus extra data like priority, db_id, etc.
    • ColumnFamilyOptions::OldDefaults and DBOptions::OldDefaults are marked deprecated, as they are no longer maintained.
    • Add subcompaction callback APIs: OnSubcompactionBegin() and OnSubcompactionCompleted().
    • Add file Temperature information to FileOperationInfo in event listener API.
    • Change the type of SizeApproximationFlags from enum to enum class. Also update the signature of DB::GetApproximateSizes API from uint8_t to SizeApproximationFlags.
    • Add Temperature hints information from RocksDB in API NewSequentialFile(). backup and checkpoint operations need to open the source files with NewSequentialFile(), which will have the temperature hints. Other operations are not covered.

    Behavior Changes

    • Disallow the combination of DBOptions.use_direct_io_for_flush_and_compaction == true and DBOptions.writable_file_max_buffer_size == 0. This combination can cause WritableFileWriter::Append() to loop forever, and it does not make much sense in direct IO.
    • ReadOptions::total_order_seek no longer affects DB::Get(). The original motivation for this interaction has been obsolete since RocksDB has been able to detect whether the current prefix extractor is compatible with that used to generate table files, probably RocksDB 5.14.0.

    New Features

    • Introduced an option BlockBasedTableOptions::detect_filter_construct_corruption for detecting corruption during Bloom Filter (format_version >= 5) and Ribbon Filter construction.
    • Improved the SstDumpTool to read the comparator from table properties and use it to read the SST File.
    • Extended the column family statistics in the info log so the total amount of garbage in the blob files and the blob file space amplification factor are also logged. Also exposed the blob file space amp via the rocksdb.blob-stats DB property.
    • Introduced the API rocksdb_create_dir_if_missing in c.h that calls underlying file system's CreateDirIfMissing API to create the directory.
    • Added last level and non-last level read statistics: LAST_LEVEL_READ_*, NON_LAST_LEVEL_READ_*.
    • Experimental: Add support for new APIs ReadAsync in FSRandomAccessFile that reads the data asynchronously and Poll API in FileSystem that checks if requested read request has completed or not. ReadAsync takes a callback function. Poll API checks for completion of read IO requests and should call callback functions to indicate completion of read requests.
    Source code(tar.gz)
    Source code(zip)
  • v6.29.3(Feb 18, 2022)

    Rocksdb Change Log

    6.29.3 (2022-02-17)

    Bug Fixes

    • Fix a data loss bug for 2PC write-committed transaction caused by concurrent transaction commit and memtable switch (#9571).

    6.29.2 (2022-02-15)

    Performance Improvements

    • DisableManualCompaction() doesn't have to wait scheduled manual compaction to be executed in thread-pool to cancel the job.

    6.29.1 (2022-01-31)

    Bug Fixes

    • Fixed a major bug in which batched MultiGet could return old values for keys deleted by DeleteRange when memtable Bloom filter is enabled (memtable_prefix_bloom_size_ratio > 0). (The fix includes a substantial MultiGet performance improvement in the unusual case of both memtable_whole_key_filtering and prefix_extractor.)

    6.29.0 (2022-01-21)

    Note: The next release will be major release 7.0. See https://github.com/facebook/rocksdb/issues/9390 for more info.

    Public API change

    • Added values to TraceFilterType: kTraceFilterIteratorSeek, kTraceFilterIteratorSeekForPrev, and kTraceFilterMultiGet. They can be set in TraceOptions to filter out the operation types after which they are named.
    • Added TraceOptions::preserve_write_order. When enabled it guarantees write records are traced in the same order they are logged to WAL and applied to the DB. By default it is disabled (false) to match the legacy behavior and prevent regression.
    • Made the Env class extend the Customizable class. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
    • Options::OldDefaults is marked deprecated, as it is no longer maintained.
    • Add ObjectLibrary::AddFactory and ObjectLibrary::PatternEntry classes. This method and associated class are the preferred mechanism for registering factories with the ObjectLibrary going forward. The ObjectLibrary::Register method, which uses regular expressions and may be problematic, is deprecated and will be in a future release.
    • Changed BlockBasedTableOptions::block_size from size_t to uint64_t.
    • Added API warning against using Iterator::Refresh() together with DB::DeleteRange(), which are incompatible and have always risked causing the refreshed iterator to return incorrect results.

    Behavior Changes

    • DB::DestroyColumnFamilyHandle() will return Status::InvalidArgument() if called with DB::DefaultColumnFamily().
    • On 32-bit platforms, mmap reads are no longer quietly disabled, just discouraged.

    New Features

    • Added Options::DisableExtraChecks() that can be used to improve peak write performance by disabling checks that should not be necessary in the absence of software logic errors or CPU+memory hardware errors. (Default options are slowly moving toward some performance overheads for extra correctness checking.)

    Performance Improvements

    • Improved read performance when a prefix extractor is used (Seek, Get, MultiGet), even compared to version 6.25 baseline (see bug fix below), by optimizing the common case of prefix extractor compatible with table file and unchanging.

    Bug Fixes

    • Fix a bug that FlushMemTable may return ok even flush not succeed.
    • Fixed a bug of Sync() and Fsync() not using fcntl(F_FULLFSYNC) on OS X and iOS.
    • Fixed a significant performance regression in version 6.26 when a prefix extractor is used on the read path (Seek, Get, MultiGet). (Excessive time was spent in SliceTransform::AsString().)

    New Features

    • Added RocksJava support for MacOS universal binary (ARM+x86)
    Source code(tar.gz)
    Source code(zip)
  • v6.28.2(Feb 2, 2022)

    6.28.2 (2022-01-31)

    Bug Fixes

    • Fixed a major bug in which batched MultiGet could return old values for keys deleted by DeleteRange when memtable Bloom filter is enabled (memtable_prefix_bloom_size_ratio > 0). (The fix includes a substantial MultiGet performance improvement in the unusual case of both memtable_whole_key_filtering and prefix_extractor.)

    6.28.1 (2022-01-10)

    Bug Fixes

    • Fixed compilation errors on newer compiler, e.g. clang-12

    6.28.0 (2021-12-17)

    New Features

    • Introduced 'CommitWithTimestamp' as a new tag. Currently, there is no API for user to trigger a write with this tag to the WAL. This is part of the efforts to support write-commited transactions with user-defined timestamps.

    Bug Fixes

    • Fixed a bug in rocksdb automatic implicit prefetching which got broken because of new feature adaptive_readahead and internal prefetching got disabled when iterator moves from one file to next.
    • Fixed a bug in TableOptions.prepopulate_block_cache which causes segmentation fault when used with TableOptions.partition_filters = true and TableOptions.cache_index_and_filter_blocks = true.
    • Fixed a bug affecting custom memtable factories which are not registered with the ObjectRegistry. The bug could result in failure to save the OPTIONS file.
    • Fixed a bug causing two duplicate entries to be appended to a file opened in non-direct mode and tracked by FaultInjectionTestFS.
    • Fixed a bug in TableOptions.prepopulate_block_cache to support block-based filters also.
    • Block cache keys no longer use FSRandomAccessFile::GetUniqueId() (previously used when available), so a filesystem recycling unique ids can no longer lead to incorrect result or crash (#7405). For files generated by RocksDB >= 6.24, the cache keys are stable across DB::Open and DB directory move / copy / import / export / migration, etc. Although collisions are still theoretically possible, they are (a) impossible in many common cases, (b) not dependent on environmental factors, and (c) much less likely than a CPU miscalculation while executing RocksDB.

    Behavior Changes

    • MemTableList::TrimHistory now use allocated bytes when max_write_buffer_size_to_maintain > 0(default in TrasactionDB, introduced in PR#5022) Fix #8371.

    Public API change

    • Extend WriteBatch::AssignTimestamp and AssignTimestamps API so that both functions can accept an optional checker argument that performs additional checking on timestamp sizes.
    • Introduce a new EventListener callback that will be called upon the end of automatic error recovery.

    Performance Improvements

    • Replaced map property TableProperties::properties_offsets with uint64_t property external_sst_file_global_seqno_offset to save table properties's memory.
    • Block cache accesses are faster by RocksDB using cache keys of fixed size (16 bytes).

    Java API Changes

    • Removed Java API TableProperties.getPropertiesOffsets() as it exposed internal details to external users.
    Source code(tar.gz)
    Source code(zip)
  • v6.27.3(Dec 20, 2021)

    6.27.3 (2021-12-10)

    Bug Fixes

    • Fixed a bug in TableOptions.prepopulate_block_cache which causes segmentation fault when used with TableOptions.partition_filters = true and TableOptions.cache_index_and_filter_blocks = true.
    • Fixed a bug affecting custom memtable factories which are not registered with the ObjectRegistry. The bug could result in failure to save the OPTIONS file.

    6.27.2 (2021-12-01)

    Bug Fixes

    • Fixed a bug in rocksdb automatic implicit prefetching which got broken because of new feature adaptive_readahead and internal prefetching got disabled when iterator moves from one file to next.

    6.27.1 (2021-11-29)

    Bug Fixes

    • Fixed a bug that could, with WAL enabled, cause backups, checkpoints, and GetSortedWalFiles() to fail randomly with an error like IO error: 001234.log: No such file or directory

    6.27.0 (2021-11-19)

    New Features

    • Added new ChecksumType kXXH3 which is faster than kCRC32c on almost all x86_64 hardware.
    • Added a new online consistency check for BlobDB which validates that the number/total size of garbage blobs does not exceed the number/total size of all blobs in any given blob file.
    • Provided support for tracking per-sst user-defined timestamp information in MANIFEST.
    • Added new option "adaptive_readahead" in ReadOptions. For iterators, RocksDB does auto-readahead on noticing sequential reads and by enabling this option, readahead_size of current file (if reads are sequential) will be carried forward to next file instead of starting from the scratch at each level (except L0 level files). If reads are not sequential it will fall back to 8KB. This option is applicable only for RocksDB internal prefetch buffer and isn't supported with underlying file system prefetching.
    • Added the read count and read bytes related stats to Statistics for tiered storage hot, warm, and cold file reads.
    • Added an option to dynamically charge an updating estimated memory usage of block-based table building to block cache if block cache available. It currently only includes charging memory usage of constructing (new) Bloom Filter and Ribbon Filter to block cache. To enable this feature, set BlockBasedTableOptions::reserve_table_builder_memory = true.
    • Add a new API OnIOError in listener.h that notifies listeners when an IO error occurs during FileSystem operation along with filename, status etc.
    • Added compaction readahead support for blob files to the integrated BlobDB implementation, which can improve compaction performance when the database resides on higher-latency storage like HDDs or remote filesystems. Readahead can be configured using the column family option blob_compaction_readahead_size.

    Bug Fixes

    • Prevent a CompactRange() with CompactRangeOptions::change_level == true from possibly causing corruption to the LSM state (overlapping files within a level) when run in parallel with another manual compaction. Note that setting force_consistency_checks == true (the default) would cause the DB to enter read-only mode in this scenario and return Status::Corruption, rather than committing any corruption.
    • Fixed a bug in CompactionIterator when write-prepared transaction is used. A released earliest write conflict snapshot may cause assertion failure in dbg mode and unexpected key in opt mode.
    • Fix ticker WRITE_WITH_WAL("rocksdb.write.wal"), this bug is caused by a bad extra RecordTick(stats_, WRITE_WITH_WAL) (at 2 place), this fix remove the extra RecordTicks and fix the corresponding test case.
    • EventListener::OnTableFileCreated was previously called with OK status and file_size==0 in cases of no SST file contents written (because there was no content to add) and the empty file deleted before calling the listener. Now the status is Aborted.
    • Fixed a bug in CompactionIterator when write-preared transaction is used. Releasing earliest_snapshot during compaction may cause a SingleDelete to be output after a PUT of the same user key whose seq has been zeroed.
    • Added input sanitization on negative bytes passed into GenericRateLimiter::Request.
    • Fixed an assertion failure in CompactionIterator when write-prepared transaction is used. We prove that certain operations can lead to a Delete being followed by a SingleDelete (same user key). We can drop the SingleDelete.
    • Fixed a bug of timestamp-based GC which can cause all versions of a key under full_history_ts_low to be dropped. This bug will be triggered when some of the ikeys' timestamps are lower than full_history_ts_low, while others are newer.
    • In some cases outside of the DB read and compaction paths, SST block checksums are now checked where they were not before.
    • Explicitly check for and disallow the BlockBasedTableOptions if insertion into one of {block_cache, block_cache_compressed, persistent_cache} can show up in another of these. (RocksDB expects to be able to use the same key for different physical data among tiers.)
    • Users who configured a dedicated thread pool for bottommost compactions by explicitly adding threads to the Env::Priority::BOTTOM pool will no longer see RocksDB schedule automatic compactions exceeding the DB's compaction concurrency limit. For details on per-DB compaction concurrency limit, see API docs of max_background_compactions and max_background_jobs.
    • Fixed a bug of background flush thread picking more memtables to flush and prematurely advancing column family's log_number.
    • Fixed an assertion failure in ManifestTailer.

    Behavior Changes

    • NUM_FILES_IN_SINGLE_COMPACTION was only counting the first input level files, now it's including all input files.
    • TransactionUtil::CheckKeyForConflicts can also perform conflict-checking based on user-defined timestamps in addition to sequence numbers.
    • Removed GenericRateLimiter's minimum refill bytes per period previously enforced.

    Public API change

    • When options.ttl is used with leveled compaction with compactinon priority kMinOverlappingRatio, files exceeding half of TTL value will be prioritized more, so that by the time TTL is reached, fewer extra compactions will be scheduled to clear them up. At the same time, when compacting files with data older than half of TTL, output files may be cut off based on those files' boundaries, in order for the early TTL compaction to work properly.
    • Made FileSystem extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
    • Clarified in API comments that RocksDB is not exception safe for callbacks and custom extensions. An exception propagating into RocksDB can lead to undefined behavior, including data loss, unreported corruption, deadlocks, and more.
    • Marked WriteBufferManager as final because it is not intended for extension.
    • Removed unimportant implementation details from table_properties.h
    • Add API FSDirectory::FsyncWithDirOptions(), which provides extra information like directory fsync reason in DirFsyncOptions. File system like btrfs is using that to skip directory fsync for creating a new file, or when renaming a file, fsync the target file instead of the directory, which improves the DB::Open() speed by ~20%.
    • DB::Open() is not going be blocked by obsolete file purge if DBOptions::avoid_unnecessary_blocking_io is set to true.
    • In builds where glibc provides gettid(), info log ("LOG" file) lines now print a system-wide thread ID from gettid() instead of the process-local pthread_self(). For all users, the thread ID format is changed from hexadecimal to decimal integer.
    • In builds where glibc provides pthread_setname_np(), the background thread names no longer contain an ID suffix. For example, "rocksdb:bottom7" (and all other threads in the Env::Priority::BOTTOM pool) are now named "rocksdb:bottom". Previously large thread pools could breach the name size limit (e.g., naming "rocksdb:bottom10" would fail).
    • Deprecating ReadOptions::iter_start_seqnum and DBOptions::preserve_deletes, please try using user defined timestamp feature instead. The options will be removed in a future release, currently it logs a warning message when using.

    Performance Improvements

    • Released some memory related to filter construction earlier in BlockBasedTableBuilder for FullFilter and PartitionedFilter case (#9070)
    Source code(tar.gz)
    Source code(zip)
  • v6.26.1(Nov 18, 2021)

    6.26.1 (2021-11-18)

    Bug Fixes

    • Fix builds for some platforms.

    6.26.0 (2021-10-20)

    Bug Fixes

    • Fixes a bug in directed IO mode when calling MultiGet() for blobs in the same blob file. The bug is caused by not sorting the blob read requests by file offsets.
    • Fix the incorrect disabling of SST rate limited deletion when the WAL and DB are in different directories. Only WAL rate limited deletion should be disabled if its in a different directory.
    • Fix DisableManualCompaction() to cancel compactions even when they are waiting on automatic compactions to drain due to CompactRangeOptions::exclusive_manual_compactions == true.
    • Fix contract of Env::ReopenWritableFile() and FileSystem::ReopenWritableFile() to specify any existing file must not be deleted or truncated.
    • Fixed bug in calls to IngestExternalFiles() with files for multiple column families. The bug could have introduced a delay in ingested file keys becoming visible after IngestExternalFiles() returned. Furthermore, mutations to ingested file keys while they were invisible could have been dropped (not necessarily immediately).
    • Fixed a possible race condition impacting users of WriteBufferManager who constructed it with allow_stall == true. The race condition led to undefined behavior (in our experience, typically a process crash).
    • Fixed a bug where stalled writes would remain stalled forever after the user calls WriteBufferManager::SetBufferSize() with new_size == 0 to dynamically disable memory limiting.
    • Make DB::close() thread-safe.
    • Fix a bug in atomic flush where one bg flush thread will wait forever for a preceding bg flush thread to commit its result to MANIFEST but encounters an error which is mapped to a soft error (DB not stopped).

    New Features

    • Print information about blob files when using "ldb list_live_files_metadata"
    • Provided support for SingleDelete with user defined timestamp.
    • Experimental new function DB::GetLiveFilesStorageInfo offers essentially a unified version of other functions like GetLiveFiles, GetLiveFilesChecksumInfo, and GetSortedWalFiles. Checkpoints and backups could show small behavioral changes and/or improved performance as they now use this new API.
    • Add remote compaction read/write bytes statistics: REMOTE_COMPACT_READ_BYTES, REMOTE_COMPACT_WRITE_BYTES.
    • Introduce an experimental feature to dump out the blocks from block cache and insert them to the secondary cache to reduce the cache warmup time (e.g., used while migrating DB instance). More information are in class CacheDumper and CacheDumpedLoader at rocksdb/utilities/cache_dump_load.h Note that, this feature is subject to the potential change in the future, it is still experimental.
    • Introduced a new BlobDB configuration option blob_garbage_collection_force_threshold, which can be used to trigger compactions targeting the SST files which reference the oldest blob files when the ratio of garbage in those blob files meets or exceeds the specified threshold. This can reduce space amplification with skewed workloads where the affected SST files might not otherwise get picked up for compaction.
    • Added EXPERIMENTAL support for table file (SST) unique identifiers that are stable and universally unique, available with new function GetUniqueIdFromTableProperties. Only SST files from RocksDB >= 6.24 support unique IDs.
    • Added GetMapProperty() support for "rocksdb.dbstats" (DB::Properties::kDBStats). As a map property, it includes DB-level internal stats accumulated over the DB's lifetime, such as user write related stats and uptime.

    Public API change

    • Made SystemClock extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
    • Made SliceTransform extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method. The Capped and Prefixed transform classes return a short name (no length); use GetId for the fully qualified name.
    • Made FileChecksumGenFactory, SstPartitionerFactory, TablePropertiesCollectorFactory, and WalFilter extend the Customizable class and added a CreateFromString method.
    • Some fields of SstFileMetaData are deprecated for compatibility with new base class FileStorageInfo.
    • Add file_temperature to IngestExternalFileArg such that when ingesting SST files, we are able to indicate the temperature of the this batch of files.
    • If DB::Close() failed with a non aborted status, calling DB::Close() again will return the original status instead of Status::OK.
    • Add CacheTier to advanced_options.h to describe the cache tier we used. Add a lowest_used_cache_tier option to DBOptions (immutable) and pass it to BlockBasedTableReader. By default it is CacheTier::kNonVolatileBlockTier, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileBlockTier). By set it to CacheTier::kVolatileTier, the DB will not use the secondary cache.
    • Even when options.max_compaction_bytes is hit, compaction output files are only cut when it aligns with grandparent files' boundaries. options.max_compaction_bytes could be slightly violated with the change, but the violation is no more than one target SST file size, which is usually much smaller.

    Performance Improvements

    • Improved CPU efficiency of building block-based table (SST) files (#9039 and #9040).

    Java API Changes

    • Add Java API bindings for new integrated BlobDB options
    • keyMayExist() supports ByteBuffer.
    • Fix multiget throwing Null Pointer Exception for num of keys > 70k (https://github.com/facebook/rocksdb/issues/8039).
    Source code(tar.gz)
    Source code(zip)
  • v6.26.0(Nov 10, 2021)

    6.26.0 (2021-10-20)

    Bug Fixes

    • Fixes a bug in directed IO mode when calling MultiGet() for blobs in the same blob file. The bug is caused by not sorting the blob read requests by file offsets.
    • Fix the incorrect disabling of SST rate limited deletion when the WAL and DB are in different directories. Only WAL rate limited deletion should be disabled if its in a different directory.
    • Fix DisableManualCompaction() to cancel compactions even when they are waiting on automatic compactions to drain due to CompactRangeOptions::exclusive_manual_compactions == true.
    • Fix contract of Env::ReopenWritableFile() and FileSystem::ReopenWritableFile() to specify any existing file must not be deleted or truncated.
    • Fixed bug in calls to IngestExternalFiles() with files for multiple column families. The bug could have introduced a delay in ingested file keys becoming visible after IngestExternalFiles() returned. Furthermore, mutations to ingested file keys while they were invisible could have been dropped (not necessarily immediately).
    • Fixed a possible race condition impacting users of WriteBufferManager who constructed it with allow_stall == true. The race condition led to undefined behavior (in our experience, typically a process crash).
    • Fixed a bug where stalled writes would remain stalled forever after the user calls WriteBufferManager::SetBufferSize() with new_size == 0 to dynamically disable memory limiting.
    • Make DB::close() thread-safe.
    • Fix a bug in atomic flush where one bg flush thread will wait forever for a preceding bg flush thread to commit its result to MANIFEST but encounters an error which is mapped to a soft error (DB not stopped).

    New Features

    • Print information about blob files when using "ldb list_live_files_metadata"
    • Provided support for SingleDelete with user defined timestamp.
    • Experimental new function DB::GetLiveFilesStorageInfo offers essentially a unified version of other functions like GetLiveFiles, GetLiveFilesChecksumInfo, and GetSortedWalFiles. Checkpoints and backups could show small behavioral changes and/or improved performance as they now use this new API.
    • Add remote compaction read/write bytes statistics: REMOTE_COMPACT_READ_BYTES, REMOTE_COMPACT_WRITE_BYTES.
    • Introduce an experimental feature to dump out the blocks from block cache and insert them to the secondary cache to reduce the cache warmup time (e.g., used while migrating DB instance). More information are in class CacheDumper and CacheDumpedLoader at rocksdb/utilities/cache_dump_load.h Note that, this feature is subject to the potential change in the future, it is still experimental.
    • Introduced a new BlobDB configuration option blob_garbage_collection_force_threshold, which can be used to trigger compactions targeting the SST files which reference the oldest blob files when the ratio of garbage in those blob files meets or exceeds the specified threshold. This can reduce space amplification with skewed workloads where the affected SST files might not otherwise get picked up for compaction.
    • Added EXPERIMENTAL support for table file (SST) unique identifiers that are stable and universally unique, available with new function GetUniqueIdFromTableProperties. Only SST files from RocksDB >= 6.24 support unique IDs.
    • Added GetMapProperty() support for "rocksdb.dbstats" (DB::Properties::kDBStats). As a map property, it includes DB-level internal stats accumulated over the DB's lifetime, such as user write related stats and uptime.

    Public API change

    • Made SystemClock extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
    • Made SliceTransform extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method. The Capped and Prefixed transform classes return a short name (no length); use GetId for the fully qualified name.
    • Made FileChecksumGenFactory, SstPartitionerFactory, TablePropertiesCollectorFactory, and WalFilter extend the Customizable class and added a CreateFromString method.
    • Some fields of SstFileMetaData are deprecated for compatibility with new base class FileStorageInfo.
    • Add file_temperature to IngestExternalFileArg such that when ingesting SST files, we are able to indicate the temperature of the this batch of files.
    • If DB::Close() failed with a non aborted status, calling DB::Close() again will return the original status instead of Status::OK.
    • Add CacheTier to advanced_options.h to describe the cache tier we used. Add a lowest_used_cache_tier option to DBOptions (immutable) and pass it to BlockBasedTableReader. By default it is CacheTier::kNonVolatileBlockTier, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileBlockTier). By set it to CacheTier::kVolatileTier, the DB will not use the secondary cache.
    • Even when options.max_compaction_bytes is hit, compaction output files are only cut when it aligns with grandparent files' boundaries. options.max_compaction_bytes could be slightly violated with the change, but the violation is no more than one target SST file size, which is usually much smaller.

    Performance Improvements

    • Improved CPU efficiency of building block-based table (SST) files (#9039 and #9040).

    Java API Changes

    • Add Java API bindings for new integrated BlobDB options
    • keyMayExist() supports ByteBuffer.
    • Fix multiget throwing Null Pointer Exception for num of keys > 70k (https://github.com/facebook/rocksdb/issues/8039).
    Source code(tar.gz)
    Source code(zip)
  • v6.25.3(Oct 15, 2021)

    6.25.3 (2021-10-14)

    Bug Fixes

    • Fixed bug in calls to IngestExternalFiles() with files for multiple column families. The bug could have introduced a delay in ingested file keys becoming visible after IngestExternalFiles() returned. Furthermore, mutations to ingested file keys while they were invisible could have been dropped (not necessarily immediately).
    • Fixed a possible race condition impacting users of WriteBufferManager who constructed it with allow_stall == true. The race condition led to undefined behavior (in our experience, typically a process crash).
    • Fixed a bug where stalled writes would remain stalled forever after the user calls WriteBufferManager::SetBufferSize() with new_size == 0 to dynamically disable memory limiting.

    6.25.2 (2021-10-11)

    Bug Fixes

    • Fix DisableManualCompaction() to cancel compactions even when they are waiting on automatic compactions to drain due to CompactRangeOptions::exclusive_manual_compactions == true.
    • Fix contract of Env::ReopenWritableFile() and FileSystem::ReopenWritableFile() to specify any existing file must not be deleted or truncated.
    Source code(tar.gz)
    Source code(zip)
  • v6.25.1(Oct 13, 2021)

    6.25.1 (2021-09-28)

    Bug Fixes

    • Fixes a bug in directed IO mode when calling MultiGet() for blobs in the same blob file. The bug is caused by not sorting the blob read requests by file offsets.

    6.25.0 (2021-09-20)

    Bug Fixes

    • Allow secondary instance to refresh iterator. Assign read seq after referencing SuperVersion.
    • Fixed a bug of secondary instance's last_sequence going backward, and reads on the secondary fail to see recent updates from the primary.
    • Fixed a bug that could lead to duplicate DB ID or DB session ID in POSIX environments without /proc/sys/kernel/random/uuid.
    • Fix a race in DumpStats() with column family destruction due to not taking a Ref on each entry while iterating the ColumnFamilySet.
    • Fix a race in item ref counting in LRUCache when promoting an item from the SecondaryCache.
    • Fix a race in BackupEngine if RateLimiter is reconfigured during concurrent Restore operations.
    • Fix a bug on POSIX in which failure to create a lock file (e.g. out of space) can prevent future LockFile attempts in the same process on the same file from succeeding.
    • Fix a bug that backup_rate_limiter and restore_rate_limiter in BackupEngine could not limit read rates.
    • Fix the implementation of prepopulate_block_cache = kFlushOnly to only apply to flushes rather than to all generated files.
    • Fix WAL log data corruption when using DBOptions.manual_wal_flush(true) and WriteOptions.sync(true) together. The sync WAL should work with locked log_write_mutex_.
    • Add checks for validity of the IO uring completion queue entries, and fail the BlockBasedTableReader MultiGet sub-batch if there's an invalid completion
    • Add an interface RocksDbIOUringEnable() that, if defined by the user, will allow them to enable/disable the use of IO uring by RocksDB
    • Fix the bug that when direct I/O is used and MultiRead() returns a short result, RandomAccessFileReader::MultiRead() still returns full size buffer, with returned short value together with some data in original buffer. This bug is unlikely cause incorrect results, because (1) since FileSystem layer is expected to retry on short result, returning short results is only possible when asking more bytes in the end of the file, which RocksDB doesn't do when using MultiRead(); (2) checksum is unlikely to match.

    New Features

    • RemoteCompaction's interface now includes db_name, db_id, session_id, which could help the user uniquely identify compaction job between db instances and sessions.
    • Added a ticker statistic, "rocksdb.verify_checksum.read.bytes", reporting how many bytes were read from file to serve VerifyChecksum() and VerifyFileChecksums() queries.
    • Added ticker statistics, "rocksdb.backup.read.bytes" and "rocksdb.backup.write.bytes", reporting how many bytes were read and written during backup.
    • Added properties for BlobDB: rocksdb.num-blob-files, rocksdb.blob-stats, rocksdb.total-blob-file-size, and rocksdb.live-blob-file-size. The existing property rocksdb.estimate_live-data-size was also extended to include live bytes residing in blob files.
    • Added two new RateLimiter IOPriorities: Env::IO_USER,Env::IO_MID. Env::IO_USER will have superior priority over all other RateLimiter IOPriorities without being subject to fair scheduling constraint.
    • SstFileWriter now supports Puts and Deletes with user-defined timestamps. Note that the ingestion logic itself is not timestamp-aware yet.
    • Allow a single write batch to include keys from multiple column families whose timestamps' formats can differ. For example, some column families may disable timestamp, while others enable timestamp.
    • Add compaction priority information in RemoteCompaction, which can be used to schedule high priority job first.
    • Added new callback APIs OnBlobFileCreationStarted,OnBlobFileCreatedand OnBlobFileDeleted in EventListener class of listener.h. It notifies listeners during creation/deletion of individual blob files in Integrated BlobDB. It also log blob file creation finished event and deletion event in LOG file.
    • Batch blob read requests for DB::MultiGet using MultiRead.
    • Add support for fallback to local compaction, the user can return CompactionServiceJobStatus::kUseLocal to instruct RocksDB to run the compaction locally instead of waiting for the remote compaction result.
    • Add built-in rate limiter's implementation of RateLimiter::GetTotalPendingRequest(int64_t* total_pending_requests, const Env::IOPriority pri) for the total number of requests that are pending for bytes in the rate limiter.
    • Charge memory usage during data buffering, from which training samples are gathered for dictionary compression, to block cache. Unbuffering data can now be triggered if the block cache becomes full and strict_capacity_limit=true for the block cache, in addition to existing conditions that can trigger unbuffering.

    Public API change

    • Remove obsolete implementation details FullKey and ParseFullKey from public API
    • Change SstFileMetaData::size from size_t to uint64_t.
    • Made Statistics extend the Customizable class and added a CreateFromString method. Implementations of Statistics need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
    • Extended FlushJobInfo and CompactionJobInfo in listener.h to provide information about the blob files generated by a flush/compaction and garbage collected during compaction in Integrated BlobDB. Added struct members blob_file_addition_infos and blob_file_garbage_infos that contain this information.
    • Extended parameter output_file_names of CompactFiles API to also include paths of the blob files generated by the compaction in Integrated BlobDB.
    • Most BackupEngine functions now return IOStatus instead of Status. Most existing code should be compatible with this change but some calls might need to be updated.
    Source code(tar.gz)
    Source code(zip)
  • v6.24.2(Oct 13, 2021)

    6.24.2 (2021-09-16)

    Bug Fixes

    • Add checks for validity of the IO uring completion queue entries, and fail the BlockBasedTableReader MultiGet sub-batch if there's an invalid completion

    6.24.1 (2021-08-31)

    Bug Fixes

    • Fix a race in item ref counting in LRUCache when promoting an item from the SecondaryCache.

    6.24.0 (2021-08-20)

    Bug Fixes

    • If the primary's CURRENT file is missing or inaccessible, the secondary instance should not hang repeatedly trying to switch to a new MANIFEST. It should instead return the error code encountered while accessing the file.
    • Restoring backups with BackupEngine is now a logically atomic operation, so that if a restore operation is interrupted, DB::Open on it will fail. Using BackupEngineOptions::sync (default) ensures atomicity even in case of power loss or OS crash.
    • Fixed a race related to the destruction of ColumnFamilyData objects. The earlier logic unlocked the DB mutex before destroying the thread-local SuperVersion pointers, which could result in a process crash if another thread managed to get a reference to the ColumnFamilyData object.
    • Removed a call to RenameFile() on a non-existent info log file ("LOG") when opening a new DB. Such a call was guaranteed to fail though did not impact applications since we swallowed the error. Now we also stopped swallowing errors in renaming "LOG" file.
    • Fixed an issue where OnFlushCompleted was not called for atomic flush.
    • Fixed a bug affecting the batched MultiGet API when used with keys spanning multiple column families and sorted_input == false.
    • Fixed a potential incorrect result in opt mode and assertion failures caused by releasing snapshot(s) during compaction.
    • Fixed passing of BlobFileCompletionCallback to Compaction job and Atomic flush job which was default paramter (nullptr). BlobFileCompletitionCallback is internal callback that manages addition of blob files to SSTFileManager.
    • Fixed MultiGet not updating the block_read_count and block_read_byte PerfContext counters

    New Features

    • Made the EventListener extend the Customizable class.
    • EventListeners that have a non-empty Name() and that are registered with the ObjectRegistry can now be serialized to/from the OPTIONS file.
    • Insert warm blocks (data blocks, uncompressed dict blocks, index and filter blocks) in Block cache during flush under option BlockBasedTableOptions.prepopulate_block_cache. Previously it was enabled for only data blocks.
    • BlockBasedTableOptions.prepopulate_block_cache can be dynamically configured using DB::SetOptions.
    • Add CompactionOptionsFIFO.age_for_warm, which allows RocksDB to move old files to warm tier in FIFO compactions. Note that file temperature is still an experimental feature.
    • Add a comment to suggest btrfs user to disable file preallocation by setting options.allow_fallocate=false.
    • Fast forward option in Trace replay changed to double type to allow replaying at a lower speed, by settings the value between 0 and 1. This option can be set via ReplayOptions in Replayer::Replay(), or via --trace_replay_fast_forward in db_bench.
    • Add property LiveSstFilesSizeAtTemperature to retrieve sst file size at different temperature.
    • Added a stat rocksdb.secondary.cache.hits
    • Added a PerfContext counter secondary_cache_hit_count
    • The integrated BlobDB implementation now supports the tickers BLOB_DB_BLOB_FILE_BYTES_READ, BLOB_DB_GC_NUM_KEYS_RELOCATED, and BLOB_DB_GC_BYTES_RELOCATED, as well as the histograms BLOB_DB_COMPRESSION_MICROS and BLOB_DB_DECOMPRESSION_MICROS.
    • Added hybrid configuration of Ribbon filter and Bloom filter where some LSM levels use Ribbon for memory space efficiency and some use Bloom for speed. See NewRibbonFilterPolicy. This also changes the default behavior of NewRibbonFilterPolicy to use Bloom for flushes under Leveled and Universal compaction and Ribbon otherwise. The C API function rocksdb_filterpolicy_create_ribbon is unchanged but adds new rocksdb_filterpolicy_create_ribbon_hybrid.

    Public API change

    • Added APIs to decode and replay trace file via Replayer class. Added DB::NewDefaultReplayer() to create a default Replayer instance. Added TraceReader::Reset() to restart reading a trace file. Created trace_record.h, trace_record_result.h and utilities/replayer.h files to access the decoded Trace records, replay them, and query the actual operation results.
    • Added Configurable::GetOptionsMap to the public API for use in creating new Customizable classes.
    • Generalized bits_per_key parameters in C API from int to double for greater configurability. Although this is a compatible change for existing C source code, anything depending on C API signatures, such as foreign function interfaces, will need to be updated.

    Performance Improvements

    • Try to avoid updating DBOptions if SetDBOptions() does not change any option value.

    Behavior Changes

    • StringAppendOperator additionally accepts a string as the delimiter.
    • BackupEngineOptions::sync (default true) now applies to restoring backups in addition to creating backups. This could slow down restores, but ensures they are fully persisted before returning OK. (Consider increasing max_background_operations to improve performance.)
    Source code(tar.gz)
    Source code(zip)
  • v6.23.3(Oct 13, 2021)

    6.23.3 (2021-08-09)

    Bug Fixes

    • Removed a call to RenameFile() on a non-existent info log file ("LOG") when opening a new DB. Such a call was guaranteed to fail though did not impact applications since we swallowed the error. Now we also stopped swallowing errors in renaming "LOG" file.
    • Fixed a bug affecting the batched MultiGet API when used with keys spanning multiple column families and sorted_input == false.

    6.23.2 (2021-08-04)

    Bug Fixes

    • Fixed a race related to the destruction of ColumnFamilyData objects. The earlier logic unlocked the DB mutex before destroying the thread-local SuperVersion pointers, which could result in a process crash if another thread managed to get a reference to the ColumnFamilyData object.
    • Fixed an issue where OnFlushCompleted was not called for atomic flush.

    6.23.1 (2021-07-22)

    Bug Fixes

    • Fix a race condition during multiple DB instances opening.

    6.23.0 (2021-07-16)

    Behavior Changes

    • Obsolete keys in the bottommost level that were preserved for a snapshot will now be cleaned upon snapshot release in all cases. This form of compaction (snapshot release triggered compaction) previously had an artificial limitation that multiple tombstones needed to be present.

    Bug Fixes

    • Blob file checksums are now printed in hexadecimal format when using the manifest_dump ldb command.
    • GetLiveFilesMetaData() now populates the temperature, oldest_ancester_time, and file_creation_time fields of its LiveFileMetaData results when the information is available. Previously these fields always contained zero indicating unknown.
    • Fix mismatches of OnCompaction{Begin,Completed} in case of DisableManualCompaction().
    • Fix continuous logging of an existing background error on every user write
    • Fix a bug that Get() return Status::OK() and an empty value for non-existent key when read_options.read_tier = kBlockCacheTier.
    • Fix a bug that stat in get_context didn't accumulate to statistics when query is failed.

    New Features

    • ldb has a new feature, list_live_files_metadata, that shows the live SST files, as well as their LSM storage level and the column family they belong to.
    • The new BlobDB implementation now tracks the amount of garbage in each blob file in the MANIFEST.
    • Integrated BlobDB now supports Merge with base values (Put/Delete etc.).
    • RemoteCompaction supports sub-compaction, the job_id in the user interface is changed from int to uint64_t to support sub-compaction id.
    • Expose statistics option in RemoteCompaction worker.

    Public API change

    • Added APIs to the Customizable class to allow developers to create their own Customizable classes. Created the utilities/customizable_util.h file to contain helper methods for developing new Customizable classes.
    • Change signature of SecondaryCache::Name(). Make SecondaryCache customizable and add SecondaryCache::CreateFromString method.
    Source code(tar.gz)
    Source code(zip)
  • v6.22.1(Jul 12, 2021)

    6.22.1 (2021-06-25)

    Bug Fixes

    • GetLiveFilesMetaData() now populates the temperature, oldest_ancester_time, and file_creation_time fields of its LiveFileMetaData results when the information is available. Previously these fields always contained zero indicating unknown.

    6.22.0 (2021-06-18)

    Behavior Changes

    • Added two additional tickers, MEMTABLE_PAYLOAD_BYTES_AT_FLUSH and MEMTABLE_GARBAGE_BYTES_AT_FLUSH. These stats can be used to estimate the ratio of "garbage" (outdated) bytes in the memtable that are discarded at flush time.
    • Added API comments clarifying safe usage of Disable/EnableManualCompaction and EventListener callbacks for compaction.

    Bug Fixes

    • fs_posix.cc GetFreeSpace() always report disk space available to root even when running as non-root. Linux defaults often have disk mounts with 5 to 10 percent of total space reserved only for root. Out of space could result for non-root users.
    • Subcompactions are now disabled when user-defined timestamps are used, since the subcompaction boundary picking logic is currently not timestamp-aware, which could lead to incorrect results when different subcompactions process keys that only differ by timestamp.
    • Fix an issue that DeleteFilesInRange() may cause ongoing compaction reports corruption exception, or ASSERT for debug build. There's no actual data loss or corruption that we find.
    • Fixed confusingly duplicated output in LOG for periodic stats ("DUMPING STATS"), including "Compaction Stats" and "File Read Latency Histogram By Level".
    • Fixed performance bugs in background gathering of block cache entry statistics, that could consume a lot of CPU when there are many column families with a shared block cache.

    New Features

    • Marked the Ribbon filter and optimize_filters_for_memory features as production-ready, each enabling memory savings for Bloom-like filters. Use NewRibbonFilterPolicy in place of NewBloomFilterPolicy to use Ribbon filters instead of Bloom, or ribbonfilter in place of bloomfilter in configuration string.
    • Allow DBWithTTL to use DeleteRange api just like other DBs. DeleteRangeCF() which executes WriteBatchInternal::DeleteRange() has been added to the handler in DBWithTTLImpl::Write() to implement it.
    • Add BlockBasedTableOptions.prepopulate_block_cache. If enabled, it prepopulate warm/hot data blocks which are already in memory into block cache at the time of flush. On a flush, the data block that is in memory (in memtables) get flushed to the device. If using Direct IO, additional IO is incurred to read this data back into memory again, which is avoided by enabling this option and it also helps with Distributed FileSystem. More details in include/rocksdb/table.h.
    • Added a cancel field to CompactRangeOptions, allowing individual in-process manual range compactions to be cancelled.
    Source code(tar.gz)
    Source code(zip)
  • v6.20.3(May 6, 2021)

    6.20.3 (2021-05-05)

    Bug Fixes

    • Fixed a bug where GetLiveFiles() output included a non-existent file called "OPTIONS-000000". Backups and checkpoints, which use GetLiveFiles(), failed on DBs impacted by this bug. Read-write DBs were impacted when the latest OPTIONS file failed to write and fail_if_options_file_error == false. Read-only DBs were impacted when no OPTIONS files existed.

    6.20.2 (2021-04-23)

    Bug Fixes

    • Fixed a bug in handling file rename error in distributed/network file systems when the server succeeds but client returns error. The bug can cause CURRENT file to point to non-existing MANIFEST file, thus DB cannot be opened.
    • Fixed a bug where ingested files were written with incorrect boundary key metadata. In rare cases this could have led to a level's files being wrongly ordered and queries for the boundary keys returning wrong results.
    • Fixed a data race between insertion into memtables and the retrieval of the DB properties rocksdb.cur-size-active-mem-table, rocksdb.cur-size-all-mem-tables, and rocksdb.size-all-mem-tables.
    • Fixed the false-positive alert when recovering from the WAL file. Avoid reporting "SST file is ahead of WAL" on a newly created empty column family, if the previous WAL file is corrupted.

    Behavior Changes

    • Due to the fix of false-postive alert of "SST file is ahead of WAL", all the CFs with no SST file (CF empty) will bypass the consistency check. We fixed a false-positive, but introduced a very rare true-negative which will be triggered in the following conditions: A CF with some delete operations in the last a few queries which will result in an empty CF (those are flushed to SST file and a compaction triggered which combines this file and all other SST files and generates an empty CF, or there is another reason to write a manifest entry for this CF after a flush that generates no SST file from an empty CF). The deletion entries are logged in a WAL and this WAL was corrupted, while the CF's log number points to the next WAL (due to the flush). Therefore, the DB can only recover to the point without these trailing deletions and cause the inconsistent DB status.

    6.20.0 (2021-04-16)

    Behavior Changes

    • ColumnFamilyOptions::sample_for_compression now takes effect for creation of all block-based tables. Previously it only took effect for block-based tables created by flush.
    • CompactFiles() can no longer compact files from lower level to up level, which has the risk to corrupt DB (details: #8063). The validation is also added to all compactions.
    • Fixed some cases in which DB::OpenForReadOnly() could write to the filesystem. If you want a Logger with a read-only DB, you must now set DBOptions::info_log yourself, such as using CreateLoggerFromOptions().
    • get_iostats_context() will never return nullptr. If thread-local support is not available, and user does not opt-out iostats context, then compilation will fail. The same applies to perf context as well.

    Bug Fixes

    • Use thread-safe strerror_r() to get error messages.
    • Fixed a potential hang in shutdown for a DB whose Env has high-pri thread pool disabled (Env::GetBackgroundThreads(Env::Priority::HIGH) == 0)
    • Made BackupEngine thread-safe and added documentation comments to clarify what is safe for multiple BackupEngine objects accessing the same backup directory.
    • Fixed crash (divide by zero) when compression dictionary is applied to a file containing only range tombstones.
    • Fixed a backward iteration bug with partitioned filter enabled: not including the prefix of the last key of the previous filter partition in current filter partition can cause wrong iteration result.
    • Fixed a bug that allowed DBOptions::max_open_files to be set with a non-negative integer with ColumnFamilyOptions::compaction_style = kCompactionStyleFIFO.
    • Fixed a bug in handling file rename error in distributed/network file systems when the server succeeds but client returns error. The bug can cause CURRENT file to point to non-existing MANIFEST file, thus DB cannot be opened.
    • Fixed a data race between insertion into memtables and the retrieval of the DB properties rocksdb.cur-size-active-mem-table, rocksdb.cur-size-all-mem-tables, and rocksdb.size-all-mem-tables.

    Performance Improvements

    • On ARM platform, use yield instead of wfe to relax cpu to gain better performance.

    Public API change

    • Added TableProperties::slow_compression_estimated_data_size and TableProperties::fast_compression_estimated_data_size. When ColumnFamilyOptions::sample_for_compression > 0, they estimate what TableProperties::data_size would have been if the "fast" or "slow" (see ColumnFamilyOptions::sample_for_compression API doc for definitions) compression had been used instead.
    • Update DB::StartIOTrace and remove Env object from the arguments as its redundant and DB already has Env object that is passed down to IOTracer::StartIOTrace
    • Added FlushReason::kWalFull, which is reported when a memtable is flushed due to the WAL reaching its size limit; those flushes were previously reported as FlushReason::kWriteBufferManager. Also, changed the reason for flushes triggered by the write buffer manager to FlushReason::kWriteBufferManager; they were previously reported as FlushReason::kWriteBufferFull.
    • Extend file_checksum_dump ldb command and DB::GetLiveFilesChecksumInfo API for IntegratedBlobDB and get checksum of blob files along with SST files.

    New Features

    • Added the ability to open BackupEngine backups as read-only DBs, using BackupInfo::name_for_open and env_for_open provided by BackupEngine::GetBackupInfo() with include_file_details=true.
    • Added BackupEngine support for integrated BlobDB, with blob files shared between backups when table files are shared. Because of current limitations, blob files always use the kLegacyCrc32cAndFileSize naming scheme, and incremental backups must read and checksum all blob files in a DB, even for files that are already backed up.
    • Added an optional output parameter to BackupEngine::CreateNewBackup(WithMetadata) to return the BackupID of the new backup.
    • Added BackupEngine::GetBackupInfo / GetLatestBackupInfo for querying individual backups.
    • Made the Ribbon filter a long-term supported feature in terms of the SST schema(compatible with version >= 6.15.0) though the API for enabling it is expected to change.
    Source code(tar.gz)
    Source code(zip)
Owner
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
Facebook
Kvrocks is a key-value NoSQL database based on RocksDB and compatible with Redis protocol.

Kvrocks is a key-value NoSQL database based on RocksDB and compatible with Redis protocol.

Bit Leak 1.9k Jan 8, 2023
Kvrocks is a distributed key value NoSQL database based on RocksDB and compatible with Redis protocol.

Kvrocks is a distributed key value NoSQL database based on RocksDB and compatible with Redis protocol.

Kvrocks Labs 1.9k Jan 9, 2023
🥑 ArangoDB is a native multi-model database with flexible data models for documents, graphs, and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript extensions.

?? ArangoDB is a native multi-model database with flexible data models for documents, graphs, and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript extensions.

ArangoDB 12.8k Jan 9, 2023
Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB

Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB. Scylla embraces a shared-nothing approach that increases throughput and storage capacity to realize order-of-magnitude performance improvements and reduce hardware costs.

ScyllaDB 8.9k Jan 4, 2023
Nebula Graph is a distributed, fast open-source graph database featuring horizontal scalability and high availability

Nebula Graph is an open-source graph database capable of hosting super large-scale graphs with billions of vertices (nodes) and trillions of edges, with milliseconds of latency. It delivers enterprise-grade high performance to simplify the most complex data sets imaginable into meaningful and useful information.

vesoft inc. 8.4k Jan 9, 2023
FEDB is a NewSQL database optimised for realtime inference and decisioning application

FEDB is a NewSQL database optimised for realtime inference and decisioning applications. These applications put real-time features extracted from multiple time windows through a pre-trained model to evaluate new data to support decision making. Existing in-memory databases cost hundreds or even thousands of milliseconds so they cannot meet the requirements of inference and decisioning applications.

4Paradigm 1.2k Jan 2, 2023
RediSearch is a Redis module that provides querying, secondary indexing, and full-text search for Redis.

A query and indexing engine for Redis, providing secondary indexing, full-text search, and aggregations.

null 4k Jan 5, 2023
Kreon is a key-value store library optimized for flash-based storage

Kreon is a key-value store library optimized for flash-based storage, where CPU overhead and I/O amplification are more significant bottlenecks compared to I/O randomness.

Computer Architecture and VLSI Systems (CARV) Laboratory 24 Jul 14, 2022
KVDK (Key-Value Development Kit) is a key-value store library implemented in C++ language

KVDK (Key-Value Development Kit) is a key-value store library implemented in C++ language. It is designed for persistent memory and provides unified APIs for both volatile and persistent scenarios. It also demonstrates several optimization methods for high performance with persistent memory. Besides providing the basic APIs of key-value store, it offers several advanced features, like transaction, snapshot as well.

Persistent Memory Programming 186 Nov 16, 2022
Kvrocks is a key-value NoSQL database based on RocksDB and compatible with Redis protocol.

Kvrocks is a key-value NoSQL database based on RocksDB and compatible with Redis protocol.

Bit Leak 1.9k Jan 8, 2023
Kvrocks is a distributed key value NoSQL database based on RocksDB and compatible with Redis protocol.

Kvrocks is a distributed key value NoSQL database based on RocksDB and compatible with Redis protocol.

Kvrocks Labs 1.9k Jan 9, 2023
BerylDB is a data structure data manager that can be used to store data as key-value entries.

BerylDB is a data structure data manager that can be used to store data as key-value entries. The server allows channel subscription and is optimized to be used as a cache repository. Supported structures include lists, sets, and keys.

BerylDB 203 Dec 16, 2022
FoundationDB - the open source, distributed, transactional key-value store

FoundationDB is a distributed database designed to handle large volumes of structured data across clusters of commodity servers. It organizes data as

Apple 12k Dec 31, 2022
A high performance, shared memory, lock free, cross platform, single file, no dependencies, C++11 key-value store

SimDB A high performance, shared memory, lock free, cross platform, single file, no dependencies, C++11 key-value store. SimDB is part of LAVA (Live A

null 454 Dec 29, 2022
An efficient, small mobile key-value storage framework developed by WeChat. Works on Android, iOS, macOS, Windows, and POSIX.

中文版本请参看这里 MMKV is an efficient, small, easy-to-use mobile key-value storage framework used in the WeChat application. It's currently available on Andr

Tencent 15.4k Jan 8, 2023
Simple constant key/value storage library, for read-heavy systems with infrequent large bulk inserts.

Sparkey is a simple constant key/value storage library. It is mostly suited for read heavy systems with infrequent large bulk inserts. It includes bot

Spotify 989 Dec 14, 2022
LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. Authors: Sanjay Ghem

Google 31.6k Jan 7, 2023
Modern transactional key-value/row storage library.

Sophia is advanced transactional MVCC key-value/row storage library. How does it differ from other storages? Sophia is RAM-Disk hybrid storage. It is

Dmitry Simonenko 1.8k Dec 15, 2022
Postmodern immutable and persistent data structures for C++ — value semantics at scale

immer is a library of persistent and immutable data structures written in C++. These enable whole new kinds of architectures for interactive and concu

Juanpe Bolívar 2.1k Dec 30, 2022
John Walker 24 Dec 15, 2022