Verneuil is a VFS extension for SQLite that asynchronously replicates databases to S3-compatible blob stores.

Related tags

Database sqlite s3
Overview

Verneuil: streaming replication for sqlite

asciicast

Verneuil1 [vɛʁnœj] is a VFS (OS abstraction layer) for sqlite that accesses local database files like the default unix VFS while asynchronously replicating snapshots to S3-compatible blob stores. We wrote it to improve the scalability and availability of pre-existing services for which sqlite is a good fit, at least for single-node deployments.

The primary design goal of Verneuil is to add asynchronous read replication to working single-node systems without introducing new catastrophic failure modes. Avoiding new failure modes takes precedence over all other considerations, including replication lag: there is no attempt to bound or minimise the staleness of read replicas. Verneuil read replicas should only be used when stale data is acceptable.

In keeping with this conservative approach to replication, the local database file on disk remains the source of truth, and the VFS is fully compatible with sqlite's default unix VFS, even for concurrent (with file locking) accesses. Verneuil stores all state that must persist across sqlite transactions on disk, so multiple processes can still access and replicate the same database with Verneuil.

Verneuil also paces all API calls (with a [currently hardcoded] limit of 30 call/second/process) to avoid "surprising" cloud bills, and decouples the sqlite VFS from the replication worker threads that upload data to a remote blob store with a crash-safe buffer directory that bounds its worst-case disk footprint to roughly four times the size of the source database file. It's thus always safe to disable access to the blob store: buffered replication data may grow over time, but always within bounds.

Replacing the default unix VFS with Verneuil impacts local sqlite operations, of course: writes must be slower, in order to queue updates for replication. However, this slowdown is usually proportional to the time it took to perform the write itself, and often dominated by the two fsyncs incurred by sqlite transaction commits in rollback mode. In addition, the additional replication logic runs with the write lock downgraded to a read lock, so subsequent transactions only block on the new replication step once they're ready to commit.

This effort is incomparable with litestream: Verneuil is meant for asynchronous read replication, with streaming backups as a nice side effect. The replication approach is thus completely different. In particular, while litestream only works with sqlite databases in WAL mode, Verneuil only supports rollback journaling. See doc/DESIGN.md for details.

What's in this repo

  1. A "Linux" VFS (c/linuxvfs.c) that implements everything that sqlite needs for a non-WAL DB, without all the backward compatibility cruft in sqlite's Unix VFS. The new VFS's behaviour is fully compatible with upstream's Unix VFS! It's a simpler starting point for new (Linux-only) sqlite VFSes.

  2. A Rust crate with a C interface (see include/verneuil.h) to configure and register:

    • The verneuil VFS, which hooks into the Linux VFS to track changes, generate snapshots in spooling directories, and asynchronously upload spooled data to a remote blob store like S3. This VFS is only compatible with sqlite's rollback journal mode. It can be called directly as a Rust program, or via its C interface.

    • The verneuil_snapshot VFS that lets sqlite access snapshots stored in S3-compatible blob stores.

  3. A runtime-loadable sqlite extension, libverneuil_vfs, that lets sqlite open databases with the verneuil VFS (to replicate the database to remote storage), or with the verneuil_snapshot VFS (to access a replicated snapshot).

  4. The verneuilctl command-line tool to restore snapshots, forcibly upload spooled data, synchronise a database file to remote storage, and perform other ad hoc administrative tasks.

Quick start

There is more detailed setup information, including how to directly link against the verneuil crate instead of loading it as a sqlite extension, in doc/VFS.md and doc/SNAPSHOT_VFS.md. The rusqlite_integration example shows how that works for a Rust crate.

For quick hacks and test drives, the easiest way to use Verneuil is to build it as a runtime loadable extension for sqlite (libverneuil_vfs).

cargo build --release --examples --features='dynamic_vfs'

The verneuilctl tool will also be useful.

cargo build --release --examples --features='vendor_sqlite'

Verneuil needs additional configuration to know where to spool replication data, and where to upload or fetch data from remote storage. That configuration data must be encoded in JSON, and will be deserialised into a verneuil::Options struct (in src/lib.rs).

A minimal configuration string looks as follows. See doc/VFS.md and doc/SNAPSHOT_VFS.md for more details.

{
  // "make_default": true, to use the replicating VFS by default
  // "tempdir": "/my/tmp/", to override the location of temporary files
  "replication_spooling_dir": "/tmp/verneuil/",
  "replication_targets": [
    {
      "s3": {
        "region": "us-east-1",
        // "endpoint": "http://127.0.0.1:9000", //for non-standard regions
        "chunk_bucket": "verneuil_chunks",
        "manifest_bucket": "verneuil_manifests",
        "domain_addressing": true  // or false for the legacy bucket-as-path interface
        // "create_buckets_on_demand": true // to create private buckets as needed
      }
    }
  ]
}

That's a mouthful to pass as query string parameters to sqlite3_open_v2, so Verneuil currently looks for that configuration string in the VERNEUIL_CONFIG environment variable. If that variable's value starts with an at sign, like "@/path/to/config.json", Verneuil looks for the configuration JSON in that file.

The configuration file does not include any credential: Verneuil gets those from the environment, either by hitting the local EC2 credentials daemon, or by reading the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

Now that the environment is set up, we can load the extension in sqlite, and start replicating our writes to S3, or any other compatible blob server (we use minio for testing).

.load ./libverneuil_vfs -- Load the Verneuil VFS extension. sqlite> .open file:source.db?vfs=verneuil -- The contents of source.db will now be spooled for replication before -- letting each transaction close. sqlite> .open file:verneuil://source.host.name/path/to/replicated.db?vfs=verneuil_snapshot -- opens a read replica for the most current snapshot replicated to s3 by `source.host.name` -- for the database at `/path/to/replicated.db`.">
$ RUST_LOG=warn [email protected] sqlite3
SQLite version 3.22.0 2018-01-22 18:45:57
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .load ./libverneuil_vfs  -- Load the Verneuil VFS extension.
sqlite> .open file:source.db?vfs=verneuil
-- The contents of source.db will now be spooled for replication before
-- letting each transaction close.
sqlite> .open file:verneuil://source.host.name/path/to/replicated.db?vfs=verneuil_snapshot
-- opens a read replica for the most current snapshot replicated to s3 by `source.host.name`
-- for the database at `/path/to/replicated.db`.

Outside the sqlite shell, extensions loading must be enabled in order to allow access to the load_extension SQL function.

URI filenames must also be enabled in order to specify the VFS in the connection string; it's also possible to pass a VFS argument to sqlite3_open_v2.

Replication data is buffered to the replication_spooling_dir synchronously, before the end of each sqlite transaction. Actually uploading the data to remote storage happens asynchronously: we wouldn't want to block transaction commit on network calls.

After exiting the shell or closing an application, we can make sure that all spooled data is flushed to remote storage with verneuilctl flush $REPLICATION_SPOOLING_DIR: this command will attempt to synchronously upload all pending spooled data in the spooling directory, and log noisily / error out on failure.

Find documentation for other verneuilctl subcommands with verneuilctl help:

Log level, in the same format as `RUST_LOG`. Defaults to only logging errors to stderr; `--log=info` increases the verbosity to also log info and warning to stderr. To fully disable logging, pass `--log=off`. SUBCOMMANDS: flush The verneuilctl flush utility accepts the path to a spooling directory, (i.e., a value for `verneuil::Options::replication_spooling_dir`), and attempts to upload all the files pending replication in that directory help Prints this message or the help of the given subcommand(s) manifest The verneuilctl manifest utility accepts the path to a source replicated file and an optional hostname, and outputs the contents of the corresponding manifest file to `--out`, or stdout by default manifest-name The verneuilctl manifest-name utility accepts the path to a source replicated file and an optional hostname, and prints the name of the corresponding manifest file to stdout restore The verneuilctl restore utility accepts the path to a verneuil manifest file, and reconstructs its contents to the `--out` argument (or stdout by default) sync The verneuilctl sync utility accepts the path to a sqlite db, and uploads a fresh snapshot to the configured replication targets $ ./verneuilctl restore --help verneuilctl-restore 0.1.0 The verneuilctl restore utility accepts the path to a verneuil manifest file, and reconstructs its contents to the `--out` argument (or stdout by default) USAGE: verneuilctl restore [OPTIONS] FLAGS: --help Prints help information -V, --version Prints version information OPTIONS: -h, --hostname The hostname of the machine that generated the snapshot. Defaults to the current machine's hostname. -m, --manifest The manifest file that describes the snapshot to restore. These are typically stored as objects in versioned buckets; it is up to the invoker to fish out the relevant version. If missing, verneuilctl restore will attempt to download it from remote storage, based on `--hostname` and `--source_path`. As special cases, an `http://` or `https://` prefix will be downloaded over HTTP(S), an `s3://bucket.region[.endpoint]/path/to/blob` URI will be loaded via HTTPS domain-addressed S3, `verneuil://machine-host-name/path/to/sqlite.db` will be loaded based on that hostname (or the current machine's hostname if empty) and source path, and a `file://` prefix will always be read as a local path. -o, --out The path to the reconstructed output file. Defaults to stdout. -s, --source-path The path to the source file that was replicated by Verneuil, when it ran on `--hostname` ">
$ ./verneuilctl --help
verneuilctl 0.1.0
utilities to interact with Verneuil snapshots

USAGE:
    verneuilctl [OPTIONS] 
       
        

FLAGS:
    -h, --help
            Prints help information

    -V, --version
            Prints version information


OPTIONS:
    -c, --config 
        
         
            The Verneuil JSON configuration used when originally copying the database to remote storage.

            A value of the form "@/path/to/json.file" refers to the contents of that file; otherwise, the argument
            itself is the configuration string.

            This parameter is optional, and defaults to the value of the `VERNEUIL_CONFIG` environment variable.
    -l, --log 
         
          
            Log level, in the same format as `RUST_LOG`.  Defaults to only logging errors to stderr; `--log=info`
            increases the verbosity to also log info and warning to stderr.

            To fully disable logging, pass `--log=off`.

SUBCOMMANDS:
    flush            The verneuilctl flush utility accepts the path to a spooling directory, (i.e., a value for
                     `verneuil::Options::replication_spooling_dir`), and attempts to upload all the files pending
                     replication in that directory
    help             Prints this message or the help of the given subcommand(s)
    manifest         The verneuilctl manifest utility accepts the path to a source replicated file and an optional
                     hostname, and outputs the contents of the corresponding manifest file to `--out`, or stdout by
                     default
    manifest-name    The verneuilctl manifest-name utility accepts the path to a source replicated file and an
                     optional hostname, and prints the name of the corresponding manifest file to stdout
    restore          The verneuilctl restore utility accepts the path to a verneuil manifest file, and reconstructs
                     its contents to the `--out` argument (or stdout by default)
    sync             The verneuilctl sync utility accepts the path to a sqlite db, and uploads a fresh snapshot to
                     the configured replication targets
$ ./verneuilctl restore --help
verneuilctl-restore 0.1.0
The verneuilctl restore utility accepts the path to a verneuil manifest file, and reconstructs its contents to the
`--out` argument (or stdout by default)

USAGE:
    verneuilctl restore [OPTIONS]

FLAGS:
        --help
            Prints help information

    -V, --version
            Prints version information


OPTIONS:
    -h, --hostname 
          
           
            The hostname of the machine that generated the snapshot.

            Defaults to the current machine's hostname.
    -m, --manifest 
           
             The manifest file that describes the snapshot to restore. These are typically stored as objects in versioned buckets; it is up to the invoker to fish out the relevant version. If missing, verneuilctl restore will attempt to download it from remote storage, based on `--hostname` and `--source_path`. As special cases, an `http://` or `https://` prefix will be downloaded over HTTP(S), an `s3://bucket.region[.endpoint]/path/to/blob` URI will be loaded via HTTPS domain-addressed S3, `verneuil://machine-host-name/path/to/sqlite.db` will be loaded based on that hostname (or the current machine's hostname if empty) and source path, and a `file://` prefix will always be read as a local path. -o, --out 
            
              The path to the reconstructed output file. Defaults to stdout. -s, --source-path 
             
               The path to the source file that was replicated by Verneuil, when it ran on `--hostname` 
             
            
           
          
         
        
       

But why?

Backtraces shards most of its backend metadata in a few thousand small (1-2 MB) to medium size (up to 1-2 GB) sqlite databases, with an average aggregate write rate of a few dozen write transactions per second (with a few hot databases and many cold ones). Before Verneuil, this approach offered adequate performance and availability. However, things could be better, and that's why we wrote Verneuil: to distribute logic that can work with slightly stale read replicas and to simplify our disaster recovery playbooks, without introducing new failure modes in single-node code that already works well enough.

In fact, making sure replicas are up to date is explicitly not a goal. Nevertheless, we find that once our backend reaches its steady state, less than 0.1% of write transactions take more than 5 seconds to replicate, and detect a replication lag of more than one minute for more rarely than once every million write. Of course, this all depends on the write load and the number of replicated databases on a machine or process. For example, we experience temporary spikes in replication lag whenever a service restarts and writes to a few hundred databases in rapid succession.

Data freshness is not a goal because Verneuil prioritises disaster avoidance over everything else. That's why we interpose a wait-free crash-safe replication buffer (implemented as files on disk) between the snapshot update logic, which must run synchronously with sqlite transaction commits, and the copier worker threads that upload snapshot data to remote blob stores. We trust this buffer to act as a "data diode" that architecturally shuts off feedback loops from the copier workers back to the sqlite VFS (i.e., back to the application). Crucially, the amount of buffered data for a given sqlite data base is bounded to a multiple of that database file's size, even when copiers are completely stuck. Even when the blob store is inaccessible or a target bucket misconfigured, local operations will not be interrupted by an ever-growing replication queue. The buffer is also updated without fsync calls that could easily impact the whole storage subsystem; Verneuil instead achieves crash safety by discarding all replication state after a reboot.

All too often, distributed solutions for scalability and availability end up introducing new catastrophic failure modes, and the result is a system that might offer resilience to rare (once a year or less) events like hardware failure or power loss, but does so by increasing complexity to a level such that unforeseen interactions between correctly functioning pieces of code regularly cause protracted customer impacting issues. Verneuil's conservative approach gives us some confidence that we can use it to improve the scalability and availability of our preexisting centralised systems without worsening the reliablity of everything that already works well enough.

Disaster avoidance includes bounding cloud costs. Verneuil can guarantee cost effectiveness for a wide range of update rate because it's always able to throttle the API calls that update data: the replication buffer will simply squash snapshots and always bound the replication data's footprint to four times the size of the source database file.

Regardless of the update pattern (frequency and number of databases), we can count on Verneuil to remain within our budget for replication: it will never average more than 302 API calls/replication target/second/process. Each call uploads either a chunk (64 KB for incompressible data, less if zstd is useful), or a manifest (16 bytes per 64 KB chunk in the source database file, so 512 KB for a 2 GB file).

Chunks can be reaped by a retention rule that deletes them after a week of inactivity (Verneuil attempts to "touch" useful chunks once a day), so, even when there's a lot of churn, a chunk upload to a standard bucket in US East costs at most $5e-6 + 64 K/1 GB * $0.023 / 4 (weeks per month) < $6e-6.

Manifests for multi-GB databases can be much larger, but manifest updates are throttled to less than one per second per database, and manifest blobs can be deleted more aggressively (e.g., as soon as a version becomes stale). With a 24h retention rule, uploading the manifest for a 2 GB database adds up to less than $6e-6 for the API call and churned storage.

We could also take into account storage costs for the permanent footprint of the replicated databases ($0.023/GB/month for standard buckets in US East) to this upper bound, but that's usually dominated by API costs.

At an average rate of 30 upload/replication target/second/process, the cost of churned data thus adds up to less than $15.55/replication target/day/process. There is usually only one replication target and one replicating process per machine, so this translates into a maximum of $15.55/day/machine (comparable to a c5.4xlarge). In practice, the average daily cost for Backtrace's backend fleet (millions of writes a day scattered across a few thousand databases) is on the order of $40/day.

How is it tested?

In addition to simply running this in production to confirm that regular single-node operations still work and that the code correctly paces its API calls, we use sqlite's open source regression test suite, after replacing the default Unix VFS with Verneuil. Unfortunately, some tests assume WAL DBs are supported, so we have to disable them; some others inject failures to exercise sqlite's failure handling logic, those too must be disabled. The resulting test suite lives at https://github.com/pkhuong/sqlite/tree/pkhuong/no-wal-baseline

Configure a sqlite build directory from the mutilated test suite, then run verneuil/t/test.sh to build test executables that load the Verneuil VFS and make it the new default. The test script also spins up a local minio container for the Verneuil VFS.

In test mode, the VFS executes internal consistency checking code, and panics whenever it notices a spooling or replication failure.

The logic for read replicas can't piggyback on top of the sqlite test suite as easily. It is instead subjected to classic unit testing and manual smoke testing.

What's missing for 1.0

  • Configurability: most of the plumbing is there to configure individual sqlite connections, but the current implementation is geared towards a program linking directly against libverneuil and configuring it with C calls. We can already load Verneuil in sqlite by configuring it with an environment variable (which matches the current global configuration structure), but we should add support for reading configuration data from the connection string (sqlite query string parameters).

Things we should do after 1.0

  1. We currently always create the journal file in 0644. Umask applies, but it would make sense to implement the same logic as sqlite's Unix VFS and inherit the main db file's permissions.

  2. The first page in a sqlite DB almost always changes after a write transaction. Should we bundle it with the directory proto?

  3. Many filesystems now support copy-on-write; we should think about using that for the commit step, instead of a journal file!

  4. The S3 client library is really naive. We should reuse HTTPS connections.

  5. Consider some way to get chunks without indirecting through S3. Could gossip promising chunks ahead of time, or simply serve them on demand over request-response like HTTP.

Footnotes

  1. The Verneuil process was the first commercial method of manufacturing synthetic gemstones... and DRH insists on pronouncing sqlite like a mineral, surely a precious one (:

  2. This hardcoded limit, coupled with the patrol logic that "touches" every extant chunk once a day, limits the total size of replicated databases for a single process: the replication logic may break down around 20-30 GB, but local operations should not be affected, except for the bounded growth in buffered replication data. That's not an issue for us because we only store metadata in sqlite, metadata that tends to be orders of magnitude smaller than the data.

Issues
  • Audit the public interface for usage of the internal runtime

    Audit the public interface for usage of the internal runtime

    In #15, we observed an issue with the public interface's transitive use of Verneuil's internal async runtime. I don't believe there's any left (everything else just pushes work to the internal threads or uses fresh worker threads), but let's make extra sure.

    opened by pkhuong 4
  • Issues running verneuil operations within a async runtime

    Issues running verneuil operations within a async runtime

    As noted in the title, when invoking verneuil operations other then the main verneuil async replication loop inside a async runtime your program will error at runtime with:

    thread 'main' panicked at 'Cannot start a runtime from within a runtime. This happens because a function (like `block_on`) attempted to block the current thread while the thread is being used to drive asynchronous tasks.', /home/noah/.cargo/registry/src/localhost-20d513fd1fb2345e/tokio-1.17.0/src/runtime/enter.rs:39:9
    stack backtrace:
       0:     0x5608fe2ebff0 - std::backtrace_rs::backtrace::libunwind::trace::h706b838f5bbd876b
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/../../backtrace/src/backtrace/libunwind.rs:90:5
       1:     0x5608fe2ebff0 - std::backtrace_rs::backtrace::trace_unsynchronized::hc15f29ae7822b7b8
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
       2:     0x5608fe2ebff0 - std::sys_common::backtrace::_print_fmt::h7c580c971f91926c
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/sys_common/backtrace.rs:67:5
       3:     0x5608fe2ebff0 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h2a772198c4032452
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/sys_common/backtrace.rs:46:22
       4:     0x5608fe310b7c - core::fmt::write::h9a6d9c74526a6c1b
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/core/src/fmt/mod.rs:1115:17
       5:     0x5608fe2e4c65 - std::io::Write::write_fmt::h00f38d9eb2c32e02
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/io/mod.rs:1663:15
       6:     0x5608fe2ee53b - std::sys_common::backtrace::_print::h5b3c171e6f864ae4
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/sys_common/backtrace.rs:49:5
       7:     0x5608fe2ee53b - std::sys_common::backtrace::print::h93b9e9ed2a98e611
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/sys_common/backtrace.rs:36:9
       8:     0x5608fe2ee53b - std::panicking::default_hook::{{closure}}::hd8da92bb68d520c5
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/panicking.rs:208:50
       9:     0x5608fe2ee011 - std::panicking::default_hook::hdbc8b2951c5afbab
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/panicking.rs:225:9
      10:     0x5608fe2eec04 - std::panicking::rust_panic_with_hook::h8a4c841655926f4e
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/panicking.rs:622:17
      11:     0x5608fe2d02e7 - std::panicking::begin_panic::{{closure}}::hb0d33eb2055508d2
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/panicking.rs:542:9
      12:     0x5608fe2d02c0 - std::sys_common::backtrace::__rust_end_short_backtrace::hd3b4c97b5a7cad4c
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/sys_common/backtrace.rs:141:18
      13:     0x5608fde1800e - std::panicking::begin_panic::hba99d266457f13fb
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/panicking.rs:541:12
      14:     0x5608fe2b7b69 - tokio::runtime::enter::enter::h9f1b3f7a8528ef6d
                                   at /home/noah/.cargo/registry/src/localhost-20d513fd1fb2345e/tokio-1.17.0/src/runtime/enter.rs:39:9
      15:     0x5608fde81ab9 - tokio::runtime::handle::Handle::block_on::h83d62ebbad7b513c
                                   at /home/noah/.cargo/registry/src/localhost-20d513fd1fb2345e/tokio-1.17.0/src/runtime/handle.rs:352:34
      16:     0x5608fdee64b1 - verneuil::tracker::Tracker::flush_spooled_data::hb8ee1dce4a49165a
                                   at /home/noah/code/reddit_exporter/rust/verneuil/src/tracker/mod.rs:384:13
      17:     0x5608fdecbea7 - verneuil__file_flush_replication_data
                                   at /home/noah/code/reddit_exporter/rust/verneuil/src/vfs_ops.rs:88:11
      18:     0x5608fdf085e8 - linux_file_control
                                   at /home/noah/code/reddit_exporter/rust/verneuil/c/vfs.c:2150:36
      19:     0x5608fdf085e8 - linux_file_control
                                   at /home/noah/code/reddit_exporter/rust/verneuil/c/vfs.c:2107:1
      20:     0x7f458b082da0 - sqlite3_file_control
      21:     0x7f458b1161a1 - <unknown>
      22:     0x7f458b11a9b3 - <unknown>
      23:     0x7f458b11f809 - <unknown>
      24:     0x7f458b11fbef - <unknown>
      25:     0x7f458b11fcea - sqlite3_prepare_v2
      26:     0x5608fe126be0 - rusqlite::inner_connection::InnerConnection::prepare::h8dbdc4b6e92dad97
                                   at /home/noah/.cargo/registry/src/localhost-20d513fd1fb2345e/rusqlite-0.26.3/src/inner_connection.rs:247:17
      27:     0x5608fde28950 - rusqlite::Connection::prepare::hfcb7f155e657f120
                                   at /home/noah/.cargo/registry/src/localhost-20d513fd1fb2345e/rusqlite-0.26.3/src/lib.rs:711:9
      28:     0x5608fde28950 - rusqlite::pragma::<impl rusqlite::Connection>::pragma::h7c49312f1339a295
                                   at /home/noah/.cargo/registry/src/localhost-20d513fd1fb2345e/rusqlite-0.26.3/src/pragma.rs:234:24
      29:     0x5608fde2ac77 - reddit_exporter::run::{{closure}}::h0b08dcc4bebf950f
                                   at /home/noah/code/reddit_exporter/rust/reddit_exporter/src/lib.rs:84:9
      30:     0x5608fde2ac77 - <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::h23826ff10c1568e1
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/core/src/future/mod.rs:80:19
      31:     0x5608fde2ac77 - reddit_exporter::main::{{closure}}::h4a70bb1ad8b599de
                                   at /home/noah/code/reddit_exporter/rust/reddit_exporter/src/main.rs:5:5
      32:     0x5608fde2ac77 - <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::h0c01af9e689a1fba
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/core/src/future/mod.rs:80:19
      33:     0x5608fde271df - tokio::park::thread::CachedParkThread::block_on::{{closure}}::hbeb1622ff3aac55c
                                   at /home/noah/.cargo/registry/src/localhost-20d513fd1fb2345e/tokio-1.17.0/src/park/thread.rs:263:54
      34:     0x5608fde271df - tokio::coop::with_budget::{{closure}}::h5bfc9a4967bb24f0
                                   at /home/noah/.cargo/registry/src/localhost-20d513fd1fb2345e/tokio-1.17.0/src/coop.rs:102:9
      35:     0x5608fde271df - std::thread::local::LocalKey<T>::try_with::h0bc854f57be45633
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/thread/local.rs:399:16
      36:     0x5608fde271df - std::thread::local::LocalKey<T>::with::hbcf97c3248396ffc
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/thread/local.rs:375:9
      37:     0x5608fde29c5e - tokio::coop::with_budget::hae6ca180114e01a9
                                   at /home/noah/.cargo/registry/src/localhost-20d513fd1fb2345e/tokio-1.17.0/src/coop.rs:95:5
      38:     0x5608fde29c5e - tokio::coop::budget::h2509df568e464e85
                                   at /home/noah/.cargo/registry/src/localhost-20d513fd1fb2345e/tokio-1.17.0/src/coop.rs:72:5
      39:     0x5608fde29c5e - tokio::park::thread::CachedParkThread::block_on::h12239001b61cbb4f
                                   at /home/noah/.cargo/registry/src/localhost-20d513fd1fb2345e/tokio-1.17.0/src/park/thread.rs:263:31
      40:     0x5608fde30fdf - tokio::runtime::enter::Enter::block_on::h6fa1fb9e9d8e58e2
                                   at /home/noah/.cargo/registry/src/localhost-20d513fd1fb2345e/tokio-1.17.0/src/runtime/enter.rs:151:13
      41:     0x5608fde30fdf - tokio::runtime::thread_pool::ThreadPool::block_on::ha300f1a646b9eff9
                                   at /home/noah/.cargo/registry/src/localhost-20d513fd1fb2345e/tokio-1.17.0/src/runtime/thread_pool/mod.rs:73:9
      42:     0x5608fde27725 - tokio::runtime::Runtime::block_on::h68b28e80d59b6be7
                                   at /home/noah/.cargo/registry/src/localhost-20d513fd1fb2345e/tokio-1.17.0/src/runtime/mod.rs:477:43
      43:     0x5608fde27725 - reddit_exporter::main::h1bfb50ff936ba5b0
                                   at /home/noah/code/reddit_exporter/rust/reddit_exporter/src/main.rs:6:5
      44:     0x5608fde28400 - core::ops::function::FnOnce::call_once::hbfe679e2efad3469
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/core/src/ops/function.rs:227:5
      45:     0x5608fde28400 - std::sys_common::backtrace::__rust_begin_short_backtrace::hd96ca07378af8c4f
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/sys_common/backtrace.rs:125:18
      46:     0x5608fde2f7fd - std::rt::lang_start::{{closure}}::h8b17894e59eb2d0d
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/rt.rs:63:18
      47:     0x5608fe2ef20a - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::hcbe5785d54a49ce2
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/core/src/ops/function.rs:259:13
      48:     0x5608fe2ef20a - std::panicking::try::do_call::h1c15e7ee8563c29d
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/panicking.rs:401:40
      49:     0x5608fe2ef20a - std::panicking::try::hb8b07cef009cb4fc
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/panicking.rs:365:19
      50:     0x5608fe2ef20a - std::panic::catch_unwind::h65a8efbe596bcbb1
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/panic.rs:434:14
      51:     0x5608fe2ef20a - std::rt::lang_start_internal::{{closure}}::h66ded7ab0f18e9fb
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/rt.rs:45:48
      52:     0x5608fe2ef20a - std::panicking::try::do_call::h29cafa65efb3f3f3
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/panicking.rs:401:40
      53:     0x5608fe2ef20a - std::panicking::try::h412fc5378b4dbfc4
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/panicking.rs:365:19
      54:     0x5608fe2ef20a - std::panic::catch_unwind::h555def6aac5bf192
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/panic.rs:434:14
      55:     0x5608fe2ef20a - std::rt::lang_start_internal::hfff48202d7fee6e3
                                   at /rustc/240ff4c4a0d0936c9eeb783fa9ff5c0507a6ffb4/library/std/src/rt.rs:45:20
      56:     0x5608fde2790a - main
      57:     0x7f458a99b0b3 - __libc_start_main
      58:     0x5608fde1b50e - _start
      59:                0x0 - <unknown>
    Aborted (core dumped)
    

    We tried an attempted patch earlier to fix this to no effect:

    diff --git a/src/tracker/mod.rs b/src/tracker/mod.rs
    index 63b6c78..ae63cf4 100644
    --- a/src/tracker/mod.rs
    +++ b/src/tracker/mod.rs
    @@ -372,6 +372,20 @@ impl Tracker {
         /// tracked database.
         #[instrument(skip(self))]
         pub fn flush_spooled_data(&self) -> Result<()> {
    -        crate::copier::copy_spool_path(self.buffer.spooling_directory())
    +        let to_flush = self.buffer.spooling_directory();
    +
    +        if let Ok(handle) = tokio::runtime::Handle::try_current() {
    +            // We use our own Tokio runtime in `copy_spool_path`.
    +            // Avoid panics due to nested runtimes by flushing in a
    +            // temporary thread.  Flushes have to upload data over the
    +            // network, so they're not exactly fast anyway.
    +            let to_flush = to_flush.to_owned();
    +
    +            handle
    +                .block_on(handle.spawn_blocking(move || crate::copier::copy_spool_path(&to_flush)))
    +                .map_err(|e| chain_error!(e, "copy_spool_path thread failed"))?
    +        } else {
    +            crate::copier::copy_spool_path(to_flush)
    +        }
         }
     }
    
    opened by Church- 4
  • Somewhat large manifest protos for > 1 GB databases

    Somewhat large manifest protos for > 1 GB databases

    We currently refer to each 64KB page in a database with a 16 byte fingerprint. That's a good 4000x reduction, but still means a 256 KB incompressible manifest for a 1 GB db file.

    Most DB writes don't actually change that many pages, so we should be able to expose that redundancy to the compressor. Do so by stashing a base list of fingerprints (as raw bytes in a content-addressed chunk), and xor-ing that with the list of fingerprints before serialisation and after deserialisation. This should result in a lot of zeros for zstd to shrink.

    In order to achieve this:

    1. Decode manifest protos with an optional base chunk
    2. Encode manifest protos with a base chunk
    3. Try to decompress manifest blobs when they look zstd-compressed
    4. Compress manifest blobs
    5. ~~Avoid re-uploading chunks back-to-back in copier.rs~~ already handled! (RecentWorkSet)
    6. Figure out a policy to reset the base chunk
    7. Make sure not to use base chunks for dbs < a certain size
    8. Stash the latest Arc<Chunk> for each db somewhere: this guarantees we keep them in the global cache.
    opened by pkhuong 1
  • Lack of sqlite flush on shutdown

    Lack of sqlite flush on shutdown

    When running a program that has a oneshot mode rather then an ongoing daemon process, due to the model verneuil operates under (wrt lag in replication of data) we may see data loss at the end of a program, this will generally only hit short running programs but may effect long running ones.

    As a work around we can force on program end the usage of verneuil::copy_all_spool_paths to force an upload of any locally replicated data that has not made it to the remote.

    opened by Church- 0
  • Let snapshots load chunks on demand

    Let snapshots load chunks on demand

    When a snapshot is small (maybe < 10-20 chunks, < 2 MB), it makes sense to download everything as soon as the snapshot is created. Snapshots for larger databases (e.g., 100+ MB) may not need to load all that data. Add the option to load only a prefix of the database (where we expect all b-tree root and internal pages to live) on snapshot creation, and the rest on demand.

    The current implementation fetches only one chunk at a time, with no prefaulting, so it will be slow. We can fix that later. For now, it's on-demand loading is disabled by default and available when it makes sense.

    Closes #12.

    opened by pkhuong 0
  • Let snapshots start with partial data

    Let snapshots start with partial data

    For large databases, fully loading all the chunks in a snapshot can be useless, and is always slow. Sqlite page allocation tends to put root btree pages at the beginning of the file, so we could want to load only the first few chunks in a snapshot, and the rest on demand.

    opened by pkhuong 0
  • Bundle high-churn chunks with the manifest

    Bundle high-churn chunks with the manifest

    Implements the improvement sketched in #7.

    We already made manifests a lot smaller with #3, at the potential expense of one extra I/O. We can recover some of that by sending chunks we don't expect to benefit from deduplication (because they're rarely reused across transactions) inline with the manifest. The net effect will be similar data transfer, but fewer blobs. Combined with #3, we'll obtain lower transfer bandwidth, storage footprint, and API calls compared to v0.1.

    The first chunk, the one that includes sqlite's header page, is a prime candidate: the header contains a 32-bit integer that's incremented whenever the file's contents change. The rest of the page serves as the root B-tree page for the schema table. Root pages are often extra sparse (only they get to hit < 50% occupancy), and the schema table is mostly SQL DDL, i.e., nicely compressible. Once compressed by zstd, it shouldn't add too much to the manifest's size.

    opened by pkhuong 0
  • Introduce the concept of well-known chunks

    Introduce the concept of well-known chunks

    The loader treats the all-zero 64KB chunk specially: it knows the corresponding fingerprint and immediately returns the corresponding data. That's a useful concept to expose to the rest of the crate, if only because we sometimes need a dummy fingerprint.

    opened by pkhuong 0
  • Bundle high-churn chunks with the manifest

    Bundle high-churn chunks with the manifest

    The first chunk (which contains sqlite's first page, and thus starts with a header that includes a file change counter) is unlikely to see meaningful deduplication from content addressing. It also tends to be highly compressible (it holds the sqlite schema btree, which is usually small and mostly consists of SQL text).

    We should bundle it with the manifest.

    opened by pkhuong 0
  • Split up `Tracker::snapshot_file_contents`

    Split up `Tracker::snapshot_file_contents`

    It's a linear pipeline, but a lot of operations. Splitting it up a little should help readers. Splitting up the tracker module into more files also helps enforce more internal invariants: just because we split up the snapshot generation logic in more methods doesn't mean it makes sense to call these methods in any other context or order. Generating a new snapshot involves a lot of state, some Rust, but mostly in the filesystem.

    Fixes #5

    opened by pkhuong 0
  • Fix #3 with a xor diffing scheme

    Fix #3 with a xor diffing scheme

    When manifests grow large, we can now represent their constituent fingerprints as xor-diffs on top of a base chunk. As long as the bytes in the base chunk are similar to the list of fingerprints, the manifest will have a lot of zero bytes when serialised, and zstd will easily compress them away.

    The main change is the optional xoring of a base chunk. However, a lot of the new code goes into making this change effective, in particular:

    • Adding compression for manifests, to actually save space when the manifest is full of zero
    • Keeping a reference to the last base chunk we read for a database path, to guarantee it will be available in cache

    These changes alleviate the issue with large manifests for large (e.g., 1-2 GB) databases that change slowly (maybe 2-3 pages per transaction), and were primarily tested by running the sqlite tests with invariant checks on the snapshots (staged, ready, and consuming) around every write.

    opened by pkhuong 0
  • Consider a different/dedicated type for chunk ids

    Consider a different/dedicated type for chunk ids

    Sqlite tops out at 2^31 pages, and 64K is its (current) maximum page size. Simply using u32 for the chunk index type would prevent some type confusion errors between offsets in a file and chunk index.

    opened by pkhuong 0
Owner
Backtrace Labs
Backtrace Labs
React-native-quick-sqlite - ⚡️ The fastest SQLite implementation for react-native.

React Native Quick SQLite The **fastest** SQLite implementation for react-native. Copy typeORM patch-package from example dir npm i react-nati

Oscar Franco 280 Jun 22, 2022
An updated fork of sqlite_protobuf, a SQLite extension for extracting values from serialized Protobuf messages.

This fork of sqlite_protobuf fixes some issues (e.g., #15) and removes the test suite that we do not use. It also comes with proto_table, a C library

Backtrace Labs 17 Jun 27, 2022
NIF interface from Erlang to LMDB databases

Elmdb, an Erlang NIF for LMDB This is an Erlang NIF for OpenLDAP's Lightning Memory-Mapped Database (LMDB) database library. LMDB is a fast, compact k

null 3 Sep 27, 2021
Tntdb is a c++-class-library for easy access to databases

Tntdb is a c++-class-library for easy access to databases

Tommi Mäkitalo 30 Jan 23, 2022
❤️ SQLite ORM light header only library for modern C++

SQLite ORM SQLite ORM light header only library for modern C++ Status Branch Travis Appveyor master dev Advantages No raw string queries Intuitive syn

Yevgeniy Zakharov 1.6k Jun 27, 2022
SQLean: all the missing SQLite functions

SQLite has very few functions compared to other DBMS. SQLite authors see this as a feature rather than a bug, because SQLite has extension mechanism in place.

Anton Zhiyanov 1.3k Jun 25, 2022
An SQLite binding for node.js with built-in encryption, focused on simplicity and (async) performance

Description An SQLite (more accurately SQLite3MultipleCiphers) binding for node.js focused on simplicity and (async) performance. When dealing with en

mscdex 14 May 15, 2022
Yet another SQLite wrapper for Nim

Yet another SQLite wrapper for Nim Features: Design for ARC/ORC, you don’t need to close the connection manually Use importdb macro to create helper f

Code Hz 7 May 10, 2022
A friendly and lightweight C++ database library for MySQL, PostgreSQL, SQLite and ODBC.

QTL QTL is a C ++ library for accessing SQL databases and currently supports MySQL, SQLite, PostgreSQL and ODBC. QTL is a lightweight library that con

null 155 Jun 26, 2022
Fork of sqlite4java with updated SQLite and very basic compiler hardening enabled.

Download latest version: sqlite4java-392 with SQLite 3.8.7, Windows/Linux/Mac OS X/Android binaries OSGi bundle 1.0.392 with sqlite4java-392 Files for

GrapheneOS 5 Jun 8, 2022
Serverless SQLite database read from and write to Object Storage Service, run on FaaS platform.

serverless-sqlite Serverless SQLite database read from and write to Object Storage Service, run on FaaS platform. NOTES: This repository is still in t

老雷 7 May 12, 2022
C++ ORM for SQLite

Hiberlite ORM C++ object-relational mapping with API inspired by the awesome Boost.Serialization - that means almost no API to learn. Usage Just compi

Paul Korzhyk 625 Jun 24, 2022
The C++14 wrapper around sqlite library

sqlite modern cpp wrapper This library is a lightweight modern wrapper around sqlite C api . #include<iostream> #include <sqlite_modern_cpp.h> using n

null 683 Jun 17, 2022
Unofficial git mirror of SQLite sources (see link for build instructions)

SQLite Source Repository This repository contains the complete source code for the SQLite database engine. Some test scripts are also included. Howeve

null 2k Jul 3, 2022
A hook for Project Zomboid that intercepts files access for savegames and puts them in an SQLite DB instead.

ZomboidDB This project consists of a library and patcher that results in file calls for your savegame(s) being transparently intercepted and redirecte

Oliver 6 May 6, 2022
Lightweight C++ wrapper for SQLite

NLDatabase Lightweight C++ wrapper for SQLite. Requirements C++11 compiler SQLite 3 Usage Let's open a database file and read some rows: #include "NLD

Raven 0 Sep 20, 2019
Writing a sqlite clone from scratch in C++

如何用C++实现一个简易数据库 基于cstack/db_tutorial C语言版本 KCNyu 2022/2/2 作为笔者写的第一个系列型教程,还是选择基于前人的教程经验以及添加一些自己个人的探索。也许有很多纰漏之处,希望大家指正。 1. 数据库是什么? 数据库是“按照数据结构来组织、存储和管理数

shengyu.li 31 Jun 24, 2022
A PostgreSQL extension providing an async networking interface accessible via SQL using a background worker and curl.

pg_net is a PostgreSQL extension exposing a SQL interface for async networking with a focus on scalability and UX.

Supabase 41 Jun 19, 2022
PostgreSQL extension for pgexporter

pgexporter_ext pgexporter_ext is an extension for PostgreSQL to provide additional Prometheus metrics for pgexporter. Features Disk space metrics See

null 4 Apr 13, 2022