Why
Currently one of the largest costs of running a Raiblocks node is due to the large amount of IO needed just to keep up with the current write rate.
General Info
Disks can do ~75 to 100 iops per second, or 120 Megabytes per second (sequential io).
Consumer SSD's can do ~10K iops, or 375 Megabytes per second of IO.
Problem Description
Bootstrapping currently requires ~1k iops and 3 megabytes/s. So LMDB is generating a lot of very small writes for every block, but it's not actually writing much data. The write rate would be easily done on a single spinning disk if the IO's were structured differently.
That's not ideal for this usecase where we are more concerned with being able to sustain a large write rate. There's a very large temporal distribution of data; newer data is more likely to be read while old data is less likely to be read. So we should choose a data-storage technology that allows for very cheap writes, has relatively cheap reads on recent data, and can scale to large amounts of data.
LMDB is a memory mapped B-Tree. It makes for some very very fast random reads; however it's expensive for writes.
Log Structured Merge Trees however have the exact properties that we're looking for. See: The advantages of an LSM vs a B-Tree
Log structured merge trees allow writes to come in at a fantastic rate, and only generate a small amount of larger IO's. So we should think about replacing LMDB with a log structured merge tree. The best in breed currently is RocksDB. It also has the added advantage that it can compress blocks.
Suggested Solution
- Add RocksDB
- Add ZSTD
- Configure RocksDB with universal compaction,
- Add a flag to allow using RockDB.
- After it's all tested and shown to be working remove the LMDB code.
Steps to reproduce the issue:
- Start a new rai node
- Run iostat -dxt 5
- Notice the very very small IO's being issued.
Environment:
cpu:
Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz, 1995 MHz
Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz, 1950 MHz
Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz, 1733 MHz
Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz, 2444 MHz
storage:
Intel 8 Series SATA Controller 1 [AHCI mode]
network:
eno1 Intel Ethernet Connection I218-V
network interface:
eno1 Ethernet network interface
lo Loopback network interface
docker0 Ethernet network interface
veth8964baa Ethernet network interface
disk:
/dev/sda Crucial_CT120M50
partition:
/dev/sda1 Partition
/dev/sda2 Partition
/dev/sda3 Partition
logs
01/24/2018 07:39:03 PM
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.40 0.00 777.60 0.00 3742.40 9.63 3.77 4.85 0.00 4.85 0.04 3.04
01/24/2018 07:39:08 PM
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.40 0.00 868.20 0.00 4256.80 9.81 4.18 4.82 0.00 4.82 0.04 3.44
01/24/2018 07:39:13 PM
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.40 0.00 696.60 0.00 3480.00 9.99 3.49 5.00 0.00 5.00 0.04 2.88
01/24/2018 07:39:18 PM
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.40 0.00 586.40 0.00 2907.20 9.92 3.03 5.17 0.00 5.17 0.04 2.48
01/24/2018 07:39:23 PM
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.60 0.00 490.60 0.00 2472.00 10.08 2.54 5.18 0.00 5.18 0.04 2.08