Describe the bug
I am implementing a cache using hybrid cache feature of CacheLib. When I am doing a test of directly inserting large amount of data into NVMCache (I wait explicitly handle.wait()
), and then read data out using handle.find()
, a lot of returned handle which wentToNVM
but now points to nullptr
.
In details, here is my list of cache config.
config
.setCacheSize(40 * 1024 * 1024)
.setCacheName(name_)
.setAccessConfig(std::pow(2, cacheEntriesPower_));
nvmConfig.navyConfig.setSimpleFile(<some path>,100 * 1024 * 1024,false);
nvmConfig.navyConfig.setBlockSize(4096);
nvmConfig.navyConfig.blockCache().setRegionSize(16 * 1024 * 1024);
config.enableNvmCache(nvmConfig);
I add a small DRAM pool:
cacheInternal_->addPool(poolName, 10 * 1024 * 1024);
Here is my test:
for (auto i = 0; i < 10240; i++) {
std::string key = std::to_string(i);
std::string val = longText + key; // longText is 4KB size
// put vertex
auto status = cache->Insert(key, val);
EXPECT_TRUE(status.ok());
}
for (auto i = 0; i <= 10240; i++) {
std::string key = std::to_string(i);
auto handle = cache->lookup(key); // This function will wrap find() and wait();
EXPECT_NE(handle, nullptr);
..........
}
Note here I am using a fixed file name as navy device file.
In this test, the small keys (starting from 0) will be evicted to NVM. When I run this test for the first time, around the first 1300 handles are not nullptr, but the remaining handles from NVM are nullptr. If I run a second time, more handles (around 3000) are not nullptr. If I continue this and about run 4 times, all returned handles do not equal to nullptr and can retrieve data.
In terms of the navy file size, the first run of the test will fail and the file size is 16MB. The second run will fail, but the file size increased to 32MB. This process continued. After the last successful run, the file size is 81MB.
So I guess, maybe it is because the device file increase speed doesn't match the speed of evicting to NVM. What if I dd
a file in advance? If I dd
a 100MB file as the navy device file in advance. The aforementioned test can pass.
Another interesting thing is if I increase the DRAM part cache size by changing the pool size from 10MB to 30MB, the test can pass too. With this in mind, I tried to reproduce the error with your ConcurrentFills
in cachelib/allocator/nvmcache/tests/NvmCacheTests.cpp
To Reproduce
Here are the changes I made:
diff --git a/cachelib/allocator/nvmcache/tests/NvmCacheTests.cpp b/cachelib/allocator/nvmcache/tests/NvmCacheTests.cpp
index 28051e7f..944ec266 100644
--- a/cachelib/allocator/nvmcache/tests/NvmCacheTests.cpp
+++ b/cachelib/allocator/nvmcache/tests/NvmCacheTests.cpp
@@ -357,11 +357,11 @@ TEST_F(NvmCacheTest, ConcurrentFills) {
auto& nvm = this->cache();
auto pid = this->poolId();
- const int nKeys = 1024;
+ const int nKeys = 10240;
for (unsigned int i = 0; i < nKeys; i++) {
auto key = std::string("blah") + folly::to<std::string>(i);
- auto it = nvm.allocate(pid, key, 15 * 1024);
+ auto it = nvm.allocate(pid, key, 4 * 1024);
ASSERT_NE(nullptr, it);
*((int*)it->getMemory()) = i;
nvm.insertOrReplace(it);
@@ -370,7 +370,7 @@ TEST_F(NvmCacheTest, ConcurrentFills) {
auto doConcurrentFetch = [&](int id) {
auto key = std::string("blah") + folly::to<std::string>(id);
std::vector<std::thread> thr;
- for (unsigned int j = 0; j < 50; j++) {
+ for (unsigned int j = 0; j < 1; j++) {
thr.push_back(std::thread([&]() {
auto hdl = nvm.find(key);
hdl.wait();
@@ -378,12 +378,12 @@ TEST_F(NvmCacheTest, ConcurrentFills) {
ASSERT_EQ(id, *(int*)hdl->getMemory());
}));
}
- for (unsigned int j = 0; j < 50; j++) {
+ for (unsigned int j = 0; j < 1; j++) {
thr[j].join();
}
};
- for (unsigned int i = 0; i < 10; i++) {
+ for (unsigned int i = 0; i < 10240; i++) {
doConcurrentFetch(i);
}
}
diff --git a/cachelib/allocator/tests/NvmTestUtils.h b/cachelib/allocator/tests/NvmTestUtils.h
index 8f53d24b..2f1c8974 100644
--- a/cachelib/allocator/tests/NvmTestUtils.h
+++ b/cachelib/allocator/tests/NvmTestUtils.h
@@ -31,7 +31,7 @@ inline NavyConfig getNvmTestConfig(const std::string& cacheDir) {
config.setDeviceMetadataSize(4 * 1024 * 1024);
config.setBlockSize(1024);
config.setNavyReqOrderingShards(10);
- config.blockCache().setRegionSize(4 * 1024 * 1024);
+ config.blockCache().setRegionSize(16 * 1024 * 1024);
config.bigHash()
.setSizePctAndMaxItemSize(50, 100)
.setBucketSize(1024)
And you will see the error:

System:
- OS: Ubuntu20.04
- Disk: NVMe SSD
Additional context
Can you help let me know if this is a potential bug or the configuration issue? Thanks!