A low-latency LRU approximation cache in C++ using CLOCK second-chance algorithm. Multi level cache too. Up to 2.5 billion lookups per second.

Overview

LruClockCache

Low-latency LRU approximation cache in C++ using CLOCK second-chance algorithm. (see wiki for details)

using MyKeyType = std::string;
using MyValueType = MinecraftChunk;

LruClockCache<MyKeyType,MyValueType> cache(1024*5,[&](MyKeyType key){ 
  // cache miss (read)
  // access data-store (network, hdd, graphics card, anything that is slower than RAM or higher-latency than RAM-latency x2)
  return readChunkFromHDD(key);
  },[&](MyKeyType key,MyValueType value){ 
  
  // cache miss (write)
  // access data-store
  writeChunkToHDD(key,value);
});

// cache handles all cace-miss functions automatically
MinecraftChunk chunk = cache.get("world coordinates 1500 35 2000");

// cache handles all cace-miss functions automatically
cache.set("world coordinates 1502 35 1999",chunk);

cache.flush(); // clears all pending-writes in the cache and writes to backing-store

160x speedup on noise-generation for procedural terrain (2x speedup against AVX/SIMD optimized version):

160x speedup on noise-generation for procedural terrain

Benchmarks:

Up to 50 million lookups per second on an old CPU like FX8150 under heavy get/set usage.

Lowest cache-miss latency performance with char key, char value: 27 nanoseconds

Lowest cache-hit latency performance with char key, char value: 16 nanoseconds

Highest cache-miss bandwidth performance with very long std::string key, very long std::string value: 1.9 GB/s (~21% peak bw)

Highest cache-hit bandwidth performance with very long std::string key, very long std::string value: 4.9 GB/s (~55% peak bw)

Test system: FX8150 CPU @ 3.6GHz (CPU from 2011), 1-channel DDR3 RAM @ 1600 MHz (9GB/s peak bw and 150ns latency), Ubuntu 18.04 LTS 64-bit

Test method for cache-miss: single-threaded simple loop doing get&set using 100k different keys & values with only 300 cache items

Test method for cache-hit: single-threaded simple loop doing get&set using 10 keys & 15 cache items


Image Softening Algorithm: read each pixel and its 4 closest neighbor pixels and write to same pixel the average of 5 pixels

Image size = 1024x1024 pixels

Cache size = 1024x5 pixels (1024x1024+1000 pixels)

Repeats: 10 (50)

Key: 32-bit integer (same)

Value: 32-bit integer (same)

Total time: 10.4 seconds (6.7 seconds)

Cache time: 9.6 seconds (6.25 seconds)

Throughput = 6.5 million pixel get/set operations per second (~50 million pixels per second)

Cache-hit-ratio (read): 78% (100%)

Timings (50 million pixel lookups per second = 20 nanosecond average per access) include all of the computation work: integer division after pixel summations and chrono time measurement latency.


Multi Level Cache (read+write coherency + multithreaded = 75 million lookups per second)

int main()
{
	std::vector<int> data(1024*1024); // simulating a backing-store
    
	MultiLevelCache<int,int> cache(
		64*1024 /* direct-mapped L1 cache elements */,
		256,1024 /* n-way set-associative (LRU approximation) L2 cache elements */,
		[&](int key){ return data[key]; } /* cache-miss function to get data from backingstore into cache */,
		[&](int key, int value){ data[key]=value; } /* cache-miss function to set data on backging-store during eviction */
	);
	cache.set(5,10); // this is single-thread example, sets value 10 at key position of 5
	cache.flush(); // writes all latest bits of data to backing-store
	std::cout<<data[5]<<std::endl;
	auto val = cache.getThreadSafe(5); // this is thread-safe from any number of threads
	cache.setThreadSafe(10,val); //    thread-safe, any number of threads
	return 0;
}

Async Multi Level Cache (read+write weak-coherency(threads are responsible to use barrier) + multithread = up to 180 million lookups per second)

int main()
{
	std::vector<int> data(1000000); // backing-store simulation
	
	// L1 cache = direct mapped 128 elements
	// L2 cache = fully associative 128*1024 elements
	// similar cache-miss functions with MultiLevelCache
	AsyncCache<int,int> cache(128,128*1024,[&](int key){ return data[key]; },[&](int key, int value){ data[key]=value; });
	
	int val;
	
	// immediately returns, cache runs asynchronously to serve in a dedicated thread
	int slot = cache.setAsync(5,100); 
	
	// immediately returns, any slot id can be used but a thread should use its own unique slot id on all operations for maximum performance
	cache.getAsync(5,&val,slot);	
	
	std::cout<<data[5]<<" "<<val<<std::endl;
	
	// waits for completion of operations issued into slot
	cache.barrier(slot); 
	
	std::cout<<data[5]<<" "<<val<<std::endl;
	
	// writes all dirty data to backing-store and waits for completion
	cache.flush(); 
	
	std::cout<<data[5]<<" "<<val<<std::endl;
	return 0;
}

Multi Level Cache (read-only + multithreaded = 2.5 billion lookups per second)

If keys are integer type (char, int, size_t), then a L1 direct-mapped cache can be added in front of an L2 LRU cache to act as a single-thread front of an LLC cache given by user (which implements thread-safe set/get operations). Currently it supports read-only multi-thread or read+write single thread:

// simulating a backing-store
std::vector<int> backingStore(10000000);

// this is to be used by multi-level cache constructor and must implement getThreadSafe setThreadSafe methods.
auto LLC = std::make_shared<LruClockCache<int,int>>(LLCsize,
[ & ](int key) {
  readmiss++;
  return backingStore[key];
},
[ & ](int key, int value) {
  writemiss++;
  backingStore[key] = value;
});

// this is the multi-level cache that adds an L1 and an L2 infront of the LLC.
size_t L1size=1024*32; // this needs to be (power of 2) sized cache because it is direct-mapped (with N-way tags) cache
size_t L2size=1024*512; // this can be any size, it is LRU cache
CacheThreader<LruClockCache,int,int> cache(LLC,L1size,L2size);
cache.set(500,10); // if it is in L1 then returns directly in 1-2 nanoseconds
                    // if it is not in L1 then goes L2 and returns within ~50 nanoseconds if its an L2 hit
                    // if it is not L2 hit then it goes LLC in a thread-safe way and gets data much slower like 500 nanoseconds or more due to std::lock_guard
 
cache.get(500); // same as set method but returns a value

// currently multithreading not supported due to lack of write-invalidation method but with a few changes it is ready to be used as a read-only cache
// when invalidation method is implemented, it will be multithreaded read-write cache. For now, it is single-thread read-write cache.

cache.flush(); // write latest bits of data to the LLC
LLC.flush(); // write latest bits of data to the backing store

Benchmarks for Multi-Level Cache:

Up to 400 million lookups per second for FX8150 3.6GHz in single-threaded Gaussian Blur algorithm. This is equivalent to 2.5 nanoseconds average access latency per pixel. For a new CPU like Ryzen, it should be as fast as a billion lookups per second.

~900 million get calls per second in read-only usage with multiple threads: https://github.com/tugrul512bit/LruClockCache/wiki/How-To-Do-Multithreading-With-a-Read-Only-Multi-Level-Cache

LLC cache hit ratio (read)=1
LLC cache hit ratio (write)=1
image size=34x34 L1 tags = 65536 L2 tags=262144 LLC tags=1048576 performance: 175530469 nanoseconds     (bandwidth = 1009.24 MB/s)      (throughput = 3.96 nanoseconds per iteration) 
Finished!
LLC cache hit ratio (read)=1.00
LLC cache hit ratio (write)=1.00
image size=68x68 L1 tags = 65536 L2 tags=262144 LLC tags=1048576 performance: 198700826 nanoseconds     (bandwidth = 947.92 MB/s)      (throughput = 4.22 nanoseconds per iteration) 
Finished!
LLC cache hit ratio (read)=1.00
LLC cache hit ratio (write)=1.00
image size=136x136 L1 tags = 65536 L2 tags=262144 LLC tags=1048576 performance: 192917413 nanoseconds     (bandwidth = 1005.22 MB/s)      (throughput = 3.98 nanoseconds per iteration) 
Finished!
LLC cache hit ratio (read)=1.00
LLC cache hit ratio (write)=1.00
image size=272x272 L1 tags = 65536 L2 tags=262144 LLC tags=1048576 performance: 298379386 nanoseconds     (bandwidth = 654.78 MB/s)      (throughput = 6.11 nanoseconds per iteration) 
Finished!
LLC cache hit ratio (read)=1.00
LLC cache hit ratio (write)=1.00
image size=544x544 L1 tags = 65536 L2 tags=262144 LLC tags=1048576 performance: 3387914152 nanoseconds     (bandwidth = 55.49 MB/s)      (throughput = 72.08 nanoseconds per iteration) 
Finished!
LLC cache hit ratio (read)=0.93
LLC cache hit ratio (write)=0.45
image size=1088x1088 L1 tags = 65536 L2 tags=262144 LLC tags=1048576 performance: 4518605673 nanoseconds     (bandwidth = 41.76 MB/s)      (throughput = 95.78 nanoseconds per iteration) 
Finished!
LLC cache hit ratio (read)=0.89
LLC cache hit ratio (write)=0.12
image size=2176x2176 L1 tags = 65536 L2 tags=262144 LLC tags=1048576 performance: 5058006420 nanoseconds     (bandwidth = 37.38 MB/s)      (throughput = 107.02 nanoseconds per iteration) 
Finished!

You might also like...
Teaching materials for Algorithm Bootcamp: Data Structure.
Teaching materials for Algorithm Bootcamp: Data Structure.

Data Structure Materials Materials Topics Code Introduction to Data Structures Struct Pointer Dynamic Memory Allocation 00_intro_to_ds.cpp Linked List

libsais is a library for linear time suffix array and burrows wheeler transform construction based on induced sorting algorithm.

libsais libsais is a library for fast (see Benchmarks below) linear time suffix array and Burrows-Wheeler transform construction based on induced sort

Open-source graph editor, with built-it step-by-step Dijkstra's Algorithm.
Open-source graph editor, with built-it step-by-step Dijkstra's Algorithm.

Visual Dijkstra - Simple visual graph editor, with built-in step-by-step Dijkstra's algorithm Visual Dijkstra is a free and open-source tool, designed

Simple C++ Genetic Algorithm library
Simple C++ Genetic Algorithm library

crsGA: Simple C++ Genetic Algorithm library crsGA is a simple C++ template library for developing genetic algorithms, plus some other utilities (Logge

Explore the world of Data Structures and Algorithm
Explore the world of Data Structures and Algorithm

Hey Everyone! 👋 DSA- PlayYard is the first open source project of Lets Grow More Community. It is the perfect place to start with or to test your DSA

Useful Algorithm in C

useful-algorithm Useful Algorithm in C Content Number System (Dec, Bin, Hex) Searching Algorithms (Linear, Bin, Jump, ...) Sort Algorithms (Quick, Mer

This algorithm is amazing and take a high performance to search something under array.
This algorithm is amazing and take a high performance to search something under array.

Sequential Binary Algorithm O(n) Algoritmo Este é um algoritmo de complexidade O(log n), que possui uma alta performance em percorrer um vetor de inte

This is a Program, to sort Arrays with the QuickSort Algorithm.

QuickSort This is a program, to sort arrays with the QuickSort Algorithm. The Algorithm is optimized to be quick, but it isn't the fastest. I have wri

Implementation of two of the most famous subdivision algorithm: Loops Subdivision and CatMull-Clark Subdivision.
Implementation of two of the most famous subdivision algorithm: Loops Subdivision and CatMull-Clark Subdivision.

3D-Subdivision-Surface Implementation of two of the most famous subdivision algorithms: Loops Subdivision for Triangles and CatMull-Clark Subdivision

Owner
Hüseyin Tuğrul BÜYÜKIŞIK
Physics engineer Web developer GPGPU Weird equations Optimizations
Hüseyin Tuğrul BÜYÜKIŞIK
Interpolating and plotting empirical data with polynomial approximation

Graph creator I needed a way to plot empirical data in my programs. A nth order polynomial as an interpolating function works for this purpose fine en

Long Nguyen 19 Mar 17, 2022
This project contains the carefully compiled easy to medium level Data Structures & Algorithm questions.

Engineering-Daze Hey everyone! ?? This project contains the carefully compiled easy to medium level Data Structures & Algorithm questions. Engineering

DSC-Banasthali 8 Apr 30, 2022
Multi-Scale Geometric Consistency Guided and Planar Prior Assisted Multi-View Stereo

ACMMP [News] The code for ACMH is released!!! [News] The code for ACMM is released!!! [News] The code for ACMP is released!!! About This repository co

Qingshan Xu 96 Jan 7, 2023
A fast multi-producer, multi-consumer lock-free concurrent queue for C++11

moodycamel::ConcurrentQueue An industrial-strength lock-free queue for C++. Note: If all you need is a single-producer, single-consumer queue, I have

Cameron 7.4k Jan 2, 2023
A bounded multi-producer multi-consumer concurrent queue written in C++11

MPMCQueue.h A bounded multi-producer multi-consumer concurrent queue written in C++11. It's battle hardened and used daily in production: In the Frost

Erik Rigtorp 836 Dec 25, 2022
This repository contains the Assignment code of Data Structures and Algorithms Assignments of SPPU, Second Year IT Syllabus (2019 pattern)

DSAL This repository contains the Assignment code of Data Structures and Algorithms Assignments of SPPU, Second Year IT Syllabus (2019 pattern) Assign

Mujeeb Khan 2 Dec 11, 2022
180+ Algorithm & Data Structure Problems using C++

180+ Algorithm & Data Structure Problems using C++

Ravi Mandliya 5.2k Jan 8, 2023
Flexible level editor

Tiled Map Editor - https://www.mapeditor.org/ About Tiled Tiled is a general purpose tile map editor for all tile-based games, such as RPGs, platforme

mapeditor.org 9.3k Dec 31, 2022
Higher level programming in C

Cello Cello is a library that brings higher level programming to C. By acting as a modern, powerful runtime system Cello makes many things easy that w

Daniel Holden 6k Jan 1, 2023
Implementing dynamic lists in C

Dynamic-Lists Implementing dynamic lists in C <begginer level> DESCRIPTION: This header file <list.h> implements python-like lists and it's functions.

null 2 Feb 2, 2022