SetSketch: Filling the Gap between MinHash and HyperLogLog

Overview

SetSketch: Filling the Gap between MinHash and HyperLogLog

This repository contains the source code to reproduce all the results and figures presented in the paper "SetSketch: Filling the Gap between MinHash and HyperLogLog" (arXiv preprint).

Abstract

MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to continuously fill the gap between both use cases. Its commutative and idempotent insert operation and its mergeable state make it suitable for distributed environments. Robust and easy-to-implement estimators for cardinality and joint quantities, as well as the ability to use SetSketch for similarity search, enable versatile applications. The developed methods can also be used for HyperLogLog sketches and allow estimation of joint quantities such as the intersection size with a smaller error compared to the common estimation approach based on the inclusion-exclusion principle.

Steps to reproduce the results and figures on Windows 10

  1. Install Windows Subsystem for Linux (WSL) with Ubuntu 20.04 LTS

  2. Install required packages:

    sudo apt install gradle g++ libboost-dev python3-matplotlib python3-scipy texlive-full
    
  3. Clone repository including submodules:

    git clone --recursive https://github.com/dynatrace-research/set-sketch-paper.git
    
  4. Switch to set-sketch-paper directory:

    cd set-sketch-paper
    
  5. Run simulations (takes several hours):

    gradle runCardinalityTest runJointTest runPerformanceTest
    
  6. Generate all figures:

    gradle pdfFigures
    
You might also like...
"Wireless Made Easy!" - Microchip MRF MiWi package is MiWi P2P and Star Stacks for MRF24J40 and MRF89XA transceivers running on MPLAB X IDE

MRF-MiWi "Wireless Made Easy!" - Microchip MiWi P2P and Star Stack Opened for MRF24J40 and MRF89XA transceiver running on MPLAB X IDE Devices: | MRF24

A simple single header 64 and 32 bit hash function using only add, sub, ror, and xor.
A simple single header 64 and 32 bit hash function using only add, sub, ror, and xor.

K-HASH A simple single header 64 bit hash function using only add, sub, ror, and xor. This a just general-purpose hash function for like making hash m

Given two arrays X and Y of positive integers, find the number of pairs such that x^y y^x (raised to power of) where x is an element from X and y is an element from Y.

Number-of-pairs Given two arrays X and Y of positive integers, find the number of pairs such that x^y y^x (raised to power of) where x is an element

Analysing and implementation of lossless data compression techniques like Huffman encoding and LZW was conducted along with JPEG lossy compression technique based on discrete cosine transform (DCT) for Image compression.

PROJECT FILE COMPRESSION ALGORITHMS - Huffman compression LZW compression DCT Aim of the project - Implement above mentioned compression algorithms an

Cross-platform STL-styled and STL-compatible library with implementing containers, ranges, iterators, type traits and other tools; actors system; type-safe config interface.

Yato A small repository where I'm gatherting useful snippets and abstractions for C++ development. Yato includes 3 main modules: multidimensional cont

R :package: and header-only C++ library for geospatial space-division based compression and encoding
R :package: and header-only C++ library for geospatial space-division based compression and encoding

spress spress provides utilities for encoding and compressing geospatial objects, such as sf objects. Installation This package requires C++11 for com

R :package: and header-only C++ library for geospatial space-division based compression and encoding
R :package: and header-only C++ library for geospatial space-division based compression and encoding

spress spress provides utilities for encoding and compressing geospatial objects, such as sf objects. Installation This package requires C++11 for com

C++ library and cmdline tools for parsing and manipulating VCF files

vcflib A C++ library for parsing and manipulating VCF files. overview The Variant Call Format (VCF) is a flat-file, tab-delimited textual format that

High performance build system for Windows, OSX and Linux. Supporting caching, network distribution and more.

FASTBuild FASTBuild is a build system for Windows, OSX and Linux, supporting distributed compilation and object caching. It is used by many game devel

Comments
  • Similarity search with set-sketch-paper

    Similarity search with set-sketch-paper

    Hi,

    I was using HLL --cmp for some similarity searches between two data sets (Sequencing data vs. Reference Genome). I would like to try the same with set-sketch. Is there anyway to run a similarity search using set-sketch?

    Thanks

    best, Ahmad

    opened by lutfia95 4
Owner
Dynatrace Research
Dynatrace Research
Directed Acyclic Graph Execution Engine (DAGEE) is a C++ library that enables programmers to express computation and data movement, as task graphs that are scheduled concurrently and asynchronously on both CPUs and GPUs.

Directed Acyclic Graph Execution Engine (DAGEE) is a C++ library that enables programmers to express computation and data movement, as tasks in a graph structure, where edges represent task dependencies

null 28 Dec 18, 2022
Rajesh Kumar Sah 1 Nov 20, 2021
Is a linear data structure with O(log n) searches and O(cbrt n) insertions and index lookups.

A binary cube is a linear data structure that contains a sorted two dimensional dynamic array of nodes which each point to a sorted array

null 22 Jul 13, 2022
Repository of problems and solutions of labsheets used for Data Structures and Algorithms (CS F211) in Semester 2, 2020-21 at BITS Pilani - Hyderabad Campus.

CS F211 Data Structures and Algorithms (BITS Pilani - Hyderabad Campus) This repository contains the problems, solution approaches & explanations and

Rohit Dwivedula 27 Oct 31, 2022
Algo-Tree is a collection of Algorithms and data structures which are fundamentals to efficient code and good software design

Algo-Tree is a collection of Algorithms and data structures which are fundamentals to efficient code and good software design. Creating and designing excellent algorithms is required for being an exemplary programmer. It contains solutions in various languages such as C++, Python and Java.

DSC-Banasthali 53 Oct 4, 2022
A family of header-only, very fast and memory-friendly hashmap and btree containers.

The Parallel Hashmap Overview This repository aims to provide a set of excellent hash map implementations, as well as a btree alternative to std::map

Gregory Popovitch 1.7k Jan 9, 2023
C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types, continuous memory storage, and no pointers are involved

C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types, continuous memory storage, and no pointers are involved

Hossein Moein 1.7k Jan 9, 2023
Templates, algorithms and data structures implemented and collected for programming contests.

Templates, algorithms and data structures implemented and collected for programming contests.

Shahjalal Shohag 2k Jan 2, 2023
This repository aims to contain solutions and explanations to various competitive programming problems, which may be important for interviews and online tests of different companies.

Competitive Programming Solutions Compilation Hello everyone ?? This repository contains solutions and explanations to various competitive programming

Abhinav Agrawal 33 Dec 14, 2022