PTHash is a C++ library implementing fast and compact minimal perfect hash functions

Overview

PTHash

PTHash is a C++ library implementing fast and compact minimal perfect hash functions as described in the paper PTHash: Revisiting FCH Minimal Perfect Hashing [1].

Given a set S of n distinct keys, a function f that bijectively maps the keys of S into the first n natural numbers is called a minimal perfect hash function (MPHF) for S. Algorithms that find such functions when n is large and retain constant evaluation time are of practical interest. For instance, search engines and databases typically use minimal perfect hash functions to quickly assign identifiers to static sets of variable-length keys such as strings. The challenge is to design an algorithm which is efficient in three different aspects: time to find f (construction time), time to evaluate f on a key of S (lookup time), and space of representation for f.

PTHash is one such algorithm [1].

The following guide is meant to provide a brief overview of the library by illustrating its functionalities through some examples. Scripts to generate the data for the tables used in the paper are also provided.

Table of contents

Integration

Integrating PTHash in your own project is very simple: just get the source code and include the header include/pthash.hpp in your code. No other configurations are needed.

If you use git, the easiest way to add PTHash is via git add submodule as follows.

git submodule add https://github.com/jermp/pthash.git

Compiling the Code

The code is tested on Linux with gcc and on Mac with clang. To build the code, CMake is required.

Clone the repository with

git clone --recursive https://github.com/jermp/pthash.git

If you have cloned the repository without --recursive, you will need to perform the following commands before compiling:

git submodule init
git submodule update

To compile the code for a release environment (see file CMakeLists.txt for the used compilation flags), it is sufficient to do the following:

mkdir build
cd build
cmake ..
make -j

For a testing environment, use the following instead:

mkdir debug_build
cd debug_build
cmake .. -DCMAKE_BUILD_TYPE=Debug -DUSE_SANITIZERS=On
make

(NOTE: Beware that the software will result in a much slower execution when running in debug mode and using sanitizers. Use this only for debug purposes, not to run performance tests.)

Quick Start

For a quick start, see the source file src/example.cpp (reported below). The example shows how to setup a simple build configuration for PTHash (parameters, bash hasher, and encoder).

After compilation, run this example with

./example

which will build a PTHash MPHF on 10M random 64-bit keys using c = 7.0 and alpha = 0.94. It also shows how to serialize the data structure on disk and re-load it for evaluation.

#include <iostream>
#include "../include/pthash.hpp"
#include "util.hpp"  // for functions distinct_keys and check

int main() {
    using namespace pthash;

    /* Generate 10M random 64-bit keys as input data. */
    static const uint64_t num_keys = 10000000;
    static const uint64_t seed = 1234567890;
    std::cout << "generating input data..." << std::endl;
    std::vector<uint64_t> keys = distinct_keys<uint64_t>(num_keys, seed);
    assert(keys.size() == num_keys);

    /* Set up a build configuration. */
    build_configuration config;
    config.c = 6.0;
    config.alpha = 0.94;
    config.verbose_output = true;

    /* Declare the PTHash function. */
    typedef single_mphf// base hasher
                        dictionary_dictionary  // encoder type
                        >
        pthash_type;
    pthash_type f;

    /* Build the function in internal memory. */
    std::cout << "building the MPHF..." << std::endl;
    f.build_in_internal_memory(keys.begin(), keys.size(), config);

    /* Compute and print the number of bits spent per key. */
    double bits_per_key = static_cast<double>(f.num_bits()) / f.num_keys();
    std::cout << "MPHF uses " << bits_per_key << " [bits/key]" << std::endl;

    /* Sanity check! */
    if (check(keys.begin(), keys.size(), f)) std::cout << "EVERYTHING OK!" << std::endl;

    /* Now evaluate f on some keys. */
    for (uint64_t i = 0; i != 10; ++i) {
        std::cout << "f(" << keys[i] << ") = " << f(keys[i]) << '\n';
    }

    /* Serialize the data structure to a file. */
    std::cout << "serializing the MPHF to disk..." << std::endl;
    std::string output_filename("mphf.bin");
    essentials::save(f, output_filename.c_str());

    /* Now reload from disk and query. */
    pthash_type other;
    essentials::load(other, output_filename.c_str());
    for (uint64_t i = 0; i != 10; ++i) {
        std::cout << "f(" << keys[i] << ") = " << other(keys[i]) << '\n';
        assert(f(keys[i]) == other(keys[i]));
    }

    return 0;
}

Build Examples

All the examples below must be run from within the directory where the code was compiled (see the section Compiling the Code), using the driver program called build.

Running the command

./build --help

shows the usage of the driver program, as reported below.

Usage: ./build [-h,--help] num_keys c alpha encoder_type [-p num_partitions] [-s seed] [-t num_threads] [-i input_filename] [-o output_filename] [-d tmp_dir] [--external] [--verbose] [--check] [--lookup]

 num_keys
	The size of the input.
 c
	A constant that trades construction speed for space effectiveness.
 alpha
	The table load factor. It must be a quantity > 0 and <= 1.
 encoder_type
	The encoder type. See include/encoders/encoders.hpp for a list of available types.
 [-p num_partitions]
	Number of partitions.
 [-s seed]
	Seed to use for construction.
 [-t num_threads]
	Number of threads to use for construction.
 [-i input_filename]
	A string input file name. If this is not provided, then num_keys 64-bit random keys will be used as input instead.
 [-o output_filename]
	Output file name where the MPHF will be serialized.
 [-d tmp_dir]
	Temporary directory used for building in external memory. Default is directory '.'.
 [--external]
	Build the MPHF in external memory.
 [--verbose]
	Verbose output during construction.
 [--check]
	Check correctness after construction.
 [--lookup]
	Measure average lookup time after construction.
 [-h,--help]
	Print this help text and silently exits.

Example 1

./build 1000000 4.5 0.99 dictionary_dictionary -s 727369 --verbose --check --lookup -o mphf.bin

This example will build a MPHF over 1M random 64-bit keys (generated with seed 727369), using c = 4.5, alpha = 0.99, and compressing the MPHF data structure with the encoder dictionary_dictionary.

The data structure will be serialized on a binary file named mphf.bin.

It will also check the correctness of the data structure (flag --check) and measure average lookup time (flag --lookup).

Construction will happen in internal memory, using a single processing thread. (Experimental setting of the paper.)

Example 2

For the following example, we are going to use the strings from the UK-2005 URLs collection, which can be downloaded by clicking here. (This is also one of the datasets used in the paper.)

The file is ~300 MB compressed using gzip (2.86 GB uncompressed).

After download, place the dataset in the build directory and run

gunzip uk-2005.urls.gz

to uncompress it. The file contains one string per line, for a total of 39,459,925 strings.

NOTE: Input files are read line by line (i.e., individual strings are assumed to be separated by the character \n).

The following command will build a MPHF using the strings of the file as input keys, with c = 7.0, alpha = 0.94.

./build 39459925 7.0 0.94 dictionary_dictionary -s 1234567890 -i uk-2005.urls --verbose --check --lookup

Example 3

./build 39459925 7.0 0.94 dictionary_dictionary -s 1234567890 -i uk-2005.urls --verbose --check --lookup -p 128 -t 4

This example will run the construction over the same input and parameters used in Example 2, but with 128 partitions and using 4 parallel threads. The resulting data structure will consume essentially the same space as that built in Example 2 and only slightly slower at lookup.

Example 4

./build 39459925 7.0 0.94 dictionary_dictionary -s 1234567890 -i uk-2005.urls --verbose --check --lookup --external

This example will run the construction over the same input and parameters used in Example 2, but using external memory. The resulting data structure will be exactly the same as that built in Example 2.

An Example Benchmark

The script script/run_benchmark.sh runs the 4 trade-off configurations (encoder, alpha, c) described in Section 5.2 of the paper [1] on 100M and 1000M keys.

C-C stands for "compact-compact" encoder; D-D for "dictionary-dictionary"; and EF for "Elias-Fano".

From within the directory where the code has been compiled, just run

bash ../script/run_benchmark.sh 2> results.json

to reproduce the bottom part of Table 5 of the SIGIR 2021 paper [1].

Below, the result of the benchmark across different processors and compilers. The code is compiled with -O3 and -march=native in all cases.

Intel i9-9900K @ 3.60 GHz, gcc 9.2.1, GNU/Linux 5.4.0-70-generic x86_64

Configuration 100M keys 1000M keys
constr. (sec) space (bits/key) lookup (ns/key) constr. (sec) space (bits/key) lookup (ns/key)
(1) C-C, alpha = 0.99, c = 7.0 42 3.36 28 1042 3.23 37
(2) D-D, alpha = 0.88, c = 11.0 19 4.05 46 308 3.94 64
(3) EF, alpha = 0.99, c = 6.0 45 2.26 49 1799 2.17 101
(4) D-D, alpha = 0.94, c = 7.0 26 3.23 37 689 2.99 55

Intel i7-7700 @ 3.60 GHz, gcc 9.3.0, GNU/Linux 5.4.0-66-generic x86_64

Configuration 100M keys 1000M keys
constr. (sec) space (bits/key) lookup (ns/key) constr. (sec) space (bits/key) lookup (ns/key)
(1) C-C, alpha = 0.99, c = 7.0 59 3.36 35 1145 3.23 40
(2) D-D, alpha = 0.88, c = 11.0 27 4.05 57 357 3.94 69
(3) EF, alpha = 0.99, c = 6.0 86 2.26 66 1918 2.17 110
(4) D-D, alpha = 0.94, c = 7.0 45 3.23 48 796 2.99 61

Intel i7-4790K @ 4.00GHz, gcc 8.3.0, GNU/Linux 5.0.0-27-generic x86_64

Configuration 100M keys 1000M keys
constr. (sec) space (bits/key) lookup (ns/key) constr. (sec) space (bits/key) lookup (ns/key)
(1) C-C, alpha = 0.99, c = 7.0 55 3.36 41 1156 3.23 51
(2) D-D, alpha = 0.88, c = 11.0 26 4.05 55 422 3.94 69
(3) EF, alpha = 0.99, c = 6.0 81 2.26 69 1921 2.17 147
(4) D-D, alpha = 0.94, c = 7.0 42 3.23 47 812 2.99 60

Other Resources

We maintain a benchmark to evaluate MPHF algorithms available at

https://github.com/roberto-trani/mphf_benchmark.

This benchmark was also used for the experiments in the SIGIR 2021 paper [1].

Authors

References

You might also like...
A tool to decrypt Call of Duty: World War II's Fast File

A tool to decrypt Call of Duty: World War II's Fast File. This tool was made to allow people making HUDs in Call of Duty: Black Ops III's mod tools to aquire the assets needed to port HUDs from Call of Duty: World War II.

MIRACL Cryptographic SDK: Multiprecision Integer and Rational Arithmetic Cryptographic Library is a C software library that is widely regarded by developers as the gold standard open source SDK for elliptic curve cryptography (ECC).

MIRACL What is MIRACL? Multiprecision Integer and Rational Arithmetic Cryptographic Library – the MIRACL Crypto SDK – is a C software library that is

x509cert is a tool and library for generating X.509 certificates and certificate requests.

x509cert is a tool and library for generating X.509 certificates and certificate requests. It is written in C99 and uses BearSSL to decode keys and compute signatures.

HashLibPlus is a recommended C++11 hashing library that provides a fluent interface for computing hashes and checksums of strings, files, streams, bytearrays and untyped data to mention but a few.

HashLibPlus HashLibPlus is a recommended C++11 hashing library that provides a fluent interface for computing hashes and checksums of strings, files,

An open source, portable, easy to use, readable and flexible SSL library

README for Mbed TLS Mbed TLS is a C library that implements cryptographic primitives, X.509 certificate manipulation and the SSL/TLS and DTLS protocol

TLS/SSL and crypto library

Welcome to the OpenSSL Project OpenSSL is a robust, commercial-grade, full-featured Open Source Toolkit for the Transport Layer Security (TLS) protoco

Library and command line tool to detect SHA-1 collision in a file

sha1collisiondetection Library and command line tool to detect SHA-1 collisions in files Copyright 2017 Marc Stevens [email protected] Distributed

Tink is a multi-language, cross-platform, open source library that provides cryptographic APIs that are secure, easy to use correctly, and hard(er) to misuse.

Tink A multi-language, cross-platform library that provides cryptographic APIs that are secure, easy to use correctly, and hard(er) to misuse. Ubuntu

Header-only C++11 library to encode/decode base64, base64url, base32, base32hex and hex (a.k.a. base16) as specified in RFC 4648, plus Crockford's base32. MIT licensed with consistent, flexible API.

cppcodec Header-only C++11 library to encode/decode base64, base64url, base32, base32hex and hex (a.k.a. base16) as specified in RFC 4648, plus Crockf

Comments
  • Compile issue on OSX (building on bioconda)

    Compile issue on OSX (building on bioconda)

    Hi @jermp,

    I'm trying to put piscem up on bioconda, and the linux build works fine. The OSX build, however, gives this error (https://dev.azure.com/bioconda/bioconda-recipes/_build/results?buildId=24063&view=logs&j=1b052f1d-4456-52f0-9d43-71c4c5bd734d&t=edc48dcd-1fc2-5e3b-7036-7be9cea00123&l=597). It seems to be a stricter rule for conversions in clang than g++. Any chance of fixing this upstream?

    Thanks! Rob

    opened by rob-p 4
  • Compile issue in gcc 10.2.1

    Compile issue in gcc 10.2.1

    The following line in util.hpp causes compile error in gcc 10.2.1: #define MAX_BUCKET_SIZE static_cast<bucket_size_type>(100) error message expected unqualified-id before ‘static_cast’

    I found that replacing that line with the following fix the problem: constexpr bucket_size_type MAX_BUCKET_SIZE = 100;

    opened by beling 2
  • distinct_keys doesn't actually remove duplicates

    distinct_keys doesn't actually remove duplicates

    The following code removes duplicates, but doesn't shrink the vector to actually remove the duplicates that std::unique moved to its end:

    https://github.com/jermp/pthash/blob/2607873d28c5467205b9962c89b30680f5b788b4/src/util.hpp#L92

    The line should probably be:

    keys.erase(std::unique(keys.begin(), keys.end()), keys.end());
    
    opened by Kristine1975 1
Releases(v1.0.1)
Owner
Giulio Ermanno Pibiri
Postdoctoral researcher in Computer Science. His research focuses on data compression, data structures, and code optimization.
Giulio Ermanno Pibiri
Retter - A collection of hash functions, ciphers, tools, libraries, and materials related to cryptography & security

Retter - A collection of hash functions, ciphers, tools, libraries, and materials related to cryptography & security.

Maciej A. Czyzewski 79 Nov 26, 2022
LibSWIFFT - A fast C/C++ library for the SWIFFT secure homomorphic hash function

LibSWIFFT - A fast C/C++ library for the SWIFFT secure homomorphic hash function Official Repository LibSWIFFT is a production-ready C/C++ library pro

Gvili Tech Ltd 23 Oct 23, 2022
A Powerful, Easy-to-Use, Compact, Cross-Platform and Installation-Free Crypto Tool. 一个强大,易用,小巧,跨平台且免安装的加密解密签名工具。

GpgFrontend GpgFrontend is a Powerful, Easy-to-Use, Compact, Cross-Platform, and Installation-Free OpenPGP Crypto Tool. By using GpgFrontend, you can

Saturn&Eric 203 Jan 7, 2023
Implementation and console application of Sha256 hash function.

Sha256 WARNING: This repository was the first version of Sha256, for a newer one check RedLibrary. What is it? This is an implementation and console a

Mr.Red 1 Feb 13, 2022
🔨 Linux Hash Cracker

?? Linux Hash Cracker Technologies • Project • Installing • How to use • Contributing • License ?? Technologies This project was developed with the fo

</Dantalion> 32 Nov 4, 2022
a header-file-only, SHA256 hash generator in C++

PicoSHA2 - a C++ SHA256 hash generator Copyright © 2017 okdshin Introduction PicoSHA2 is a tiny SHA256 hash generator for C++ with following propertie

Shintarou Okada 531 Dec 29, 2022
The Keccak (SHA-3) hash used by Ethereum.

The Keccak (SHA3) digest for Ruby This Ruby extension exposes the Keccak (SHA3) digest C bindings in the non-final version used by Ethereum. It is bas

Afr Schoe 18 Nov 3, 2022
This repository aims to provide an easy-to-use implementation of the Secure Hash Standard as specified in FIPS 180-4

HashLibCpp This repository aims to provide an easy-to-use implementation of the Secure Hash Standard. (currently implemented are SHA224, SHA256 and SH

ADD1609 1 Feb 2, 2022
Mbedcrypto - a portable, small, easy to use and fast c++14 library for cryptography.

mbedcrypto mbedcrypto is a portable, small, easy to use, feature rich and fast c++14 library for cryptography based on fantastic and clean mbedtlsnote

amir zamani 38 Nov 22, 2022