A compressed, associative, exact dictionary for k-mers.

Overview

Language grade: C/C++

SSHash

This is a compressed dictionary data structure for k-mers (strings of length k over the DNA alphabet {A,C,G,T}), based on Sparse and Skew Hashing.

A (pre-print) paper describing the data structure will appear soon.

For a dictionary of n k-mers, two basic queries are supported:

  • i = Lookup(g), where i is in [0,n) if the k-mer g is found in the dictionary or i = -1 otherwise;
  • g = Access(i), where g is the k-mer associated to the identifier i.

A membership query (determine if a given k-mer is present in the dictionary or not) is, therefore, supported by means of the lookup query. The dictionary can also stream through all k-mers of a given DNA file (.fasta or .fastq formats) to determine their membership to the dictionary.

NOTE: The Lookup query assumes that two k-mers being the reverse complement of each other are the same.

Table of contents

Compiling the Code

The code is tested on Linux with gcc and on Mac with clang. To build the code, CMake is required.

Clone the repository with

git clone --recursive https://github.com/jermp/sshash.git

If you have cloned the repository without --recursive, you will need to perform the following commands before compiling:

git submodule init
git submodule update

To compile the code for a release environment (see file CMakeLists.txt for the used compilation flags), it is sufficient to do the following:

mkdir build
cd build
cmake ..
make -j

For a testing environment, use the following instead:

mkdir debug_build
cd debug_build
cmake .. -D CMAKE_BUILD_TYPE=Debug -D SSHASH_USE_SANITIZERS=On
make

Dependencies

The repository has minimal dependencies: it only uses the PTHash library (for minimal perfect hashing), and zlib to read gzip-compressed streams.

To automatically pull the PTHash dependency, just clone the repo with --recursive as explained in Compiling the Code.

If you do not have zlib installed, you can do

sudo apt-get install zlib1g

if you are on Linux/Ubuntu, or

brew install zlib

if you have a Mac.

Build a Dictionary

The driver program called build can be used to build a dictionary.

From within the directory where the code was compiled (see the section Compiling the Code), run the command:

./build --help

to show the usage of the driver program (reported below for convenience).

Usage: ./build [-h,--help] input_filename k m [-s seed] [-n max_num_kmers] [-l l] [-c c] [--canonical-parsing] [-o output_filename] [--check] [--bench] [--verbose]

 input_filename
	Must be a FASTA file (.fa/fasta extension) compressed with gzip (.gz) or not:
	- without duplicate nor invalid kmers
	- one DNA sequence per line.
	For example, it could be the de Bruijn graph topology output by BCALM.

 k
	K-mer length.

 m
	Minimizer length (must be <= k).

 [-s seed]
	Seed for construction (default is 1).

 [-n max_num_kmers]
	Build the dictionary from at most this number of k-mers.

 [-l l]
	A (integer) constant that controls the space/time trade-off of the dictionary. A reasonable values lies between 2 and 12 (default is 6).

 [-c c]
	A (floating point) constant that trades construction speed for space effectiveness of minimal perfect hashing. A reasonable value lies between 3.0 and 10.0 (default is 3.000000).

 [--canonical-parsing]
	Canonical parsing of k-mers. This option changes the parsing and results in a trade-off between index space and lookup time.

 [-o output_filename]
	Output file name where the data structure will be serialized.

 [--check]
	Check correctness after construction.

 [--bench]
	Run benchmark after construction.

 [--verbose]
	Verbose output during construction.

 [-h,--help]
	Print this help text and silently exits.

Examples

For the examples, we are going to use some collections of stitched unitigs from the directory ../data/unitigs_stitched. These collections were built for k = 31, so dictionaries should be built with k = 31 as well to ensure correctness.

In the section Input Files, we explain how such collections of stitched unitigs can be obtained from raw FASTA files.

Example 1

./build ../data/unitigs_stitched/salmonella_enterica_k31_ust.fa.gz 31 13 --check --bench -o salmonella_enterica.index

This example builds a dictionary for the k-mers read from the file ../data/unitigs_stitched/salmonella_enterica_k31_ust.fa.gz, with k = 31 and m = 13. It also check the correctness of the dictionary (--check option), run a performance benchmark (--bench option), and serializes the index on disk to the file salmonella_enterica.index.

To run a performance benchmark after construction of the index, use:

./bench salmonella_enterica.index

Example 2

./build ../data/unitigs_stitched/salmonella_100_k31_ust.fa.gz 31 15 -l 2 -o salmonella_100.index

This example builds a dictionary from the input file ../data/unitigs_stitched/salmonella_100_k31_ust.fa.gz (a pangenome consisting in 100 genomes of Salmonella Enterica), with k = 31, m = 15, and l = 2. It also serializes the index on disk to the file salmonella_100.index.

To perform some streaming membership queries, use:

./query salmonella_100.index ../data/queries/SRR5833294.10K.fastq.gz

if your queries are meant to be read from a FASTQ file, or

./query salmonella_100.index ../data/queries/salmonella_enterica.fasta.gz --multiline

if your queries are to be read from a (multi-line) FASTA file.

Input Files

SSHash is meant to index k-mers from collections that do not contain duplicates nor invalid k-mers (strings containing symbols different from {A,C,G,T}). These collections can be obtained, for example, by extracting the maximal unitigs of a de Bruijn graph.

Doing so is easy to do using the tool BCALM2. This tool builds a compacted de Bruijn graph and outputs its maximal unitigs. From the output of BCALM2, we can then stitch (i.e., glue) some unitigs to reduce the number of nucleotides. The stitiching process is carried out using the UST tool.

Below we provide a complete example (assuming both BCALM2 and UST are installed correctly) that downloads the Human (GRCh38) Chromosome 13 and extracts the maximal stitiched unitigs for k = 31.

mkdir DNA_datasets
wget http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz -O DNA_datasets/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz
~/bcalm/build/bcalm -in ~/DNA_datasets/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz -kmer-size 31 -abundance-min 1 -nb-cores 8
~/UST/ust -k 31 -i ~/Homo_sapiens.GRCh38.dna.chromosome.13.fa.unitigs.fa
gzip Homo_sapiens.GRCh38.dna.chromosome.13.fa.unitigs.fa.ust.fa
rm ~/Homo_sapiens.GRCh38.dna.chromosome.13.fa.unitigs.fa

See also the script scripts/download_and_preprocess_datasets.sh for precise arguments.

Comments
  • SSHash as a C++ API / library

    SSHash as a C++ API / library

    Hello @jermp: thank you for designing this awesome data structure! We at @COMBINE-lab (tag: @rob-p) are finding SSHash very useful.

    We are interested in using SSHash from within C++ codes, and were wondering if you could provide some C++ API / header for that. We are, for now, interested in the following basic functionalities. I'm using the "unitig" terminology in the following, but the input sequence types (i.e. unitigs / maximal paths) does not matter.

    Given a k-mer x, find its:

    • hash value determined by SSHash, i.e. lookup(x)
    • unitig ID, i.e. unitig(x)
    • offset within the unitig that contains it, i.e. unitig_offset(x)

    Given a unitig ID u, find its:

    • size, i.e. size(u)
    • list of neighbor unitigs, neighbors(u): this likely consists of a query of 2 * 4 (canonical) k-mers.

    We hope that these might be feasible for you to provide.

    Thanks again!

    question discussion 
    opened by jamshed 42
  • error with 64-bit hash code

    error with 64-bit hash code

    hello,

    i have this error :

    terminate called after throwing an instance of 'std::runtime_error' what(): Using 64-bit hash codes with more than 2^30 keys can be dangerous due to collisions: use 128-bit hash codes instead.

    and i don't find how i can use a 128-bit hash codes when building

    opened by PandorasRed 17
  • External memory construction seems too high

    External memory construction seems too high

    Hi @jermp,

    I'm trying out the new external memory construction feature of sshash (super-exciting!). I've noticed that the memory usage seems surprisingly high. For example, I ran on the compacted dBG (not the UST) of the human genome with:

    /usr/bin/time ./build  unitigs.fa 31 19 -d /scratch/shash_tmp -o grch38_k31_seg.sshash --verbose
    

    and I am getting ~12G of memory usage. The time is ~16m, but I'd expect this given it's running on spinning rust rather than SSD.

    2316.51user 47.52system 16:27.23elapsed 239%CPU (0avgtext+0avgdata 12690096maxresident)k
    8155944inputs+50695376outputs (1major+8309647minor)pagefaults 0swaps
    

    Any idea why the memory usage might be like this? I noticed that your initial mention was that it was ~4G for the human genome with k=31. Is it because of my minimizer size? When I use a minimizer of 15, this goes down to ~9G, which is better, but still larger than I'd have anticipated. Any thoughts on what might be going on here?

    Thanks! Rob

    question discussion 
    opened by rob-p 11
  • Segmentation fault

    Segmentation fault

    Hi, I'm getting this Segmentation fault. Am I doing something stupid or is it a bug?

    ./build/build test.fa 31 13 -o ./index/out.sshash
    
    k = 31, m = 13, seed = 1, l = 6, c = 3, canonical_parsing = false, weighted = false
    reading file 'test.fa'...
    m_buffer_size 29411764
    sorting buffer...
    saving to file './sshash.tmp.run_1653987257158295118.minimizers.0.bin'...
    read 1 sequences, 63 bases, 33 kmers
    num_kmers 33
    num_super_kmers 4
    num_pieces 2 (+3.63636 [bits/kmer])
    === step 1: 'parse_file' 7.9e-05 [sec] (2393.94 [ns/kmer])
    Segmentation fault
    

    This is the file test.fna (I had to change the extension to .txt for github because github did not allow .fa file extension).

    test.txt

    opened by jnalanko 6
  • Error: inlining failed in call to ‘always_inline’

    Error: inlining failed in call to ‘always_inline’

    I tried to compile the package to give it a try but met a problem at compiling:

    /usr/lib/gcc/x86_64-linux-gnu/11/include/popcntintrin.h: In function ‘uint64_t pthash::util::popcount(uint64_t)’:
    /usr/lib/gcc/x86_64-linux-gnu/11/include/popcntintrin.h:42:1: error: inlining failed in call to ‘always_inline’ ‘long long int _mm_popcnt_u64(long long unsigned int)’: target specific option mismatch
       42 | _mm_popcnt_u64 (unsigned long long __X)
          | ^~~~~~~~~~~~~~
    

    Googled for a while and found out the the problem is related to the -mpopcnt option, but I am not familiar with the cmake and the _mm_popcnt_u64() function.
    My system is Linux 5.13.0-24-generic #24-Ubuntu-21.10 I appreciate any help to resolve this issue.

    opened by yifangt 6
  • No queries being run

    No queries being run

    Hi again,

    I'm trying to run queries on a small sshash index, but the program reports that zero queries were ran. What could be the issue?

    [email protected]:~/code/SBWT_experiments$ ./sshash/build/query index.sshash test.fna
    2022-06-01 19:02:46: loading index from file 'index.sshash'...
    index size: 2.13636 [MB] (7.09913 [bits/kmer])
    2022-06-01 19:02:46: performing queries from file 'test.fna'...
    2022-06-01 19:02:46: DONE
    ==== query report:
    num_kmers = 0
    num_valid_kmers = 0 (-nan% of kmers)
    num_positive_kmers = 0 (-nan% of valid kmers)
    num_searches = 140332537301184/0 (inf%)
    num_extensions = 2/0 (inf%)
    elapsed = 0.005 millisec / 5e-06 sec / 8.33333e-08 min / inf ns/kmer
    

    The index and the queries are attached here. I'm not attaching the original sequences because they are over 30GB. data.zip

    opened by jnalanko 5
  • Default constructor to the iterator

    Default constructor to the iterator

    Hey, I am using sshash as submodule in my project. It will make my life much easier if you added an empty constructor for dictionary::iterator. I am getting this error "error: no matching function for call to ‘sshash::dictionary::iterator::iterator()’". I am not using the constructor directly but I am keeping an iterator as a member in my classes.

    Thanks, Moustafa

    opened by shokrof 5
  • question about input data

    question about input data

    hello,

    i'm trying your software to see memory consumption of it for a metagenomics dataset i have, the problem is that i can't use bcalm to generate unitig for it(https://github.com/GATB/bcalm/issues/71) due to the big number of kmer, have you any software recomandation to transform a lot of read into a input format that work with your software?

    question 
    opened by PandorasRed 2
  • Permute and reverse-complement the strings to minimize the number of abundance runs

    Permute and reverse-complement the strings to minimize the number of abundance runs

    This command re-orders (and possibly reverse-complement) the strings in the collection to minimize the number of runs in the abundances and, hence, optimize the encoding of the abundances. This is achieved via the program ./permute which, in turn, uses the builder/cover.hpp class to build the permutation.

    opened by jermp 0
Releases(v3.0.0)
  • v3.0.0(Jul 1, 2022)

    This release of the library features a restructured public API for the dictionary and its supported queries.

    • "Advanced" lookup queries now include, besides the usual absolute kmer_id: contig information (contig_id and contig_size of the contig where the k-mer lies in), the relative (within the contig) identifier of the k-mer (named kmer_id_in_contig), and the orientation of the k-mer in the contig. For any positive query, 0 <= kmer_id_in_contig < contig_size holds true.
    • Streaming queries are now general, not just streaming membership queries as they were in the previous releases, and return advanced lookup information by default.
    • Support for Navigational queries has been added. Given a k-mer g[1..k], a navigational query determines if g[2..k]+x is present (forward neighbourhood) and if x+g[1..k-1] is present (backward neighbourhood) in the dictionary, for x = A, C, G, T ('+' here means string concatenation). If a contig identifier is specified for a navigational query (rather than a k-mer), then the backward neighbourhood of the first k-mer and the forward neighbourhood of the last k-mer in the contig are returned.
    Source code(tar.gz)
    Source code(zip)
  • v2.1.0(May 25, 2022)

  • v2.0.0(May 24, 2022)

    No major changes compared to previous version (rather than renaming of variables for consistency with papers), but we removed a (useless) serialised 4-byte integer from skew_index and so previous index binary files are not compatible with this library release.

    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Apr 2, 2022)

    This release adds a new tool called permute that re-orders (and possibly reverse-complement) the strings in an input (weighted) collection to minimize the number of runs in the abundances and, hence, optimize the encoding of the abundances.

    The abundances are encoded in O(r) space on top of the space for a SSHash dictionary, where r is the number of runs (i.e., maximal substrings formed by a single abundance value) in the abundances. The i-th abundance in the sequence, corresponding to the k-mer of identifier i, is retrieved in O(log r) time.

    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Mar 1, 2022)

    This release adds a new feature: compressed abundances. The SSHash dictionary now can also store the abundances in highly compressed space.

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Mar 1, 2022)

Owner
Giulio Ermanno Pibiri
Postdoctoral researcher in Computer Science. His research focuses on data compression, data structures, and code optimization.
Giulio Ermanno Pibiri
This project Orchid-Fst implements a fast text string dictionary search data structure: Finite state transducer (short for FST) in c++ language.This FST C++ open source project has much significant advantages.

Orchid-Fst 1. Project Overview This project Orchid-Fst implements a fast text string dictionary search data structure: Finite state transducer , which

Bin Ding 9 Jul 25, 2022
Typesafe, Generic & Extremely fast Dictionary in C 🚀

CDict.h Typesafe, Generic, and Extremely Fast Dictionary in C ?? Key Features Extremely fast non-cryptographic hash algorithm XXHash Complete Typesafe

Robus Gauli 17 Sep 20, 2022
🌳 A compressed rank/select dictionary exploiting approximate linearity and repetitiveness.

The block-ε tree is a compressed rank/select dictionary that achieves new space-time trade-offs by exploiting the approximate linearity and the repeti

Giorgio Vinciguerra 10 Jun 5, 2022
Compressed Log Processor (CLP) is a free tool capable of compressing text logs and searching the compressed logs without decompression.

CLP Compressed Log Processor (CLP) is a tool capable of losslessly compressing text logs and searching the compressed logs without decompression. To l

null 258 Oct 3, 2022
Library that solves the exact cover problem using Dancing Links, also known as DLX.

The DLX Library The DLX library The DLX library solves instances of the exact cover problem, using Dancing Links (Knuth’s Algorithm X). Also included

Ben Lynn 41 Sep 30, 2022
a game of packing problems - several thousand of them, to be exact

myriad myriad is a game of packing problems -- several thousand of them, to be exact. install you can compile the game using gcc: gcc -Wall -lncurses

Lux L. 5 Dec 21, 2021
Flat map - Header only associative linear container.

flat_map flat_map is a header only associative container library that constructed on linear container. It compliants C++17/20 standard associative con

Kohei Takahashi 24 Aug 26, 2022
C++ associative containers

This directory contains several hash-map implementations, similar in API to SGI's hash_map class, but with different performance characteristics. spa

null 1.4k Sep 25, 2022
An associative array implemented by C laugnage

C Map An associative array implemented by C laugnage. This is an implementation of an associative array written by C. it is similar as C++ <map> but i

Dr.Xiao 1 Jan 21, 2022
wideint is a C++ implementation of wide exact-width integer types.

wideint - wide exact-width integer types Copyright (c) 2022 Joergen Ibsen About wideint is a C++ implementation of wide exact-width integer types. #in

Jørgen Ibsen 3 Jan 27, 2022
A RBTree is a sorted associative collection that is implemented with a Red-Black Tree.

A RBTree is a sorted associative collection that is implemented with a Red-Black Tree.

Yusuke Endoh 5 Feb 9, 2022
This project Orchid-Fst implements a fast text string dictionary search data structure: Finite state transducer (short for FST) in c++ language.This FST C++ open source project has much significant advantages.

Orchid-Fst 1. Project Overview This project Orchid-Fst implements a fast text string dictionary search data structure: Finite state transducer , which

Bin Ding 9 Jul 25, 2022
Typesafe, Generic & Extremely fast Dictionary in C 🚀

CDict.h Typesafe, Generic, and Extremely Fast Dictionary in C ?? Key Features Extremely fast non-cryptographic hash algorithm XXHash Complete Typesafe

Robus Gauli 17 Sep 20, 2022
Modular personalized dictionary generator.

Narthex Narthex (Greek: Νάρθηξ, νάρθηκας) is a modular & minimal dictionary generator for Unix and Unix-like operating system written in C and Shell.

Michael Constantine Dimopoulos 165 Sep 25, 2022
An easy to use C++ library for fetching words from Urban Dictionary

Urban++ A simple C++ library for fetching words from Urban Dictionary This README is also the documentation for this library. This is a very simple an

Nishant Mishra 2 Jun 6, 2022
Vulkan Video Sample Application demonstrating an end-to-end, all-Vulkan, processing of h.264/5 compressed video content.

This project is a Vulkan Video Sample Application demonstrating an end-to-end, all-Vulkan, processing of h.264/5 compressed video content. The application decodes the h.264/5 compressed content using an HW accelerated decoder, the decoded YCbCr frames are processed with Vulkan Graphics and then presented via the Vulkan WSI.

NVIDIA DesignWorks Samples 125 Oct 5, 2022
C++20 compile time compressed string tables

Squeeze - C++20 Compile time string compression Experiments in building complex compile time executed code using constexpr functions to generate a com

Ashley Roll 43 May 29, 2022
PNGFuse is a cross-platform application that allows you to embed and extract full zlib-compressed files within PNG metadata.

PNGFuse PNGFuse is a portable, lightweight, and cross-platform application written in C++ that allows you to embed and extract full zlib-compressed fi

Eta 3 Dec 29, 2021
Bolt is an algorithm for compressing vectors of real-valued data and running mathematical operations directly on the compressed representations.

Bolt is an algorithm for compressing vectors of real-valued data and running mathematical operations directly on the compressed representations.

null 2.3k Sep 24, 2022
CComp: A Parallel Compression Algorithm for Compressed Word Search

The goal of CComp is to achieve better compressed search times while achieving the same compression-decompression speed as other parallel compression algorithms. CComp achieves this by splitting both the word dictionaries and the input stream, processing them in parallel.

Emir Öztürk 4 Sep 30, 2021