A fast Python Common substrings of multiple strings library with C++ implementation

Overview

GitHub

A fast Python Common substrings of multiple strings library with C++ implementation

Having a bunch of strings, can I print some substrings which appear K times ? Can I know which is longest substrings ?

Installing

Make sure Cython is installed properly !

Building from source:

cd python
python setup.py install

Usuage and examples

Step 1. Import the library

from common_multiple_strings import PyCommon_multiple_strings
tree = PyCommon_multiple_strings()  # init

Step 2. Build the data structure

Build from list of str:

tree.from_strings(list_str)

or build from file:

tree.from_path(<path_to_file>)

It is noted that sentences in file are presentened line-by-line. Each line is a sentence.

Sample code:

>>> from common_multiple_strings import PyCommon_multiple_strings
>>> tree = PyCommon_multiple_strings()  # init
>>> list_str = [
       'abc',
       'abcxa',
       'xamnb',
       'yamnc',
       'abcd'
    ]
>>> tree.from_strings(list_str)

Step 3. Query

This library introduces 4 types of query:

a) List some substrings appear TIMES times:

tree.query(times=TIMES)

Ouput is a dictionary with key is an integer K, value is a list of some substrings which appear exactly K times.

Sample code:

>>> print(tree.query(times=(2, None)))
{2: ['amn', 'xa', 'n', 'mn'], 3: ['bc', 'abc'], 4: ['c', 'b'], 5: ['a']}
>>> print(tree.query(times=(3, 3)))
{3: ['bc', 'abc']}

b) Length of the longest substring appears TIMES times:

tree.length_longest_substring(times=TIMES)

Ouput is an integer.

Sample code:

>>> print(tree.length_longest_substring(times=(2, None)))
3

c) Lengths of the longest substrings appear TIMES times:

tree.lengths_longest_substrings(times=TIMES)

Ouput is a dictionary with key is an integer K, value is the length of the longest substring which appear exactly K times.

Sample code:

>>> print(tree.lengths_longest_substrings(times=(2, None)))
{2: 3, 3: 3, 4: 1, 5: 1}

d) List some substrings which have length of L and appear TIMES times:

tree.filter_substrings_by_length(length_input=L, times=TIMES)

Ouput is a dictionary with key is an integer K, value is a list of some substrings which appear exactly K times and have length of L.

Sample code:

>>> print(tree.filter_substrings_by_length(length_input=2, times=(2, None)))
{2: ['xa', 'mn'], 3: ['bc']}
>>> print(tree.filter_substrings_by_length(length_input=10, times=(2, None)))
{}

Some Notes:

- Params:

times, default (None, None) is a tuple represents the minimum and maximum appearances of the desired output.

Query substrings appear exactly N times, then times=(N, N).

Query substrings appear more than N times, then times=(N, None).

Query substrings appear less than N times, then times=(None, N).

Query substrings appear less than N times and more than M times, then times=(M, N).

length_input is an integer, represents the length of substrings

- This library accepts various utf8 characters including punctuations, numbers, upper-case characters, ... For more details, checkout alphabet.cpp file. Thanks coccoc-tokenizer for providing these available lists

- Query a) and d) does not output all possible results, but you can use it for analytic purposes.

Algorithm and Complexity

Data structure suffix tree is the core of this library. It encourages fast query and efficient storing.

If the total length of all input strings are L, then the average complexity should be L * log(L).

References

Algorithm from the lecture of String Algorithms and Algrorithms in Computational Biology - Gusfield

You might also like...
C++ implementation of a fast hash map and hash set using robin hood hashing

A C++ implementation of a fast hash map and hash set using robin hood hashing The robin-map library is a C++ implementation of a fast hash map and has

Algo-Tree is a collection of Algorithms and data structures which are fundamentals to efficient code and good software design. Creating and designing excellent algorithms is required for being an exemplary programmer. It contains solutions in various languages such as C++, Python and Java. Generic parse tree, configurable lexer, `lemon` parser generator, wrapped for C++17 and Python 3.
Generic parse tree, configurable lexer, `lemon` parser generator, wrapped for C++17 and Python 3.

This project augments the Lemon parser generator with a high-level parse tree interface, grammar action DSL, and an integrated, configurable lexer allowing the creation of an entire standalone, object-oriented parser from a single input grammar file. The entire parser is written in native C/C++, and the parser interface is made comfortably available to both C++ and Python3 applications.

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

algos - Leetcode Algorithms with Python and C++

algos - Leetcode Algorithms with Python and C++ Dependencies (C++) C++ Compiler ( g++ or clang++ ) (C++) cmake (optional but recommended) (Python) Pyt

C and Python examples from my book on using PETSc to solve  PDEs
C and Python examples from my book on using PETSc to solve PDEs

p4pdes PETSc for Partial Differential Equations is a new book on using PETSc to solve partial differential equations by modern numerical methods. Orde

Simple and fast configuration file library (written in C99)

Features Configuration file reading Supported operating systems Ubuntu MacOS Windows Build requirements C99 compiler CMake 3.10+ Cloning git clone htt

A fast and efficient non-iterating hashmap library

libhash: a fast and efficient non-iterating hashmap library Libhash is a fast and efficient non-iterating hashmap library Usage Usage is easy and simp

Owner
Đào Nguyên Dương
Đào Nguyên Dương try best @daokiem ~~Rời xa đội tuyển là Bão Tố~~
Đào Nguyên Dương
Experimental managed C-strings library

Stricks Managed C strings library. ?? API Why ? Because handling C strings is tedious and error-prone. Appending while keeping track of length, null-t

Francois Alcover 88 Oct 10, 2022
C++ library to manage strings with different encodings

StringSuite C++ library to manage strings and (almost) any kind of encoded data. License Encmetric is written under the GNU Lesser General Public Lice

Paolo 3 Feb 12, 2022
StringCheese is a CTF tool to solve easy challenges automatically in many cases where a strings | grep is just not enough

StringCheese StringCheese is a script written in Python to extract CTF flags (or any other pattern with a prefix) automatically. It works like a simpl

Mathis HAMMEL 61 Nov 28, 2022
➿ mulle-c-string-escape turns data into C-strings

mulle-c-string-escape ➿ mulle-c-string-escape turns data into C-strings Non-ASCII characters will be escaped to hex or octal. C-escapes are used for k

Nat! 9 Apr 11, 2022
🔗 Common Data Structures and Algorithms

?? Data Structures and Algorithms This library provides common data structures. It will also provide some data structures which needed in render or ga

Recep Aslantas 43 Oct 13, 2022
HashTableBenchmark - A simple cross-platform speed & memory-efficiency benchmark for the most common hash-table implementations in the C++ world

Hash-Tables Benchmarks This repository contains a bunch of extendable benchmarks mostly for following containers: std:unordered_map from STL. google::

Unum 9 Nov 20, 2022
A collection of multiple types of lists used during pentesting, collected in one place. List types include usernames, passwords, combos, wordlist and may more..

Access list is a collection of multiple types of lists used during pentesting, collected in one place, created by Undercode This list include a collec

UNDERCODE UTILITIES 10 Nov 21, 2022
tail multiple files at once using inotify

tailer - tail entire directories at once using inotify This utility is based on Erik Zenker's inotify-cpp library for monitoring file create/modify/re

Ray Burgemeestre 2 Nov 22, 2021
Dry-comparisons - C++17 Utility classes for comparing multiple values in one simple expression

dry-comparisons This is the source code for the any_of, all_of and none_of C++17 utility mentioned in the blog post "DRY multicomparisons" Public doma

Björn Fahller 200 Nov 25, 2022
C++ implementation of a fast hash map and hash set using hopscotch hashing

A C++ implementation of a fast hash map and hash set using hopscotch hashing The hopscotch-map library is a C++ implementation of a fast hash map and

Thibaut Goetghebuer-Planchon 573 Nov 26, 2022