The approximate regex matching library and agrep command line tool.

Overview

Introduction

TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.

The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression.

In other words, the time complexity of the algorithm is O(M^2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic on the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only on pathological cases which are probably very rare in practice.

Hacking

Here's how to work with this code.

Prerequisites

You will need the following tools installed on your system:

  • autoconf
  • automake
  • gettext
  • libtool
  • zip (optional)

Building

First, prepare the tre. Change to the root of the source directory and run

./utils/autogen.sh

This will regenerate various things using the prerequisite tools so that you end up with a buildable tree.

After this, you can run the configure script and build TRE as usual:

./configure
make
make check
make install

Building a source code package

In a prepared tree, this command creates a source code tarball:

./configure && make dist

Alternatively, you can run

./utils/build-sources.sh

which builds the source code packages and puts them in the dist subdirectory. This script needs a working zip command.

Features

TRE is not just yet another regexp matcher. TRE has some features which are not there in most free POSIX compatible implementations. Most of these features are not present in non-free implementations either, for that matter.

Approximate matching

Approximate pattern matching allows matches to be approximate, that is, allows the matches to be close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure (also known as the Levenshtein distance) where characters can be inserted, deleted, or substituted in the searched text in order to get an exact match.

Each insertion, deletion, or substitution adds the distance, or cost, of the match. TRE can report the matches which have a cost lower than some given threshold value. TRE can also be used to search for matches with the lowest cost.

TRE includes a version of the agrep (approximate grep) command line tool for approximate regexp matching in the style of grep. Unlike other agrep implementations (like the one by Sun Wu and Udi Manber from University of Arizona) TRE agrep allows full regexps of any length, any number of errors, and non-uniform costs for insertion, deletion and substitution.

Strict standard conformance

POSIX defines the behaviour of regexp functions precisely. TRE attempts to conform to these specifications as strictly as possible. TRE always returns the correct matches for subpatterns, for example. Very few other implementations do this correctly. In fact, the only other implementations besides TRE that I am aware of (free or not) that get it right are Rx by Tom Lord, Regex++ by John Maddock, and the AT&T ast regex by Glenn Fowler and Doug McIlroy.

The standard TRE tries to conform to is the IEEE Std 1003.1-2001, or Open Group Base Specifications Issue 6, commonly referred to as "POSIX". It can be found online here. The relevant parts are the base specifications on regular expressions (and the rationale) and the description of the regcomp() API.

For an excellent survey on POSIX regexp matchers, see the testregex pages by Glenn Fowler of AT&T Labs Research.

Predictable matching speed

Because of the matching algorithm used in TRE, the maximum time consumed by any regexec() call is always directly proportional to the length of the searched string. There is one exception: if back references are used, the matching may take time that grows exponentially with the length of the string. This is because matching back references is an NP complete problem, and almost certainly requires exponential time to match in the worst case.

Predictable and modest memory consumption

A regexec() call never allocates memory from the heap. TRE allocates all the memory it needs during a regcomp() call, and some temporary working space from the stack frame for the duration of the regexec() call. The amount of temporary space needed is constant during matching and does not depend on the searched string. For regexps of reasonable size TRE needs less than 50K of dynamically allocated memory during the regcomp() call, less than 20K for the compiled pattern buffer, and less than two kilobytes of temporary working space from the stack frame during a regexec() call. There is no time/memory tradeoff. TRE is also small in code size; statically linking with TRE increases the executable size less than 30K (gcc-3.2, x86, GNU/Linux).

Wide character and multibyte character set support

TRE supports multibyte character sets. This makes it possible to use regexps seamlessly with, for example, Japanese locales. TRE also provides a wide character API.

Binary pattern and data support

TRE provides APIs which allow binary zero characters both in regexps and searched strings. The standard API cannot be easily used to, for example, search for printable words from binary data (although it is possible with some hacking). Searching for patterns which contain binary zeroes embedded is not possible at all with the standard API.

Completely thread safe

TRE is completely thread safe. All the exported functions are re-entrant, and a single compiled regexp object can be used simultaneously in multiple contexts; e.g. in main() and a signal handler, or in many threads of a multithreaded application.

Portable

TRE is portable across multiple platforms. Here's a table of platforms and compilers that have been successfully used to compile and run TRE:

Platform(s) Compiler(s)
AIX 4.3.2 - 5.3.0 GCC, C for AIX compiler version 5
Compaq Tru64 UNIX V5.1A/B Compaq C V6.4-014 - V6.5-011
Cygwin 1.3 - 1.5 GCC
Digital UNIX V4.0 DEC C V5.9-005
FreeBSD 4 and above GCC
GNU/Linux systems on x86, x86_64, ppc64, s390 GCC
HP-UX 10.20- 11.00 GCC, HP C Compiler
IRIX 6.5 GCC, MIPSpro Compilers 7.3.1.3m
Max OS X
NetBSD 1.5 and above GCC, egcs
OpenBSD 3.3 and above GCC
Solaris 2.7-10 sparc/x86 GCC, Sun Workshop 6 compilers
Windows 98 - XP Microsoft Visual C++ 6.0

TRE 0.7.5 should compile without changes on all of the above platforms. Tell me if you are using TRE on a platform that is not listed above, and I'll add it to the list. Also let me know if TRE does not work on a listed platform.

Depending on the platform, you may need to install libutf8 to get wide character and multibyte character set support.

Free

TRE is released under a license which is essentially the same as the "2 clause" BSD-style license used in NetBSD. See the file LICENSE for details.

Roadmap

There are currently two features, both related to collating elements, missing from 100% POSIX compliance. These are:

  • Support for collating elements (e.g. [[.<X>.]], where <X> is a collating element). It is not possible to support multi-character collating elements portably, since POSIX does not define a way to determine whether a character sequence is a multi-character collating element or not.

  • Support for equivalence classes, for example [[=<X>=]], where <X> is a collating element. An equivalence class matches any character which has the same primary collation weight as <X>. Again, POSIX provides no portable mechanism for determining the primary collation weight of a collating element.

Note that other portable regexp implementations don't support collating elements either. The single exception is Regex++, which comes with its own database for collating elements for different locales. Support for collating elements and equivalence classes has not been widely requested and is not very high on the TODO list at the moment.

These are other features I'm planning to implement real soon now:

  • All the missing GNU extensions enabled in GNU regex, such as [[:<:]] and [[:>:]]

  • A REG_SHORTEST regexec() flag for returning the shortest match instead of the longest match.

  • Perl-compatible syntax

    • [:^class:]
    • Matches anything but the characters in class. Note that
    • [^[:class:]] works already, this would be just a
    • convenience shorthand.
    • \A
    • Match only at beginning of string
    • \Z
    • Match only at end of string, or before newline at the end
    • \z
    • Match only at end of string
    • \l
    • Lowercase next char (think vi)
    • \u
    • Uppercase next char (think vi)
    • \L
    • Lowercase till \E (think vi)
    • \U
    • Uppercase till \E (think vi)
    • (?=pattern)
    • Zero-width positive look-ahead assertions.
    • (?!pattern)
    • Zero-width negative look-ahead assertions.
    • (?<=pattern)
    • Zero-width positive look-behind assertions.
    • (?<!pattern)
    • Zero-width negative look-behind assertions.

Documentation especially for the nonstandard features of TRE, such as approximate matching, is a work in progress (with "progress" loosely defined...)

Comments
  • Long pattern strings

    Long pattern strings

    This pull request adds two configure options via environment variables:

    • TRE_EXPRESSION_STACK_SIZE_IN_BYTES
    • TRE_EXPRESSION_RING_BUFFER_SIZE_IN_BYTES

    This allows the user to compile tre with much large stack and ring buffer sizes, if needed to parse very large pattern strings.

    The expression ring buffer is still statically allocated.

    A PACKAGING.md file is also added with instructions on how to build a distribution package.

    (I'm using tre in a project that needs fuzzy matching with very large pattern strings - hence this changeset.)

    opened by adamfeuer 5
  • MinGW and WChar_t

    MinGW and WChar_t

    Hi, I've compiled the TRE with MinGw and enabled wchar_t. It worked with English words. I've tested it with Persian language but did not work. Is the problem related to MinGw ? How can I fix it ?

    opened by Behrouz-m 1
  • Fix Windows compilation errors

    Fix Windows compilation errors

    • On Windows, Visual Studio 2008 Express Edition fails to find tre-config.h. This pull request provides a copy of it in the appropriate location.
    • On Windows, there's no setup.py file; only the setup.py.in template is provided. The pull request provides an template-replaced version of setup.py.
    • On Windows, tre.dll is placed in C;\Pythonxy along with a warning. This pull request places it in the site-packages directory alongside of tre.pyd.
    opened by bjones1 1
  • missing install-sh?

    missing install-sh?

    [email protected]:~/cvs/fossil/cwal/th1ish/tre$ ./configure --prefix=$HOME configure: error: cannot find install-sh, install.sh, or shtool in utils "."/utils

    opened by sgbeal 1
  • Update for Markdown formatting

    Update for Markdown formatting

    The <X> was rendered invisible.

    (It could be set as inline code (like above) instead of "backslashed", but I wanted to keep your choice of formatting.)

    opened by ghost 0
  • Trivial typo fix

    Trivial typo fix

    Comment had an ISO-8859-1 encoded non-breaking space, probably unintentional. Replacing it with a regular space makes this an ASCII file (arguably preferred).

    opened by mvkorpel 0
  • Fix compilation when TRE_USE_SYSTEM_REGEX

    Fix compilation when TRE_USE_SYSTEM_REGEX

    tre.h was changed to define REG_USEBYTES if TRE_USE_SYSTEM_REGEX was undefined which broke the compilation step when it was defined. This fixes the compilation by defining REG_USEBYTES.

    Error message:

    make  all-recursive
    make[1]: Entering directory `/usr/share/media/src/git/tre'
    Making all in lib
    make[2]: Entering directory `/usr/share/media/src/git/tre/lib'
    make  all-am
    make[3]: Entering directory `/usr/share/media/src/git/tre/lib'
    /bin/sh ../libtool --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I. -I..     -g -O2 -MT tre-ast.lo -MD -MP -MF .deps/tre-ast.Tpo -c -o tre-ast.lo tre-ast.c
    libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I.. -g -O2 -MT tre-ast.lo -MD -MP -MF .deps/tre-ast.Tpo -c tre-ast.c  -fPIC -DPIC -o .libs/tre-ast.o
    mv -f .deps/tre-ast.Tpo .deps/tre-ast.Plo
    /bin/sh ../libtool --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I. -I..     -g -O2 -MT tre-compile.lo -MD -MP -MF .deps/tre-compile.Tpo -c -o tre-compile.lo tre-compile.c
    libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I.. -g -O2 -MT tre-compile.lo -MD -MP -MF .deps/tre-compile.Tpo -c tre-compile.c  -fPIC -DPIC -o .libs/tre-compile.o
    tre-compile.c: In function 'tre_compile':
    tre-compile.c:1894:33: error: 'REG_USEBYTES' undeclared (first use in this function)
       parse_ctx.cur_max = (cflags & REG_USEBYTES) ? 1 : TRE_MB_CUR_MAX;
                                     ^
    tre-compile.c:1894:33: note: each undeclared identifier is reported only once for each function it appears in
    make[3]: *** [tre-compile.lo] Error 1
    make[3]: Leaving directory `/usr/share/media/src/git/tre/lib'
    make[2]: *** [all] Error 2
    make[2]: Leaving directory `/usr/share/media/src/git/tre/lib'
    make[1]: *** [all-recursive] Error 1
    make[1]: Leaving directory `/usr/share/media/src/git/tre'
    make: *** [all] Error 2
    
    opened by h3xx 0
  • Squash MSVS 2012 warnings.

    Squash MSVS 2012 warnings.

    Hello,

    If you would like to compile TRE with the newest MSVC compiler without warnings, have a look at d0426f8, please. The vcbuild branch includes header and project files to build the static library with MSVS.

    --- Ferda

    opened by prantlf 0
  • Remove broken agrep test entry (fails with bash >= 5.2)

    Remove broken agrep test entry (fails with bash >= 5.2)

    It's meant to cause agrep to return with exit code 2, but asserts that it's exit code 1 instead.

    It's meant to ensure that using "." as pattern results in exit code 2 because it matches also an empty string. However, glob expansion results in "." picking up files such as "." and ".." from the CWD, which get interpreted as valid pattern. This results in exit status 1 (no match found) which is what the .ok file expects, but that's invalid.

    With bash 5.2, glob expansion no longer matches "." and ".." by default, so the test works as intended by accident, causing a mismatch with the expected wrong exit code.

    It's unfortunately not easily possible to avoid glob expansion in this case.

    Just remove the test for now.

    opened by Vogtinator 0
  • Create tags and releases

    Create tags and releases

    0.8.0 was released in 2009 but there are no tags or releases for that or any preceding version in this repository. Please create them and attach the original distfiles to the releases. Preserving history is important.

    opened by ryandesign 0
  • Partial zh_CN translation + README fix

    Partial zh_CN translation + README fix

    This PR contains a partial zh_CN translation, in that the error messages are done but not the longer help messages. I also fixed up the README by a bit, so the last section about what PCRE extensions to include make more sense. I also added a sensence pointing people to the header if they want to see what functions provide the extensions, as I don't think I have the time to edit the HTML docs yet.

    opened by Artoria2e5 0
  • Remove duplicate public headers

    Remove duplicate public headers

    The current way tre.h is duplicated greatly confuses maintainers. Debian, for example, mistakenly packages the includes/tre/tre.h instead of lib/tre.h, resulting in an inability for users to use the R additions from #14. In addition, a mismatching tre-config.h delays compile errors to link time, which is also bad.

    This commit moves lib/tre.h to the public header location, which 277d66ced2ed70f2ede1bc8773f103218a51effd implies is the preferred place.

    The changes build on a Debian buster WSL1 environment and installs the correct headers. Visual Studio 2019 wants to upgrade the vcxproj files, so I did not verify them.

    opened by Artoria2e5 0
  • PlayStation 3 build support

    PlayStation 3 build support

    Here's a quick PS3 port of TRE (library) 😄

    Requirements:

    • open-source ps3 toolchain
    • open-source PSL1GHT SDK

    Build & install:

    make -f ps3/Makefile install
    

    (I wasn't sure if I should update the Readme.md with instructions or not)

    opened by bucanero 0
  • Certificate expired on https://laurikari.net

    Certificate expired on https://laurikari.net

    Hello! Could you please renew the certificate on https://laurikari.net/tre/ ? It has expired a couple weeks ago:

    Your connection is not private Attackers might be trying to steal your information from laurikari.net (for example, passwords, messages, or credit cards). Learn more NET::ERR_CERT_DATE_INVALID

    Thank you! Dmitry

    opened by DmitryFromDubna 3
Owner
Ville Laurikari
CTO at Zefort
Ville Laurikari
A non-backtracking NFA/DFA-based Perl-compatible regex engine matching on large data streams

Name libsregex - A non-backtracking NFA/DFA-based Perl-compatible regex engine library for matching on large data streams Table of Contents Name Statu

OpenResty 600 Dec 22, 2022
Glob pattern to regex translator in C++11. Optionally, directory traversal with glob pattern in C++17. Header-only library.

Glob pattern to regex translator in C++11. Optionally, directory traversal with glob pattern in C++17. Header-only library.

Takayuki MATSUOKA 3 Oct 27, 2021
Easier CPP interface to PCRE regex engine with global match and replace

RegExp Easier CPP interface to PCRE regex engine with global match and replace. I was looking around for better regex engine than regcomp for my C/C++

Yasser Asmi 5 May 21, 2022
High-performance regular expression matching library

Hyperscan Hyperscan is a high-performance multiple regex matching library. It follows the regular expression syntax of the commonly-used libpcre libra

Intel Corporation 4k Jan 1, 2023
A small implementation of regular expression matching engine in C

cregex cregex is a compact implementation of regular expression (regex) matching engine in C. Its design was inspired by Rob Pike's regex-code for the

Jim Huang 72 Dec 6, 2022
RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.

This is the source code repository for RE2, a regular expression library. For documentation about how to install and use RE2, visit https://github.co

Google 7.5k Jan 4, 2023
regular expression library

Oniguruma https://github.com/kkos/oniguruma Oniguruma is a modern and flexible regular expressions library. It encompasses features from different reg

K.Kosako 1.9k Jan 3, 2023
Perl Incompatible Regular Expressions library

This is PIRE, Perl Incompatible Regular Expressions library. This library is aimed at checking a huge amount of text against relatively many regular

Yandex 320 Oct 9, 2022
Onigmo is a regular expressions library forked from Oniguruma.

Onigmo (Oniguruma-mod) https://github.com/k-takata/Onigmo Onigmo is a regular expressions library forked from Oniguruma. It focuses to support new exp

K.Takata 585 Dec 6, 2022
SRL-CPP is a Simple Regex Language builder library written in C++11 that provides an easy to use interface for constructing both simple and complex regex expressions.

SRL-CPP SRL-CPP is a Simple Regex Language builder library written in C++11 that provides an easy to use interface for constructing both simple and co

Telepati 0 Mar 9, 2022
A non-backtracking NFA/DFA-based Perl-compatible regex engine matching on large data streams

Name libsregex - A non-backtracking NFA/DFA-based Perl-compatible regex engine library for matching on large data streams Table of Contents Name Statu

OpenResty 600 Dec 22, 2022
✔️The smallest header-only GUI library(4 KLOC) for all platforms

Welcome to GUI-lite The smallest header-only GUI library (4 KLOC) for all platforms. 中文 Lightweight ✂️ Small: 4,000+ lines of C++ code, zero dependenc

null 6.6k Jan 8, 2023
Yori is a CMD replacement shell that supports backquotes, job control, and improves tab completion, file matching, aliases, command history, and more.

Yori is a CMD replacement shell that supports backquotes, job control, and improves tab completion, file matching, aliases, command history, and more.

Malcolm Smith 1.1k Dec 30, 2022
Glob pattern to regex translator in C++11. Optionally, directory traversal with glob pattern in C++17. Header-only library.

Glob pattern to regex translator in C++11. Optionally, directory traversal with glob pattern in C++17. Header-only library.

Takayuki MATSUOKA 3 Oct 27, 2021
led is a line-oriented text editor in command line

led is a line-oriented text editor in command line. This editor is similar to the standard program on unix systems - GNU ed. But i'm not going to make an exact clone of that program, it's just a pet project.

Artem Mironov 16 Dec 27, 2022
Easier CPP interface to PCRE regex engine with global match and replace

RegExp Easier CPP interface to PCRE regex engine with global match and replace. I was looking around for better regex engine than regcomp for my C/C++

Yasser Asmi 5 May 21, 2022
The Approximate Library is a WiFi Arduino library for building proximate interactions between your Internet of Things and the ESP8266 or ESP32

The Approximate Library The Approximate library is a WiFi Arduino Library for building proximate interactions between your Internet of Things and the

David Chatting 102 Dec 7, 2022
LZFSE compression library and command line tool

LZFSE This is a reference C implementation of the LZFSE compressor introduced in the Compression library with OS X 10.11 and iOS 9. LZFSE is a Lempel-

null 1.7k Jan 4, 2023
Library and command line tool to detect SHA-1 collision in a file

sha1collisiondetection Library and command line tool to detect SHA-1 collisions in files Copyright 2017 Marc Stevens [email protected] Distributed

Marc Stevens 1.2k Dec 29, 2022
Warp speed Data Transfer (WDT) is an embeddedable library (and command line tool) aiming to transfer data between 2 systems as fast as possible over multiple TCP paths.

WDT Warp speed Data Transfer Design philosophy/Overview Goal: Lowest possible total transfer time - to be only hardware limited (disc or network bandw

Facebook 2.7k Dec 31, 2022