The approximate regex matching library and agrep command line tool.

Related tags

tre
Overview

Introduction

TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.

The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression.

In other words, the time complexity of the algorithm is O(M^2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic on the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only on pathological cases which are probably very rare in practice.

Hacking

Here's how to work with this code.

Prerequisites

You will need the following tools installed on your system:

  • autoconf
  • automake
  • gettext
  • libtool
  • zip (optional)

Building

First, prepare the tre. Change to the root of the source directory and run

./utils/autogen.sh

This will regenerate various things using the prerequisite tools so that you end up with a buildable tree.

After this, you can run the configure script and build TRE as usual:

./configure
make
make check
make install

Building a source code package

In a prepared tree, this command creates a source code tarball:

./configure && make dist

Alternatively, you can run

./utils/build-sources.sh

which builds the source code packages and puts them in the dist subdirectory. This script needs a working zip command.

Features

TRE is not just yet another regexp matcher. TRE has some features which are not there in most free POSIX compatible implementations. Most of these features are not present in non-free implementations either, for that matter.

Approximate matching

Approximate pattern matching allows matches to be approximate, that is, allows the matches to be close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure (also known as the Levenshtein distance) where characters can be inserted, deleted, or substituted in the searched text in order to get an exact match.

Each insertion, deletion, or substitution adds the distance, or cost, of the match. TRE can report the matches which have a cost lower than some given threshold value. TRE can also be used to search for matches with the lowest cost.

TRE includes a version of the agrep (approximate grep) command line tool for approximate regexp matching in the style of grep. Unlike other agrep implementations (like the one by Sun Wu and Udi Manber from University of Arizona) TRE agrep allows full regexps of any length, any number of errors, and non-uniform costs for insertion, deletion and substitution.

Strict standard conformance

POSIX defines the behaviour of regexp functions precisely. TRE attempts to conform to these specifications as strictly as possible. TRE always returns the correct matches for subpatterns, for example. Very few other implementations do this correctly. In fact, the only other implementations besides TRE that I am aware of (free or not) that get it right are Rx by Tom Lord, Regex++ by John Maddock, and the AT&T ast regex by Glenn Fowler and Doug McIlroy.

The standard TRE tries to conform to is the IEEE Std 1003.1-2001, or Open Group Base Specifications Issue 6, commonly referred to as "POSIX". It can be found online here. The relevant parts are the base specifications on regular expressions (and the rationale) and the description of the regcomp() API.

For an excellent survey on POSIX regexp matchers, see the testregex pages by Glenn Fowler of AT&T Labs Research.

Predictable matching speed

Because of the matching algorithm used in TRE, the maximum time consumed by any regexec() call is always directly proportional to the length of the searched string. There is one exception: if back references are used, the matching may take time that grows exponentially with the length of the string. This is because matching back references is an NP complete problem, and almost certainly requires exponential time to match in the worst case.

Predictable and modest memory consumption

A regexec() call never allocates memory from the heap. TRE allocates all the memory it needs during a regcomp() call, and some temporary working space from the stack frame for the duration of the regexec() call. The amount of temporary space needed is constant during matching and does not depend on the searched string. For regexps of reasonable size TRE needs less than 50K of dynamically allocated memory during the regcomp() call, less than 20K for the compiled pattern buffer, and less than two kilobytes of temporary working space from the stack frame during a regexec() call. There is no time/memory tradeoff. TRE is also small in code size; statically linking with TRE increases the executable size less than 30K (gcc-3.2, x86, GNU/Linux).

Wide character and multibyte character set support

TRE supports multibyte character sets. This makes it possible to use regexps seamlessly with, for example, Japanese locales. TRE also provides a wide character API.

Binary pattern and data support

TRE provides APIs which allow binary zero characters both in regexps and searched strings. The standard API cannot be easily used to, for example, search for printable words from binary data (although it is possible with some hacking). Searching for patterns which contain binary zeroes embedded is not possible at all with the standard API.

Completely thread safe

TRE is completely thread safe. All the exported functions are re-entrant, and a single compiled regexp object can be used simultaneously in multiple contexts; e.g. in main() and a signal handler, or in many threads of a multithreaded application.

Portable

TRE is portable across multiple platforms. Here's a table of platforms and compilers that have been successfully used to compile and run TRE:

Platform(s) Compiler(s)
AIX 4.3.2 - 5.3.0 GCC, C for AIX compiler version 5
Compaq Tru64 UNIX V5.1A/B Compaq C V6.4-014 - V6.5-011
Cygwin 1.3 - 1.5 GCC
Digital UNIX V4.0 DEC C V5.9-005
FreeBSD 4 and above GCC
GNU/Linux systems on x86, x86_64, ppc64, s390 GCC
HP-UX 10.20- 11.00 GCC, HP C Compiler
IRIX 6.5 GCC, MIPSpro Compilers 7.3.1.3m
Max OS X
NetBSD 1.5 and above GCC, egcs
OpenBSD 3.3 and above GCC
Solaris 2.7-10 sparc/x86 GCC, Sun Workshop 6 compilers
Windows 98 - XP Microsoft Visual C++ 6.0

TRE 0.7.5 should compile without changes on all of the above platforms. Tell me if you are using TRE on a platform that is not listed above, and I'll add it to the list. Also let me know if TRE does not work on a listed platform.

Depending on the platform, you may need to install libutf8 to get wide character and multibyte character set support.

Free

TRE is released under a license which is essentially the same as the "2 clause" BSD-style license used in NetBSD. See the file LICENSE for details.

Roadmap

There are currently two features, both related to collating elements, missing from 100% POSIX compliance. These are:

  • Support for collating elements (e.g. [[.<X>.]], where <X> is a collating element). It is not possible to support multi-character collating elements portably, since POSIX does not define a way to determine whether a character sequence is a multi-character collating element or not.

  • Support for equivalence classes, for example [[=<X>=]], where <X> is a collating element. An equivalence class matches any character which has the same primary collation weight as <X>. Again, POSIX provides no portable mechanism for determining the primary collation weight of a collating element.

Note that other portable regexp implementations don't support collating elements either. The single exception is Regex++, which comes with its own database for collating elements for different locales. Support for collating elements and equivalence classes has not been widely requested and is not very high on the TODO list at the moment.

These are other features I'm planning to implement real soon now:

  • All the missing GNU extensions enabled in GNU regex, such as [[:<:]] and [[:>:]]

  • A REG_SHORTEST regexec() flag for returning the shortest match instead of the longest match.

  • Perl-compatible syntax

    • [:^class:]
    • Matches anything but the characters in class. Note that
    • [^[:class:]] works already, this would be just a
    • convenience shorthand.
    • \A
    • Match only at beginning of string
    • \Z
    • Match only at end of string, or before newline at the end
    • \z
    • Match only at end of string
    • \l
    • Lowercase next char (think vi)
    • \u
    • Uppercase next char (think vi)
    • \L
    • Lowercase till \E (think vi)
    • \U
    • Uppercase till \E (think vi)
    • (?=pattern)
    • Zero-width positive look-ahead assertions.
    • (?!pattern)
    • Zero-width negative look-ahead assertions.
    • (?<=pattern)
    • Zero-width positive look-behind assertions.
    • (?<!pattern)
    • Zero-width negative look-behind assertions.

Documentation especially for the nonstandard features of TRE, such as approximate matching, is a work in progress (with "progress" loosely defined...)

Issues
  • Long pattern strings

    Long pattern strings

    This pull request adds two configure options via environment variables:

    • TRE_EXPRESSION_STACK_SIZE_IN_BYTES
    • TRE_EXPRESSION_RING_BUFFER_SIZE_IN_BYTES

    This allows the user to compile tre with much large stack and ring buffer sizes, if needed to parse very large pattern strings.

    The expression ring buffer is still statically allocated.

    A PACKAGING.md file is also added with instructions on how to build a distribution package.

    (I'm using tre in a project that needs fuzzy matching with very large pattern strings - hence this changeset.)

    opened by adamfeuer 5
  • R additions

    R additions

    This adds the additional functions from the R library mentioned in issue #13 along with REG_USEBYTES.

    opened by opoplawski 3
  • missing install-sh?

    missing install-sh?

    [email protected]:~/cvs/fossil/cwal/th1ish/tre$ ./configure --prefix=$HOME configure: error: cannot find install-sh, install.sh, or shtool in utils "."/utils

    opened by sgbeal 1
  • Fix Windows compilation errors

    Fix Windows compilation errors

    • On Windows, Visual Studio 2008 Express Edition fails to find tre-config.h. This pull request provides a copy of it in the appropriate location.
    • On Windows, there's no setup.py file; only the setup.py.in template is provided. The pull request provides an template-replaced version of setup.py.
    • On Windows, tre.dll is placed in C;\Pythonxy along with a warning. This pull request places it in the site-packages directory alongside of tre.pyd.
    opened by bjones1 1
  • MinGW and WChar_t

    MinGW and WChar_t

    Hi, I've compiled the TRE with MinGw and enabled wchar_t. It worked with English words. I've tested it with Persian language but did not work. Is the problem related to MinGw ? How can I fix it ?

    opened by Behrouz-m 1
  • Squash MSVS 2012 warnings.

    Squash MSVS 2012 warnings.

    Hello,

    If you would like to compile TRE with the newest MSVC compiler without warnings, have a look at d0426f8, please. The vcbuild branch includes header and project files to build the static library with MSVS.

    --- Ferda

    opened by prantlf 0
  • Fix overflow of TRE_CHAR_MAX

    Fix overflow of TRE_CHAR_MAX

    TRE_CHAR_MAX must fit in int.

    opened by andreas-schwab 0
  • Fix compilation when TRE_USE_SYSTEM_REGEX

    Fix compilation when TRE_USE_SYSTEM_REGEX

    tre.h was changed to define REG_USEBYTES if TRE_USE_SYSTEM_REGEX was undefined which broke the compilation step when it was defined. This fixes the compilation by defining REG_USEBYTES.

    Error message:

    make  all-recursive
    make[1]: Entering directory `/usr/share/media/src/git/tre'
    Making all in lib
    make[2]: Entering directory `/usr/share/media/src/git/tre/lib'
    make  all-am
    make[3]: Entering directory `/usr/share/media/src/git/tre/lib'
    /bin/sh ../libtool --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I. -I..     -g -O2 -MT tre-ast.lo -MD -MP -MF .deps/tre-ast.Tpo -c -o tre-ast.lo tre-ast.c
    libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I.. -g -O2 -MT tre-ast.lo -MD -MP -MF .deps/tre-ast.Tpo -c tre-ast.c  -fPIC -DPIC -o .libs/tre-ast.o
    mv -f .deps/tre-ast.Tpo .deps/tre-ast.Plo
    /bin/sh ../libtool --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I. -I..     -g -O2 -MT tre-compile.lo -MD -MP -MF .deps/tre-compile.Tpo -c -o tre-compile.lo tre-compile.c
    libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I.. -g -O2 -MT tre-compile.lo -MD -MP -MF .deps/tre-compile.Tpo -c tre-compile.c  -fPIC -DPIC -o .libs/tre-compile.o
    tre-compile.c: In function 'tre_compile':
    tre-compile.c:1894:33: error: 'REG_USEBYTES' undeclared (first use in this function)
       parse_ctx.cur_max = (cflags & REG_USEBYTES) ? 1 : TRE_MB_CUR_MAX;
                                     ^
    tre-compile.c:1894:33: note: each undeclared identifier is reported only once for each function it appears in
    make[3]: *** [tre-compile.lo] Error 1
    make[3]: Leaving directory `/usr/share/media/src/git/tre/lib'
    make[2]: *** [all] Error 2
    make[2]: Leaving directory `/usr/share/media/src/git/tre/lib'
    make[1]: *** [all-recursive] Error 1
    make[1]: Leaving directory `/usr/share/media/src/git/tre'
    make: *** [all] Error 2
    
    opened by h3xx 0
  • Trivial typo fix

    Trivial typo fix

    Comment had an ISO-8859-1 encoded non-breaking space, probably unintentional. Replacing it with a regular space makes this an ASCII file (arguably preferred).

    opened by mvkorpel 0
  • Partial zh_CN translation + README fix

    Partial zh_CN translation + README fix

    This PR contains a partial zh_CN translation, in that the error messages are done but not the longer help messages. I also fixed up the README by a bit, so the last section about what PCRE extensions to include make more sense. I also added a sensence pointing people to the header if they want to see what functions provide the extensions, as I don't think I have the time to edit the HTML docs yet.

    opened by Artoria2e5 0
  • Remove duplicate public headers

    Remove duplicate public headers

    The current way tre.h is duplicated greatly confuses maintainers. Debian, for example, mistakenly packages the includes/tre/tre.h instead of lib/tre.h, resulting in an inability for users to use the R additions from #14. In addition, a mismatching tre-config.h delays compile errors to link time, which is also bad.

    This commit moves lib/tre.h to the public header location, which 277d66ced2ed70f2ede1bc8773f103218a51effd implies is the preferred place.

    The changes build on a Debian buster WSL1 environment and installs the correct headers. Visual Studio 2019 wants to upgrade the vcxproj files, so I did not verify them.

    opened by Artoria2e5 0
  • PlayStation 3 build support

    PlayStation 3 build support

    Here's a quick PS3 port of TRE (library) 😄

    Requirements:

    • open-source ps3 toolchain
    • open-source PSL1GHT SDK

    Build & install:

    make -f ps3/Makefile install
    

    (I wasn't sure if I should update the Readme.md with instructions or not)

    opened by bucanero 0
  • Certificate expired on https://laurikari.net

    Certificate expired on https://laurikari.net

    Hello! Could you please renew the certificate on https://laurikari.net/tre/ ? It has expired a couple weeks ago:

    Your connection is not private Attackers might be trying to steal your information from laurikari.net (for example, passwords, messages, or credit cards). Learn more NET::ERR_CERT_DATE_INVALID

    Thank you! Dmitry

    opened by DmitryFromDubna 2
  • error while loading shared libraries: libtre.so.5: cannot open shared object file: No such file or directory

    error while loading shared libraries: libtre.so.5: cannot open shared object file: No such file or directory

    Hi there, thanks for this amazing library. I am used to work with third-party APIs that expose TRE, but for the first time I will need to use it directly from within C++.

    So, here is what I did. I downloaded a copy of the repository and followed the build instructions:

    ./configure
    make
    make check
    

    At that point, all checks seemed to work without any errors until the agrep test stage, where only one test is passed - all the rest fail. In the test log, I see many many identical errors:

    error while loading shared libraries: libtre.so.5: cannot open shared object file: No such file or directory

    For the purpose of further testing, I then continued:

    make install

    Everything else seems to proceed wit no additional errors. Next, I tried to compile a small test code from a file called test.cpp, located in the root folder, that contains merely:

    extern "C" {
    #include "tre/lib/tre.h"
    }
    
    int main()
    {
        regex_t tgt;
        tre_regcomp(&tgt, "First Try", REG_EXTENDED);
    
        return 0;
    }
    

    When I compile with g++ test.cpp -ltre, still no error thrown. However, when I then try ./a.out, I get the error:

    ./a.out: error while loading shared libraries: libtre.so.5: cannot open shared object file: No such file or directory

    How could I solve such issues?

    opened by nyangz 1
  • Add distinct error message for maximum number in {} being too large

    Add distinct error message for maximum number in {} being too large

    This PR add a distinct error message for the maximum number in {} being too large (hard-coded to 255 in tre.h).

    We recently switched to using tre-regex in Pi-hole and received a report today, that the regex

    a{2,256}
    

    was returning

    {} content invalid
    

    (return code REG_BADBR).

    This was a bit of a mystery at first, as GNU regex doesn't seem to enforce a maximum number of iterations in a bound expression. I propose adding an explicit warning message to help users seeing what is wrong.

    The new return code REG_BADMAX has a distinct error message including the upper limit:

    Maximum repetition in {} larger than 255
    

    Including the limit here seems meaningful as the limit is set at compile time and not customizable during runtime. Hence, users have no access to its actual value otherwise. I use stringification of the C preprocessor to implement this transparently and without overhead.

    I added a test case for the new return type. make check passes.

    opened by DL6ER 1
  • Specifying the function return values in the webpage would be nice

    Specifying the function return values in the webpage would be nice

    The doc/tre-api.html file in the Git repository explains the return codes of the various functions; it would be helpful if these could be included in the online documentation at https://laurikari.net/tre/documentation/regcomp/ etc. could include this information as well.

    opened by juliangilbey 0
  • Regex (^|\b) is not handled correctly, but (\b|^) is

    Regex (^|\b) is not handled correctly, but (\b|^) is

    The following small example shows that this regex is not handled correctly.

    #include <stdio.h>
    #include <tre/tre.h>
    
    int main() {
      regex_t preg;
      int ret;
    
      tre_regcomp(&preg, "\\ba", REG_EXTENDED);
      ret = tre_regexec(&preg, "this is a word", 0, NULL, 0);
      printf("return value for \\ba is %d\n", ret);
    
      tre_regcomp(&preg, "^a", REG_EXTENDED);
      ret = tre_regexec(&preg, "this is a word", 0, NULL, 0);
      printf("return value for ^a is %d\n", ret);
    
      tre_regcomp(&preg, "(^|\\b)a", REG_EXTENDED);
      ret = tre_regexec(&preg, "this is a word", 0, NULL, 0);
      printf("return value for (^|\\b)a is %d\n", ret);
    
      tre_regcomp(&preg, "(\\b|^)a", REG_EXTENDED);
      ret = tre_regexec(&preg, "this is a word", 0, NULL, 0);
      printf("return value for (\\b|^)a is %d\n", ret);
    }
    
    euler:/tmp $ gcc -o testtre testtre.c -ltre
    euler:/tmp $ ./testtre 
    return value for \ba is 0
    return value for ^a is 1
    return value for (^|\b)a is 1
    return value for (\b|^)a is 0
    

    I'm using libtre version 0.8.0-6 on Debian, which is based on the 0.8.0 release.

    Best wishes,

    Julian

    opened by juliangilbey 0
  • docs: fix simple typo, sotfware -> software

    docs: fix simple typo, sotfware -> software

    There is a small typo in python/tre-python.c.

    Should read software rather than sotfware.

    opened by timgates42 0
  • Supporting GNU regex syntax bits?

    Supporting GNU regex syntax bits?

    Hi.

    I'm wondering how hard it would be for your library to supply an API like that of the GNU regex? In particular, I need the ability to support different flavors of syntax, depending upon different sets of syntax bits.

    This would be to drop into GNU awk, if possible.

    Thanks!

    opened by arnoldrobbins 0
Owner
Ville Laurikari
CTO at Zefort
Ville Laurikari
Glob pattern to regex translator in C++11. Optionally, directory traversal with glob pattern in C++17. Header-only library.

Glob pattern to regex translator in C++11. Optionally, directory traversal with glob pattern in C++17. Header-only library.

Takayuki MATSUOKA 2 Oct 13, 2021
A small implementation of regular expression matching engine in C

cregex cregex is a compact implementation of regular expression (regex) matching engine in C. Its design was inspired by Rob Pike's regex-code for the

Jim Huang 64 Jul 6, 2021
High-performance regular expression matching library

Hyperscan Hyperscan is a high-performance multiple regex matching library. It follows the regular expression syntax of the commonly-used libpcre libra

Intel Corporation 3.3k Oct 12, 2021
Onigmo is a regular expressions library forked from Oniguruma.

Onigmo (Oniguruma-mod) https://github.com/k-takata/Onigmo Onigmo is a regular expressions library forked from Oniguruma. It focuses to support new exp

K.Takata 548 Oct 14, 2021
A Compile time PCRE (almost) compatible regular expression matcher.

Compile time regular expressions v3 Fast compile-time regular expressions with support for matching/searching/capturing during compile-time or runtime

Hana Dusíková 2.1k Oct 22, 2021
regular expression library

Oniguruma https://github.com/kkos/oniguruma Oniguruma is a modern and flexible regular expressions library. It encompasses features from different reg

K.Kosako 1.6k Oct 19, 2021
C++ regular expressions made easy

CppVerbalExpressions C++ Regular Expressions made easy VerbalExpressions is a C++11 Header library that helps to construct difficult regular expressio

null 348 Oct 6, 2021
Registry Explorer - enhanced Registry editor/viewer

Registry Explorer Replacement for the Windows built-in Regedit.exe tool. Improvements over that tool include: Show real Registry (not just the standar

Pavel Yosifovich 735 Oct 22, 2021
Super Light Regexp engine for C/C++

SLRE: Super Light Regular Expression library Documentation and API reference are at docs.cesanta.com/slre Contributions To submit contributions, sign

Cesanta Software 487 Sep 13, 2021
Perl Incompatible Regular Expressions library

This is PIRE, Perl Incompatible Regular Expressions library. This library is aimed at checking a huge amount of text against relatively many regular

Yandex 301 Sep 21, 2021
RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.

This is the source code repository for RE2, a regular expression library. For documentation about how to install and use RE2, visit https://github.co

Google 6.1k Oct 16, 2021