A non-backtracking NFA/DFA-based Perl-compatible regex engine matching on large data streams

Overview

Name

libsregex - A non-backtracking NFA/DFA-based Perl-compatible regex engine library for matching on large data streams

Table of Contents

Status

This library is already quite usable and some people are already using it in production.

Nevertheless this library is still under heavy development. The API is still in flux and may be changed quickly without notice.

This is a pure C library that is designed to have zero dependencies.

No pathological regexes exist for this regex engine because it does not use a backtracking algorithm at all.

Already rewrote the code base of Russ Cox's re1 library using the nginx coding style (yes, I love it!), also incorporated a clone of the nginx memory pool into it for memory management.

Already ported the Thompson and Pike VM backends to sregex. The former is just for yes-or-no matching, and the latter also supports sub-match capturing.

Implemented the case-insensitive matching mode via the SRE_REGEX_CASELESS flag.

The full streaming matching API for the sregex engine has already been implemented, for both the Pike and Thompson regex VMs. The sub-match capturing also supports streaming processing. When the state machine is yielded (that is, returning SRE_AGAIN on the current input data chunk), sregex will always output the current value range for the $& sub-match capture in the user-supplied ovector array.

Almost all the relevant test cases for PCRE 8.32 and Perl 5.16.2 have been imported into sregex's test suite and all tests are passing right now.

Already implemented an API for assembling multiple user regexes and returning an ID indicating exactly which regex is matched (first), as well as the corresponding sub-match captures.

There is also a Just-in-Time (JIT) compiler targeting x86_64 for the Thompson VM.

Syntax Supported

The following Perl 5 regex syntax features have already been implemented.

^             match the beginning of lines
$             match the end of lines

\A            match only at beginning of stream
\z            match only at end of stream

\b            match a word boundary
\B            match except at a word boundary

.             match any char

[ab0-9]       character classes (positive)
[^ab0-9]      character classes (negative)

\d            match a digit character ([0-9])
\D            match a non-digit character ([^0-9])

\s            match a whitespace character ([ \f\n\r\t])
\S            match a non-whitespace character ([^ \f\n\r\t])

\h            match a horizontal whitespace character
\H            match a character that isn't horizontal whitespace

\v            match a vertical whitespace character
\V            match a character that isn't vertical whitespace

\w            match a "word" character ([A-Za-z0-9_])
\W            match a non-"word" character ([^A-Za-z0-9_])

\cK           control char (example: VT)

\N            match a character that isn't a newline

ab            concatenation; first match a, and then b
a|b           alternation; match a or b

(a)           capturing parentheses
(?:a)         non-capturing parantheses

a?            match 1 or 0 times, greedily
a*            match 0 or more times, greedily
a+            match 1 or more times, greedily

a??           match 1 or 0 times, not greedily
a*?           match 0 or more times, not greedily
a+?           match 1 or more times, not greedily

a{n}          match exactly n times
a{n,m}        match at least n but not more than m times, greedily
a{n,}         match at least n times, greedily

a{n}?         match exactly n times, not greedily (redundant)
a{n,m}?       match at least n but not more than m times, not greedily
a{n,}?        match at least n times, not greedily

The following escaping sequences are supported:

\t          tab
\n          newline
\r          return
\f          form feed
\a          alarm
\e          escape
\b          backspace (in character class only)
\x{}, \x00  character whose ordinal is the given hexadecimal number
\o{}, \000  character whose ordinal is the given octal number

Escaping a regex meta character yields the literal character itself, like \{ and \..

Only the octet mode is supported; no multi-byte character encoding love (yet).

API

This library provides a pure C API. This API is still in flux and may change in the near future without notice.

Back to TOC

Constants

This library provides the following public constants for use in the various API functions.

  • SRE_OK
  • SRE_DECLINED
  • SRE_AGAIN
  • SRE_ERROR

The actual meanings of these constants depend on the concrete API functions using them.

Back to TOC

Memory pool API

This library utilizes a memory pool to simplify memory management. Most of the low-level API functions provided by this library does accept a memory pool pointer as an argument.

The operations on the memory pool on the user side are limited to

  1. creating a memory pool,
  2. destroying a memory pool, and
  3. resetting a memory pool.

Back to TOC

sre_create_pool

sre_pool_t *sre_create_pool(size_t size);

Creates a memory pool with a page size of size. Returns the pool as an opaque pointer type sre_pool_t.

Usually the page size you specify should not be too large. Usually 1KB or 4KB should be sufficient. Optimal values depend on your actual regexes and input data pattern involved and should be tuned empirically.

The returned memory pool pointer is usually fed into other API functions provided by this library as an argument.

It is your responsibility to destroy the pool when you no longer need it via the sre_destroy_pool function. Failing to destroy the pool will result in memory leaks.

Back to TOC

sre_destroy_pool

void sre_destroy_pool(sre_pool_t *pool);

Destroys the memory pool created by the sre_create_pool function.

Back to TOC

sre_reset_pool

void sre_reset_pool(sre_pool_t *pool);

Back to TOC

Regex parsing and compilation API

Before running a regex (or set of multiple regexes), you need to parse and compile them first, such that you can run the compiled form of the regex(es) over and over again at maximum speed.

Back to TOC

sre_regex_parse

typedef uint8_t     sre_char;
typedef uintptr_t   sre_uint_t;
typedef intptr_t    sre_int_t;

sre_regex_t *sre_regex_parse(sre_pool_t *pool, sre_char *regex,
    sre_uint_t *ncaps, int flags, sre_int_t *err_offset);

Parses the string representation of the user regex specified by the regex parameter (as a null-terminated string).

Returns a parsed regex object of the opaque pointer type sre_regex_t if no error happens. Otherwise returns a NULL pointer and set the offset in the regex string where the parse failure happens.

The parsed regex object pointer is an Abstract-Syntax-Tree (AST) representation of the string regex. It can later be fed into API function calls like sre_regex_compile as an argument.

The first parameter, pool, is a memory pool created by the sre_create_pool API function.

The ncaps parameter is used to output the number of sub-match captures found in the regex. This integer can later be used to extract sub-match captures.

The flags parameter specifies additional regex compiling flags like below:

  • SRE_REGEX_CASELESS case-insensitive matching mode.

Back to TOC

sre_regex_parse_multi

typedef uint8_t     sre_char;
typedef uintptr_t   sre_uint_t;
typedef intptr_t    sre_int_t;

sre_regex_t *sre_regex_parse_multi(sre_pool_t *pool, sre_char **regexes,
    sre_int_t nregexes, sre_uint_t *max_ncaps, int *multi_flags,
    sre_int_t *err_offset, sre_int_t *err_regex_id);

Similar to the sre_regex_parse API function but works on multiple regexes at once.

These regexes are specified by the C string array regexes, whose size is determined by the nregexes parameter.

All these input regexes are combined into a single parsed regex object, returned as the opaque pointer of the type sre_regex_t, just like sre_regex_parse. These regexes are logically connected via the alternative regex operator (|), so the order of these regexes determine their relative precedence in a tie. Despite of being connected by | logically, the regex execution API can still signify which of these regexes is matched by returning the regex ID which is the offset of the regex in the regexes input array.

Upon failures, returns the NULL pointer and sets

  • the output parameter err_regex_id for the number of regex having syntax errors (i.e., the 0-based offset of the regex in the regexes input parameter array), and
  • the output parameter err_offset for the string offset in the guilty regex where the failure happens.

The output parameter max_ncaps returns the maximum number of sub-match captures in all these regexes. Note that, this is is the maximum instead of the sum.

The multi_flags is an input array consisting of the regex flags for every regex specified in the regexes array. The size of this array must be no shorter than the size specified by nregexes. For what regex flags you can use, just check out the documentation for the sre_regex_parse API function.

Back to TOC

sre_regex_compile

sre_program_t *sre_regex_compile(sre_pool_t *pool, sre_regex_t *re);

Compiles the parsed regex object (returned by sre_regex_parse) into a bytecode representation of the regex, of the opaque pointer type sre_program_t.

Returns the NULL pointer in case of failures.

The memory pool specified by the pool parameter does not have to be the same as the one used by the earlier sre_regex_parse call. But you could use the same memory pool if you want.

The compiled regex form (or bytecode form) returned can be fed into one of the regex backend VMs provided by this library for execution. See regex execution API for more details.

Back to TOC

Regex execution API

The regex execution API provides various different virtual machines (VMs) for running the compiled regexes by different algorithms.

Currently the following VMs are supported:

Back to TOC

Thompson VM

The Thompson VM uses the Thompson NFA simulation algorithm to execute the compiled regex(es) by matching against an input string (or input stream).

Back to TOC

sre_vm_thompson_create_ctx

sre_vm_thompson_ctx_t *sre_vm_thompson_create_ctx(sre_pool_t *pool,
    sre_program_t *prog);

Creates and returns a context structure (of the opaque type sre_vm_thompson_ctx_t) for the Thompson VM. Returns NULL in case of failure (like running out of memory).

This return value can later be used by the sre_vm_thompson_exec function as an argument.

The prog parameter accepts the compiled bytecode form of the regex(es) returned by the sre_regex_compile function. This compiled regex(es) is embedded into the resulting context structure.

Accepts a memory pool created by the sre_create_pool function as the first argument. This memory pool does not have to be the same as the pool used for parsing or compiling the regex(es).

Back to TOC

sre_vm_thompson_exec

typedef intptr_t    sre_int_t;
typedef uint8_t     sre_char;

sre_int_t sre_vm_thompson_exec(sre_vm_thompson_ctx_t *ctx, sre_char *input,
    size_t size, unsigned int eof);

Executes the compiled regex(es) on the input string data atop the Thompson VM (without Just-In-Time optimizations).

The ctx argument value is returned by the sre_vm_thompson_create_ctx function. The compiled (bytecode) form of the regex(es) are already embedded in this ctx value. This ctx argument can be changed by this function call and must be preserved for all the sre_vm_thompson_exec calls on the same data stream. Different data streams MUST use different ctx instances. When a data stream is completely processed, the corresponding ctx instance MUST be discarded and cannot be reused again.

The input data is specified by a character data chunk in a data stream. The input parameter specifies the starting address of the data chunk, the size parameter specifies the size of the chunk, while the eof parameter identifies whether this chunk is the last chunk in the stream. If you just want to match on a single C string, then always specify 1 as the eof argument and exclude the NULL string terminator in your C string while computing the size argument value.

This function may return one of the following values:

  • SRE_OK A match is found.
  • SRE_DECLINED No match can be found. This value can never be returned when the eof parameter is unset (because a match MAY get found when seeing more input string data).
  • SRE_AGAIN More data (in a subsequent call) is needed to obtain a match. The current data chunk can be discarded after this call returns. This value can only be returned when the eof parameter is not set.
  • SRE_ERROR A fatal error has occurred (like running out of memory).

This function does not return the regex ID of the matched regex when multiple regexes are specified at once via the sre_regex_parse_multi function is used. This may change in the future.

Sub-match captures are not supported in this Thompson VM by design. You should use the Pike VM instead if you want that.

Back to TOC

Just-In-Time Support for Thompson VM

The Thompson VM comes with a Just-In-Time compiler. Currently only the x86_64 architecture is supported. Support for other architectures may come in the future.

Back to TOC

sre_vm_thompson_jit_compile
typedef intptr_t    sre_int_t;

sre_int_t sre_vm_thompson_jit_compile(sre_pool_t *pool, sre_program_t *prog,
    sre_vm_thompson_code_t **pcode);

Compiles the bytecode form of the regex(es) created by sre_regex_compile down into native code.

It returns one of the following values:

  • SRE_OK Compilation is successful.
  • SRE_DECLINED The current architecture is not supported.
  • SRE_ERROR A fatal error occurs (like running out of memory).

The pool parameter specifies a memory pool created by sre_create_pool. This pool is used for the JIT compilation.

The prog parameter is the compiled bytecode form of the regex(es) created by the sre_regex_compile function call.

The resulting JIT compiled native code along with the runtime information is saved in the output argument pcode of the opaque type sre_vm_thompson_code_t. This structure is allocated by this function in the provided memory pool.

This sre_vm_thompson_code_t object can later be executed by running the C function pointer fetched from this object via the sre_vm_thompson_jit_get_handler call.

Back to TOC

sre_vm_thompson_jit_get_handler
typedef uint8_t     sre_char;
typedef intptr_t    sre_int_t;
typedef sre_int_t (*sre_vm_thompson_exec_pt)(sre_vm_thompson_ctx_t *ctx,
    sre_char *input, size_t size, unsigned int eof);

sre_vm_thompson_exec_pt sre_vm_thompson_jit_get_handler(
    sre_vm_thompson_code_t *code);

Fetches a C function pointer from the JIT compiled form of the regex(es) generated via an earlier sre_vm_thompson_jit_compile.

The C function pointer is of the exactly same function prototype of the interpreter entry function sre_vm_thompson_exec. The only difference is that the sre_vm_thompson_ctx_t object MUST be created via the sre_vm_thompson_jit_create_ctx function instead of the sre_vm_thompson_create_ctx function. Despite that, the resulting C function pointer can be used as the same way as sre_vm_thompson_exec.

Back to TOC

sre_vm_thompson_jit_create_ctx
sre_vm_thompson_ctx_t *sre_vm_thompson_jit_create_ctx(sre_pool_t *pool,
    sre_program_t *prog);

Allocates a context structure for executing the compiled native code form of the regex(s) generated by the Just-In-Time compiler of the Thompson VM.

This context object should only be used by the C function returned by the sre_vm_thompson_jit_get_handler function call. Use of this object in sre_vm_thompson_exec is prohibited.

Back to TOC

Pike VM

The Pike VM uses an enhanced version of the Thompson NFA simulation algorithm that supports sub-match captures.

Back to TOC

sre_vm_pike_create_ctx

typedef intptr_t    sre_int_t;

sre_vm_pike_ctx_t *sre_vm_pike_create_ctx(sre_pool_t *pool, sre_program_t *prog,
    sre_int_t *ovector, size_t ovecsize);

Creates and returns a context structure (of the opaque type sre_vm_pike_ctx_t) for the Pike VM. Returns NULL in case of failure (like running out of memory).

This return value can later be used by the sre_vm_pike_exec function as an argument.

The prog parameter accepts the compiled bytecode form of the regex(es) returned by the sre_regex_compile function. This compiled regex(es) is embedded into the resulting context structure.

Accepts a memory pool created by the sre_create_pool function as the first argument. This memory pool does not have to be the same as the pool used for parsing or compiling the regex(es).

The ovector parameter specifies an array for outputting the beginning and end offsets of the (sub-)match captures. The elements of the array are used like below:

  1. The 1st element of the array holds the beginning offset of the whole match,
  2. the 2nd element holds the end offset of the whole match,
  3. the 3rd element holds the beginning offset of the 1st sub-match capture,
  4. the 4th element holds the end offset of the 1st sub-match capture,
  5. the 5rd element holds the beginning offset of the 2st sub-match capture,
  6. the 6th element holds the end offset of the 2st sub-match capture,
  7. and so on...

The size of the ovector array is specified by the ovecsize parameter, in bytes. The size of the array can be computed as follows:

    ovecsize = 2 * (ncaps + 1) * sizeof(sre_int_t)

where ncaps is the value previously output by the sre_regex_parse or sre_regex_parse_multi function.

The ovector array is allocated by the caller and filled by this function call.

Back to TOC

sre_vm_pike_exec

typedef uint8_t     sre_char;
typedef intptr_t    sre_int_t;

sre_int_t sre_vm_pike_exec(sre_vm_pike_ctx_t *ctx, sre_char *input, size_t size,
    unsigned eof, sre_int_t **pending_matched);

Executes the compiled regex(es) on the input string data atop the Pike VM (without Just-In-Time optimizations).

The ctx argument value is returned by the sre_vm_pike_create_ctx function. The compiled (bytecode) form of the regex(es) are already embedded in this ctx value. This ctx argument can be changed by this function call and must be preserved for all the sre_vm_pike_exec calls on the same data stream. Different data streams MUST use different ctx instances. When a data stream is completely processed, the corresponding ctx instance MUST be discarded and cannot be reused again.

The input data is specified by a character data chunk in a data stream. The input parameter specifies the starting address of the data chunk, the size parameter specifies the size of the chunk, while the eof parameter identifies whether this chunk is the last chunk in the stream. If you just want to match on a single C string, then always specify 1 as the eof argument and exclude the NULL string terminator in your C string while computing the size argument value.

The pending_matched parameter outputs an array holding all the pending matched captures (whole-match only, no sub-matches) if no complete matches have been found yet (i.e., this call returns SRE_AGAIN). This is very useful for doing regex substitutions on (large) data streams where the caller can use the info in pending_matched to decide exactly how much data in the current to-be-thrown data chunk needs to be buffered. The caller should never allocate the space for this array, rather, this function call takes care of it and just sets the (double) pointer to point to its internal (read-only) storage.

This function may return one of the following values:

  • a non-negative value A match is found and the value is the ID of the (first) matched regex if multiple regexes are parsed at once via the sre_regex_parse_multi function. A regex ID is the 0-based index of the corresponding regex in the regexes array fed into the sre_regex_parse_multi function.
  • SRE_DECLINED No match can be found. This value can never be returned when the eof parameter is unset (because a match MAY get found when seeing more input string data).
  • SRE_AGAIN More data (in a subsequent call) is needed to obtain a match. The current data chunk can be discarded after this call returns. This value can only be returned when the eof parameter is not set.
  • SRE_ERROR A fatal error has occurred (like running out of memory).

Back to TOC

Examples

Please check out the sregex-cli command-line utility's source for usage:

https://github.com/agentzh/sregex/blob/master/src/sre_cli.c#L1

The sregex-cli command-line interface can be used as a convenient way to exercise the engine:

./sregex-cli 'a|ab' 'blab'

It also supports the --flags option which can be used to enable case-insensitive matching:

./sregex-cli --flags i 'A|AB' 'blab'

And also the --stdin option for reading data chunks from stdin:

# one single data chunk to be matched:
perl -e '$s="foobar";print length($s),"\n$s"' \
    | ./sregex-cli --stdin foo

# 3 data chunks (forming a single input stream) to be matched:
perl -e '$s="foobar";print length($s),"\n$s" for 1..3' \
    | sregex-cli --stdin foo

A real-world application of this library is the ngx_replace_filter module:

https://github.com/agentzh/replace-filter-nginx-module

Back to TOC

Installation

make
make install

Gnu make and gcc are required. (On operating systems like FreeBSD and Solaris, you should type gmake instead of make here.)

It will build libsregex.so (or libsregex.dylib on Mac OS X), libsregex.a, and the command-line utility sregex-cli and install them into the prefix /usr/local/ by default.

If you want to install into a custom location, then just specify the PREFIX variable like this:

make PREFIX=/opt/sregex
make install PREFIX=/opt/sregex

If you are building a binary package (like an RPM package), then you will find the DESTDIR variable handy, as in

make PREFIX=/opt/sregex
make install PREFIX=/opt/sregex DESTDIR=/path/to/my/build/root

If you run make distclean before make, then you also need bison 2.7+ for generating the regex parser files.

Back to TOC

Test Suite

The test suite is driven by Perl 5.

To run the test suite

make test

Gnu make, perl 5.16.2, and the following Perl CPAN modules are required:

  • Cwd
  • IPC::Run3
  • Test::Base
  • Test::LongString

If you already have perl installed in your system, you can use the following command to install these CPAN modules (you may need to run it using root):

cpan Cwd IPC::Run3 Test::Base Test::LongString

You can also run the test suite using the Valgrind Memcheck tool to check memory issues in sregex:

make valtest

Because we have a huge test suite, to run the test suite in parallel, you can specify the parallelism level with the jobs make variable, as in

make test jobs=8

or similarly

make valtest jobs=8

So the test suite will run in 8 parallel jobs (assuming you have 8 CPU cores).

The streaming matching API is much more thoroughly excerised by the test suite of the ngx_replace_filter module.

Back to TOC

TODO

  • implement the (?i) and (?-i) regex syntax.
  • implement a simplified version of the backreferences.
  • implement the comment notation (?#comment).
  • implement the POSIX character class notation.
  • allow '\0' be used in both the regex and the subject string.
  • add a bytecode optimizer to the regex VM (which also generates minimized DFAs for the Thompson VM).
  • add a JIT compiler for the Pike VM targeting x86_64.
  • port the existing x86_64 JIT compiler for the Thompson VM to other architectures like i386.
  • implement the generalized look-around assertions like (?=pattern), (?!pattern), (?<=pattern), and (?<!pattern).
  • implement the UTF-8, GBK, and Latin1 matching mode.

Back to TOC

Author

Yichun "agentzh" Zhang (章亦春) [email protected], OpenResty Inc.

Back to TOC

Copyright and License

Part of this code is from the NGINX open source project: http://nginx.org/LICENSE

This library is licensed under the BSD license.

Copyright (C) 2012-2017, by Yichun "agentzh" Zhang (章亦春), OpenResty Inc.

Copyright (C) 2007-2009 Russ Cox, Google Inc. All rights reserved.

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  • Neither the name of Google, Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Back to TOC

See Also

Back to TOC

Comments
  • add multiline support

    add multiline support

    i.e. flag that remove special meaning of newlines in input data, so ^ will match first char in first chunk, and $ will match last byte in last chunk (i.e one with EOF set).

    opened by socketpair 6
  • fix hexdecimal literal string

    fix hexdecimal literal string

    Fix these warnings only:

    $ perl -l -e 'print $^V'
    v5.16.2
    $ perl t/03-pcre-testinput1-06.t  >/dev/null
    Illegal hexadecimal digit '\' ignored at (eval 66) line 1, <DATA> line 1.
    Illegal hexadecimal digit 'Z' ignored at (eval 66) line 1, <DATA> line 1.
    Illegal hexadecimal digit '\' ignored at (eval 67) line 1, <DATA> line 1.
    Illegal hexadecimal digit '\' ignored at (eval 70) line 1, <DATA> line 1.
    Illegal hexadecimal digit 'Z' ignored at (eval 70) line 1, <DATA> line 1.
    
    opened by cofyc 5
  • Add perl version checking in test suite

    Add perl version checking in test suite

    t/03-pcre-testinput1-02.t .. 211/? 
    #   Failed test 'TEST 37: testinput1:113 - bad regex: ^\ca\cA\c[\c{\c:: Use ";" instead of "\c{" at t/SRegex.pm line 162.
    # '
    #   at t/SRegex.pm line 166.
    # Looks like you failed 1 test of 359.
    t/03-pcre-testinput1-02.t .. Dubious, test returned 1 (wstat 256, 0x100)
    Failed 1/359 subtests 
    t/03-pcre-testinput1-03.t .. ok     
    
    opened by socketpair 4
  • perl 5.18-style incompatibilities

    perl 5.18-style incompatibilities

    with prove -j1 -r -e perl5.18.0 t some tests are failing, which look like API incompatibilities. They might need documentation and perl version specific checks in the test suite.

    t/03-pcre-testinput1-07.t (Wstat: 256 Tests: 363 Failed: 1)
      Failed test:  363
      Non-zero exit status: 1
    t/03-pcre-testinput1-19.t (Wstat: 1024 Tests: 382 Failed: 4)
      Failed tests:  48, 50, 56, 58
      Non-zero exit status: 4
    
    t/03-pcre-testinput1-19.t .. 1/?
    #   Failed test 'TEST 7: testinput1:3678 - pike vm capture ok'
    #   at t/SRegex.pm line 260.
    #          got: '(1, 6)'
    #     expected: '(1, 7)'
    
    #   Failed test 'TEST 7: testinput1:3678 - splitted pike vm capture ok'
    #   at t/SRegex.pm line 263.
    #          got: '(1, 6)'
    #     expected: '(1, 7)'
    
    #   Failed test 'TEST 8: testinput1:3681 - pike vm capture ok'
    #   at t/SRegex.pm line 260.
    #          got: '(1, 6)'
    #     expected: '(1, 7)'
    
    #   Failed test 'TEST 8: testinput1:3681 - splitted pike vm capture ok'
    #   at t/SRegex.pm line 263.
    #          got: '(1, 6)'
    #     expected: '(1, 7)'
    
    opened by rurban 4
  • Bad pike performance

    Bad pike performance

    I am experiencing bad performance and large performance drops comeparted to the complexity of the regex.

    As you can see in the data bellow the performance drop from one char to two chars is huge. and the performance drop from two chars to a more advanced regex is relatively smale.

    Am i doing somthing wrong in my usage of the library or is this exepected performance?

    Data:

    single char

    regex:       `a`
    throughput:   ~867 MiB/s 
    cpu_time:     sre_vm_pike_exec ~23%
    

    two chars

    regex:        `aa`
    throughput:   ~82 MiB/s 
    cpu_time:     sre_vm_pike_exec ~91%
    

    all image and script tags

    regex:        `<img[^>]+>|<script( [^>]+)?>([^<]*)</script>`
    throughput:   ~49 MiB/s
    cpu_time:     sre_vm_pike_exec ~93.26%
    

    testscript

    #define BUFFER_SIZE 1024
    #define POOL_SIZE 1024
    size_t pos,c;
    int fd = open(filepath,O_RDONLY);
    int eof = 0;
    char buffer[BUFFER_SIZE];
    sre_int_t *pending_matched;
    if(fd < 0) {
    	exit(1);
    }
    sre_pool_t *pool = sre_create_pool(POOL_SIZE);
    sre_vm_pike_ctx_t *ctx = sre_vm_pike_create_ctx(pool,prog,ovector,ovecsize);
    for (;;) {
    	c = read(fd,buffer,BUFFER_SIZE);
    	if (c < BUFFER_SIZE)
    		eof = 1;
    	sre_int_t ret;
    	char *tmpos = buffer;
    go:
        ret	= sre_vm_pike_exec(ctx,tmpos,c,eof,&pending_matched);
    	if(ret > 0 && ovector[1] < pos+c) {
    		tmpos += ovector[1];
    		c -= ovector[1];
    		pos += ovector[1];
    		goto go;
    	}
    	pos += c;
    	fprintf(stdout,"%.*s",c,buffer);
    	if(eof)
    		exit(0);
    }
    

    testdata

    CUNY webpage source repeated until it made a 1.1G file

    opened by Miniwoffer 3
  • Failed tests with perl-5.22.2-361.fc24.x86_64

    Failed tests with perl-5.22.2-361.fc24.x86_64

    Fedora 24 gcc-6.1.1-3.fc24.x86_64 perl-5.22.2-361.fc24.x86_64

    #   Failed test 'TEST 37: testinput1:113 - bad regex: ^\ca\cA\c[\c{\c:: Use ";" instead of "\c{" at t/SRegex.pm line 162.
    # '
    #   at t/SRegex.pm line 166.
    # Looks like you failed 1 test of 359.
    t/03-pcre-testinput1-02.t .. 
    Dubious, test returned 1 (wstat 256, 0x100)
    Failed 1/359 subtests 
    
    #   Failed test 'TEST 9: testinput1:4081 - bad regex: ^\c^?: Character following "\c" must be printable ASCII at t/SRegex.pm line 162.
    # '
    #   at t/SRegex.pm line 166.
    # Looks like you failed 1 test of 371.
    t/03-pcre-testinput1-20.t .. 
    Dubious, test returned 1 (wstat 256, 0x100)
    Failed 1/371 subtests 
    
    Test Summary Report
    -------------------
    t/03-pcre-testinput1-02.t (Wstat: 256 Tests: 359 Failed: 1)
      Failed test:  263
      Non-zero exit status: 1
    t/03-pcre-testinput1-20.t (Wstat: 256 Tests: 371 Failed: 1)
      Failed test:  63
      Non-zero exit status: 1
    Files=44, Tests=14414, 11 wallclock secs ( 2.09 usr  0.14 sys +  7.46 cusr  3.13 csys = 12.82 CPU)
    Result: FAIL
    Makefile:159: recipe for target 'test' failed
    
       160              if (!ref $re && !defined $block->no_match && !defined $block->cap) {
       161                  eval {
       162                      $s =~ m/$prefix$re/sm;
       163                  };
    

    I think something has changed how perl handles \c sequences with perl 5.22

    From perl5220delta:

    In double-quotish "\cX", X must now be a printable ASCII character In prior releases, failure to do this raised a deprecation warning.

    opened by maage 3
  • Support for single line mode

    Support for single line mode

    Hi, Is there any chance support could be added for single-line mode? (meaning the dot token will match newlines as well when this flag is specified) I can try my hand at adding this functionality if you'd like (any guidance you could offer would be very much appreciated!).

    opened by int0x2e 3
  • `make test` requires several Perl modules that don't seem to be default (on Ubuntu 15.10 at least)

    `make test` requires several Perl modules that don't seem to be default (on Ubuntu 15.10 at least)

    Hi, thought you guys might want to know, on a fresh Ubuntu 15.10 install, make test generates a few dozen of these:

    Can't locate Test/Base.pm in @INC (you may need to install the Test::Base module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.20.2 /usr/local/share/perl/5.20.2 /usr/lib/x86_64-linux-gnu/perl5/5.20 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.20 /usr/share/perl/5.20 /usr/local/lib/site_perl .) at t/SRegex.pm line 6.
    BEGIN failed--compilation aborted at t/SRegex.pm line 6.
    Compilation failed in require at t/01-sanity-01.t line 3.
    BEGIN failed--compilation aborted at t/01-sanity-01.t line 3.
    t/01-sanity-01.t ...........
    Dubious, test returned 2 (wstat 512, 0x200)
    No subtests run
    

    Fresh Ubuntu 15.10, only things I installed were things like lrzip, pv, tmux, things like that. Not a huge issue or anything, I don't think many people make test, and you guys might already know about this and are not really worried about it.

    If anyone came here via googling the error code, installing Test::Base, IPC::Run3 and Test::LongString with cpan fixed my issue.

    opened by raincoats 3
  • Versioned/tagged release?

    Versioned/tagged release?

    I'm looking at packaging sregex as an RPM and to do so would be easiest with a specific tagged release and version numbering. I know sregex isn't yet fully stable, so even as a pre-release or unstable version number would be really helpful. I see there's https://github.com/openresty/sregex/releases but that's quite a while back.

    Thanks!

    opened by davidjb 2
  • multiple parsers - sregex needs own namespace

    multiple parsers - sregex needs own namespace

    If there is multiple bison parsers with default prefix, program might not run properly. As it probably has multiple yyparse definitions.

    See: info bison Multiple Parsers in the Same Program

    diff --git a/Makefile b/Makefile
    index ac08ab7..a084f58 100644
    --- a/Makefile
    +++ b/Makefile
    @@ -138,7 +138,7 @@ $(FILE_A): $(lib_o_files)
    
     %.c %.h: %.y
            $(E) "BISON     $@"
    -       $(Q)bison -v $<
    +       $(Q)bison -p sregex_yy -v $<
    
     %.h: %.dasc
            $(E) "DYNASM    $@"
    
    opened by maage 2
  • make error on centos

    make error on centos

    在ubuntu下没有问题,在centos下make出错

    我的环境是: os:centos 6.3 kernel:2.6.32-220.el6.x86_64 gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC)

    以下是我下载后的操作 [root@server ~]# git clone https://github.com/agentzh/sregex.git Initialized empty Git repository in /root/sregex/.git/ remote: Counting objects: 1229, done. remote: Compressing objects: 100% (588/588), done. remote: Total 1229 (delta 857), reused 1012 (delta 640) Receiving objects: 100% (1229/1229), 391.62 KiB | 161 KiB/s, done. Resolving deltas: 100% (857/857), done. [root@server ~]# cd sregex/ [root@server sregex]# ls bench dynasm LICENSE Makefile misc README.markdown src t util [root@server sregex]# make BISON src/sregex/sre_yyparser.h CC src/sregex/sre_palloc.o CC src/sregex/sre_regex.o CC src/sregex/sre_yyparser.o cc1: warnings being treated as errors src/sregex/sre_yyparser.c: In function ‘yyparse’: src/sregex/sre_yyparser.c:1218: error: ‘yylloc.last’ may be used uninitialized in this function src/sregex/sre_yyparser.c:1218: error: ‘yylloc.pos’ may be used uninitialized in this function make: *** [src/sregex/sre_yyparser.o] Error 1 [root@server sregex]#

    opened by loveshell 2
  • Misleading indentation – maybe a bug?

    Misleading indentation – maybe a bug?

    In file included from src/sregex/sre_vm_thompson_jit.c:16:
    ./dynasm/dasm_x86.h: In function 'dasm_put':
    ./dynasm/dasm_x86.h:207:2: warning: this 'if' clause does not guard... [-Wmisleading-indentation]
      207 |  if (*p++ == 1 && *p == DASM_DISP) mrm = n; continue;
          |  ^~
    ./dynasm/dasm_x86.h:207:45: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'if'
      207 |  if (*p++ == 1 && *p == DASM_DISP) mrm = n; continue;
          |                                             ^~~~~~~~
    
    opened by jirutka 0
  • 01-sanity-02.t – tests 34, 36, 37 are broken

    01-sanity-02.t – tests 34, 36, 37 are broken

    t/01-sanity-02.t ........... 1/? 
    #   Failed test 'TEST 34: - bad regex: x{0, 1}: Unescaped left brace in regex is illegal here in regex; marked by <-- HERE in m/x{ <-- HERE 0, 1}/ at sregex-0.0.1/t/SRegex.pm line 162.
    # '
    #   at sregex-0.0.1/t/SRegex.pm line 166.
    
    #   Failed test 'TEST 36: - bad regex: x{0 ,1}: Unescaped left brace in regex is illegal here in regex; marked by <-- HERE in m/x{ <-- HERE 0 ,1}/ at sregex-0.0.1/t/SRegex.pm line 162.
    # '
    #   at sregex-0.0.1/t/SRegex.pm line 166.
    
    #   Failed test 'TEST 37: - bad regex: x{,12}: Unescaped left brace in regex is illegal here in regex; marked by <-- HERE in m/x{ <-- HERE ,12}/ at sregex-0.0.1/t/SRegex.pm line 162.
    # '
    #   at sregex-0.0.1/t/SRegex.pm line 166.
    

    Environment:

    • gcc 10.2.1_pre1
    • perl v5.32.0
    • IPC::Run3 0.048
    • Test::Base 0.89
    • Test::LongString 0.17
    • PathTools (Cwd) 3.75
    • Alpine Linux Edge
    opened by jirutka 0
  • Serialize AST

    Serialize AST

    I've been writing a project in Lua/OR and so far relied on a custom regex implementation in Lua-Land because I need access to the AST of parsed expressions. That's not possible with the PCRE bindings in OR. It seems like sregex is one step closer because parsing, compiling, and executing are different steps, with the parser returning "a pointer to the AST" according to the readme.

    Is there a way to access that AST and serialize it (to JSON or similar) for further processing?

    opened by turbo 0
  • How to discard ctx

    How to discard ctx

    I try to follow the steps to excute the Pike VM(sre_vm_pike_create_ctx then sre_vm_pike_exec). According to the manual , "Different data streams MUST use different ctx instances", so i create another ctx when i excute with another data stream. This causes a problem that it will lead to running out of the memory when creating ctx frequenty. However , i cannot not find any API to free the ctx.Are there any solutions for this?

    opened by encore-zhou 0
Owner
OpenResty
A Fast and Scalable Web Platform by Extending NGINX with LuaJIT. Commercial support is provided at https://openresty.com/
OpenResty
SRL-CPP is a Simple Regex Language builder library written in C++11 that provides an easy to use interface for constructing both simple and complex regex expressions.

SRL-CPP SRL-CPP is a Simple Regex Language builder library written in C++11 that provides an easy to use interface for constructing both simple and co

Telepati 0 Mar 9, 2022
The approximate regex matching library and agrep command line tool.

Introduction TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzz

Ville Laurikari 698 Dec 26, 2022
Easier CPP interface to PCRE regex engine with global match and replace

RegExp Easier CPP interface to PCRE regex engine with global match and replace. I was looking around for better regex engine than regcomp for my C/C++

Yasser Asmi 5 May 21, 2022
Perl Incompatible Regular Expressions library

This is PIRE, Perl Incompatible Regular Expressions library. This library is aimed at checking a huge amount of text against relatively many regular

Yandex 320 Oct 9, 2022
Glob pattern to regex translator in C++11. Optionally, directory traversal with glob pattern in C++17. Header-only library.

Glob pattern to regex translator in C++11. Optionally, directory traversal with glob pattern in C++17. Header-only library.

Takayuki MATSUOKA 3 Oct 27, 2021
A small implementation of regular expression matching engine in C

cregex cregex is a compact implementation of regular expression (regex) matching engine in C. Its design was inspired by Rob Pike's regex-code for the

Jim Huang 72 Dec 6, 2022
High-performance regular expression matching library

Hyperscan Hyperscan is a high-performance multiple regex matching library. It follows the regular expression syntax of the commonly-used libpcre libra

Intel Corporation 4k Jan 1, 2023
A Compile time PCRE (almost) compatible regular expression matcher.

Compile time regular expressions v3 Fast compile-time regular expressions with support for matching/searching/capturing during compile-time or runtime

Hana Dusíková 2.6k Jan 5, 2023
Super Light Regexp engine for C/C++

SLRE: Super Light Regular Expression library Documentation and API reference are at docs.cesanta.com/slre Contributions To submit contributions, sign

Cesanta Software 505 Dec 21, 2022
An implementation of Finite Automata (NFA, DFA) in C++

Automata An implementation of Finite Automata (Nondeterminstic Finite Automata, Determinstic Finite Automata) in C++ Several Automata Algorithms are i

Jack Xia 1 Dec 14, 2021
RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.

This is the source code repository for RE2, a regular expression library. For documentation about how to install and use RE2, visit https://github.co

Google 7.5k Jan 4, 2023
SRL-CPP is a Simple Regex Language builder library written in C++11 that provides an easy to use interface for constructing both simple and complex regex expressions.

SRL-CPP SRL-CPP is a Simple Regex Language builder library written in C++11 that provides an easy to use interface for constructing both simple and co

Telepati 0 Mar 9, 2022
The approximate regex matching library and agrep command line tool.

Introduction TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzz

Ville Laurikari 698 Dec 26, 2022
Easier CPP interface to PCRE regex engine with global match and replace

RegExp Easier CPP interface to PCRE regex engine with global match and replace. I was looking around for better regex engine than regcomp for my C/C++

Yasser Asmi 5 May 21, 2022
A large scale non-linear optimization library

Ceres Solver Ceres Solver is an open source C++ library for modeling and solving large, complicated optimization problems. It is a feature rich, matur

null 2.9k Jan 3, 2023
Perl Incompatible Regular Expressions library

This is PIRE, Perl Incompatible Regular Expressions library. This library is aimed at checking a huge amount of text against relatively many regular

Yandex 320 Oct 9, 2022
A C++ Web Framework built on top of Qt, using the simple approach of Catalyst (Perl) framework.

Cutelyst - The Qt Web Framework A Web Framework built on top of Qt, using the simple and elegant approach of Catalyst (Perl) framework. Qt's meta obje

Cutelyst 809 Dec 19, 2022
Similar to C++ streams, but the stream elements are structured JSON data rather than characters.

JIOS : JSON Input Output Streams Similar to C++ streams, but the stream elements are structured JSON data rather than characters. Contents Features [P

Castedo Ellerman 3 Aug 16, 2019
HashLibPlus is a recommended C++11 hashing library that provides a fluent interface for computing hashes and checksums of strings, files, streams, bytearrays and untyped data to mention but a few.

HashLibPlus HashLibPlus is a recommended C++11 hashing library that provides a fluent interface for computing hashes and checksums of strings, files,

Telepati 6 Dec 22, 2022
A simple PoC to demonstrate that is possible to write Non writable memory and execute Non executable memory on Windows

WindowsPermsPoC A simple PoC to demonstrate that is possible to write Non writable memory and execute Non executable memory on Windows You can build i

Lorenzo Maffia 55 Jul 21, 2022