Ctpg - Compile Time Parser Generator

Overview

Linux Build Windows Build

CTPG

C++ Compile Time Parser Generator

C++ single header library which takes a language description as a C++ code and turns it into a LR1 table parser with a deterministic finite automaton lexical analyzer, all in compile time. What's more, the generated parser is actually itself capable of parsing in compile time. All it needs is a C++17 compiler!

Contents

Instalation

Option 1.

It is a single header library. You can just copy the include/ctpg.hpp header wherever you want.

Option 2.

Use cmake to build the library:

git clone https://github.com/peter-winter/ctpg
cd ctpg
mkdir build
cd build
cmake ..
make
make test       #optionally
make install    #needs appropriate permissions (root)

This option allows integrating CTPG into dependent cmake projects.

Compiler support

Tested on:

  • GCC 10.3, 11.x
  • Clang 12.x, 13.x
  • MSVC 19.30

Usage

Following code demonstrates a simple parser which takes a comma separated list of integer numbers as argument and prints a sum of them.

readme-example.cpp

number("number"); int to_int(std::string_view sv) { int i = 0; std::from_chars(sv.data(), sv.data() + sv.size(), i); return i; } constexpr parser p( list, terms(',', number), nterms(list), rules( list(number) >= to_int, list(list, ',', number) >= [](int sum, char, const auto& n){ return sum + to_int(n); } ) ); int main(int argc, char* argv[]) { if (argc < 2) return -1; auto res = p.parse(string_buffer(argv[1]), std::cerr); bool success = res.has_value(); if (success) std::cout << res.value() << std::endl; return success ? 0 : -1; }">
#include <ctpg/ctpg.hpp>
#include <iostream>
#include <charconv>

using namespace ctpg;
using namespace ctpg::buffers;

constexpr nterm<int> list("list");

constexpr char number_pattern[] = "[1-9][0-9]*";
constexpr regex_term 
   number(
   "number");


   int 
   to_int(std::string_view sv)
{
    
   int i = 
   0;
    
   std::from_chars(sv.
   data(), sv.
   data() + sv.
   size(), i);
    
   return i;
}


   constexpr parser 
   p(
    list,
    
   terms(
   ',', number),
    nterms(list),
    rules(
        
   list(number) >=
            to_int,
        list(list, 
   ',', number)
            >= [](
   int sum, 
   char, 
   const 
   auto& n){ 
   return sum + 
   to_int(n); }
    )
);


   int 
   main(
   int argc, 
   char* argv[])
{
    
   if (argc < 
   2)
        
   return -
   1;
    
   auto res = p.
   parse(
   string_buffer(argv[
   1]), std::cerr);
    
   bool success = res.
   has_value();
    
   if (success)
        std::cout << res.
   value() << std::endl;
    
   return success ? 
   0 : -
   1;
}
  

Compile and run:

g++ readme-example.cpp -std=c++17 -o example && example "10, 20, 30"

You should see the output : 60. If incorrect text supplied as an argument:

g++ readme-example.cpp -std=c++17 -o example && example "1, 2, 3x"

you should see:

[1:8] PARSE: Unexpected character: x

Explanation

Header

#include "ctpg.hpp"

Namespaces

Namespace ctpg is the top namespace. There are couple of feature namespaces like buffers

using namespace ctpg;
using namespace ctpg::buffers;

Terminal symbols

Terminal symbols (short: terms) are symbols used in grammar definition that are atomic blocks. Examples of the terms from a C++ language are: identifier, '+' operator, various keywords etc.

To define a term use the one of char_term, string_term and regex_term classes.

Here is the example of a regex_term with a common integer number regex pattern.

number("number");">
constexpr char number_pattern[] = "[1-9][0-9]*";
constexpr regex_term 
   number(
   "number");
  

The constructor argument ("number") indicates a debug name and can be omitted, however it is not advised. Names are handy to diagnose problems with the grammar. If omitted, the name will be set to the pattern string.

Note: the pattern needs to have a static linkage to be allowed as a template parameter. This is C++17 limitation, and CTPG does not support C++20 features yet.

Other types of terms

char_term is used when we need to match things like a + or , operator. string_term is used when we need to match a whole string, like a language keyword.

Nonterminal symbols

Nonterminal symbols (short: nonterms) are essentially all non atomic symbols in the grammar. In C++ language these are things like: expression, class definition, function declaration etc.

To define a nonterm use the nterm class.

constexpr nterm<int> list("list");  

The constructor argument ("list") is a debug name as well, like in the case of regex_term. The difference is in nterms names are neccessary, because they serve as unique identifiers as well. Therefore it is a requirement that nonterm names are unique.

Template parameter in this case is a value type. More on this concept later.

Parser definition

The parser class together with its template deduction guides allows to define parsers using 4 arguments:

  • Grammar root - symbol which is a top level nonterm for a grammar.
  • List of all terms
  • List of all nonterms
  • List of rules

The parser object should be declared as constexpr, which makes all the neccessary calculations of the LR(1) table parser done in compile time.

Let's break down the arguments:

constexpr parser p(
    list,
    terms(',', number),         
    nterms(list),  

Grammar root.

When the root symbol gets matched (in this case list) the parse is successful.

Term list.

List of terms enclosed in a terms call. In our case there are two: number and a ,.

Note: the , term is not defined earlier in the code. It is an implicit char_term. The code implicitly converts the char to the char_term class. Therefore char_terms (as well as string_terms) are allowed not to be defined in advance. Their debug names are assigned to the them by default to a char (or a string) they represent.

Nonterm list.

List of terms enclosed in a nterms call. In our case, just a single list nonterm is enough.

Rules

List of rules enclosed in a rules call. Each rule is in the form of: nonterm(symbols...) >= functor The nonterm part is what's called a left side of the rule. The symbols are called the right side.

The right side can contain any number of nterm objects as well as terms (regex_terms, char_terms or string_terms). Terms can be in their implicit form, like , in the example. Implicit string_terms are in form of "strings".

    rules(
        list(number)
            >= to_int
        list(list, ',', number)
            >= [](int sum, char, const auto& n)
            { return sum + to_int(n); }
    )

The first rule list(number) indicates that the list nonterm can be parsed using a single number regex term.

The second rule uses what's know as a left recurrence. In other words, a list can be parsed as a list followed by a , and a number.

Functors

The functors are any callables that can accept the exact number of arguments as there are symbols on the right side and return a value type of the left side. Each nth argument needs to accept a value of a value type of the nth right side symbol.

So in the case of the first to_int functor, it is required to accept a value type of regex_term and return an int.

The second functor is a lambda which accepts 3 arguments: an int for the list, a char for the , and auto for whatever is passed as a value type for the regex_term.

Note: Functors are called in a way that allows taking advantage of move semantics, so defining it's arguments as a move reference is encouraged.

Value types for terms

Terms unlike nonterms (which have their value types defined as a template parameter to the nterm definition), have their value types predefined to either a term_value for a char_term, and a term_value for both regex_term and string_term.

The term_value class template is a simple wrapper that is implicitly convertible to it's template parameter (either a char or std::string_view). That's why when providing functors we can simply declare arguments as either a char or a std::string_view. In our case the to_int functor has a std::string_view argument, which accepts a term_value just fine. Of course an auto in case of lambda will always do the trick.

The advantage of declaring functor arguments as term_value specialization is that we can access other features (like source tracking) using the term_value methods.

Parse method call

Use parse method with 2 argumets:

  • a buffer
  • an error stream

Buffers

Use a string_buffer from a buffers namespace to parse a null terminated string or a std::string.

Error stream

Stream reference like std::cerr or any other std::ostream can be pased as a stream argument. This is the place where the parse method is going to spit out error messages like a syntax error.

auto res = p.parse(string_buffer(argv[1]), std::cerr);

Parse return value

The parse method returns an std::optional , where T is a value type of the root symbol. Use the .has_value() and the .value() to check and access the result of the parse.

Note: White space characters are skipped by default between consequent terms.

Compile time parsing

Example code can be easily changed to create an actual constexpr parser. First, all the functors need to be constexpr. To achieve this change the to_int function to:

constexpr int to_int(std::string_view sv)
{
    int sum = 0;
    for (auto c : sv) { sum *= 10; sum += c - '0'; }
    return sum;
}

The function is now constexpr. The header is now unneccessary.

Note: To allow constexpr parsing all of the nonterm value types have to be literal types.

Also change the main to use cstring_buffer and declare a parse result constexpr. The error stream argument is also unavailable in constexpr parsing.

int main(int argc, char* argv[])
{
    if (argc < 2)
    {
        constexpr char example_text[] = "1, 20, 3";

        constexpr auto cres = p.parse(cstring_buffer(example_text)); // notice cstring_buffer and no std::err output
        std::cout << cres.value() << std::endl;
        return 0;
    }

    auto res = p.parse(string_buffer(argv[1]), std::cerr);
    bool success = res.has_value();
    if (success)
        std::cout << res.value() << std::endl;
    return success ? 0 : -1;
}

Now when no argument is passed to the program, it prints the compile time result of parsing "1, 20, 3".

g++ readme-example.cpp -std=c++17 -o example && example

should print the number 24.

Invalid input in constexpr parsing

If the example_text variable was an invalid input, the code cres.value() would throw, because the cres is of type std::optional with no value.

Changing the parse call to:

constexpr int cres = p.parse(cstring_buffer(example_text)).value();

would cause compilation error, because throwing std::bad_optional_access is not constexpr.

LR(1) parser

CTPG uses a LR(1) parser. This is short from left-to-right and 1 lookahead symbol.

Algorithm

The parser uses a parse table which is somewhat resembling a state machine. Here is pseudo code for the algorithm:

struct entry
   int next          // valid if shift
   int rule_length   // valid if reduce
   int nterm_nr      // valid if reduce   
   enum kind {success, shift, reduce, error }

bool parse(input, sr_table[states_count][terms_count], goto_table[states_count][nterms_count])
   state = 0
   states.push(state)
   needs_term = true;

   while (true)
      if (needs_term)
         term_nr = get_next_term(input)
      entry = sr_table[state, term_nr]
      kind = entry.kind

      if (kind == success)
         return true

      else if (kind == shift)
         needs_term = true;
         state = entry.next
         states.push(state)
         continue

      else if (kind == reduce)
         states.pop_n(entry.rule_length)
         state = states.top()
         state = goto_table[state, entry.nterm_nr]
         continue

      else
         return false

Parser contains a state stack, which grows when the algorithm encounters a shift operation and shrinks on reduce operation.

Aside from a state stack, there is also a value stack for dedicated for parse result calculation. Each shift pushes a value to the stack and each reduce calls an appropriate functor with values from a value stack, removing values from a stack and replacing them with a single value associated with a rule's left side.

Table creation

This topic is out of scope of this manual. There is plenty of material online on LR parsers. Recomended book on the topic: Compilers: Principles, Techniques and Tools

Conflicts

There are situations (parser states) in which when a particualr term is encountered on the input, there is an ambiguity regarding the operation a parser should perform.

In other words a language grammar may be defined in such a way, that both shift and reduce can lead to a successfull parse result, however the result will be different in both cases.

Example 1

Consider a classic expression parser (functors omitted for clarity):

constexpr parser p(
    expr,
    terms('+', '*', number),
    nterms(expr),
    rules(
        expr(number),
        expr(expr, '+', expr),
        expr(expr, '*', expr)
    )
);

Consider 2 + 2 * 2 input being parsed and a parser in a state after successfully matching 2 + 2 and encountering * term.

Both shifting a * term and reducing by the rule expr(expr, '+', expr) would be valid, however would produce different results. This is a classic operator precedence case, and this conflict needs to be resolved somehow. This is where precedence and associativity take place.

Precedence and associativity

CTPG parsers can resolve such conflict based on precedence and associativity rules defined in a grammar.

Example above can be fixed by explicit term definitions.

Normally, char_terms can be introduced by implicit definition in the terms call. However when in need to define a precedence, explicit definition is required.

Simply change the code to:

constexpr char_term o_plus('+', 1);  // precedence set to 1
constexpr char_term o_mul('*', 2);   // precedence set to 2

constexpr parser p(
    expr,
    terms(o_plus, o_mul, number),
    nterms(expr),
    rules(
        expr(number),
        expr(expr, '+', expr),      // note: no need for o_plus and o_mul in the rules, however possible
        expr(expr, '*', expr)
    )
);

The higher the precedence value set, the higher the term precedence. Default term precedence is equal to 0.

This explicit precedence definition allows a * operator to have bigger precedence over +.

Example 2

constexpr char_term o_plus('+', 1);  // precedence set to 1
constexpr char_term o_minus('-', 1);  // precedence set to 1
constexpr char_term o_mul('*', 2);   // precedence set to 2

constexpr parser p(
    expr,
    terms(o_plus, o_minus, o_mul, number),
    nterms(expr),
    rules(
        expr(number),
        expr(expr, '+', expr),
        expr(expr, '-', expr),   // extra rule allowing binary -
        expr(expr, '*', expr),
        expr('-', expr)          // extra rule allowing unary -
    )
);

Binary - and + operators have the same precedence in pretty much all languages. Unary - however almost always have a bigger precedence than all binary operators. We can't achieve this by simply defining - precedence in char_term definition. We need a way to tell that expr('-', expr) has a bigger precedence then all binary rules.

To achieve this override the precedence in a term by a precedence in a rule changing:

expr('-', expr) to expr('-', expr)[3]

The [] operator allows exactly this. It explicitly sets the rule precedence so the parser does not have to deduce rule precedence from a term.

So the final code looks like this:

constexpr char_term o_plus('+', 1);  // precedence set to 1
constexpr char_term o_minus('-', 1);  // precedence set to 1
constexpr char_term o_mul('*', 2);   // precedence set to 2

constexpr parser p(
    expr,
    terms(o_plus, o_minus, o_mul, number),
    nterms(expr),
    rules(
        expr(number),
        expr(expr, '+', expr),
        expr(expr, '-', expr),   // extra rule allowing binary -
        expr(expr, '*', expr),
        expr('-', expr)[3]       // extra rule allowing unary -, with biggest precedence
    )
);

Example 3

Consider the final code and let's say the input is 2 + 2 + 2, parser has read 2 + 2 and is about to read the second +. In this case what is the required behaviour? Should the first 2 + 2 be reduced or a second + should be shifted? (This may not matter in case of integer calculations, but may have a big difference in situations like expression type deduction in c++ when operator overloading is involved.)

This is the classic associativity case which can be solved by expicitly defining the term associativity.

There are 3 types of associativity available: left to right, right to left and not associative as the default.

To explicitly define a term associativity change the term definitions to:

constexpr char_term o_plus('+', 1, associativity::ltor);
constexpr char_term o_minus('-', 1, associativity::ltor);
constexpr char_term o_mul('*', 2, associativity::ltor);

Now all of these operators are left associative, meaning the reduce will be preferred over shift.

Should the associativity be defined as associativity::rtol, shift would be preferred.

No associativity prefers shift by default.

Precedence and associativity summary

When a shift reduce conflict is encountered these rules apply in order:

Let r be a rule which is a subject to reduce and t be a term that is encountered on input.

  1. when explicit r precedence from [] operator is bigger than t precedence, perform a reduce
  2. when precedence of last term in r is bigger than t precedence, perform a reduce
  3. when precedence of last term in r is equal to t precedence and last term in r is left associative, perform a reduce
  4. otherwise, perform a shift.

Reduce - reduce conflicts

In some cases the language is ill formed and the parser contains a state in which there is an ambiguity between several reduce actions.

Consider example:

special_op("op"); constexpr parser p( op, terms('!', '*', '+'), nterms(special_op, op), rules( special_op('!'), op('!'), op('*'), op('+'), op(special_op) ) );">
constexpr nterm<char> op("op");
constexpr nterm<char> special_op("op");

constexpr parser p(
    op,
    terms('!', '*', '+'),
    nterms(special_op, op),
    rules(
        special_op('!'),
        op('!'),
        op('*'),
        op('+'),
        op(special_op)
    )
);

Let's say we parse an input !. The parser has no way of telling if it should reduce using rule special_op('!') or op('!').

This is an example of reduce/reduce conflict and such parser behaviour should be considered undefined.

There is a diagnostic tool included in CTPG which detects such conflicts so they can be addressed.

Functors - advanced

Consider a parser matching white space separated names (strings).

name("name"); using name_type = std::string_view; using list_type = std::vector ; constexpr nterm list("list"); constexpr parser p( list, terms(name), nterms(list), rules( list(), list(list, name) ) );">
constexpr char pattern[] = "[a-zA-Z0-9_]+";
constexpr regex_term 
     name(
     "name");

     using name_type = std::string_view;

     using list_type = std::vector
     
      ;

      constexpr nterm
       
       list(
       "list");


       constexpr parser 
       p(
    list,
    
       terms(name),
    nterms(list),
    rules(
        
       list(),
        list(list, name)
    )
);
      
     
    

How exactly would the functors look for this kind of parser?

The first rule list() is an example of an empty rule. This means the list can be reduced from no input.

Since the rule's left side is a list the functor needs to return its value type, which is a list_type. The right side is empty so the functor needs to have no arguments.

So let's return an empty vector: [](){ return list_type{}; }

The second rule reduces a list from a name and a list, therefore the functor needs to accept:

  • list_type for the first argument: list
  • term_value for the second argument: name
  • return a list_type

So let's create a functor:

[](auto&& list, auto&& name){ list.emplace_back(std::move(name)); return list; }

The name argument will resolve to term_value && , which is convertible to std::string_view&&.

Now the parser looks like this:

name("name"); using name_type = std::string_view; using list_type = std::vector ; constexpr nterm list("list"); constexpr parser p( list, terms(name), nterms(list), rules( list() >= [](){ return list_type{}; }, list(list, name) >= [](auto&& list, auto&& name){ list.push_back(name); return std::move(list); } ) );">
constexpr char pattern[] = "[a-zA-Z0-9_]+";
constexpr regex_term 
     name(
     "name");

     using name_type = std::string_view;

     using list_type = std::vector
     
      ;

      constexpr nterm
       
       list(
       "list");


       constexpr parser 
       p(
    list,
    
       terms(name),
    nterms(list),
    rules(
        
       list()
            >= [](){ 
       return list_type{}; },
        
       list(list, name)
            >= [](
       auto&& list, 
       auto&& name){ list.
       push_back(name); 
       return 
       std::move(list); }
    )
);
      
     
    

Note: Here we take advantage of move semantics which are supported in the functor calls. This way we are working with the same std::vector instance we created as empty using the first rule.

Important Note It is possible for functors to have referrence (both const and not) argument types, however lifetime of the objects passed to functors ends immediately after the functor returns. So it is better to avoid using referrence types as nterm value types.

Functor helpers

There are a couple of handy ready to use functor templates:

val

Use when a functor needs to return a value which doesn't depend on left side:

= val(false), binary('1') >= val(true), binary(binary, '&', binary) >= [](bool b1, auto, bool b2){ return b1 & b2; }, binary(binary, '|', binary) >= [](bool b1, auto, bool b2){ return b1 | b2; }, ) ); ">
using namespace ctpg::ftors;
constexpr nterm<bool> binary("binary");

constexpr parser p(
    binary,
    terms('0', '1', '&', '|'),
    nterms(binary),
    rules(
        binary('0')
            >= val(false),
        binary('1')
            >= val(true),
        binary(binary, '&', binary)
            >= [](bool b1, auto, bool b2){ return b1 & b2; },
        binary(binary, '|', binary)
            >= [](bool b1, auto, bool b2){ return b1 | b2; },
    )
);   

create

Use when a functor needs to return a default value of a given type:

// word list parser from one of previous examples

using namespace ctpg::ftors;

constexpr parser p(
    list,
    terms(name),
    nterms(list),
    rules(
        list()
            >= create
   {},    
   // use instead of a lambda
        
   list(list, name)
            >= [](
   auto&& list, 
   auto&& name){ list.
   push_back(name); 
   return 
   std::move(list); }
    )
);
  

element placeholders

Use whenever a rule simply passes nth element from the right side:

number("number"); constexpr to_int(std::string_view x){ /*implement*/ } constexpr nterm expr("expr"); constexpr parser p( expr, terms('+', '(', ')', number), nterms(expr), rules( expr(number) >= to_int, expr(expr, '+', expr) >= [](int i1, auto, int i2){ return i1 + i2; }, expr('(', expr, ')') >= _e2 // here, just return the second element ) );">
using namespace ctpg::ftors;

constexpr char pattern[] = "[1-9][0-9]*";
constexpr regex_term 
    number(
    "number");


    constexpr 
    to_int(std::string_view x){ 
    /*implement*/ }


    constexpr nterm<
    int> 
    expr(
    "expr");


    constexpr parser 
    p(
    expr,
    
    terms(
    '+', 
    '(', 
    ')', number),
    nterms(expr),
    rules(
        
    expr(number)
            >= to_int,
        expr(expr, 
    '+', expr)
            >= [](
    int i1, 
    auto, 
    int i2){ 
    return i1 + i2; },
        
    expr(
    '(', expr, 
    ')')
            >= _e2      
    // here, just return the second element
    )
);
   

list helpers

Use push_back or emplace_back when dealing with common list tasks.

The push_back calls push_back on first element passing second element as argument:

list(list, element) = push_back{}

The emplace back works similarly but supports move semantics.

// word list parser from one of previous examples

using namespace ctpg::ftors;

constexpr parser p(
    list,
    terms(name),
    nterms(list),
    rules(
        list()
            >= create
   {},
        
   list(list, name)
            >= push_back{}
    )
);
  

Both push_back an emplace_back are class templates, which take two indexes (like in case of element placeholders one-indexed), denoting element numbers for the list and the value to append.

So in case of comma separeted numbers, we can simply use:

rules(
    list() >= create
   
    >{},
    
    list(list, 
    ',', number) >= push_back<1, 3>{}  
    // 1 is the list, 3 is the number
)
   
  

Default functors

There is a situation where the functor can be entirely omitted, that is whenever a left side value type is move constructible from right side value types:

word("word"); constexpr char protocol_pattern[] = "http://|https://"; constexpr regex_term protocol("protocol"); using list_type = std::vector ; struct url_type { constexpr url_type(std::string_view pr, list_type&& l): pr(pr), l(std::move(l)) {} std::string_view pr; list_type l; }; constexpr nterm url; constexpr nterm list; constexpr parser p( url, terms(word, '.', protocol), nterms(url, list), rules( list(word) >= [](auto w){ return list_type{w}; }, list(list, '.', word) >= [](auto&& l, auto, auto w){ l.push_back(w); return std::move(l); }, url(protocol, list) // skip functor entirely, url_type move constructible from right side value types ) );">
// Example parser, accepts url addresses in for of a protocol and a list of words, like: https://www.example.com

constexpr char word_pattern[] = "[0-9A-Za-z]+";
constexpr regex_term 
       word(
       "word");

       constexpr 
       char protocol_pattern[] = 
       "http://|https://";

       constexpr regex_term
        
        protocol(
        "protocol");


        using list_type = std::vector
        
         ;

         struct 
         url_type
{
    
         constexpr 
         url_type(std::string_view pr, list_type&& l):
         pr(pr), l(std::move(l))
    {}
    std::string_view pr;
    list_type l;
};


         constexpr nterm
         
           url;

          constexpr nterm
          
            list;


           constexpr parser 
           p(
    url,
    
           terms(word, 
           '.', protocol),
    nterms(url, list),
    rules(
        
           list(word)
            >= [](
           auto w){ 
           return list_type{w}; },
        
           list(list, 
           '.', word)
            >= [](
           auto&& l, 
           auto, 
           auto w){ l.
           push_back(w); 
           return 
           std::move(l); },
        
           url(protocol, list)
            
           // skip functor entirely, url_type move constructible from right side value types
    )
);
          
         
        
       
      

Various features

Parse options

To change the parse options simply provide a parse_options instance to the parse call:

p.parse(parse_options{}, cstring_buffer("abc"), std::cerr);

To set a particular option use on of the set_xxx methods:

p.parse(parse_options{}.set_verbose(), cstring_buffer("abc"), std::cerr);

Note: The set_xxx methods return parse_options instance using *this, so they can be chained together.

The list of available parse options:

  • set_skip_whitespace(bool value)

By default parser skips the whitespace characters between the terms, this can be changed using this option.

  • set_verbose(bool value)

Sets the parser to verbose mode. More on this in the Verbose output section.

Verbose output

To allow verbose output for debugging purposes call parse method with such arguments:

p.parse(parse_options{}.set_verbose(), cstring_buffer("abc"), std::cerr);

The default parse_options is appended with the set_verbose call, thus changing the verbosity option. The last argument can be anything convertible to std::ostream referrence.

The verbose output stream contains alongside usual syntax errors, the detailed process of syntax and lexical analyze. The shift and reduce actions are put to the output which is useful together with the Diagnostics information. The lexical analyzer DFA actions are also printed, again useful during diagnostics.

Source tracking

Source tracking is a feature that makes the parser keep track of source point (that is line and column) it is currently in. This feature is always available and source point information is attached to every term value that is passed to a functor.

To use this information make the functor accept the term_value type arguments for each term.

For char_terms the value type is term_value , for both string_term and regtex_term the value type is term_value . Each of these types have the source_point member that can be accessed using get_sp() method.

The source_point struct has a line and column public members and can be output to a stream using << operator.

This is an example of a parser that accepts a whitespace separated words and stores them in a collection together with their source points.

Take a look on the functor that utilises both value and source point of a word using const auto& w argument by calling get_value() and get_sp() respectively.

using namespace ctpg; using namespace ctpg::ftors; using namespace ctpg::buffers; struct word_t { std::string w; source_point sp; }; using text_t = std::vector ; auto&& add_word(text_t&& txt, std::string_view sv, source_point sp) { txt.push_back(word_t{std::string(sv), sp}); return std::move(txt); } constexpr char word_pattern[] = "[A-Za-z]+"; constexpr regex_term word("word"); constexpr nterm text("text"); constexpr parser p( text, terms(word), nterms(text), rules( text() >= create {}, text(text, word) >= [](auto&& txt, const auto& w) { return add_word(std::move(txt), w.get_value(), w.get_sp()); } ) ); int main(int argc, char* argv[]) { if (argc < 2) return -1; auto res = p.parse(string_buffer(argv[1]), std::cout); if (res.has_value()) { for (const auto& w : res.value()) { std::cout << w.w << " at " << w.sp << std::endl; } } return 0; }">
#include "ctpg.hpp"
#include <iostream>

using namespace ctpg;
using namespace ctpg::ftors;
using namespace ctpg::buffers;

struct word_t
{
    std::string w;
    source_point sp;
};

using text_t = std::vector<word_t>;

auto&& add_word(text_t&& txt, std::string_view sv, source_point sp)
{
    txt.push_back(word_t{std::string(sv), sp});
    return std::move(txt);
}

constexpr char word_pattern[] = "[A-Za-z]+";
constexpr regex_term 
       word(
       "word");

       constexpr nterm<
       text_t> 
       text(
       "text");


       constexpr parser 
       p(
    text,
    
       terms(word),
    nterms(text),
    rules(
        
       text() >= create
       
        {},
        
        text(text, word) >= [](
        auto&& txt, 
        const 
        auto& w) { 
        return 
        add_word(
        std::move(txt), w.
        get_value(), w.
        get_sp()); }
    )
);


        int 
        main(
        int argc, 
        char* argv[])
{
    
        if (argc < 
        2)
        
        return -
        1;
    
        auto res = p.
        parse(
        string_buffer(argv[
        1]), std::cout);
    
        if (res.
        has_value())
    {
        
        for (
        const 
        auto& w : res.
        value())
        {
            std::cout << w.
        w << 
        " at " << w.
        sp << std::endl;
        }
    }
    
        return 
        0;
}
       
      

Buffers

There are currently two types of buffers available: cstring_buffer, useful for constexpr parsing static array like buffer, and string_buffer for runtime parsing.

It is however easy to add custom types of buffers, there are just couple of requirements for the types to be eligible as buffers.

The buffer needs to expose public iterator type which should be obtainable by begin and end methods and return iterators to the start and past the end of the buffer respectively.

The get_view member should return a std::string_view given two iterators, one at the start of the view and the other past the end.

iterator begin() const { return iterator{ data }; }
iterator end() const { return iterator{ data + N - 1 }; }
std::string_view get_view(iterator start, iterator end) const

The iterator type should expose following public member methods:

char operator *() const;      // derefference to a pointed char
iterator& operator ++();      // pre and post incrementation
iterator operator ++(int);
bool operator == (const iterator& other) const;    // comparison operator

Typed terms

It is possible to define term value types as custom types, not limited to char or std::string_view. It can be achieved using typed_term class template.

Wrap the usual term definition:

char_term plus('+');

with typed_term like this:

// a custom type for the plus term
struct plus_tag{};

typed_term plus(char_term('+'), create
   {});
  

The create is a functor available in the ctpg::ftors namespace, which simply creates an object of given type using a default constructor of that type and ignoring all passed arguments to it. In fact any callble object which accepts std::string_view can be used instead of create, this is just an example. The plus term has a value type identical to the return type of the functor, plus_tag in this case.

Take a look at the typed-terms.cpp in the examples, it uses this feature to create a simple calculator, but instead of the runtime switch statement on the char value like in the simple-expr-parser.cpp, the functor object has an overload for each arithmetic operator.

Note: Typed terms cannot use their implicit versions like the basic terms (char_term, string_term) in the rules. They have to be referrenced by the typed_terms object.

Error recovery

If a special error term in a rule is used, the parser tries to recover from syntax error.

Consider the example from error-recovery.cpp example (here, simplified):

constexpr parser p(
    exprs,
    terms(number, o_plus, ';'),
    nterms(exprs, expr),
    rules(
        exprs() >= create
   {},
        
   exprs(exprs, expr, 
   ';') >= push_back<1, 2>{},
        
   exprs(exprs, error, 
   ';') >= _e1,
        expr(expr, 
   '+', expr) >= [](
   int x1, skip, 
   int x2){ 
   return x1 + x2; },
        
   expr(number) >= [](
   const 
   auto& sv){ 
   return 
   get_int(sv); }
    )
);
  

This rule allows parser to recover from syntax error when the expression is ill formed, the _e1 functor will simply pass expressions parsed to this point:

exprs(exprs, error, ';') >= _e1,

Recovery follows the rules:

  • when syntax error occurs a special is presented to the LR algorithm.
  • parser states are reverted (popped from a stack) until the state accepting the is encountered.
    • if at any point the is no more states to pop, algorithm fails.
  • is shifted, and shift action is performed.
  • terminals are consumed and ignored until the terminal which would not result in a syntax error is encountered.
    • if at any point end of input is encountered, the algorithm fails.

To see how error in rules affect the parse table generation take a look at the diagnostic output and look for the occurrences. See the Diagnostics section for details.

Regular expressions

When defining a regex pattern for a regex term:

number("number");">
constexpr char number_pattern[] = "[1-9][0-9]*";
constexpr regex_term 
   number(
   "number");
  

use the following supported features (in precedence descending order):

Feature Example Meaning
Single char a character a
Escaped char \| character |
Escaped char (hex) \\x20 space character
Any char . any character
Char range [a-z] lower case letter
Char set [abc] a, b, or a c character
Inverted char range [^a-z] everything but lower case letter
Inverted char set [^!] everything but ! character
Complex char set [_a-zA-Z] any letter or underscore character
Concatenation ab characters a and b in order
Repetition (zero or more) a* zero or more of a character
Repetition (one or more) a+ one or more of a character
Optional a? optional a character
Repetition (defined number) a{4} 4 a characters
Alternative a|b a character or b character
Grouping (a|b)* any number of a or b characters

Diagnostics

To diagnose broblems in the parser use the write_diag_str method which returns a string of output with the parser state machine details:

p.write_diag_str(std::cerr)

The output contains 2 sections: one for syntax analyzer starting with the word PARSER and the other for lexical analyzer starting with LEXICAL ANALYZER.

Parser section

Parser Object size: 51576

First information in the secion is the size of the parser object. This may easily be couple of megabytes for some complex grammars, so consider declaring the parser as a static object rather than on local stack.

Next there is a state machine description in form of:

STATE nr

followed by description of all possible situations in which the parser is when in this state. Each of the situations refer to a single rule and are in form:

nterm <- s0 s1 s2 ... s(n) . s(n+1) ... s(rule_length) ==> lookahead_term

The nterm is the name of the left side nonterm, s0, s1 ... are right side symbols from the same rule.

The . after the s(n) means the parser is done matching the part of the rule before the . (all the symbols before the .).

The lookahead_term is the term expected after the whole rule is matched. If the parser encounters the lookahead term after the rule is matched, the reduce operation is performed.

After the situations there is an action list (in order: goto actions, shift actions and reduce actions):

On 
   
     goto 
    
     
...
On 
     
       shift 
      
       
...
On 
       
         reduce using (
        
         ) ... 
        
       
      
     
    
   

Goto and shift actions are basically the same, only difference is goto action refers to a nonterm and shift to a term.

They both refer to the . in the situation, that is, given the symbol after the . in this state, parser goes to a new state with a .

Reduce actions occur when the whole rule is matched, hence the reduce actions are present only when the state contains a situation with . at the end. What the action means is: given the reduce using rule with a . Rules are numbered according to the apearance in the source code (in the rules call during the parser definition) starting from 0.

Conflicts

Shift/reduce conflicts are presented with lines:

On 
   
     shift to 
    
      S/R CONFLICT, prefer reduce(
     
      ) over shift
On 
      
        shift to 
       
         S/R CONFLICT, prefer shift over reduce(
        
         ) 
        
       
      
     
    
   

Reduce/reduce conflicts look like this:

On 
   
     R/R CONFLICT - !!! FIX IT !!!

   

Lexical analyzer section

Section contains deterministic finite automaton which corresponds to all of the terms used in a grammar.

Each line represents a single machine state:

STATE 
   
     [recognized 
    
     ] {
     
       -> 
      
       } {
       
         -> 
        
         }... 
        
       
      
     
    
   

The [recognized ] part is optional and means that the DFA in this state could return the recognized term, however it is trying to match longest possible input so it continues consuming characters. When it reaches an error state (no new state for the character) the last recognized term is returned, or an 'unexpected character' error occurs if no term recognized so far.

The -> represents the DFA transition on a character described by character description . Character descriptions are in form of a single printable character, or in case of non-printable it's hex representation like : 0x20 for space character. Character descriptions can also contain character range in form: [start-end].

There will be unreachable states in the form:

STATE 
   
     (unreachable)

   

These are leftovers from the regular expression to DFA conversion, just ignore them.

Comments
  • Fix up CMake build and CI

    Fix up CMake build and CI

    There were a number of issues with the CMake build before. This PR fixes all of them. They are:

    1. There was no way to disable warnings-as-errors.
    2. The standard BUILD_TESTING variable wasn't honored.
    3. FetchContent and add_subdirectory users got the project's install rules by default.
    4. CI didn't build the examples (and they didn't have a CMakeLists.txt to begin with).
    5. The project searched for a C compiler it never used (C++ only)
    6. The path for installing the generated CMake package files wasn't independently configurable (important for Debian, vcpkg, and surely others)
    7. The project would incorrectly find Catch2 version 3, when only 2 is supported.
    8. The package set a variable for the include path.
    9. The language example unnecessarily used include_directories.
    10. The usage of #include <ctpg.hpp> and #include <ctpg/ctpg.hpp> was inconsistent.
    11. There was no ctpg::ctpg alias for FetchContent and add_subdirectory users.

    Happy to work with you to make these changes more palatable.

    opened by alexreinking 13
  • Lifetime Warning from VC++

    Lifetime Warning from VC++

    Not sure if this is a valid warning, but here it is:

    Severity	Code	Description	Project	File	Line	Suppression State
    Warning	C26800	Use of a moved from object: ''(*start)'' (lifetime.1).	scratchpad	D:\Ben\Dev\ctpg-master\include\ctpg.hpp	2503	
    
    
    opened by BenHanson 8
  • [question] Cant Parameterize and

    [question] Cant Parameterize and "Seperate out" Parser

    I am trying to split apart (and parameterize) my parser, something like this..

    template< 
            auto ConvertToTypeConstantParameter, 
            typename OperhandTypeParameterType, 
            auto DataTypeNameParameterConstant, 
            auto& LiteralRegexParameterConstant, 
                    // = defualt_constant_expression_literal_regex, 
            auto& IdentifierRegexParameterConstant, 
                    // = defualt_constant_expression_identifier_regex, 
            auto PlusCharacterParameterConstant = '+', 
            auto MinusCharacterParameterConstant = '-', 
            auto MultiplyCharacterParameterConstant = '*', 
            auto DivideCharacterParameterConstant = '/', 
            auto LeftParanethesisCharacterParameterConstant = '(', 
            auto RightParanethesisCharacterParameterConstant = ')', 
            auto& IdentifierNameParameterConstant 
                    = identifier_name_string, 
            auto& FactorNameParameterConstant 
                    = factor_name_string, 
            auto& SumNameParameterConstant 
                    = sum_name_string, 
            auto& ParanthesisScopeParameterConstant 
                    = parenthesis_scope_name_string 
        >
    struct ConstantExpression
    {
        using DataType = OperhandTypeParameterType;
        constexpr static auto data_type_name = DataTypeNameParameterConstant;
    
        constexpr static auto literal_term = ctpg::regex_term< LiteralRegexParameterConstant >{ 
                "num"
            };
    
        constexpr static auto identifier = ctpg::regex_term< IdentifierRegexParameterConstant >{ 
                "id"
            };
    
        constexpr static auto factor = ctpg::nterm< OperhandTypeParameterType >{ FactorNameParameterConstant };
        constexpr static auto sum = ctpg::nterm< OperhandTypeParameterType >{ SumNameParameterConstant };
        constexpr static auto parenthesis_scope = ctpg::nterm< OperhandTypeParameterType >{ ParanthesisScopeParameterConstant };
    
        constexpr static auto nterms = ctpg::nterms( 
                factor, sum, parenthesis_scope 
            );
    
        constexpr static auto plus_term = ctpg::char_term{ 
                PlusCharacterParameterConstant, 
                1, 
                ctpg::associativity::ltor 
            };
        constexpr static auto minus_term = ctpg::char_term{ 
                MinusCharacterParameterConstant, 
                1, 
                ctpg::associativity::ltor 
            };
        constexpr static auto multiply_term = ctpg::char_term{ 
                MultiplyCharacterParameterConstant, 
                2, 
                ctpg::associativity::ltor 
            };
        constexpr static auto divide_term = ctpg::char_term{ 
                DivideCharacterParameterConstant, 
                2, 
                ctpg::associativity::ltor 
            };
        constexpr static auto left_parenthesis_term = ctpg::char_term{ 
                LeftParanethesisCharacterParameterConstant, 
                3, 
                ctpg::associativity::ltor 
            };
        constexpr static auto right_parenthesis_term = ctpg::char_term{ 
                RightParanethesisCharacterParameterConstant, 
                3, 
                ctpg::associativity::ltor 
            };
    
        constexpr static auto terms = ctpg::terms(
                plus_term, 
                minus_term, 
                multiply_term, 
                divide_term, 
                left_parenthesis_term, 
                right_parenthesis_term 
            );
    
        constexpr static auto rules = ctpg::rules( 
                factor( literal_term ) >= ConvertToTypeConstantParameter, 
                factor( factor, multiply_term, literal_term ) 
                    >= []( size_t current_factor, auto, const auto& next_token ) {  
                            return current_factor * ConvertToTypeConstantParameter( next_token ); 
                        }, 
                //More rules...
            );
    };
    

    This way, for multiple data-types I can make more constant expresssion regexes for different data-types (and for different types of literals with different regexes e.g natural number [0-9][0-9]* vs float [0-9]*.[0-9]+).

    The problem comes when I try to use a literal, it seems to have trouble accepting the string for the regex in the literal_term

    I have tried multiple methods, at first all the parameters were just plain auto, and I tried serializing the strings into templates, then injecting them back into an array type with the size known at compile time (as a bunch of errors with the size of the type not being known at compile time come up otherwise).

    template< auto FirstConstantParameter, auto... SeriesConstantParameters >
    struct TakeOneFromTemplateSeries {
        constexpr static auto first = FirstConstantParameter;
        using NextType = TakeOneFromTemplateSeries< SeriesConstantParameters... >;
    };
    
    template< auto... ElementParameterConstants >
    struct RawTemplateArray
    {
        using ElementType = decltype( 
                TakeOneFromTemplateSeries< ElementParameterConstants... >::first 
            );
        constexpr static auto size = sizeof...( ElementParameterConstants );
        constexpr static ElementType array[ size ] = { ElementParameterConstants... };
        constexpr static ElementType* pointer = array;
    };
    
    template< 
            auto ArrayParameterConstant, 
            size_t IndexParameterConstant, 
            size_t ArrayLengthParameterConstant, 
            auto... ElementParameterConstants 
        >
    struct ToRawTemplateArrayImplementation
    {
        using ResultType = typename ToRawTemplateArrayImplementation< 
                ArrayParameterConstant, 
                IndexParameterConstant + 1, 
                ArrayLengthParameterConstant, 
                ElementParameterConstants..., 
                ArrayParameterConstant[ IndexParameterConstant % ArrayLengthParameterConstant ] 
            >::ResultType;
    };
    
    template< 
            auto ArrayParameterConstant, 
            size_t IndexParameterConstant, 
            auto... ElementParameterConstants 
        >
    struct ToRawTemplateArrayImplementation< 
            ArrayParameterConstant, 
            IndexParameterConstant, 
            IndexParameterConstant, 
            ElementParameterConstants... 
        >
    {
        using ResultType = RawTemplateArray< 
                ElementParameterConstants... 
            >;
    };
    
    template< auto ArrayParameterConstant >
    struct ToRawTemplateArray
    {
        using ResultType = typename ToRawTemplateArrayImplementation< 
                ArrayParameterConstant, 
                0, 
                // std::strlen( ArrayParameterConstant ) 
                ctpg::utils::str_len( ArrayParameterConstant ) + 1
            >::ResultType;
    };
    
    constexpr static const char natural_number_regex_string[] = "[0-9][0-9]*";
    constexpr static auto natural_number_regex = ToRawTemplateArray< natural_number_regex_string >::ResultType{};
    constexpr static auto defualt_constant_expression_literal_regex = natural_number_regex;
    
    

    I would then pass natural_number_regex into ConstantExpression If I do it like this

    template< 
            auto ConvertToTypeConstantParameter, 
            typename OperhandTypeParameterType, 
            auto& DataTypeNameParameterConstant = decltype( defualt_constant_expression_data_type_name )::array, 
            auto& LiteralRegexParameterConstant = decltype( defualt_constant_expression_literal_regex )::array,
            auto& IdentifierRegexParameterConstant = decltype( defualt_constant_expression_identifier_regex )::array, 
            auto& IdentifierNameParameterConstant = decltype( defualt_constant_expression_identifier_term_name )::array, 
            auto& FactorNameParameterConstant = decltype( defualt_constant_expression_factor_nterm_name )::array, 
            auto& SumNameParameterConstant = decltype( defualt_constant_expression_sum_nterm_name )::array, 
            auto& ParanthesisScopeParameterConstant = decltype( defualt_constant_expression_parenthesis_scope_nterm_name )::array, 
            auto PlusCharacterParameterConstant = '+', 
            auto MinusCharacterParameterConstant = '-', 
            auto MultiplyCharacterParameterConstant = '*', 
            auto DivideCharacterParameterConstant = '/', 
            auto LeftParanethesisCharacterParameterConstant = '(', 
            auto RightParanethesisCharacterParameterConstant = ')', 
        >
    
    

    The compiler would spit out a whole bunch of template barf I have spent a while formatting and trying to make some sense of, clang is giving me something GCC wont

    ctpg.hpp:180:25: warning: ISO C++20 considers use of overloaded operator '==' (with operand types 'ctpg::stdex::cvector<unsigned short, 13>::iterator' and 'ctpg::stdex::cvector<unsigned short, 13>::iterator') to be ambiguous despite there being a unique best viable function [-Wambiguous-reversed-operator]
                while (!(it == end()))
                         ~~ ^  ~~~~~
    /root/.conan/data/ctpg/1.3.6/_/_/package/5ab84d6acfe1f23c4fae0ab88f26e3a396351ac9/include/ctpg.hpp:2546:25: note: in instantiation of member function 'ctpg::stdex::cvector<unsigned short, 13>::erase' requested here
            ps.cursor_stack.erase(ps.cursor_stack.end() - ri.r_elements, ps.cursor_stack.end());
    
    ng_buffer<12>, ctpg::detail::no_stream>' requested here
            auto res = p.parse(
                         ^
    /root/.conan/data/ctpg/1.3.6/_/_/package/5ab84d6acfe1f23c4fae0ab88f26e3a396351ac9/include/ctpg.hpp:3285:43: note: in instantiation of function template specialization 'ctpg::regex::analyze_dfa_size<12UL>' requested here
        static const size_t dfa_size = regex::analyze_dfa_size(Pattern);
                                              ^
    /root/workdir/Include/Warp/Expression.hpp:134:42: note: in instantiation of template class 'ctpg::regex_term<natural_number_regex_string>' requested here
        constexpr static auto literal_term = ctpg::regex_term< LiteralRegexParameterConstant >{ 
                                             ^
    /root/workdir/Source/Main.cpp:19:9: note: in instantiation of template class 'ConstantExpression<&to_size_t, unsigned long, natural_number_name_string, natural_number_regex_string, identifier_regex_string, '+', '-', '*', '/', '(', ')', identifier_name_string, factor_name_string, sum_name_string, parenthesis_scope_name_string>' requested here
            NaturalNumberConstantExpression::factor, 
            ^
    /root/.conan/data/ctpg/1.3.6/_/_/package/5ab84d6acfe1f23c4fae0ab88f26e3a396351ac9/include/ctpg.hpp:127:28: note: ambiguity is between a regular call to this operator and a call with the argument order reversed
                constexpr bool operator == (const it_type& other) const { return cast()->ptr == other.ptr; }
                               ^
    /root/.conan/data/ctpg/1.3.6/_/_/package/5ab84d6acfe1f23c4fae0ab88f26e3a396351ac9/include/ctpg.hpp:180:25: warning: ISO C++20 considers use of overloaded operator '==' (with operand types 'ctpg::stdex::cvector<unsigned short, 26>::iterator' and 'ctpg::stdex::cvector<unsigned short, 26>::iterator') to be ambiguous despite there being a unique best viable function [-Wambiguous-reversed-operator]
                while (!(it == end()))
                         ~~ ^  ~~~~~
    /root/.conan/data/ctpg/1.3.6/_/_/package/5ab84d6acfe1f23c4fae0ab88f26e3a396351ac9/include/ctpg.hpp:2546:25: note: in instantiation of member function 'ctpg::stdex::cvector<unsigned short, 26>::erase' requested here
            ps.cursor_stack.erase(ps.cursor_stack.end() - ri.r_elements, ps.cursor_stack.end());
    
    /root/.conan/data/ctpg/1.3.6/_/_/package/5ab84d6acfe1f23c4fae0ab88f26e3a396351ac9/include/ctpg.hpp:127:28: note: ambiguity is between a regular call to this operator and a call with the argument order reversed
                constexpr bool operator == (const it_type& other) const { return cast()->ptr == other.ptr; }
                               ^
    In file included from /root/workdir/Source/Main.cpp:1:
    In file included from /root/workdir/Include/Warp/Expression.hpp:10:
    In file included from /root/.conan/data/ctpg/1.3.6/_/_/package/5ab84d6acfe1f23c4fae0ab88f26e3a396351ac9/include/ctpg.hpp:12:
    /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/variant:1100:7: error: static_assert failed due to requirement '__detail::__variant::__exactly_once<ctpg::term_value<std::basic_string_view<char, std::char_traits<char>>>, std::nullptr_t, ctpg::no_type, unsigned long, ctpg::term_value<char>>' "T must occur exactly once in alternatives"
          static_assert(__detail::__variant::__exactly_once<_Tp, _Types...>,
    

    These are just a few errors/warnings/notes I thought may be relevant

    I have also tried specifying auto& (because I saw it in ctpg::buffers::cstring_buffer) instead of just auto -- no dice

    With either, if I specify raw literals in the parameters I get something like

    /root/workdir/Source/Main.cpp:6:9: error: pointer to subobject of string literal is not allowed in a template argument
            "[0-9][0-9]*",
    

    If I try to substitute constants such as

    
    constexpr static const char natural_number_regex_string[] = "[0-9][0-9]*";
    
    

    I get right back to the errors I had before

    I have no idea how to make this work properly and think I may be forced to go back to a "monolithic" parser. Anyone know how to make this work

    Thank you,

    • Chris P.s Here is a gist of a full compiler output and here is my header file and this is a Main.cpp using it, if you comment out the second rule, it works because it does not have the literal term P.s.s If I substitute the mock rules with the actual ones I want too use, I get
    /root/.conan/data/ctpg/1.3.6/_/_/package/5ab84d6acfe1f23c4fae0ab88f26e3a396351ac9/include/ctpg.hpp:127:28: note: ambiguity is between a regular call to this operator and a call with the argument order reversed
                constexpr bool operator == (const it_type& other) const { return cast()->ptr == other.ptr; }
                               ^
    /root/workdir/Source/Main.cpp:16:9: error: excess elements in struct initializer
            NaturalNumberConstantExpression::factor, 
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    4 warnings and 1 error generated.
    

    This warning is showing up quite persistently.

    opened by cgbsu 7
  • Make tests and examples executables conditional

    Make tests and examples executables conditional

    Hi,

    Having test and example executables optionally removed would be nice to have as it would cut down installation and package generation times (I'm writing an AUR package for ctpg right now btw).

    Thanks for you work so far :)

    Regards, Jules

    opened by JPenuchot 6
  • clang is not supported?

    clang is not supported?

    When I try to use ctpg with clang13, clang fails to build. It works correctly in gcc11. Is clang not supported?

    include/ctpg.hpp:1466:18: error: function 'create_regex_parser<ctpg::regex::dfa_size_analyzer>' with deduced return type cannot be used before it is defined
            auto p = create_regex_parser(a);
                     ^
    include/ctpg.hpp:1460:20: note: 'create_regex_parser<ctpg::regex::dfa_size_analyzer>' declared here
        constexpr auto create_regex_parser(Builder& b);
                       ^
    include/ctpg.hpp:1463:24: error: no return statement in constexpr function
        constexpr size32_t analyze_dfa_size(const char (&pattern)[N])
                           ^
    include/ctpg.hpp:1612:25: error: function 'create_regex_parser<ctpg::regex::dfa_size_analyzer>' with deduced return type cannot be used before it is defined
            auto p = regex::create_regex_parser(a);
                            ^
    include/ctpg.hpp:1460:20: note: 'create_regex_parser<ctpg::regex::dfa_size_analyzer>' declared here
        constexpr auto create_regex_parser(Builder& b);
                       ^
    include/ctpg.hpp:1641:29: error: function 'create_regex_parser<ctpg::regex::dfa_size_analyzer>' with deduced return type cannot be used before it is defined
                auto p = regex::create_regex_parser(a);
                                ^
    include/ctpg.hpp:1460:20: note: 'create_regex_parser<ctpg::regex::dfa_size_analyzer>' declared here
        constexpr auto create_regex_parser(Builder& b);
                       ^
    include/ctpg.hpp:831:36: error: in-class initializer for static data member is not a constant expression
        static const size_t dfa_size = regex::analyze_dfa_size(Pattern);
                                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    opened by toge 5
  • Add support for passing external context to parsers

    Add support for passing external context to parsers

    Hi!

    I was trying to use this library to write a simple and fast expression evaluator which would support the use of external variables and functions inside expressions, but I couldn't find any method that would allow me to pass such external context into reducing functors. I mean, the parser itself is constexpr thus it is impossible to capture anything into lambdas and "parse" method doesn't accept any parameters that can be passed to functors.

    I think, use of external contexts is a very important feature because the only alternative is creating an AST and then traversing it using a separate algorithm. It requires to perform tons of small memory allocations (one for each AST node) and I'd prefer to avoid it and, at least, use some preallocated memory chunk instead (which is still impossible because I can't tell the parser where this chunk is located).

    So, I've tried to add the support for external contexts myself. See the added example and commit description for more info. It is just a first draft, so, feel free to request any changes.

    Also, the known issue is that this implementation is incompatible with list helpers because they accept any arguments including contexts and it leads to hard errors on instantiation of std::is_invocable.

    Regards, Ilya Nozhkin

    opened by ilya-nozhkin 4
  • [Feture Request] Integration with CTRE

    [Feture Request] Integration with CTRE

    CTRE is a dedicated Compile Time Regex library, as such it seems to have more regex features, and supports unicode, would it be possible to integrate it into CTPG?

    opened by cgbsu 3
  • Ignore all whitespaces but newlines

    Ignore all whitespaces but newlines

    First of all I really like your library and am very curious to learn about the implementation details and tinker around with it. I just had one question:

    Is there a way to ignore all whitespaces but newlines to define a language where newlines are a syntax element yet? Like e.g. the HTTP protocol?

    opened by stephanroslen 3
  • Parsing exponent notation in json-parser

    Parsing exponent notation in json-parser

    I just saw that the to_js_number() does not work when the exponent does not have a sign (like 1.0e8). A quick look into the JSON schema tells me that it is allowed.

    image

    Was this missed or is support for these not required?

    Here is a simple fix anyway. I am happy to provide a pull request if this is necessary. Just confused right now.

    https://godbolt.org/z/7bo3d41K6

    opened by nandanvasudevan 3
  • Regex Fails to Match

    Regex Fails to Match

    using namespace ctpg;
    using namespace ctpg::buffers;
    
    constexpr char number_pattern[] = "0|[1-9][0-9]*";
    constexpr ctpg::regex::expr<number_pattern> r;
    const bool match = r.match(string_buffer("0"));
    
    

    match returns false, whereas matching string_buffer("1") returns true.

    opened by BenHanson 3
  • Add construct ftor

    Add construct ftor

    It turned out the list helpers example for comma separeted numbers did not work, as the first rule does not take the first number and the second rule expects to start with a comma. So I thought it might be a good idea to provide a helper that initializes a container with a given argument.

    • added construct ftor to do so
    • added testcase in list_helpers.cpp
    • changed documentation
    opened by stephanroslen 2
  • Maybe it is possible that removing lexer scanner?

    Maybe it is possible that removing lexer scanner?

    I am writing a grammar using another parser library. i find lexer-scanner is unnatural. when we define a token , we usually give it a name with some semantics such as VARIABLE, STRING, INT, FLOAT, BOOL etc , this is unnatural because the lexer should not carry any infomation about semantics. maybe it is more suitable that using LITTTLE_CHAR_SET, CHARS_SET_WITH_QUOTES, DIGIT_SET replace VARIABLE, STRING, INT, but obviously, these name are too verbose. it seems unimportant, but when i define a grammar, i always need make a tradeoff between an natural but complex grammar and a simple but incoherent grammar, because the place using same token often have different semantics.

    So, i consider whether we can get a nature grammar definition by removing lexer-scanner and replacing lexer-token with inline regex. At the same time, i think of your lib and i feel it is suitable to your lib becase it is able to complement the problem about lexer priority.

    opened by 95833 2
  • Precedence When Matching/Lexing

    Precedence When Matching/Lexing

    Possible to use precedence in pattern matching step?

    I have 3 terms on for digits ("[0-9]+"), "u" and one for identifiers "[a-zA-Z_][a-zA-Z_0-9]+"

    I "u" to identify an unsigned integer and I can specify its bit arity with a number after it

    e.g "100u8" is an an unsigned 8-bit integer with value 100.

    100u works fine (unsigned integer with value 100 and unspecified bit-arity)

    However 100 is pared as "Digits" <- "Base10Digits", (I do this because I can parse digits in different bases) then "u8" is parsed as an identifier, and there is no rule to match this.

    Right now "u" has a higher precedence than "Identifier" but identifier gets matched first.

    Would it be possible to: A ) Try matching patterns with higher precedence first B ) If there is no suitable rule for the matched term, fallback and see if there is a suitable term (case: Unexpected Identifier)?

    opened by cgbsu 2
  • [Question] Is it applicable to parsing markdown?

    [Question] Is it applicable to parsing markdown?

    I want to convert between 2 similar formats (DokuWiki -> Obsidian). But I fail to describe a term "arbitrary text that is not matched by any terms, but surrounded by any recognizable terms". Like, ** any such text // and another one//**, where "**" and "//" are terms properly described to the parser generator.

    opened by codemonc 2
  • string_view buffer

    string_view buffer

    Thanks for your work! I'm enjoying playing with the library.

    I found this handy fwiw:

    struct sv_buffer
    {
    public:
        constexpr sv_buffer(std::string_view str) : str(str) {}
        constexpr sv_buffer(const char *str) : str(str) {}
    
        constexpr auto begin() const { return str.cbegin(); }
        constexpr auto end() const { return str.cend(); }
    
        using iterator = std::string_view::const_iterator;
    
        constexpr std::string_view get_view(iterator start, iterator end) const
        {
            return std::string_view(str.data() + (start - str.begin()),
                                    end - start);
        }
    
    private:
        std::string_view str;
    };
    
    
    opened by mjhurd 1
  • Constexpr parsing: non-constexpr containers

    Constexpr parsing: non-constexpr containers

    I'm currently facing an issue while using ctpg::parser:

    • libstdc++ doesn't have a constexpr-ready version of std::optional
    • libc++ doesn't have a constexpr-ready version of std::vector

    This makes ctpg::parser currently unusable in constexpr functions since both are required but no standard library implementation has them both constexpr-ready.

    There could several ways to make constexpr ctpg happen, being:

    I will probably go for one of these options on my side because I need ctpg to work in constexpr contexts for my research, but I would like to know if merging one of these changes on your side sounds reasonable to you, and eventually which one you prefer so I can put my focus on it.

    Regards, Jules

    opened by JPenuchot 5
Releases(v1.3.7)
Owner
Piotr Winter
C++ dev since 2005
Piotr Winter
Fast C/C++ CSS Parser (Cascading Style Sheets Parser)

MyCSS — a pure C CSS parser MyCSS is a fast CSS Parser implemented as a pure C99 library with the ability to build without dependencies. Mailing List:

Alexander 121 Sep 22, 2022
DimensionalAnalysis - A compact C++ header-only library providing compile-time dimensional analysis and unit awareness

Dimwits ...or DIMensional analysis With unITS is a C++14 library for compile-time dimensional analysis and unit awareness. Minimal Example #include <i

NJOY 8 Jul 8, 2022
Type safe - Zero overhead utilities for preventing bugs at compile time

type_safe type_safe provides zero overhead abstractions that use the C++ type system to prevent bugs. Zero overhead abstractions here and in following

Jonathan Müller 1.2k Jan 8, 2023
Pipet - c++ library for building lightweight processing pipeline at compile-time for string obfuscation, aes ciphering or whatever you want

Pipet Pipet is a lightweight c++17 headers-only library than can be used to build simple processing pipelines at compile time. Features Compile-time p

C. G. 60 Dec 12, 2022
Compile-time C Compiler implemented as C++14 constant expressions

constexpr-8cc: Compile-time C Compiler constexpr-8cc is a compile-time C compiler implemented as C++14 constant expressions. This enables you to compi

Keiichi Watanabe 762 Dec 12, 2022
Set of tests to benchmark the compile time of c++ constructs

CompileTimer Set of tests to benchmark the compile time of c++ constructs This project is an attempt to understand what c++ construct take how much ti

Jan Wilmans 6 Sep 21, 2019
[WIP] Experimental C++14 multithreaded compile-time entity-component-system library.

ecst Experimental & work-in-progress C++14 multithreaded compile-time Entity-Component-System header-only library. Overview Successful development of

Vittorio Romeo 450 Dec 17, 2022
Entity-Component-System (ECS) with a focus on ease-of-use, runtime extensibility and compile-time type safety and clarity.

Kengine The Koala engine is a type-safe and self-documenting implementation of an Entity-Component-System (ECS), with a focus on runtime extensibility

Nicolas Phister 466 Dec 26, 2022
A C++ 17 implementation of qntm's base65536 that runs at compile time

A C++ 17 implementation of qntm's base65536 that runs at compile time. With alternatives for C++ 11 and C++ 14 that runs at runtime. Useage: At compil

Sleepy Flower Girl 3 Feb 13, 2022
Einsums in C++ Provides compile-time contraction pattern analysis to determine optimal operation to perform

Einsums in C++ Provides compile-time contraction pattern analysis to determine optimal operation to perform. Examples This will optimize at compile-ti

Justin Turney 14 Dec 15, 2022
Run-time program generator embedded in C++

Run-time program generator embedded in C++

Z Guan 25 Aug 5, 2022
Updates the Wii's current system time with the real world time.

Fix Wii System Time This is a homebrew tool I made for the Wii a while ago. It updates the current system time with the real world time via worldtimea

Puzzle 2 Nov 9, 2022
Project PLS is developed based on icarus iverilog and will compile verilog into a much faster optimized model.

Getting Started with PLS The project is developed based on icarus iverilog. Special thanks to Stephen Williams ([email protected]). PLS is a Verilog si

null 11 Dec 19, 2022
A commented version of my libft, with details about how my algorithms work and simple main functions to compile them.

LIBFT COMMENTED VERSION : ?? PART I : ?? FT_STRCHR : ?? PART II : ?? FT_SUBSTR : /* * DEFINITION : * CREATES A SUBSTRING FROM A STRING WITH PREDETER

Abdessamad Laamimi 1 Nov 11, 2021
Read-Compile-Run-Loop: tiny and powerful interactive C++ compiler (REPL)

Read-Compile-Run-Loop: tiny and powerful interactive C++ compiler (REPL) RCRL is a tiny engine for interactive C++ compilation and execution (implemen

Viktor Kirilov 383 Jan 8, 2023
Languages that compile to Lua

lua-languages Languages that compile to Lua Lua is famously and deceptively simple and enables many different programming paradigms. Like Javascript,

Conrad Steenberg 398 Jan 8, 2023
A utility to compile IW engine legacy UI scripts.

MENU Tool A utility to compile IW engine legacy UI scripts. Supported Games IW5 (Call of Duty: Modern Warfare 3) Usage ./menu-tool.exe <path> Disclaim

Xenxo Espasandín 3 Sep 21, 2022
Half-Life Singleplayer SDK 2.3, updated to compile with Visual Studio 2019. Provided as-is with no further support. See the README for more information.

Half Life 1 SDK LICENSE Half Life 1 SDK Copyright© Valve Corp. THIS DOCUMENT DESCRIBES A CONTRACT BETWEEN YOU AND VALVE CORPORATION (“Valve”). PLEASE

Sam Vanheer 6 Oct 10, 2022