a header-only crossplatform type-safe dynamic compiler generator based on C++ 17.

Overview

Mu Compiler Generator

MuCompilerGenerator(MuCplGen)

a Header-Only dynamic compiler generator based on C++ 17.

Why MuCplGen?

    header-only cross-platform
    self-contained (without any dependencies)
    type-safe semantic action (debug-friendly)

frequent-update-repo.

git clone https://e.coding.net/revdolgaming/musys/MuCompilerGenerator.git

Introduction

MuCompilerGenerator(MuCplGen) is a Header-Only dynamic compiler generator based on C++ 17.

MuCplGen covers the front-end of a compiler, which can be divided into 2 parts, respectively Scanner and Parser. A Scanner employs a Determined Finite Automaton(DFA) to tokenize a series of words. While the rule for word match is called Regular Expression(Regex). And the output of a Scanner is a token list, every token in the list contains the info of the word recognized, e.g. the line number, the type, and so on. Then Parser takes the list of tokens as input, working as a Push Down Automaton(PDA) to decide whether the token sequence satisfies the designed Context Free Grammar(CFG).

It's annoying to construct DFA and PDA by yourself(manually), however they are theoretically possible to generate automatically with certain algorithm. MuCplGen takes the dull job of this part.

The only things you need to do is:

  • Define the Token
  • Define the Regex Rules for Scanner
  • Define the Transform between tokens and terminators.
  • Define the CFG and Semantic Action for Parser

Yes, your job is to give rules.

Include MuCplGen

MuCplGen is header-only, copy the folder MuCplGen to your include folder, and everything done.

If you want to run the Examples, CMake is required. CMake GUI is recommended.

command line:

mkdir build
cd build
cmake ..
cmake --build

if you are Linux user, make sure that you g++8 is available. [linux install g++8](#install g++8)

Requirement

c++ std compiler
Windows c++17 MVSC
Linux c++17 g++8
other c++17

Quick Start

In this part we are going to create an easy calculator from scratch, covering using predefined EasyScanner, implementing a Parser and defining its rules.

Read from Console

Create main.cpp in your project. Include MuCplGen/MuCplGen.h.

//main.cpp
#include <MuCplGen/MuCplGen.h>
using namespace MuCplGen;

int main()
{
	std::vector<LineContent> lines;
	LineContent line;
	line.line_no = 1;
    std::cout << "Calculate:";
	std::cin >> line.content;
	lines.push_back(line);
}

For convinience, this time, we read the console as input.

Recognize Number & Operator

EasyScanner is a predefined Scanner, which can recognize:

  • number
  • arithmetic operator
  • relation operator
  • ...

we just ask EasyScanner to recoginze number and arithmetic operator for us.

int main()
{
    ...
	EasyScanner easyScanner;
	std::vector<EasyToken> tokens = easyScanner.Scann(lines);
}

after scanning, EasyScanner returns a list of tokens to us, which will be the input of our Parser.

Analyze Expression

To define our Parser, we need some rules, which is called CFG.

An arithmetic expression can be defined as follows:

Expr->	E
//priority 3(+ -)
E	->	E + T
E	->	E - T
E	->	T
//priority 2(* /)
T	->	T * P
T	->	T / P
T	->	P
//priority 1(^)
P	->	P ^ F //power
P	->	F
//priority 0
F	->	Num
F	->	( E )

The part on the left of -> is called Head of the production, while items on the right called Body. Every production indicates which Head can produce which Body.

Explaining arithmetic expression in this way help us to take the expression apart tree-like obeying the arithmetic rule.

e.g. When E -> E + T happens, we know it's safe and logical to do the addition operation between E and T, without any fear about the priority.

Create a Parser

Create a cpp header file in your project, called "Calculator", this time, a SLR parser is enough, and we use the predefined token EasyToken.

using namespace MuCplGen;
class Calculator :public SyntaxDirected<SLRParser<EasyToken>>
{
	Calculator(std::ostream& log = std::cout) : SyntaxDirected(log)
	{

	}
};

Define the Semanitc Action

Once a production happen, the Semantic Action attaching on it will involke.

We use the semantic action to pass on the calculated value, and finally get the result when Expr -> E happens. So we need to define how to process and pass on the value in every intermediate production.

Before doing so, we need to tell our parser how we translate an input token as terminator(who can never be a Head, cuz as its name, it's the end of the produce process)

using namespace MuCplGen;
class Calculator :public SyntaxDirected<SLRParser<EasyToken>>
{
public:
	Calculator(std::ostream& log = std::cout) : SyntaxDirected(log)
	{
		{
			//translate number token as terminator Num
			auto& t = CreateTerminator();
			t.name = "Num";
			t.translation = [this](const Token& token)
			{
				return token.type == Token::TokenType::number;
			};
		}
        ...
	}
};

and then, we define the first Semantic Action right after the terminator definition.

{
    auto& p = CreateParseRule();
    p.expression = "F -> Num";
    //first template parameter is the return type, and others are input parameters
    p.SetAction<float, Empty>(
        [this](Empty)->float
        {
            auto& token = CurrentToken();
            return std::stof(token.name);
        });
}

here, when F -> Num happens, we take the current token, and we know it should be a number, so we just convert it from string to float, and pass it on. Thus any CFG rule who has an F in its Body can get this float data from the F.

Notice that, we places an Empty as the parameter, we need it to save a place for those who doesn't pass any data. In this case, the Num doesn't take any data, so we place Empty for it.

as for, P -> F, we know F takes a float, and this production is just for a priority decreasing, without further semantic aim, so we simply PassOn the first data. i.e. P takes the float from F. Thus any CFG rule who has an P in its Body can get this float data from the P.

{
    auto& p = CreateParseRule();
    p.expression = "P -> F";
    p.SetAction(PassOn(0));
}

SetAction(PassOn(0)) is defualt.

cool, now we are going to calculate something.

P -> P ^ F

{
    auto& p = CreateParseRule();
    p.action_name = "Power()";
    p.expression = "P -> P ^ F";
    p.SetAction<float, float, Empty, float>(
        [this](float P, Empty, float F)->float
        {
            return std::pow(P, F);
        });
}

this time, we give a name to this action(only for debug and readability).

But why is the action so simple?

We solve the problem from small! We know P takes a float data, so does F, only thing we need to do is to do the power(P,F).

But, here pay attention to the Empty , which saves the place for ^, cuz ^ doesn't takes any data. if you forget it, the type-safety-checker will throw an exception to you.

So do the rest rules.

{
    auto& p = CreateParseRule();
    p.expression = "T -> P";
}

{
    auto& p = CreateParseRule();
    p.action_name = "Multipy()";
    p.expression = "T -> T * P";
    p.SetAction<float, float, Empty, float>(
        [this](float T, Empty, float P)->float
        {
            return T * P;
        });
}

{
    auto& p = CreateParseRule();
    p.action_name = "Divid()";
    p.expression = "T -> T / P";
    p.SetAction<float, float, Empty, float>(
        [this](float T, Empty, float P)->float
        {
            return T / P;
        });
}

{
    auto& p = CreateParseRule();
    p.expression = "E -> T";
}

{
    auto& p = CreateParseRule();
    p.action_name = "Add()";
    p.expression = "E -> E + T";
    p.SetAction<float, float, Empty, float>(
        [this](float E, Empty, float T)->float
        {
            return E + T;
        });
}

{
    auto& p = CreateParseRule();
    p.action_name = "Sub()";
    p.expression = "E -> E - T";
    p.SetAction<float, float, Empty, float>(
        [this](float E, Empty, float T)->float
        {
            return E - T;
        });
}

{
    auto& p = CreateParseRule();
    p.action_name = "Compress()";
    p.expression = "F -> ( E )";
    p.SetAction(PassOn(1));
}

Finially never forget the Expr -> E, this is the entrance of all productions, we need to set this rule as the first rule, although it does nothing at all.

{
    //translate number token as terminator Num
    auto& t = CreateTerminator();
    t.name = "Num";
    t.translation = [this](const Token& token)
    {
        return token.type == Token::TokenType::number;
    };
}

{
    auto& p = CreateParseRule();
    p.expression = "Expr -> E";
    p.SetAction<Empty, float>(
        [this](float res)->Empty
        {
            std::cout << "Result = " << res << std::endl;
            return Empty{};
        });
}
...

Use Our Own Parser

Go back to our main.cpp file, include Calculator.h, and create a Calculator object in the main function, call Parse() on Calculator to build the process the input tokens.

//main.cpp
#include <MuCplGen/MuCplGen.h>
#include <Calculator.h>
using namespace MuCplGen;

int main()
{
	std::vector<LineContent> lines;
	LineContent line;
	line.line_no = 1;
	std::cout << "Calculate:";
	std::cin >> line.content;
	lines.push_back(line);

	EasyScanner easyScanner;
	std::vector<EasyToken> tokens = easyScanner.Scann(lines);
    Highlight(lines, tokens);
    
	Calculator calculator;
	calculator.Parse(lines, tokens);
}

Run:

image-20211121173909113

If you have any problem, please check Examples/Calculator, or using CMake to build the Examples.

cool! It works. but, what if something wrong?

image-20211121174052577

good!

So how can we get more detail debug info, e.g. how the PDA works?

class Calculator :public SyntaxDirected<SLRParser<EasyToken>>
{
public:
	Calculator(std::ostream& log = std::cout) : SyntaxDirected(log)
	{
		debug_option = Debug::DebugOption::ShowReductionProcess;
        ...
    }
}

Setup debug_option as ShowReductionProcess.

Run again.

image-20211121175018648

It works pretty well. There are quite a lot of Debug Info MuCplGen can provide, please set the debug_options to test the debug info by yourself.

Further

In the following content, we are going into more details about how MuCplGen works.

  • if you want to create a Scanner by yourself please check the Chapter Scanner.

  • if you need more info about Parser, please check the Chapter [Syntax-Directed Parser](#Syntax-Directed Parser)

Scanner

Regular Expression (Regex) based.

  • Regex Rule for Scanner

    type field detail
    std::string [tokenType](#Regex with Action) readability string token type name
    std::regex [expression](#Regex with Action) regex rule
    int [priority](#Recognize Priority) to solve conflicts
    std::function [onSucceed](#Scanner Action) callback when token recognized

How Custom Scanner Looks like

struct CustomToken : public BaseToken
{
    ...
}

class CustomScanner : public Scanner<CustomToken>
{
    using Token = CustomToken;
    CustomScanner()
    {
        auto& rule = CreateRule();
        rule.tokenType = "typename";
        //regular expression
        rule.expression = R"(....)";
        rule.priority = 0;
        //action
        rule.onSucceed = 
           	[this](std::smatch&, Token& token)->ScannActionResult
        	{
            	return SaveToken;
        	};
        //other rules
        ...
    }
}

Regex with Action

auto& num = CreateRule();
//for readability
num.tokenType = "number";
//regex
num.expression = R"(^(\-|\+)?\d+(\.\d+)?)";
//action to do when a token is recoginized
num.onSucceed = [this](std::smatch&, Token& token)->ScannActionResult
{
    //custom data, for fast token type dicision
    token.type = Token::TokenType::number;
    //only for debug token highlight as you can see above
    token.color = ConsoleForegroundColor::White;
    //tell the scanner to save this Token
    return SaveToken;
};

Remember to start your regex with ^, or something weird may happen. e.g. Hhello will match the regex hello.

std::smatch& is the regex match info, in this case, ignored.

Scanner Action

auto& blank = CreateRule();
blank.tokenType = "Blank";
blank.expression = CommonRegex::Blank;
blank.onSucceed = [this](std::smatch&, Token&)->ScannActionResult
{
    //to ignore blank as ' ', '\t', '\n' ...
    //if you need the blank token, never Discard it. (e.g. python keeps the blank token)
    return DiscardThisToken;
};

auto& comment = CreateRule();
comment.tokenType = "Comment";
comment.expression = "^//.*";
comment.onSucceed = [this](std::smatch&, Token&)->ScannActionResult
{
    //Skip current line
    //usually, comment should be removed by preprocessor
    //it's just example
    return (ScannActionResult)(DiscardThisToken | SkipCurrentLine);
};

view: EasyScanner.h

Default Action

Default action is to "Save the Token" and of course you can set the default action of a scanner

struct Scanner
{
public:
    using ScannAction = std::function<ScannActionResult(std::smatch&, Token&)>;
    ScannAction defaultAction;
    ...
}

Priority

Sometimes conflicts occur, e.g. custom-defined identifier may be the same as your predefined keyword, so there's a priority option for you.

A smaller number indicates a higher priority. Default priority is 0.

auto& keyword = CreateRule();
keyword.tokenType = "keyword";
keyword.expression = 
    "^(void|char|float|int|return|enum|struct|class|private|switch"
    "|case|break|default|if|else|while|do)";
keyword.onSucceed = [this](std::smatch, Token& token)->ScannActionResult
{
    token.type = Token::TokenType::keyword;
    token.color = ConsoleForegroundColor::Blue;
    return SaveToken;
};

auto& id = CreateRule();
id.priority = 1;
id.tokenType = "identifier";
id.expression = CommonRegex::Identifier;
id.onSucceed = [this](std::smatch, Token& token)->ScannActionResult
{
    token.type = Token::TokenType::identifier;
    token.color = ConsoleForegroundColor::White;
    return SaveToken;
};

So "int" will be a keyword rather than an identifier.

view: EasyScanner.h

Macroscopic Greedy-Match

A Scanner tent to match as long as it can, i.e. if more than 2 regex rule are satisfied, the longer one will be chosen (if they are same in priority).

e.g.

  • :: will be recognized as :: rather than :and:

  • ->will be recognized as -> rather than -and>

which is pretty sensible.

if you want to break the rule, use the priority field of a regex rule.

Debug Your Scanner

cuz, it's your response to define a Token, so, if you want to highlight the Token in console, derive your token from DebugToken

namespace MuCplGen::Debug
{
	struct DebugToken :public BaseToken
	{
		ConsoleForegroundColor color = ConsoleForegroundColor::White;
	};
}
struct MyToken : public DebugToken
{
    ...
}

Thus, MuCplGen::Debug::Highlight() will highlight your content.

auto input_text = FileLoader::Load("easy.txt");
EasyScanner easyScanner;
auto token_set = easyScanner.Scann(input_text);
Debug::Highlight(input_text, token_set);

Of curse, Highlight() can show error info more readably.

template<typename Token = DebugToken>
void Highlight(
    std::vector<LineContent>& input_text, std::vector<Token>& token_set,
    std::vector<std::pair<size_t, std::string>>& error_info_pair, std::ostream& log = std::cout)

The first field of std::pair<size_t, std::string> is a token iterator, indicates the error token index in token_set, and the second is, self-explanatory, the error info.

Syntax-Directed Parser

Context Free Grammar (CFG) based.

Parser Usage
SLR SLRParser<UserToken,T> ✔️
LR1 LR1Parser<UserToken,T> ✔️
BaseParser

Store Tables

It consumes substantial time to build a PDA from CFG, when the amount of states grows more than 3k. So MuCplGen provides a mechanism to store the build result in binary form.

The binary form storage is not cross-platform, however, is machine-relative.

It's up to you whether to save or load a build result. (Sometimes IO is slower than directly build a PDA ) .

this is the first time to build a PDA,which takes 69ms.

image-20211118132824549

while, the second time, we just load the build result form the disk, which takes only 2ms. We actually solve the most time consuming part. The total time falls as fast as 1/10.

image-20211118132910802

To control the save & load strategy.

enum class BuildOption
{
    Runtime = 0,
    Save = 1,
    Load = 1 << 1,
    LoadAndSave = Load | Save
};

//custom code:
struct CustomToken
{
    ...
}

class CustomParser :public SyntaxDirected<SLRParser<CustomToken>>
{
public:
	CustomParser(std::ostream& log = std::cout) :SyntaxDirected(log)
	{
		generation_option = BuildOption::LoadAndSave;
		SetStorage("./storage/ILGen.bin");
        ...
    }
}
enum description
Runtime never load and save, always build a PDA
Save always save the build result after build a PDA
Load always load the build result from the disk
LoadAndSave asways load and save.

if load fails, the Parser roll back to the runtime mode, rebuild the PDA from your CFG.

Appendix

Install g++8

ubuntu

bash

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install g++-8
#set g++8 as defualt
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-8 100
sudo update-alternatives --config g++
#check the version
g++ --version
Owner
MuGdxy
MuGdxy
PL/0 to C compiler to teach basic compiler construction from a practical, hands-on perspective.

pl0c pl0c is a compiler for the PL/0 language. It reads in PL/0 source code and outputs equivalent C source code. It was written to be the subject of

Brian Callahan 80 Nov 13, 2021
An automatic crafting rotation generator for ffxiv.

ffxiv-crafting-guide This is an automatic crafting rotation generator for ffxiv, written in C++. Introduction Given your character status and recipe i

Renko 11 Oct 16, 2021
Collection of C99 dynamic array implementations

darc darc stands for Dynamic ARray Collection. This repo hosts 3 type-generic C99 implementations : mga (Macro Generated Array) type-safe 0-cost abstr

A.P. Jo. 5 Nov 29, 2021
A single file C++ header-only minizip wrapper library

cpp-zipper A single file C++ header-only minizip wrapper library This code is based on 'Making MiniZip Easier to Use' by John Schember. https://nachti

null 3 Nov 23, 2021
Modern, header-only, compact and cross platform C++ network/sockets library

cpp-net-lib Modern, header-only, compact and cross-platform C++ network/sockets library. Don't mind the crappy name, I suck at naming things. Why? I n

Marc 6 Nov 27, 2021
A Compiler Writing Journey

In this Github repository, I'm documenting my journey to write a self-compiling compiler for a subset of the C language. I'm also writing out the details so that, if you want to follow along, there will be an explanation of what I did, why, and with some references back to the theory of compilers.

Warren 5.8k Dec 7, 2021
An operating system, but it only plays Tetris.

TETRIS-OS: An operating system that only plays Tetris. Video with an explanation of the development process. Features: It's Tetris. 32-bit (x86) Fully

null 3.5k Jul 13, 2021
A very small v8 javascript runtime for linux only

Just A very small v8 javascript runtime for linux only Build and Run Currently working on modern linux (debian/ubuntu and alpine teste

theanarkh 9 Nov 8, 2021
A reliable and easy to use CPP program header file for simplifying, code writing in cpp

CPP Custom Header This header file main purpose is to implement most famous and most used algorithm that are easy to implement but quite lengthy and t

Jitesh Kumar 1 Dec 1, 2021
Code profiler based on Frida

Code Profiler Based on Frida This repository contains the code to profile LIEF functions with Frida. Get Started Make sure to download the right versi

LIEF 26 Nov 14, 2021
A curated list of project-based tutorials in C

A list of tutorials that work towards the making of small to large projects in C.

R 7.8k Nov 29, 2021
Ubpa small flat containers based on C++20

USmallFlat Ubpa small flat containers based on C++20 Containers basic_flat_map basic_flat_multimap basic_flat_multiset basic_flat_set basic_small_vect

Ubpa 15 Sep 14, 2021
λQ: A Simple Quantum Programming Language based on QWIRE.

λQ λQ: A Simple Language based on QWIRE which can be compiled to QASM. The name λQ means lambda calculus with quantum circuits. This is a term project

Wenhao Tang 5 Jul 11, 2021
Ideas, thoughts, and notes on a typeclass/interface based polymorphism pattern for standard C

Polymorphism through Typeclasses / Interface / Traits Ideas, thoughts, and notes on an action based polymorphism pattern for good ol' C. Originally us

Chase 16 Sep 25, 2021
Cmusic source code, based on the CARL 0.2.0 distribution

CMUSIC These are the sources and makefiles for a build of the classic CARL cmusic and related programs. The build has been tested on OSX and Linux, an

null 6 Apr 10, 2021
Stack based programming language

stacky WIP stack-based compiled concatenative programming language. Currently it's somewhere between B and C programming languages in universe where B

Robert Bendun 2 Dec 2, 2021
Implementation of kcp protocol based on c++11

?? kcp-cpp A C++11 header-only kcp library,It has been heavily optimized to support native heartbeat packets and multithreading There are a lot of ins

 KH 8 Nov 20, 2021
Single-header header-only C++11 / C++14 / C++17 library for easily managing set of auto-generated type-safe flags.

Single-header header-only C++11 / C++14 / C++17 library for easily managing set of auto-generated type-safe flags. Quick start #include <bitflags/bitf

Marin Peko 66 Sep 29, 2021
Cross-platform STL-styled and STL-compatible library with implementing containers, ranges, iterators, type traits and other tools; actors system; type-safe config interface.

Yato A small repository where I'm gatherting useful snippets and abstractions for C++ development. Yato includes 3 main modules: multidimensional cont

Alexey 6 Sep 2, 2021
Easy to use, header only, macro generated, generic and type-safe Data Structures in C

C Macro Collections Easy to use, header only, macro generated, generic and type-safe Data Structures in C. Table of Contents Installation Contributing

Leonardo Vencovsky 258 Dec 1, 2021
Easy to use, header only, macro generated, generic and type-safe Data Structures in C

C Macro Collections Easy to use, header only, macro generated, generic and type-safe Data Structures in C. Table of Contents Installation Contributing

Leonardo Vencovsky 258 Dec 1, 2021
variant lite - A C++17-like variant, a type-safe union for C++98, C++11 and later in a single-file header-only library

variant lite: A single-file header-only version of a C++17-like variant, a type-safe union for C++98, C++11 and later Contents Example usage In a nuts

Martin Moene 198 Nov 14, 2021
Type-safe zero-boilerplate interfaces for pure C99, implemented as a single-header library.

Interface99 Type-safe zero-boilerplate interfaces for pure C99, implemented as a single-header library. [ examples/state.c ] #include <interface99.h>

null 105 Nov 26, 2021
Single-header, ranges-compatible generator type built on C++20 coroutines

generator Single-header, ranges-compatible generator type built with C++20 coroutines. A generator allows implementing sequence producers which are te

Sy Brand 19 Nov 22, 2021
"SaferCPlusPlus" is essentially a collection of safe data types intended to facilitate memory and data race safe C++ programming

A collection of safe data types that are compatible with, and can substitute for, common unsafe native c++ types.

null 306 Nov 28, 2021