Lingo - Text encoding for modern C++

Overview

Lingo

Lingo is an encoding aware string library for C++11 and up. It aims to be a drop in replacement for the standard library strings by defining new string classes that mirror the standard library as much as possible, while also extending them with new functionality made possible by its encoding and code page aware design.

Github Actions Codecov Coveralls Releases
Master ga-build ccov-coverage cvrl-coverage release
Latest ga-build ccov-coverage cvrl-coverage

Features

  • Encoding and code page aware lingo::string and lingo::string_view classes, almost fully compatible with std::string and std::string_view.
  • Conversion constructors between lingo::strings of different encodings and code pages.
  • lingo::encoding::* for low level encoding and decoding of code points.
  • lingo::page::* for additional code point information and conversion between different code pages.
  • lingo::error::* for different error handling behaviours.
  • lingo::encoding::point_iterator and lingo::page::point_mapper helpers to manually iterate or convert points individually.
  • lingo::string_converter to manually convert entire strings.
  • Null terminator aware lingo::string_view.
  • lingo::make_null_terminated helper function for APIs that only support C strings, which ensures that a string is null terminated with minimal copying.

How it works

The string class in the C++ the standard library is defined like this:

namespace std
{
    template <class CharT, class Traits, class Allocator>
    class basic_string;
}

CharT is the code point type, and Traits contains all operations to work with the code units. This setup works fine for simple ASCII strings, but runs into problems when working with more complicated encodings.

  • It assumes that every CharT is a code point, while in reality most strings use some kind of multibyte encoding. Encodings such as UTF-8 and UTF-16 can be difficult to work with.
  • It has no information about the code page used. char could be ascii, utf8, iso 8859-1, or anything really. And while the standard is adding char8_t, char16_t and char32_t for unicode, it really only knows that it is a form of Unicode, but has no idea how actually encode, decode or transform the data.

To solve this problem, Lingo defines a new string type:

namespace lingo
{
    template <typename Encoding, typename Page, typename Allocator>
    class basic_string;
}

Lingo splits the responsibility of managing the code points of a string between an Encoding type and a Page type. The Encoding type defines how a code point can be encoded to and decoded from one or more code units. The Page type defines what every decoded code point actually means, and knows how to convert it to other Pages.

Here are some examples of what that actually looks like:

using ascii_string = lingo::basic_string<
    lingo::encoding::none<char, char>,
    lingo::page::ascii>;

using utf8_string = lingo::basic_string<
    lingo::encoding::utf8<char8_t, char32_t>,
    lingo::page::unicode>;

using utf16_string = lingo::basic_string<
    lingo::encoding::utf16<char16_t, char32_t>,
    lingo::page::unicode>;

using utf32_string = lingo::basic_string<
    lingo::encoding::utf32<char32_t, char32_t>,
    lingo::page::unicode>;

using iso_8895_1_string = lingo::basic_string<
    lingo::encoding::none<unsigned char, unsigned char>,
    lingo::page::iso_8895_1>;

You may wonder why there is a lingo::encoding::utf32 encoding, since there is no difference between UTF-32 and decoded Unicode. It is indeed possible to use lingo::encoding::none instead, and still have a fully functional UTF-32 string. However, lingo::encoding::utf32 does add some extra validation, such as detecting surrogate code units, making it better at dealing with invalid inputs.

Currently implemented

Encodings

  • lingo::encoding::none
  • lingo::encoding::utf8
  • lingo::encoding::utf16
  • lingo::encoding::utf32
  • lingo::encoding::base64

Meta encodings

  • lingo::encoding::swap_endian: Swaps the endianness of the code units.
  • lingo::encoding::join: Chains multiple encodings together (e.g. join<swap_edian, utf16> to create utf16_be).

Code pages

  • lingo::page::ascii
  • lingo::page::unicode
  • lingo::page::iso_8859_n with n = [1, 16] except 12.

Error handlers

  • lingo::error::strict Throws an exception on error.

Algorithms

Will be added in a future version.

How to build

Lingo is a header only library, but some of the header files do have to be generated first. You can check the latest releases for a package that has all headers generated for you.

If you want the library yourself, you will have to build the CMake project. All you need is CMake 3.12 or higher, Python 3 (for the code gen) and a C++11 compatible compiler. The tests are written using Catch and can be run with ctest.

How to include in your project

Since Lingo is a header only library, all you need to do is copy the header files and add it as an include directory.

There is one thing that you do need to look out for, which is the execution character set. This library assumes by default that char is UTF-8, and that wchar_t is UTF-16 or UTF-32, depending on the size of wchar_t.

This matches the default settings of GCC and Clang, but not of Visual Studio. If your compiler's execution set does not match the defaults, you have two options:

Configure your compiler

Configure the library

The following macros can be defined to overwrite the default encodings for char and wchar_t:

  • LINGO_CHAR_ENCODING
  • LINGO_WCHAR_ENCODING
  • LINGO_CHAR_PAGE
  • LINGO_WCHAR_PAGE

So for example, if you want to use ISO/IEC 8859-1 for chars, you will have to define the follow macros:

  • -DLINGO_CHAR_ENCODING=none
  • -DLINGO_CHAR_PAGE=iso_8859_1

This method is not recommended. Compiler flags are a much more reliable way to set the correct execution encoding.

Other documentation

You might also like...
Cross-platform tool to extract wavetables and draw envelopes from sample files, exporting the wavetable and generating the appropriate SFZ text to use in a suitable player.
Cross-platform tool to extract wavetables and draw envelopes from sample files, exporting the wavetable and generating the appropriate SFZ text to use in a suitable player.

wextract Cross-platform tool to extract wavetables and draw envelopes from sample files, exporting the wavetable and generating the appropriate SFZ te

A Flutter Web Plugin to display Text Widget as Html for SEO purpose
A Flutter Web Plugin to display Text Widget as Html for SEO purpose

SEO Renderer A flutter plugin (under development) to render text widgets as html elements for SEO purpose. Created specifically for issue https://gith

Off The Grid (OTG) Messenger is an easy way for people to communicate through text messages when in remote areas.
Off The Grid (OTG) Messenger is an easy way for people to communicate through text messages when in remote areas.

Off The Grid (OTG) Messenger is an easy way for people to communicate through text messages when in remote areas. With a theoretical transmission range of 10 miles (16kms), OTG messenger can be used by groups of people to stay connected when they are in areas not serviced by mobile connectivity.

C.impl is a small portable C interpreter integrated with a line text editor

C.impl C.impl is a small portable C interpreter integrated with a line text editor, originally developed for the ELLO 1A computer: http://ello.cc The

Simple text editor in C++ - Simple editor built upon kilo editor.

GUMBO editor Simple editor built upon kilo editor. Still big work in progress although this is just fun side project to learn more C/C++. From 0.0.2-

Scans a given text file for any misspelled words

speller-program Scans a given text file for any misspelled words Directories: dictionaries: a file that contains all words in the dictionary texts: co

Implement a program that computes the approximate grade level needed to comprehend some text.

Readability - CS50 Implement a program that computes the approximate grade level needed to comprehend some text. Reading Levels According to Scholasti

Text to International Phonetic Alphabets

Text to International Phonetic Alphabet (IPA) Installation: Windows python -m pip install text2ipa macOS sudo pip3 install text2ipa Linux pip instal

Cool and different approach to Strimer Plus. Colorful scrolling text message. It's ready for you!
Cool and different approach to Strimer Plus. Colorful scrolling text message. It's ready for you!

Strimer Plus DIY Version: 2021.10.27 Author: Murat TAMCI Web Site: www.themt.co Note: In loving memory of my grandfather (Ahmet Ozdil) Welcome to Stri

Comments
  • Issue Using in Windows Build

    Issue Using in Windows Build

    Hello,

    I am trying to utilize this in a program I am writing, however, after generating the headers, I am receiving the following errors when running this code:

    #include <lingo/string.hpp> 
    #include <lingo/string_converter.hpp>
    
    lingo::utf8_string value_str_utf8;
    lingo::utf16_string value_str_utf16(value_str_utf8);
    sql_stmt->setNString(1, odbc::NString(std::u16string(value_str_utf16.data())));
    

    Errors:

    1>C:\Users\user\source\lib\lingo\include\lingo\string_storage.hpp(21,69): warning C4003: not enough arguments for function-like macro invocation 'max'
    1>C:\Users\user\source\lib\lingo\include\lingo\string_storage.hpp(21,69): error C2589: '(': illegal token on right side of '::'
    1>C:\Users\user\source\lib\lingo\include\lingo\string_storage.hpp(22): message : see reference to class template instantiation 'lingo::internal::basic_string_storage_long_marker<Unit>' being compiled
    1>C:\Users\user\source\lib\lingo\include\lingo\string_storage.hpp(21,69): error C2059: syntax error: '('
    1>C:\Users\user\source\lib\lingo\include\lingo\string_storage.hpp(241,40): warning C4003: not enough arguments for function-like macro invocation 'max'
    1>C:\Users\user\source\lib\lingo\include\lingo\string_storage.hpp(241,40): error C2589: '(': illegal token on right side of '::'
    1>C:\Users\user\source\lib\lingo\include\lingo\string_storage.hpp(404): message : see reference to class template instantiation 'lingo::basic_string_storage<Unit,Allocator>' being compiled
    1>C:\Users\user\source\lib\lingo\include\lingo\encoding\endian.hpp(54,30): error C2589: '(': illegal token on right side of '::'
    1>C:\Users\user\source\lib\lingo\include\lingo\encoding\endian.hpp(63): message : see reference to class template instantiation 'lingo::encoding::swap_endian<Encoding>' being compiled
    1>Done building project "project.vcxproj" -- FAILED.
    ========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========
    
    opened by EliSauder 4
  • Refactor string_storage

    Refactor string_storage

    string_storage should be responsible not just for allocating memory, but also moving code units around efficiently.

    Design a better interface that allows string to simply tell the string_storage where the data needs to go, while string_storage puts everything in the right place.

    It should also be efficient, switching to std::memcpy or std::move where possible, and also have a strong exception guarantee.

    refactor 
    opened by rick-de-water 0
  • Add C++20 concepts

    Add C++20 concepts

    • [ ] lingo::concept::encoding
    • [ ] lingo::concept::page
    • [ ] lingo::concept::encode_result
    • [ ] lingo::concept::decode_result
    • [ ] lingo::concept::error_handler
    feature 
    opened by rick-de-water 0
  • Unicode normalization algorithm

    Unicode normalization algorithm

    • [ ] Add normalize to the Unicode page.
    • [ ] Add normalize as a free function that can take any string object with a page that has a normalize function.
    • [ ] Add normalize as a member function to lingo::strings with a page that has a normalize function.
    feature unicode 
    opened by rick-de-water 0
Releases(v0.1.1)
Owner
Rick de Water
Rick de Water
Typewriter Effect with Rich Text + *Correct* Text Wrapping

Typewriter Effect with Rich Text + Correct Text Wrapping I've spent way too long getting this right. This is meant as a base class for a UMG dialogue

Sam Bloomberg 30 Nov 29, 2022
Text - A spicy text library for C++ that has the explicit goal of enabling the entire ecosystem to share in proper forward progress towards a bright Unicode future.

ztd.text Because if text works well in two of the most popular systems programming languages, the entire world over can start to benefit properly. Thi

Shepherd's Oasis 228 Dec 25, 2022
A modern port of Turbo Vision 2.0, the classical framework for text-based user interfaces. Now cross-platform and with Unicode support.

Turbo Vision A modern port of Turbo Vision 2.0, the classical framework for text-based user interfaces. Now cross-platform and with Unicode support. I

null 1.4k Dec 31, 2022
Project #1: Run-length Encoding (Computer Architecture, Fall 2021)

4190.308 Computer Architecture (Fall 2021) Project #1: Run-length Encoding Due: 11:59PM, September 26 (Sunday) Introduction In this project, you need

SNU Systems Software & Architecture Laboratory 8 Dec 13, 2022
A C++ concepts and range based character encoding and code point enumeration library

Travis CI (Linux:gcc) Text_view A C++ Concepts based character encoding and code point enumeration library. This project is the reference implementati

Tom Honermann 121 Sep 9, 2022
Single header lib for JPEG encoding. Public domain. C99. stb style.

tiny_jpeg.h A header-only public domain implementation of Baseline JPEG compression. Features: stb-style header only library. Does not do dynamic allo

Sergio Gonzalez 212 Dec 14, 2022
Jittey - A public domain text editor written in C and Win32

Jittey (Jacob's Terrific Text Editor) is a single-file basic text editor written in pure C and Win32, there is no real reason to use it, but it

Jakub Šebek 29 Dec 15, 2022
Minimalistic text-based 1-bit music tracker

1bitr 1bitr ("One Bitter" or "The Bitter One") is a minimalistic text-based music tracker. It only supports 1-bit audio playback and encourages users

Serge Zaitsev 58 Dec 25, 2022
A very minimal & simple text editor written in C with only Standard C Library.

Texterm Text Editor A very minimal & simple text editor written in C with only Standard Library. Syntax highlighting supported for C JavaScript Python

Biraj 40 Dec 8, 2022
MINCE is an Emacs-like text editor from Mark of the Unicorn, Inc.

MINCE Is Not Complete[ly] EMACS Overview MINCE is an Emacs-like text editor from Mark of the Unicorn, Inc. Versions were available for many oper

Jeffrey H. Johnson 20 Nov 5, 2022