Simdutf - Unicode routines (UTF8, UTF16): billions of characters per second.

Overview

Alpine Linux MSYS2-CI MSYS2-CLANG-CI Ubuntu 20.04 CI (GCC 9) VS16-ARM-CI VS16-CI

simdutf: Unicode validation and transcoding at billions of characters per second

Most modern software relies on the Unicode standard. In memory, Unicode strings are represented using either UTF-8 or UTF-16. The UTF-8 format is the de facto standard on the web (JSON, HTML, etc.) and it has been adopted as the default in many popular programming languages (Go, Rust, Swift, etc.). The UTF-16 format is standard in Java, C# and in many Windows technologies.

Not all sequences of bytes are valid Unicode strings. It is unsafe to use Unicode strings in UTF-8 and UTF-16LE without first validating them. Furthermore, we often need to convert strings from one encoding to another, by a process called transcoding. For security purposes, such transcoding should be validating: it should refuse to transcode incorrect strings.

This library provide fast Unicode functions such as

  • UTF-8 and UTF-16LE validation,
  • UTF-8 to UTF-16LE transcoding, with or without validation,
  • UTF-16LE to UTF-8 transcoding, with or without validation,
  • From an UTF-8 string, compute the size of the UTF-16 equivalent string,
  • From an UTF-16 string, compute the size of the UTF-8 equivalent string,
  • UTF-8 and UTF-16LE character counting.

The functions are accelerated using SIMD instructions (e.g., ARM NEON, SSE, AVX, etc.). When your strings contain hundreds of characters, we can often transcode them at speeds exceeding a billion caracters per second. You should expect high speeds not only with English strings (ASCII) but also Chinese, Japanese, Arabic, and so forth. We handle the full character range (including, for example, emojis).

The library compiles down to tens of kilobytes. Our functions are exception-free and non allocating. We have extensive tests.

How fast is it?

Over a wide range of realistic data sources, we transcode a billion characters per second or more. Our approach can be 3 to 10 times faster than the popular ICU library on difficult (non-ASCII) strings. We can be 20x faster than ICU when processing easy strings (ASCII). Our good results apply to both recent x64 and ARM processors.

To illustrate, we present a benchmark result with values are in billions of characters processed by second. Consider the following figures.

Datasets: https://github.com/lemire/unicode_lipsum

Please refer to our benchmarking tool for a proper interpretation of the numbers. Our results are reproducible.

Requirements

  • C++11 compatible compiler. We support LLVM clang, GCC, Visual Studio. (Our optional benchmark tool requires C++17.)
  • For high speed, you should have a recent 64-bit system (e.g., ARM or x64).
  • If you rely on CMake, you should use a recent CMake (at least 3.15) ; otherwise you may use the single header version. The library is also available from Microsoft's vcpkg.

Usage (Usage)

We made a video to help you get started with the library.

the simdutf library

Usage (CMake)

cmake -B build
cmake --build build
cd build
ctest .

Visual Studio users must specify whether they want to build the Release or Debug version.

To run benchmarks, execute the benchmark command. You can get help on its usage by first building it and then calling it with the --help flag. E.g., under Linux you may do the following:

cmake -B build
cmake --build build
./build/benchmarks/benchmark --help

Instructions are similar for Visual Studio users.

Since ICU is so common and popular, we assume that you may have it already on your system. When it is not found, it is simply omitted from the benchmarks. Thus, to benchmark against ICU, make sure you have ICU installed on your machine and that cmake can find it. For macOS, you may install it with brew using brew install icu4c. If you have ICU on your system but cmake cannot find it, you may need to provide cmake with a path to ICU, such as ICU_ROOT=/usr/local/opt/icu4c cmake -B build.

Single-header version

You can create a single-header version of the library where all of the code is put into two files (simdutf.h and simdutf.cpp). We publish a zip archive containing these files, e.g., see https://github.com/simdutf/simdutf/releases/download/v1.0.0/singleheader.zip

You may generate it on your own using a Python script.

python3 ./singleheader/amalgamate.py

We require Python 3 or better.

Under Linux and macOS, you may test it as follows:

cd singleheader
c++ -o amalgamation_demo amalgamation_demo.cpp -std=c++17
./amalgamation_demo

Example

Using the single-header version, you could compile the following program.

#include <iostream>
#include <memory>

#include "simdutf.cpp"
#include "simdutf.h"

int main(int argc, char *argv[]) {
  const char *source = "1234";
  // 4 == strlen(source)
  bool validutf8 = simdutf::validate_utf8(source, 4);
  if (validutf8) {
    std::cout << "valid UTF-8" << std::endl;
  } else {
    std::cerr << "invalid UTF-8" << std::endl;
    return EXIT_FAILURE;
  }
  // We need a buffer of size where to write the UTF-16LE words.
  size_t expected_utf16words = simdutf::utf16_length_from_utf8(source, 4);
  std::unique_ptr<char16_t[]> utf16_output{new char16_t[expected_utf16words]};
  // convert to UTF-16LE
  size_t utf16words =
      simdutf::convert_utf8_to_utf16(source, 4, utf16_output.get());
  std::cout << "wrote " << utf16words << " UTF-16LE words." << std::endl;
  // It wrote utf16words * sizeof(char16_t) bytes.
  bool validutf16 = simdutf::validate_utf16(utf16_output.get(), utf16words);
  if (validutf16) {
    std::cout << "valid UTF-16LE" << std::endl;
  } else {
    std::cerr << "invalid UTF-16LE" << std::endl;
    return EXIT_FAILURE;
  }
  // convert it back:
  // We need a buffer of size where to write the UTF-8 words.
  size_t expected_utf8words =
      simdutf::utf8_length_from_utf16(utf16_output.get(), utf16words);
  std::unique_ptr<char[]> utf8_output{new char[expected_utf8words]};
  // convert to UTF-8
  size_t utf8words = simdutf::convert_utf16_to_utf8(
      utf16_output.get(), utf16words, utf8_output.get());
  std::cout << "wrote " << utf8words << " UTF-8 words." << std::endl;
  std::string final_string(utf8_output.get(), utf8words);
  std::cout << final_string << std::endl;
  if (final_string != source) {
    std::cerr << "bad conversion" << std::endl;
    return EXIT_FAILURE;
  } else {
    std::cerr << "perfect round trip" << std::endl;
  }
  return EXIT_SUCCESS;
}

API

Our API is made of a few non-allocating function. They typically take a pointer and a length as a parameter, and they sometimes take a pointer to an output buffer. Users are responsible for memory allocation.

namespace simdutf {


/**
 * Validate the UTF-8 string.
 *
 * Overridden by each implementation.
 *
 * @param buf the UTF-8 string to validate.
 * @param len the length of the string in bytes.
 * @return true if and only if the string is valid UTF-8.
 */
simdutf_warn_unused bool validate_utf8(const char *buf, size_t len) noexcept;

/**
 * Validate the UTF-16LE string.
 *
 * Overridden by each implementation.
 *
 * This function is not BOM-aware.
 *
 * @param buf the UTF-16LE string to validate.
 * @param len the length of the string in number of 2-byte words (char16_t).
 * @return true if and only if the string is valid UTF-16LE.
 */
simdutf_warn_unused bool validate_utf16(const char16_t *buf, size_t len) noexcept;

/**
 * Convert possibly broken UTF-8 string into UTF-16LE string.
 *
 * During the conversion also validation of the input string is done.
 * This function is suitable to work with inputs from untrusted sources.
 *
 * @param input         the UTF-8 string to convert
 * @param length        the length of the string in bytes
 * @param utf16_buffer  the pointer to buffer that can hold conversion result
 * @return the number of written char16_t; 0 if the input was not valid UTF-8 string
 */
simdutf_warn_unused size_t convert_utf8_to_utf16(const char * input, size_t length, char16_t* utf8_output) noexcept;

/**
 * Convert valid UTF-8 string into UTF-16LE string.
 *
 * This function assumes that the input string is valid UTF-8.
 *
 * @param input         the UTF-8 string to convert
 * @param length        the length of the string in bytes
 * @param utf16_buffer  the pointer to buffer that can hold conversion result
 * @return the number of written char16_t
 */
simdutf_warn_unused size_t convert_valid_utf8_to_utf16(const char * input, size_t length, char16_t* utf16_buffer) noexcept;

/**
 * Compute the number of 2-byte words that this UTF-8 string would require in UTF-16LE format.
 *
 * This function does not validate the input.
 *
 * @param input         the UTF-8 string to process
 * @param length        the length of the string in bytes
 * @return the number of char16_t words required to encode the UTF-8 string as UTF-16LE
 */
simdutf_warn_unused size_t utf16_length_from_utf8(const char * input, size_t length) noexcept;

/**
 * Convert possibly broken UTF-16LE string into UTF-8 string.
 *
 * During the conversion also validation of the input string is done.
 * This function is suitable to work with inputs from untrusted sources.
 *
 * This function is not BOM-aware.
 *
 * @param input         the UTF-16LE string to convert
 * @param length        the length of the string in 2-byte words (char16_t)
 * @param utf8_buffer   the pointer to buffer that can hold conversion result
 * @return number of written words; 0 if input is not a valid UTF-16LE string
 */
simdutf_warn_unused size_t convert_utf16_to_utf8(const char16_t * input, size_t length, char* utf8_buffer) noexcept;

/**
 * Convert valid UTF-16LE string into UTF-8 string.
 *
 * This function assumes that the input string is valid UTF-16LE.
 *
 * This function is not BOM-aware.
 *
 * @param input         the UTF-16LE string to convert
 * @param length        the length of the string in 2-byte words (char16_t)
 * @param utf8_buffer   the pointer to buffer that can hold the conversion result
 * @return number of written words; 0 if conversion is not possible
 */
simdutf_warn_unused size_t convert_valid_utf16_to_utf8(const char16_t * input, size_t length, char* utf8_buffer) noexcept;

/**
 * Compute the number of bytes that this UTF-16LE string would require in UTF-8 format.
 *
 * This function does not validate the input.
 *
 * This function is not BOM-aware.
 *
 * @param input         the UTF-16LE string to convert
 * @param length        the length of the string in 2-byte words (char16_t)
 * @return the number of bytes required to encode the UTF-16LE string as UTF-8
 */
simdutf_warn_unused size_t utf8_length_from_utf16(const char16_t * input, size_t length) noexcept;

/**
 * Count the number of code points (characters) in the string assuming that
 * it is valid.
 *
 * This function assumes that the input string is valid UTF-16LE.
 *
 * This function is not BOM-aware.
 *
 * @param input         the UTF-16LE string to process
 * @param length        the length of the string in 2-byte words (char16_t)
 * @return number of code points
 */
simdutf_warn_unused size_t count_utf16(const char16_t * input, size_t length) noexcept;

/**
 * Count the number of code points (characters) in the string assuming that
 * it is valid.
 *
 * This function assumes that the input string is valid UTF-8.
 *
 * @param input         the UTF-8 string to process
 * @param length        the length of the string in bytes
 * @return number of code points
 */
simdutf_warn_unused size_t count_utf8(const char * input, size_t length) noexcept;


}

Usage

The library used by haskell/text.

References

License

This code is made available under the Apache License 2.0 as well as the MIT license.

We include a few competitive solutions under the benchmarks/competition directory. They are provided for research purposes only.

Issues
  • Preliminary version of the new UTF8=>UTF16 + some benchmarks

    Preliminary version of the new UTF8=>UTF16 + some benchmarks

    This PR adds some UTF8=>UTF16 benchmarks and a new SSE based transcoder.

    The new transcoder seems empirically faster. The main reason I built it was to make more portable. It should be "easily" ported to NEON and AVX, my next step.

    The general idea is to first index a block of data. Currently it works in "cache line" sizes. So you find out where the "continuation bytes" are (0b10______) on a 64-byte block and that tells you right away whether you are dealing with ASCII. If you are, you can just quickly transcode ASCII => UTF16. There is a cheaper way to detect ASCII, but the nice thing here is that if you know where the "continuation bytes" are, you all have you need to know where the UTF-8 characters. No need for another movemask.

    Furthermore, given that we are working in large blocks (say 64 bytes), then we can repeatedly call a decoding routine with the same mask computed once.

    This should be more portable because NEON lacks movemask, and requires many operations to emulate the same result. However, it does fine doing a movemask over a large block since you can amortize the cost somewhat.

    Currently, the haswell kernel just contains my previous UTF8=>UTF16 SSE transcoder while the westmere kernel contains the new UTF8=>UTF16 SSE transcoder. Here are the numbers:

    $  ./build/benchmarks/benchmark -P convert -F benchmarks/dataset/wikipedia_mars/chinese.txt 
    testcases: 1
    convert_valid_utf8_to_utf16+fallback, input size: 75146, iterations: 100, 
      18.842 ins/byte,    3.395 GHz,    0.797 GB/s 
    convert_valid_utf8_to_utf16+haswell, input size: 75146, iterations: 100, 
       2.989 ins/byte,    3.398 GHz,    1.673 GB/s 
    convert_valid_utf8_to_utf16+westmere, input size: 75146, iterations: 100, 
       3.253 ins/byte,    3.401 GHz,    2.865 GB/s 
    $ ./build/benchmarks/benchmark -P convert -F benchmarks/dataset/wikipedia_mars/english.txt 
    testcases: 1
    convert_valid_utf8_to_utf16+fallback, input size: 181798, iterations: 100, 
      21.947 ins/byte,    3.395 GHz,    0.836 GB/s 
    convert_valid_utf8_to_utf16+haswell, input size: 181798, iterations: 100, 
       1.192 ins/byte,    3.405 GHz,    8.843 GB/s 
    convert_valid_utf8_to_utf16+westmere, input size: 181798, iterations: 100, 
       1.175 ins/byte,    3.407 GHz,    8.835 GB/s 
    $ ./build/benchmarks/benchmark -P convert -F benchmarks/dataset/wikipedia_mars/french.txt 
    testcases: 1
    convert_valid_utf8_to_utf16+fallback, input size: 245549, iterations: 100, 
      21.790 ins/byte,    3.394 GHz,    0.703 GB/s 
    convert_valid_utf8_to_utf16+haswell, input size: 245549, iterations: 100, 
       2.226 ins/byte,    3.395 GHz,    1.299 GB/s 
    convert_valid_utf8_to_utf16+westmere, input size: 245549, iterations: 100, 
       3.707 ins/byte,    3.396 GHz,    2.045 GB/s 
    
    opened by lemire 20
  • Port to ARM NEON of the utf16 to utf8 transcoder.

    Port to ARM NEON of the utf16 to utf8 transcoder.

    I think it is important to port the utf16 to utf8 transcoder to ARM NEON.

    Apple M1

    | | llvm | arm64 | |----|------|--------| | Arabic-Lipsum.utf16.txt |  0.338 | 4.881 | | Chinese-Lipsum.utf16.txt |  0.389 | 3.312 | | Emoji-Lipsum.utf16.txt |  0.351 | 0.591 | | Hebrew-Lipsum.utf16.txt |  0.336 | 6.632 | | Hindi-Lipsum.utf16.txt |  0.275 | 3.290 | | Japanese-Lipsum.utf16.txt |  0.355 | 3.242 | | Korean-Lipsum.utf16.txt |  0.375 | 3.324 | | Latin-Lipsum.utf16.txt | 0.401 | 21.514 | | Russian-Lipsum.utf16.txt | 0.260 | 6.472 |

    These numbers are very, very good.

    Zen2 results:

    | | avx2 | |----|------| | Arabic-Lipsum.utf16.txt | 3.549 | | Chinese-Lipsum.utf16.txt | 2.307 | | Emoji-Lipsum.utf16.txt | 0.329| | Hebrew-Lipsum.utf16.txt | 3.543 | | Hindi-Lipsum.utf16.txt | 2.228 | | Japanese-Lipsum.utf16.txt | 2.252 | | Korean-Lipsum.utf16.txt |  2.192 | | Latin-Lipsum.utf16.txt | 6.769 | | Russian-Lipsum.utf16.txt |  3.540 |

    opened by lemire 13
  • The issue of conversion between UTF-8 and UTF-32

    The issue of conversion between UTF-8 and UTF-32

    When I use the conversion between UTF-8 and UTF-32, I get an incorrect result. The sample code is as follows:

    #include <fstream>
    #include <iostream>
    #include <stdexcept>
    #include <string>
    
    #include <simdutf.h>
    
    std::string read_file(const char *path) {
      std::ifstream ifs;
      ifs.open(path);
    
      if (!ifs) {
        throw std::runtime_error("Can not open file");
      }
    
      std::string data;
    
      auto size = ifs.seekg(0, std::ifstream::end).tellg();
      data.resize(size);
      ifs.seekg(0, std::ifstream::beg).read(std::data(data), size);
    
      return data;
    }
    
    std::u32string utf8_to_utf32(const std::string &str) {
      auto source = std::data(str);
      auto source_size = std::size(str);
    
      if (!simdutf::validate_utf8(source, source_size)) {
        throw std::runtime_error("validate_utf8() failed");
      }
    
      std::u32string result;
      result.resize(simdutf::utf32_length_from_utf8(source, source_size));
      simdutf::convert_valid_utf8_to_utf32(source, source_size, std::data(result));
    
      return result;
    }
    
    std::string utf32_to_utf8(const std::u32string &str) {
      auto source = std::data(str);
      auto source_size = std::size(str);
    
      if (!simdutf::validate_utf32(source, source_size)) {
        throw std::runtime_error("validate_utf32() failed");
      }
    
      std::string result;
      result.resize(simdutf::utf8_length_from_utf32(source, source_size));
      simdutf::convert_valid_utf32_to_utf8(source, source_size, std::data(result));
    
      return result;
    }
    
    int main() {
      auto source = read_file("100012892.txt");
    
      auto utf32 = utf8_to_utf32(source);
      auto utf8 = utf32_to_utf8(utf32);
    
      if (source != utf8) {
        std::cerr << "error\n";
      }
    }
    

    The test file is here

    The diff file is here

    Did I do something wrong?

    bug 
    opened by KaiserLancelot 11
  • Ascii validation

    Ascii validation

    Fixes #103

    This provides the analogous method validate_ascii which will check whether a string is ASCII-valid (i.e. the most significant bit of each byte is 0).

    opened by NicolasJiaxin 8
  • Make atomic_ptr optional for active_implementation?

    Make atomic_ptr optional for active_implementation?

    When cross-compiling simdutf targetting wasm32-wasi, due to lack of <atomics> and such, compilation will fail. Would you accept a patch that opt out of atomic_ptr on platforms that lack threads and atomics?

    opened by TerrorJack 8
  • This provides output-size functions.

    This provides output-size functions.

    In practice, users will want to know exactly how big the output buffer should be. One way to solve this, is to proceed with a two-pass run where a first pass goes really quickly through the input and determines just the output size.

    Tests and acceleration are missing so it is a WIP.

    (I have come to believe that this is an important missing functionality.)

    opened by lemire 8
  • Implemented Inoue et al. (2008) as a competitor + added Cameron (2008)

    Implemented Inoue et al. (2008) as a competitor + added Cameron (2008)

    This is my honest attempt at implementing what I believe is the first recorded instance of a practical SIMD UTF8-to-UTF16 transcoder. The original implementation was meant for POWER processor. I could not find it anywhere. There are slides at https://researcher.watson.ibm.com/researcher/files/jp-INOUEHRS/IPSJPRO2008_SIMDdecoding.pdf

    I implemented it both for ARM NEON and for x64 (SSE). Because of how the algorithm is designed, it seemed best suited for ARM NEON.

    It does not handle 4-byte inputs. It does not validate the input.

    Reference: Hiroshi Inoue and Hideaki Komatsu and Toshio Nakatani, Accelerating UTF-8 Decoding Using SIMD Instructions (in Japanese), Information Processing Society of Japan Transactions on Programming 1 (2), 2008.

    The performance is not good but the algorithm is pretty.

    ❯ ./buildarm/benchmarks/benchmark -P convert_valid_utf8_to_utf16  -F benchmarks/dataset/wikipedia_mars/arabic.txt
    testcases: 1
    input detected as UTF8
    current system detected as arm64
    ===========================
    convert_valid_utf8_to_utf16+arm64, input size: 945989, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/arabic.txt
    kpc_set_config failed, run the program with sudo
       3.418 GB/s (1.1 %)
    convert_valid_utf8_to_utf16+fallback, input size: 945989, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/arabic.txt
       1.926 GB/s (1.1 %)
    convert_valid_utf8_to_utf16+inoue2008, input size: 945989, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/arabic.txt
       1.373 GB/s (0.5 %)
    
    ❯ ./buildarm/benchmarks/benchmark -P convert_valid_utf8_to_utf16  -F benchmarks/dataset/wikipedia_mars/english.txt
    testcases: 1
    input detected as UTF8
    current system detected as arm64
    ===========================
    convert_valid_utf8_to_utf16+arm64, input size: 991380, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/english.txt
    kpc_set_config failed, run the program with sudo
      13.930 GB/s (15.1 %)
    WARNING: Measurements are noisy, try increasing iteration count (-I).
    convert_valid_utf8_to_utf16+fallback, input size: 991380, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/english.txt
       5.074 GB/s (0.5 %)
    convert_valid_utf8_to_utf16+inoue2008, input size: 991380, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/english.txt
      11.455 GB/s (1.3 %)
    

    AMD Rome results...

    $ ./build/benchmarks/benchmark -P convert_valid_utf8_to_utf16  -F benchmarks/dataset/wikipedia_mars/arabic.txt
    testcases: 1
    input detected as UTF8
    current system detected as haswell
    ===========================
    convert_valid_utf8_to_utf16+fallback, input size: 266929, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/arabic.txt
      11.489 ins/byte,    3.394 GHz,    0.760 GB/s (1.2 %),    2.574 ins/cycle, 0.0802461 b.misses/byte, 0 c.mis/byte 
    convert_valid_utf8_to_utf16+haswell, input size: 266929, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/arabic.txt
       2.202 ins/byte,    3.396 GHz,    3.018 GB/s (2.9 %),    1.956 ins/cycle, 0.00351404 b.misses/byte, 0 c.mis/byte 
    convert_valid_utf8_to_utf16+inoue2008, input size: 266929, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/arabic.txt
       5.894 ins/byte,    3.394 GHz,    0.655 GB/s (0.6 %),    1.138 ins/cycle, 0.00204923 b.misses/byte, 0 c.mis/byte 
    convert_valid_utf8_to_utf16+westmere, input size: 266929, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/arabic.txt
       2.620 ins/byte,    3.396 GHz,    2.858 GB/s (1.2 %),    2.205 ins/cycle, 0.00383248 b.misses/byte, 0 c.mis/byte 
    [email protected]_station-for-dlemire:~/CVS/github/simdutf$ ./build/benchmarks/benchmark -P convert_valid_utf8_to_utf16  -F benchmarks/dataset/wikipedia_mars/english.txt
    testcases: 1
    input detected as UTF8
    current system detected as haswell
    ===========================
    convert_valid_utf8_to_utf16+fallback, input size: 181798, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/english.txt
       3.969 ins/byte,    3.398 GHz,    3.061 GB/s (0.6 %),    3.575 ins/cycle, 0.00213974 b.misses/byte, 0 c.mis/byte 
    convert_valid_utf8_to_utf16+haswell, input size: 181798, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/english.txt
       0.754 ins/byte,    3.404 GHz,    9.580 GB/s (0.9 %),    2.123 ins/cycle, 0.000302534 b.misses/byte, 0 c.mis/byte 
    convert_valid_utf8_to_utf16+inoue2008, input size: 181798, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/english.txt
       1.853 ins/byte,    3.401 GHz,    5.891 GB/s (0.6 %),    3.210 ins/cycle, 0.00121013 b.misses/byte, 0 c.mis/byte 
    convert_valid_utf8_to_utf16+westmere, input size: 181798, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/english.txt
       1.163 ins/byte,    3.406 GHz,    9.114 GB/s (1.3 %),    3.113 ins/cycle, 0.000286032 b.misses/byte, 0 c.mis/byte 
    

    I also added an implementation of Cameron (2008) that I found. There is no NEON support, but it looks like it works under x64.

    Reference: Cameron, Robert D, A case study in SIMD text processing with parallel bit streams: UTF-8 to UTF-16 transcoding, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, 91--98.

    It claims to be a validating transcoder but I have not tested it at all. It beats our fallback code, but it does not gets near our validating UTF8-UTF16 transcoder.

     ./build/benchmarks/benchmark -P convert_utf8_to_utf16  -F benchmarks/dataset/wikipedia_mars/arabic.txt
    testcases: 1
    input detected as UTF8
    current system detected as haswell
    ===========================
    convert_utf8_to_utf16+fallback, input size: 266929, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/arabic.txt
      14.540 ins/byte,   14.540 cycle/byte,    3.394 GHz,    0.648 GB/s (1.0 %),    2.777 ins/cycle, 0.084086 b.misses/byte, 0 c.mis/byte 
    convert_utf8_to_utf16+haswell, input size: 266929, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/arabic.txt
       2.965 ins/byte,    2.965 cycle/byte,    3.396 GHz,    2.757 GB/s (1.1 %),    2.408 ins/cycle, 0.00372009 b.misses/byte, 0 c.mis/byte 
    convert_utf8_to_utf16+u8u16, input size: 266929, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/arabic.txt
       8.530 ins/byte,    8.530 cycle/byte,    3.395 GHz,    1.491 GB/s (0.8 %),    3.748 ins/cycle, 0.000157345 b.misses/byte, 0 c.mis/byte 
    convert_utf8_to_utf16+westmere, input size: 266929, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/arabic.txt
       4.596 ins/byte,    4.596 cycle/byte,    3.395 GHz,    2.169 GB/s (0.8 %),    2.935 ins/cycle, 0.00332298 b.misses/byte, 0 c.mis/byte 
    $ ./build/benchmarks/benchmark -P convert_utf8_to_utf16  -F benchmarks/dataset/wikipedia_mars/english.txt 
    testcases: 1
    input detected as UTF8
    current system detected as haswell
    ===========================
    convert_utf8_to_utf16+fallback, input size: 181798, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/english.txt
       3.435 ins/byte,    3.435 cycle/byte,    3.397 GHz,    3.015 GB/s (0.9 %),    3.048 ins/cycle, 0.00174919 b.misses/byte, 0 c.mis/byte 
    convert_utf8_to_utf16+haswell, input size: 181798, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/english.txt
       0.793 ins/byte,    0.793 cycle/byte,    3.405 GHz,    9.709 GB/s (2.4 %),    2.261 ins/cycle, 0.000143016 b.misses/byte, 0 c.mis/byte 
    convert_utf8_to_utf16+u8u16, input size: 181798, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/english.txt
       2.489 ins/byte,    2.489 cycle/byte,    3.399 GHz,    4.294 GB/s (4.5 %),    3.144 ins/cycle, 0.000192521 b.misses/byte, 0 c.mis/byte 
    convert_utf8_to_utf16+westmere, input size: 181798, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/english.txt
       1.220 ins/byte,    1.220 cycle/byte,    3.406 GHz,    8.571 GB/s (2.9 %),    3.071 ins/cycle, 0.00017602 b.misses/byte, 0 c.mis/byte 
    
    opened by lemire 7
  • Introducing a fast path in the utf8 to utf16 transcoder

    Introducing a fast path in the utf8 to utf16 transcoder

    This PR can be hugely beneficial in some cases (+30% on the .Chinese-Lipsum.utf8.txt file).

    Before:

    convert_utf8_to_utf16+haswell, input size: 69840, iterations: 400, dataset: unicode_lipsum/lipsum/Chinese-Lipsum.utf8.txt
       3.694 ins/byte,    3.694 cycle/byte,    2.901 GB/s (1.5 %), 0.000057 b.misses/byte, 0.000000 c.mis/byte,    3.404 GHz,    3.148 ins/cycle
      10.996 ins/char,   10.996 cycle/char,    0.974 Gc/s (1.5 %), 0.000171 b.misses/char, 0.000000 c.mis/char,     2.98 byte/char
    

    After:

    convert_utf8_to_utf16+haswell, input size: 69840, iterations: 400, dataset: unicode_lipsum/lipsum/Chinese-Lipsum.utf8.txt
       3.612 ins/byte,    3.612 cycle/byte,    4.004 GB/s (2.7 %), 0.002205 b.misses/byte, 0.000000 c.mis/byte,    3.407 GHz,    4.245 ins/cycle
      10.754 ins/char,   10.754 cycle/char,    1.345 Gc/s (2.7 %), 0.006564 b.misses/char, 0.000000 c.mis/char,     2.98 byte/char
    

    The gist of it is that we introduce some fast paths.

    When the branch predictor does well, then we sidestep the data dependency in our slow loop and really shine.

    Before merging this, I will review carefully all experimental results.

    opened by lemire 6
  • Apple instrumentation (ARM) + enabling sanitizer builds.

    Apple instrumentation (ARM) + enabling sanitizer builds.

    Apple instrumentation (see https://lemire.me/blog/2021/03/24/counting-cycles-and-instructions-on-the-apple-m1-processor/) is somewhat depressing.... The Apple processor can retire up to 8 instructions per cycle, and probably 4 NEON instructions, but my numbers are far below that...

    ❯ sudo ./build/benchmarks/benchmark -P convert_utf8_to_utf16 -F benchmarks/dataset/wikipedia_mars/english.txt
    Password:
    testcases: 1
    convert_utf8_to_utf16+arm64, input size: 991380, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/english.txt
       0.622 ins/byte,    3.207 GHz,   14.806 GB/s (1.0 %),    2.873 ins/cycle, 0.000329843 b.misses/byte, 0.00215356 c.mis/byte 
    convert_utf8_to_utf16+fallback, input size: 991380, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/english.txt
       3.438 ins/byte,    3.205 GHz,    5.041 GB/s (1.5 %),    5.407 ins/cycle, 0.00105711 b.misses/byte, 0.0083066 c.mis/byte 
    ❯ sudo ./build/benchmarks/benchmark -P convert_utf8_to_utf16 -F benchmarks/dataset/wikipedia_mars/chinese.txt 
    testcases: 1
    convert_utf8_to_utf16+arm64, input size: 378464, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/chinese.txt
       2.646 ins/byte,    3.206 GHz,    2.693 GB/s (1.2 %),    2.223 ins/cycle, 0.00921356 b.misses/byte, 0.00121015 c.mis/byte 
    convert_utf8_to_utf16+fallback, input size: 378464, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/chinese.txt
       6.188 ins/byte,    3.206 GHz,    2.440 GB/s (0.8 %),    4.711 ins/cycle, 0.0174548 b.misses/byte, 0.00228555 c.mis/byte 
    
    
    ❯ sudo ./build/benchmarks/benchmark -P convert_utf8 -F benchmarks/dataset/wikipedia_mars/chinese.txt 
    testcases: 1
    convert_utf8_to_utf16+arm64, input size: 378464, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/chinese.txt
       2.646 ins/byte,    3.206 GHz,    2.678 GB/s (7.2 %),    2.211 ins/cycle, 0.00967331 b.misses/byte, 0.0010939 c.mis/byte 
    convert_utf8_to_utf16+fallback, input size: 378464, iterations: 400, dataset: benchmarks/dataset/wikipedia_mars/chinese.txt
       6.188 ins/byte,
    

    I also made it possible to sanitize our code. It could be useful to check code sanity, but it did not find any bug.

    opened by lemire 6
  • API project

    API project

    https://github.com/lemire/simdutf/blob/19a0db4cc5938abdf1893ae58ae472648ceff8fd/include/simdutf/implementation.h#L10

    Let me share my vision of API, i.e. what function should we provide:

    simdutf::validate_utf8
    simdutf::validate_utf16
    simdutf::utf8_to_utf16 -- with error checking
    simdutf::utf16_to_utf8 -- with error checking
    simdutf::unchecked::utf8_to_utf16 -- without error checking
    simdutf::unchecked::utf16_to_utf8 -- without error checking
    
    // maybe later
    simdutf::utf8_length
    simdutf::utf16_length
    

    Do you think it makes sense?

    opened by WojciechMula 6
  • Fix issue #132.

    Fix issue #132.

    Fix #132. There is an issue when using convert_valid_utf8_to_utf32 related to a non-ASCII character following a long chain of ASCII characters. Credit to @KaiserLancelot for identifying the bug.

    opened by NicolasJiaxin 5
  • Build fast one-pass

    Build fast one-pass "detect encoding function"

    Currently we might do something like this...

    simdutf_warn_unused encoding_type implementation::autodetect_encoding(const char * input, size_t length) const noexcept {
        // If there is a BOM, then we trust it.
        auto bom_encoding = simdutf::BOM::check_bom(input, length);
        if(bom_encoding != encoding_type::unspecified) { return bom_encoding; }
        // UTF8 is common, it includes ASCII, and is commonly represented
        // without a BOM, so if it fits, go with that. Note that it is still
        // possible to get it wrong, we are only 'guessing'. If some has UTF-16
        // data without a BOM, it could pass as UTF-8.
        //
        // An interesting twist might be to check for UTF-16 ASCII first (every
        // other byte is zero).
        if(validate_utf8(input, length)) { return encoding_type::UTF8; }
        // The next most common encoding that might appear without BOM is probably
        // UTF-16LE, so try that next.
        if((length % 2) == 0) {
          if(validate_utf16(reinterpret_cast<const char16_t*>(input), length/2)) { return encoding_type::UTF16_LE; }
        }
        if((length % 4) == 0) {
          if(validate_utf32(reinterpret_cast<const char32_t*>(input), length/4)) { return encoding_type::UTF32_LE; }
        }
        return encoding_type::unspecified;
    }
    

    It would not normally be so bad, but our SIMD validate_utf* functions are optimistic: they expect the input to be valid and so they always scan the whole input. If the input is UTF-8, it is pretty good... We could use our fallback functions instead...

    Importantly, the 'answer' is not well defined. What we would like is a function that returns which unicode types are possible given the input. So the result is an enumeration... utf8, utf16_le, utf16_be, utf32_le, utf32_be....?

    But it seems that there ought to be a one-pass algorithm to detect the unicode type... So you scan the input, using a single pass, and when you are done, you know exactly which types are allowed.

    opened by lemire 2
  • Add support for UTF-32BE

    Add support for UTF-32BE

    Currently, we refer to UTF-32LE as UTF-32. We need to add support UTF-32BE which is relatively easy.

    Related: https://github.com/simdutf/simdutf/issues/3

    opened by lemire 1
Releases(v1.0.1)
Owner
simdutf: Unicode at gigabytes per second
simdutf: Unicode at gigabytes per second
📚 single header utf8 string functions for C and C++

?? utf8.h A simple one header solution to supporting utf8 strings in C and C++. Functions provided from the C header string.h but with a utf8* prefix

Neil Henning 1.2k Jun 26, 2022
Read file to console, automatically recognize file encoding, include ansi, utf16le, utf16be, utf8. Currently output ansi as gbk for chinese text search.

rgpre A tool for rg --pre. Read file to console, automatically recognize file encoding, include ansi, utf16le, utf16be, utf8. Currently output ansi as

null 3 Mar 18, 2022
FFTW is a free collection of fast C routines for computing the Discrete Fourier Transform in one or more dimensions

FFTW is a free collection of fast C routines for computing the Discrete Fourier Transform in one or more dimensions

null 3 Mar 30, 2022
libu8ident - Follow unicode security guidelines for identifiers

libu8ident - Follow unicode security guidelines for identifiers without adding the full Unicode database. This library does the unicode identifier sec

Reini Urban 5 Mar 3, 2022
Header-only library providing unicode aware string support for C++

CsString Introduction CsString is a standalone library which provides unicode aware string support. The CsBasicString class is a templated class which

CopperSpice 89 Jun 4, 2022
Text - A spicy text library for C++ that has the explicit goal of enabling the entire ecosystem to share in proper forward progress towards a bright Unicode future.

ztd.text Because if text works well in two of the most popular systems programming languages, the entire world over can start to benefit properly. Thi

Shepherd's Oasis 195 Jun 8, 2022
Neo - Simulates the digital rain from "The Matrix" (cmatrix clone with 32-bit color and Unicode support)

neo WARNING: neo may cause discomfort and seizures in people with photosensitive epilepsy. User discretion is advised. neo recreates the digital rain

Stew Reive 390 Jun 18, 2022
A modern port of Turbo Vision 2.0, the classical framework for text-based user interfaces. Now cross-platform and with Unicode support.

Turbo Vision A modern port of Turbo Vision 2.0, the classical framework for text-based user interfaces. Now cross-platform and with Unicode support. I

null 1.3k Jun 24, 2022
A Blender 2.7+ plugin that exports sausage link characters with animations

Sausage64 Sausage64 is a plugin for Blender 2.7 onwards, which allows you to export "sausage link" style character models with animations. The plugin

Buu342 23 May 5, 2022
Play video by fonts in a console window by composing characters

FontVideo Play video by fonts in a console window by composing characters. Using FFmpeg API to decode the input file, then the video stream is rendere

0xaa55 8 Feb 23, 2022
libleftpad is a useful C++ library which prepends characters to strings

libleftpad libleftpad is a useful C++ library which prepends characters to strings. It is very definitely a serious project. Usage: #include <libleftp

Tristan Brindle 3 Feb 4, 2022
libleftpad is a useful C++ library which prepends characters to strings

libleftpad is a useful C++ library which prepends characters to strings

Tristan Brindle 3 Feb 4, 2022
This is the second project in the 42 Cadet Curriculum

This is the second project in the 42 Cadet Curriculum. The aim of this project is to make you code a function that returns a line, read from a file descriptor. Overall, it is an easy project once you comprehend what is being asked from you. It gets complicated by the fact that you are only allowed to use the following functions: read, malloc and free.

Paulo Rafael Ramalho 0 May 27, 2022
File Mod(FMod) is the second version of FMod.

File Mod(FMod) is the second version of FMod.

Dr Code 1 Oct 22, 2021
The Intel 8080 ("eighty-eighty") is the second 8-bit microprocessor designed and manufactured by Intel.

i8080(Intel 8080) The Intel 8080 ("eighty-eighty") is the second 8-bit microprocessor designed and manufactured by Intel. It first appeared in April 1

VitorMob 14 May 19, 2022
Second project for 42 : Reading text available on the file descriptor one line at a time.

get_next_line Initial commit This project will not only allow you to add a very convenient function to your collection, but it will also allow you to

Dieau 1 Mar 28, 2022
Now I shall sing the second kingdom there where the soul of man is cleansed, made worthy to ascend to Heaven.

Inferno® is a distributed operating system, originally developed at Bell Labs, but now developed and maintained by Vita Nuova® as Free Software. Appli

null 2 Jan 28, 2022
This is the second project in the 42 Cadet Curriculum.

This is the second project in the 42 Cadet Curriculum. The aim of this project is to make you code a function that returns a line, read from a file descriptor. Overall, it is an easy project once you comprehend what is being asked from you. It gets complicated by the fact that you are only allowed to use the following functions: read, malloc and free.

Face Tattoo 2 Mar 26, 2022
Second life for famous JPEGView - fast and tiny viewer/editor for JPEG, BMP, PNG, WEBP, TGA, GIF and TIFF images with a minimalist GUI and base image processing.

JPEGView-Image-Viewer-and-Editor Updated Dec 07 2021. Version 1.1.1.0 has been released. Download link1, link2 added. Second life for famous JPEGView

Ann Hatt 20 Jun 9, 2022