A Haskell library for fast decoding of JSON documents using the simdjson C++ library

Related tags

JSON json haskell simd
Overview

hermes

CI badge

A Haskell interface over the simdjson C++ library for decoding JSON documents. Hermes, messenger of the gods, was the maternal great-grandfather of Jason, son of Aeson.

Overview

This library exposes functions that can be used to write decoders for JSON documents using the simdjson On Demand API. From the simdjson On Demand design documentation:

Good applications for the On Demand API might be:

You are working from pre-existing large JSON files that have been vetted. You expect them to be well formed according to a known JSON dialect and to have a consistent layout. For example, you might be doing biomedical research or machine learning on top of static data dumps in JSON.

Both the generation and the consumption of JSON data is within your system. Your team controls both the software that produces the JSON and the software the parses it, your team knows and control the hardware. Thus you can fully test your system.

You are working with stable JSON APIs which have a consistent layout and JSON dialect.

With this in mind, Data.Hermes parsers can potentially decode Haskell types faster than traditional Data.Aeson.FromJSON instances, especially in cases where you only need to decode a subset of the document. This is because Data.Aeson.FromJSON converts the entire document into a Data.Aeson.Value, which means memory usage increases linearly with the input size. The simdjson::ondemand API does not have this constraint because it iterates over the JSON string in memory without constructing any abstract representation.

Usage

This library does not offer a Haskell API over the entire simdjson On Demand API. It currently binds only to what is needed for defining and running a Decoder. You can see the tests and benchmarks for example usage. Decoder a is a thin layer over IO that keeps some context around for better error messages. simdjson::ondemand exceptions will be caught and re-thrown with enough information to troubleshoot. In the worst case you may run into a segmentation fault that is not caught, which you are encouraged to report as a bug.

Decoders

personDecoder :: Value -> Decoder Person
personDecoder = withObject $ \obj ->
  Person
    <$> atKey "_id" text obj
    <*> atKey "index" int obj
    <*> atKey "guid" text obj
    <*> atKey "isActive" bool obj
    <*> atKey "balance" text obj
    <*> atKey "picture" (nullable text) obj
    <*> atKey "latitude" scientific obj

-- Decode a strict ByteString.
decodePersons :: ByteString -> Either HermesException [Person]
decodePersons = decodeEither $ list personDecoder

It looks a little like Waargonaut.Decode.Decoder m, just not as polymorphic. The interface is copied because it's elegant and does not rely on typeclasses. However, hermes does not give you a cursor to play with, the cursor is implied and is forward-only (except when accessing object fields). This limitation allows us to write very fast decoders.

Exceptions

When decoding fails for a known reason, you will get a Left HermesException indicating if the error came from simdjson or from an internal hermes call. The exception contains a HError record with some useful information, for example:

*Main> decodeEither (withObject . atKey "hello" $ list text) "{ \"hello\": [\"world\", false] }" 
Left (SIMDException (HError {path = "/hello/1", errorMsg = "Error while getting value of type text. The JSON element does not have the requested type.", docLocation = "false] }", docDebug = "json_iterator [ depth : 3, structural : 'f', offset : 21', error : No error ]"}))

Benchmarks

We benchmark decoding a very small object into a Map, full decoding of a large-ish (12 MB) JSON array of objects, and then a partial decoding of Twitter status objects to highlight the on-demand benefits.

Intel Core i7-7500U @2.70GHz / 2x8GB RAM @LPDDR3

Non-threaded runtime

Threaded runtime

Performance Tips

  • Decode to Text instead of String wherever possible!
  • Decode to Int or Double instead of Scientific if you can.
  • If you know the key ordering of the JSON then you can use atOrderedKey instead of atKey. This is faster but it cannot handle missing keys.
  • You can improve performance by holding onto your own HermesEnv. decodeEither creates and destroys the simdjson instances every time it runs, which adds a performance penalty. Beware, do not share a HermesEnv across multiple threads.

Limitations

Because the On Demand API uses a forward-only iterator (except for object fields), you must be mindful to not access values out of order. In other words, you should not hold onto a Value to parse later since the iterator may have already moved beyond it.

Further work is coming to wrap the simdjson::dom API, which should allow walking the DOM in any order you want, but at the expense of parsing the entire document into a DOM.

Because the On Demand API does not validate the entire document upon creating the iterator (besides UTF-8 validation and basic well-formed checks), it is possible to parse an invalid JSON document but not realize it until later. If you need the entire document to be validated up front then a DOM parser is a better fit for you.

The On Demand approach is less safe than DOM: we only validate the components of the JSON document that are used and it is possible to begin ingesting an invalid document only to find out later that the document is invalid. Are you fine ingesting a large JSON document that starts with well formed JSON but ends with invalid JSON content?

This library currently cannot decode scalar documents, e.g. a single string, number, boolean, or null as a JSON document.

Portability

Per the simdjson documentation:

A recent compiler (LLVM clang6 or better, GNU GCC 7.4 or better, Xcode 11 or better) on a 64-bit (PPC, ARM or x64 Intel/AMD) POSIX systems such as macOS, freeBSD or Linux. We require that the compiler supports the C++11 standard or better.

However, this library relies on std::string_view without a shim, so C++17 or better is highly recommended.

Issues
  • Make `-march=native` opt-in

    Make `-march=native` opt-in

    simdjson detects CPU capabilities at runtime and selects the appropriate optimized routines. So -march=native is not essential to getting good performance and I think it should be explicitly chosen by people who are sure they'll be running binaries on the same machines they're building.

    This PR is motivated by a SIGILL I caught after building an application on a AVX-512 enabled Intel machine and running it on a slightly older AMD machine.

    opened by ethercrow 2
  • Null byte emits arbitrary garbage data

    Null byte emits arbitrary garbage data

    repl> newtype Person = Person Int deriving Show
    repl> personDecoder = withObject $ \obj -> Person <$> atKey "lol" int obj
    repl> decodeEither personDecoder "\x00"
    Left (SIMDException (HError {path = "", errorMsg = "A JSON document made of a scalar (number, Boolean, null or string) is treated as a value. Use get_bool(), get_double(), etc. on the document instead. ", docLocation = "", docDebug = "json_iterator [ depth : 1, structural : '\NUL/\DEL\NUL\NULO\aB\NUL\NUL\NULA\nB\NUL\NUL\NULh(\DEL\NUL\NUL\SOH\NUL\NUL\NUL"}))
    repl> decodeEither personDecoder "\x00"
    Left (SIMDException (HError {path = "", errorMsg = "A JSON document made of a scalar (number, Boolean, null or string) is treated as a value. Use get_bool(), get_double(), etc. on the document instead. ", docLocation = "", docDebug = "json_iterator [ depth : 1, structural : '\NULM\ETXB\NUL\NUL\NUL`(\DEL\NUL\NUL9M\ETXB\NUL\NUL\NUL7M\ETXB\NUL\NUL\NUL\EMe^\EOT"}))
    repl> decodeEither personDecoder "\x00"
    Left (SIMDException (HError {path = "", errorMsg = "A JSON document made of a scalar (number, Boolean, null or string) is treated as a value. Use get_bool(), get_double(), etc. on the document instead. ", docLocation = "", docDebug = "json_iterator [ depth : 1, structural : '\NUL\STX\tB\NUL\NUL\NULy\ETXB\NUL\NUL\NULJ~\ETXB\NUL\NUL\NULh(\DEL\NUL\NUL\SOH\NUL\NUL\NUL"}))
    

    Scroll to the end of the line to see some garbage data.

    The fact that the same input bytestring returns unequal errors breaks referential transparency, which users expect.

    This also makes it harder to use Hermes in property testing.

    Here is an expression that I would expect to be True:

    repl> fromLeft undefined (decodeEither personDecoder "\x00") == fromLeft undefined (decodeEither personDecoder "\x00")
    False
    
    opened by ysangkok 2
  • Confusing docLocation on missing key (and crash)

    Confusing docLocation on missing key (and crash)

    *Data.Hermes> newtype Person = Person Int deriving Show
    *Data.Hermes> 
    *Data.Hermes> personDecoder = withObject $ \obj -> Person <$> atKey "lol" int obj
    *Data.Hermes> decodeEither personDecoder "{}"
    Left (SIMDException (HError {path = "/lol", errorMsg = "The JSON field referenced does not exist in this object.", docLocation = "/\DEL", docDebug = "json_iterator [ depth : 0, structural : 'n', offset : 2', error : No error ]"}))
    

    What does the "/\DEL" mean here? I am worried that it is accessing something out of bounds. I would expect the location to be /.

    So I tried running it a few times and something is definitely wrong here:

    *Data.Hermes> decodeEither personDecoder "{}"
    *** Exception: Cannot decode byte '\xb8': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream
    *Data.Hermes> decodeEither personDecoder "{}"
    cabal: repl failed for hermes-json-0.1.0.0. The build process segfaulted (i.e.
    SIGSEGV).
    

    This is on Debian Bullseye and ghc 8.10.7 from ghcup and commit

    commit a3fe3eb9d8da07295d7c7ac3030dc5daaf89f27f (HEAD -> master, origin/master, origin/HEAD)
    Author: Josh Miller <[email protected]>
    Date:   Tue Nov 2 17:18:48 2021 -0500
    
        use CStringLen for JSON pointer as well
    
    opened by ysangkok 2
  • Benchmarks should specify version of compiler/Aeson used and its flags

    Benchmarks should specify version of compiler/Aeson used and its flags

    I just found out about the cffi flag for Aeson. It would be nice to know whether the benchmarks in the readme were performed with or without that flag set.

    opened by ysangkok 0
  • Non-finite values of IEEE 754 floats are not decodable

    Non-finite values of IEEE 754 floats are not decodable

    Ok, maybe you don't even wanna support this. But it would be nice to specify if this is on purpose.

    Aeson 2 can roundtrip plus/minus infinity and NaN just fine:

    repl> Data.Aeson.decode @Double $ encode @Double ((1)/0)
    Just Infinity
    repl> Data.Aeson.decode @Double $ encode @Double ((-1)/0)
    Just (-Infinity)
    repl> Data.Aeson.decode @Double $ encode @Double ((0)/0)
    Just NaN
    

    But Hermes can't decode these encodings:

    repl> newtype Person = Person Double deriving (Eq, Show)
    repl> personDecoder = Data.Hermes.withObject $ \obj -> Person <$> atKey "lol" double obj
    repl> Data.Hermes.decodeEither personDecoder $ "{\"lol\": " <> (toStrict $ encode @Double ((1)/0)) <> "}"
    Left (SIMDException (HError {path = "/lol", errorMsg = "Error while getting value of type double. The JSON element does not have the requested type.", docLocation = "\"+inf\"}", docDebug = "json_iterator [ depth : 2, structural : '\"', offset : 8', error : No error ]"}))
    repl> Data.Hermes.decodeEither personDecoder $ "{\"lol\": " <> (toStrict $ encode @Double ((-1)/0)) <> "}"
    Left (SIMDException (HError {path = "/lol", errorMsg = "Error while getting value of type double. The JSON element does not have the requested type.", docLocation = "\"-inf\"}", docDebug = "json_iterator [ depth : 2, structural : '\"', offset : 8', error : No error ]"}))
    repl> Data.Hermes.decodeEither personDecoder $ "{\"lol\": " <> (toStrict $ encode @Double ((0)/0)) <> "}"
    Left (SIMDException (HError {path = "/lol", errorMsg = "Error while getting value of type double. The JSON element does not have the requested type.", docLocation = "null}", docDebug = "json_iterator [ depth : 2, structural : 'n', offset : 8', error : No error ]"}))
    

    Given that people are likely to use this library as a replacement for Aeson, I think it would be useful to point out differences like this.

    opened by ysangkok 3
Releases(0.2.0.0)
Owner
Josh Miller
(´・ω・)っ λ
Josh Miller
https://github.com/json-c/json-c is the official code repository for json-c. See the wiki for release tarballs for download. API docs at http://json-c.github.io/json-c/

\mainpage json-c Overview and Build Status Building on Unix Prerequisites Build commands CMake options Testing Building with vcpkg Linking to libjson-

json-c 2.5k Jul 2, 2022
C library for encoding, decoding and manipulating JSON data

Jansson README Jansson is a C library for encoding, decoding and manipulating JSON data. Its main features and design principles are: Simple and intui

Petri Lehtinen 2.6k Jun 30, 2022
json-cpp is a C++11 JSON serialization library.

JSON parser and generator for C++ Version 0.1 alpha json-cpp is a C++11 JSON serialization library. Example #include <json-cpp.hpp> struct Foo {

Anatoly Scheglov 4 Dec 31, 2019
This is a JSON C++ library. It can write and read JSON files with ease and speed.

Json Box JSON (JavaScript Object Notation) is a lightweight data-interchange format. Json Box is a C++ library used to read and write JSON with ease a

Anhero inc. 109 Apr 24, 2022
A convenience C++ wrapper library for JSON-Glib providing friendly syntactic sugar for parsing JSON

This library is a wrapper for the json-glib library that aims to provide the user with a trivial alternative API to the API provided by the base json-

Rob J Meijer 16 May 10, 2022
json-build is a zero-allocation JSON serializer compatible with C89

json-build is a zero-allocation JSON serializer compatible with C89. It is inspired by jsmn, a minimalistic JSON tokenizer.

Lucas Müller 24 Jun 20, 2022
An easy-to-use and competitively fast JSON parsing library for C++17, forked from Bitcoin Cash Node's own UniValue library.

UniValue JSON Library for C++17 (and above) An easy-to-use and competitively fast JSON parsing library for C++17, forked from Bitcoin Cash Node's own

Calin Culianu 10 Aug 31, 2021
A fast streaming JSON parsing library in C.

********************************************************************** This is YAJL 2. For the legacy version of YAJL see https

Lloyd Hilaiel 2k Jun 17, 2022
Very fast Python JSON parsing library

cysimdjson Fast JSON parsing library for Python, 7-12 times faster than standard Python JSON parser. It is Python bindings for the simdjson using Cyth

TeskaLabs 205 Jun 22, 2022
A fast JSON parser/generator for C++ with both SAX/DOM style API

A fast JSON parser/generator for C++ with both SAX/DOM style API Tencent is pleased to support the open source community by making RapidJSON available

Tencent 12.1k Jun 24, 2022
Fast JSON serialization and parsing in C++

DAW JSON Link v2 Content Intro Default Mapping of Types API Documentation - Member mapping classes and methods Cookbook Get cooking and putting it all

Darrell Wright 276 Jun 22, 2022
A very sane (header only) C++14 JSON library

JeayeSON - a very sane C++14 JSON library JeayeSON was designed out of frustration that there aren't many template-based approaches to handling JSON i

Jeaye Wilkerson 128 Jun 7, 2022
A C++ library for interacting with JSON.

JsonCpp JSON is a lightweight data-interchange format. It can represent numbers, strings, ordered sequences of values, and collections of name/value p

null 6.5k Jun 23, 2022
A tiny JSON library for C++11.

json11 json11 is a tiny JSON library for C++11, providing JSON parsing and serialization. The core object provided by the library is json11::Json. A J

Dropbox 2.3k Jun 22, 2022
A killer modern C++ library for interacting with JSON.

JSON Voorhees Yet another JSON library for C++. This one touts new C++11 features for developer-friendliness, an extremely slow-speed parser and no de

Travis Gockel 125 Mar 4, 2022
a JSON parser and printer library in C. easy to integrate with any model.

libjson - simple and efficient json parser and printer in C Introduction libjson is a simple library without any dependancies to parse and pretty prin

Vincent Hanquez 260 Jun 18, 2022
Lightweight JSON library written in C.

About Parson is a lightweight json library written in C. Features Full JSON support Lightweight (only 2 files) Simple API Addressing json values with

Krzysztof Gabis 1.1k Jun 29, 2022
QJson is a qt-based library that maps JSON data to QVariant objects.

QJson JSON (JavaScript Object Notation) is a lightweight data-interchange format. It can represents integer, real number, string, an ordered sequence

Flavio Castelli 266 May 19, 2022
C++ header-only JSON library

Welcome to taoJSON taoJSON is a C++ header-only JSON library that provides a generic Value Class, uses Type Traits to interoperate with C++ types, use

The Art of C++ 459 Jun 20, 2022