The Universal Storage Engine

Overview

TileDB logo

Azure Pipelines Anaconda download count badge

The Universal Storage Engine

TileDB is a powerful engine for storing and accessing dense and sparse multi-dimensional arrays, which can help you model any complex data efficiently. It is an embeddable C++ library that works on Linux, macOS, and Windows. It is open-sourced under the permissive MIT License, developed and maintained by TileDB, Inc. To distinguish this project from other TileDB offerings, we often refer to it as TileDB Embedded.

TileDB includes the following features:

  • Support for both dense and sparse arrays
  • Support for dataframes and key-value stores (via sparse arrays)
  • Cloud storage (AWS S3, Google Cloud Storage, Azure Blob Storage)
  • Chunked (tiled) arrays
  • Multiple compression, encryption and checksum filters
  • Fully multi-threaded implementation
  • Parallel IO
  • Data versioning (rapid updates, time traveling)
  • Array metadata
  • Array groups
  • Numerous APIs on top of the C++ library
  • Numerous integrations (Spark, Dask, MariaDB, GDAL, etc.)

You can use TileDB to store data in a variety of applications, such as Genomics, Geospatial, Finance and more. The power of TileDB stems from the fact that any data can be modeled efficiently as either a dense or a sparse multi-dimensional array, which is the format used internally by most data science tooling. By storing your data and metadata in TileDB arrays, you abstract all the data storage and management pains, while efficiently accessing the data with your favorite data science tool.

Quickstart

You can install the TileDB library as follows:

# Homebrew (macOS):
$ brew update
$ brew install tiledb-inc/stable/tiledb

# Or Conda (macOS, Linux, Windows):
$ conda install -c conda-forge tiledb

Alterantively, you can use the Docker image we provide:

$ docker pull tiledb/tiledb
$ docker run -it tiledb/tiledb

We include several examples. You can start with the following:

Documentation

You can find the detailed TileDB documentation at https://docs.tiledb.com.

Format Specification

The TileDB data format is open-source and can be found here.

APIs

The TileDB team maintains a variety of APIs built on top of the C++ library:

Integrations

TileDB is also integrated with several popular databases and data science tools:

Get involved

TileDB Embedded is an open-source project and welcomes all forms of contributions. Contributors to the project should read over the contribution docs for more information.

We'd love to hear from you. Drop us a line at [email protected], visit our forum or contact form, or follow us on Twitter to stay informed of updates and news.

Issues
  • Build on Ubuntu 20.04 LTS

    Build on Ubuntu 20.04 LTS

    Hi,

    I just tried building TileDB on a new system with Ubuntu 20.04 LTS and for some reason I get a cmake error:

    RegularExpression::compile(): Nested *?+.
    RegularExpression::compile(): Error in compile.
    CMake Error at examples/cpp_api/CMakeLists.txt:52 (string):
      string sub-command REGEX, mode REPLACE failed to compile regex
      "^/home/arne/Development/c++/TileDB/examples/cpp_api/".
    
    
    CMake Error at examples/cpp_api/CMakeLists.txt:56 (string):
      string sub-command REGEX, mode REPLACE needs at least 6 arguments total to
      command.
    
    
    CMake Error at examples/cpp_api/CMakeLists.txt:62 (build_TileDB_example_cppapi):
      build_TileDB_example_cppapi Function invoked with incorrect arguments for
      function named: build_TileDB_example_cppapi
    

    I've tried building it with CMake 3.10.2 and CMake 3.18.20.

    opened by aosterthun 28
  • Fix catching polymorphic type errors when building with GCC 8

    Fix catching polymorphic type errors when building with GCC 8

    Inspired by https://github.com/WebAssembly/binaryen/pull/1400/files

    opened by eLvErDe 22
  • No documentation on how to actually use the S3 backend

    No documentation on how to actually use the S3 backend

    We have no easy to use guide / docs on how to setup the S3 backend (AWS), connect to a bucket (with correct config settings), write and read back some data.

    docs enhancement 
    opened by jakebolewski 17
  • TileDB Error's are not actionable

    TileDB Error's are not actionable

    Currently when using the C-API there is no way to tell what the error was (class) without parsing the error string.

    opened by jakebolewski 14
  • Deserialization error?

    Deserialization error?

    Hi,

    I'm using release-1.7.6 of TileDB-core, and have been receiving the following:

    Error in libtiledb_array_open([email protected], uri, query_type) : [TileDB::Filter] Error: Deserialization error; unexpected metadata length

    Creating schemas etc. seem to go ok (no errors are thrown, and the schema looks ok in a debugger), and using the created domain from within C++ seems to go ok (i.e., I can read in / write out data from TileDB into temp arrays). However, in trying to see and work with the same array from Python & R I get the above error. Both Python and R are 'latest'; I've also tried R from within your Docker instance.

    Thanks in advance for any insight. Steve

    opened by iphcsteve 12
  • Support TileDB compilation under Cygwin

    Support TileDB compilation under Cygwin

    If you want TileDB to compile under Cygwin, then all of the #if(n)def _WIN32 code lines need to be converted t #if(n)def _MSC_VER. The authors assumed that the _WIN32 flag was only set by Visual Studio, but in fact it is also set for Cygwin (a linux variant) and maybe Mingw. To ensure that the _WIN32 conditionals are invoked only when using Visual Studio, one needs to use _MSC_VER.

    Windows build cygwin 
    opened by DennisHeimbigner 11
  • support non-continuous subsetting

    support non-continuous subsetting

    This has been a major hurdle preventing me do more extensive benchmarking since this type of query would be the most frequently used in our applications.

    I know I was told the random slicing is not optimized yet. But I wonder if is even currentlysupported by the current C/C++ API? By random slicing, I simply mean [ like indexing in R. That is, instead of providing start and end for each dim, we supply an array of index for each e.g.

     query.set_subarray<int>{{1,3,5,7,9}, {2,4,6,8}};
    

    I know I could build my query by submitting multiple start,end queries, but I was hoping it was part of native supported API because chances are my version won't be the efficient implementation.

    API Addition C API C++ API enhancement 
    opened by mikejiang 11
  • Issue with subarray writes in sorted order.

    Issue with subarray writes in sorted order.

    Hi, I am facing an issue in updating a dense subarray with ordered writes.

    • I have modified the tiledb_dense_create.cc example to create an array of dimensions 2^15 x 2^15, with one integer attribute, and one float attribute , and space tile extant of 128X128.
    • I have modified the "tiledb_dense_write_ordered_subarray.cc" to create an int buffer and a float buffer to update an subarray of capacity 2^18 elements (512 x 512 subarray).
    • When the file tiledb_dense_write_ordered_subarray is executed it crashes.
    • This crash does not happen with smaller subarray size. It also does not happen with the global order writes. I understand that tileDB performs data reorganization to convert from sorted order to global order. But is there any memory limit on the size of subarray for sorted writes?

    I have attached the modified versions of both the files. tiledb issue.zip

    Regards, Kalyan.

    opened by kavskalyan 10
  • Upgrade AWS SDK to latest version

    Upgrade AWS SDK to latest version

    The AWS SDK API changed over the past year which causes distribution issues when packaging TileDB with S3 support.

    s3 
    opened by jakebolewski 10
  • replace ReadFromOffset with ReadRange

    replace ReadFromOffset with ReadRange

    After switching to tiledb, we (Satelligence) had much higher (as in 50 - 100 times) egress traffic from google storage, even leading to getting a warning from google that we hit the limit/quotum for storage egress bandwidth. We've been able to work around this by limiting the amount of simultaneous read processes, but that is of course not a sustainable solution.

    After investigating, the excess traffic appeared to be caused by cancelled requests, which in turn were issued by tiledb because it is using ReadFromOffset instead of ReadRange. Where the latter translates to a closed http range request, the former translates to an open-ended http range request, effectively (starting to) read the whole blob (starting from offset). Switching to ReadRange in GCS::read fixed the issue.

    Note that the egress increase in this case might even be completely administrative. Tiledb does issue open ended range requests, but then only reads length + read_ahead_length bytes before closing the stream (and thereby cancelling the request), effectively resulting in a closed range read. Also, when looking at the instance/pod ingress instead of the bucket's egress, the excess traffic did not appear. So apparently google measures bucket egress traffic by analyzing requests, instead of probing nic's. Another possibility is that there sits some buffering appliance in between, and bucket's egress is measure before the buffer, while pod's ingress is measured after the buffer. Anyway, even if this would be purely administrative, we need a fix for it, to allow us to scale our processing again without hitting google's bandwith limit.

    The fix appeared simple: use ReadRange instead of ReadFromOffset.

    Because an image is worth more than 1000 words, here it is:

    image

    This graph shows sent bytes from bucket, running 2 times the same job, first without and then with the proposed fix. Note that the blue colored line (in the left part) is bytes from requests with status "CANCELLED", while the greenish line is from requests with the status "OK". In the right part, there is no blue line because there are no cancelled requests.

    NB this was also discussed yesterday in a chat with @stavrospapadopoulos and @normanb.

    Btw, this is my first PR to TileDB, so though I've read the contribution guidelines, I might miss some of your conventions. Like: should I add a line to HISTORY? Or is this change too small for that? Also, I'm not sure what I should put at "TYPE". Improvement? Bug?


    TYPE: IMPROVEMENT DESC: replace ReadFromOffset with ReadRange in GCS::read() to avoid excess gcs egress traffic

    backport release-2.3 
    opened by vincentschut 10
  • [Backport release-2.3] Fixing tile extent calculations for signed integer domains

    [Backport release-2.3] Fixing tile extent calculations for signed integer domains

    Backport 217f668425b06ed1b2fae7f160ce28d0ae1b2def from #2303

    opened by github-actions[bot] 0
  • Refactor and fix defects in buffer classes: read, set_offset, advance_offset

    Refactor and fix defects in buffer classes: read, set_offset, advance_offset

    The buffer classes Buffer, ConstBuffer, and PreallocatedBuffer share a lot of common code. It looks there was copypasta at some point in the past, and what was copied had defects. This changes creates a common base class and fixes their common defects.


    TYPE: BUG DESC: Fix defects in buffer classes: read, set_offset, advance_offset

    opened by eric-hughes-tiledb 0
  • Follow up fixes to floating point calculations for tile extents

    Follow up fixes to floating point calculations for tile extents

    TYPE: BUG DESC: Follow up fixes to floating point calculations for tile extents

    backport release-2.3 
    opened by KiterLuc 0
  • Refactor get/set buffer APIs

    Refactor get/set buffer APIs

    TYPE: C_API DESC: Refactoring [get/set]_buffer APIs

    opened by bekadavis9 0
  • Support datatype conversion

    Support datatype conversion

    This branch adds the ability to return results as different datatypes. I.e. cast a float to a double , cast double to int and etc. The filter "ConversionFilter" is used to convert from one numeric datatype to another. When a value being casted is larger or smaller than the datatype being casted to, we will just set to the max/min value instead of attempting to a case with weird results. For example, casting -1 of int32_t type to uint32_t will return 0. In C Api, a function is added like this: int32_t tiledb_query_set_query_datatype(tiledb_ctx_t* ctx, tiledb_query_t* query, const char* buffer_name,tiledb_datatype_t datatype, bool* var_length) The initial implementation focuses only on fixed length numeric datatypes for reading query. We might also add support for writing query and variable length datatypes.


    TYPE: FEATURE DESC: Support datatype conversion for numerical values

    opened by bdeng-xt 0
  • Maximum fragment size consolidation parameter

    Maximum fragment size consolidation parameter

    There's an example in the docs of how consolidation can somewhat approximate LSM trees.

    It would be helpful when operating in such a mode (consolidating after every write) to define a maximum fragment size preventing overly expensive consolidation. More expensive / unbounded consolidation can then be run manually at another time.

    opened by gatesn 2
  • Change tp_index_ from a static to a shared_ptr to a class singleton.

    Change tp_index_ from a static to a shared_ptr to a class singleton.

    ThreadPool has a race condition when its instance are defined static. There's are static members of ThreadPool that must be present for the ThreadPool destructor. Because the order of static destruction is not defined, a ThreadPool destructor can throw.

    The correction is to make these members into class singletons, initialized on demand in the first ThreadPool constructor.


    TYPE: BUG DESC: remove race condition in ThreadPool

    opened by eric-hughes-tiledb 2
  • Array is unreadable with string dimension containing only empty strings

    Array is unreadable with string dimension containing only empty strings

    With only empty strings auto size = s1.size() + s2.size(); is zero, which results in an exception when fetching the array_fragments or fetching the nonempty domain, unfortunately making the array unreadable.

    Traceback (most recent call last):
      File "empty_strings.py", line 27, in <module>
        tiledb.array_fragments(array_name)
      File ".../lib/python3.6/site-packages/tiledb/highlevel.py", line 123, in array_fragments
        return tiledb.FragmentInfoList(uri, ctx)
      File ".../lib/python3.6/site-packages/tiledb/fragment.py", line 28, in __init__
        self.nonempty_domain = fi.get_non_empty_domain(schema)
    tiledb.libtiledb.TileDBError: [TileDB::FragmentInfo] Error: Cannot get non-empty domain var size; Dimension is fixed sized
    

    https://github.com/TileDB-Inc/TileDB/blob/5658e01d24a70cca9bd60872980ec2703693e945/tiledb/sm/misc/types.h#L116-L126

    https://github.com/TileDB-Inc/TileDB/blob/5658e01d24a70cca9bd60872980ec2703693e945/tiledb/sm/misc/types.h#L227-L230

    Example
    import shutil
    
    import numpy as np
    from tiledb import *
    
    array_name = 'empty_strings'
    
    s = ArraySchema(
        domain=Domain(
            Dim('a', dtype=np.int32, domain=(-10, 10)),
            Dim('b', dtype=np.bytes_, domain=(None, None)),
        ),
        attrs=[Attr('x', dtype=np.int32)],
        sparse=True,
    )
    
    shutil.rmtree(array_name, ignore_errors=True)
    SparseArray.create(array_name, schema=s)
    
    with SparseArray(array_name, mode='w') as A:
        A[[1, 2, 3], ['', '', '']] = [1, 2, 3]
    
    tiledb.array_fragments(array_name)
    
    with SparseArray(array_name) as A:
        # Also fails, but with less obvious error message
        A.nonempty_domain()
    
    opened by gatesn 1
  • Cost of sorting sparse array coordinates

    Cost of sorting sparse array coordinates

    I’m looking to store DataFrames as sparse arrays using TileDB but have a few questions about performance.

    Provided my query intercepts only a single fragment, everything seems nice and fast. As soon as multiple fragments are involved, I see in the verbose stats dump a huge increase in the duration of Time to compute range result coordinates.

    For the given example (below), it takes 6.03 seconds to read all three fragments, but only 0.52 seconds to read a single fragment.

    Example
    import contextlib
    import random
    import time
    from timeit import timeit
    
    import numpy as np
    import pandas as pd
    import tiledb
    
    # Name of the array to create.
    array_name = "quickstart_sparse"
    
    
    def create_array():
        dom = tiledb.Domain(
            tiledb.Dim(name="x", domain=(None, None), dtype=np.bytes_),
            tiledb.Dim(name="y", domain=(None, None), dtype=np.bytes_),
            # Index dimension to ensure coords are unique
            tiledb.Dim(name="i", domain=(0, np.iinfo(np.int64).max - 1), dtype=np.int64),
        )
    
        schema = tiledb.ArraySchema(
            domain=dom, sparse=True, attrs=[tiledb.Attr(name="value", dtype=np.float64)]
        )
    
        # Create the (empty) array on disk.
        tiledb.SparseArray.create(array_name, schema)
    
    
    def write_array(n=100_000):
        for x in ('A', 'B', 'C'):
            df = pd.DataFrame([{
                'x': x,
                'y': random.choice(['a', 'b', 'c']),
                'data': random.random()
            } for _ in range(n)])
    
            # Open the array and write to it.
            with tiledb.SparseArray(array_name, mode="w") as A:
                A[df['x'], df['y'], df.index] = df['data']
    
    
    def read_array(n):
        with tiledb.SparseArray(array_name, mode="r") as A:
            print(n, "all", timeit(lambda: A.df[:], setup=lambda: A.df[:], number=3))
            print(n, "x", timeit(lambda: A.df['A', :, :], setup=lambda: A.df[:], number=3))
            print(n, "y", timeit(lambda: A.df[: 'b', :], setup=lambda: A.df[:], number=3))
            
    
    for n in [100, 500, 1_000, 5_000, 10_000, 50_000, 100_000, 500_000, 1_000_000]:
        import shutil
        shutil.rmtree(array_name, ignore_errors=True)
    
        if tiledb.object_type(array_name) != "array":
            create_array()
            write_array(n=n)
    
        read_array(n=n)
    
    print(tiledb.libtiledb.__file__)
    print(tiledb.version, tiledb.libtiledb.version())
    

    Adding a couple more timer stats into The TileDB code I discovered that almost all the additional overhead is from sorting the ResultCoords here:

    https://github.com/TileDB-Inc/TileDB/blob/70ddc9af4da5e1af9e0a9cfb38332455ab15d033/tiledb/sm/query/reader.cc#L1054-L1062

    I’ve verified that when I set allow_duplicates in the schema then the performance is fast again.

    I’m not totally familiar with the internals so wanted to start a thread to discuss a couple of thoughts:

    1. First, it seems that we ought to be able to benefit from long runs of already sorted result coordinates. Perhaps suggesting that an algorithm like TimSort might perform better than QuickSort for these cases?
    2. Second is that in this case the fragment MBRs are non-overlapping in the first dimension, so we should be able to sort the MBRs and then concatenate the result coordinates without sorting them further. I’m not sure where this check could/should happen?
    3. Finally, my understanding is that within each tile (and maybe fragment?) the coordinates are already sorted. If this is the case, then it might mean we can perform a k-way merge of sorted coordinates in O(n log k) instead of concatenation and an (admittedly parallelised) O(n log n) sort?

    I took a stab at trying out the k-way merge here instead of simply concatenating:



    https://github.com/TileDB-Inc/TileDB/blob/70ddc9af4da5e1af9e0a9cfb38332455ab15d033/tiledb/sm/query/reader.cc#L1150-L1154

    I asserted that the overall coordinates were indeed sorted for my use-case, but I have absolutely no idea whether there are other places or details that mean this assumption doesn't hold or that this solution doesn't work in the general case! This would need someone more familiar to validate.

    Even so, after removing the assertion code and just running the k-way merge I see these numbers:

    • read_all 2.92 seconds (all data from all three fragments)
    • read_x 0.57 seconds (all data from a single fragment, selected by the first dimension)
    • read_y 0.91 seconds (some smaller amount of data from all three fragments)

    I also ran against varying sizes of n. The log-log scaled graph suggests that there isn't actually an algorithmic improvement here, but I do think the constants involved might make the change worth it?

    Screenshot 2021-05-19 at 19 44 33 Screenshot 2021-05-19 at 19 52 08
    opened by gatesn 2
  • Make CMake safer for non-standard development environments

    Make CMake safer for non-standard development environments

    CMake has a deficiency where it does not propagate toolchain files to external projects, nor cache entries related to these toolchains. To support a workaround, add propagate_cache_variables(), which takes a list of designated cache variables and constructs definitions to pass to external projects. Use this function for all the external projects in the present commit. Changed CONFIGURE_COMMAND in external projects to CMAKE_ARGS. CONFIGURE_COMMAND does not pass on certain CMake configuration parameters, mostly notably here any generator specified by -G. Remove some architectural specifications that not only assumed Visual Studio, but are passed to external projects by default if CONFIGURE_COMMAND is not used. Added definition CMAKE_POSITION_INDEPENDENT_CODE=ON, which abstracts the -fPIC compiler flag.


    TYPE: BUG DESC: Make CMake safe for non-standard development environments

    opened by eric-hughes-tiledb 0
Releases(2.3.0)
http://torch.ch

Development Status Torch is not in active developement. The functionality provided by the C backend of Torch, which are the TH, THNN, THC, THCUNN libr

Torch 8.6k Jun 11, 2021
C++ implementation of the Python Numpy library

NumCpp: A Templatized Header Only C++ Implementation of the Python NumPy Library

David Pilger 1.7k Jun 12, 2021