Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Related tags

Miscellaneous arrow
Overview

Apache Arrow

Fuzzing Status License Twitter Follow

Powering In-Memory Analytics

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.

Major components of the project include:

Arrow is an Apache Software Foundation project. Learn more at arrow.apache.org.

What's in the Arrow libraries?

The reference Arrow libraries contain many distinct software components:

  • Columnar vector and table-like containers (similar to data frames) supporting flat or nested types
  • Fast, language agnostic metadata messaging layer (using Google's Flatbuffers library)
  • Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files
  • IO interfaces to local and remote filesystems
  • Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)
  • Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)
  • Conversions to and from other in-memory data structures
  • Readers and writers for various widely-used file formats (such as Parquet, CSV)

Implementation status

The official Arrow libraries in this repository are in different stages of implementing the Arrow format and related features. See our current feature matrix on git master.

How to Contribute

Please read our latest project contribution guide.

Getting involved

Even if you do not plan to contribute to Apache Arrow itself or Arrow integrations in other projects, we'd be happy to have you involved:

Comments
  • ARROW-10224: [Python] Add support for Python 3.9 except macOS wheel and Windows wheel

    ARROW-10224: [Python] Add support for Python 3.9 except macOS wheel and Windows wheel

    Adds support and testing for Python 3.9. I am looking for review as this change may have touched too many things, but I'm also looking to get the CI to test all the different environments.

    H/T: @kou, the documentation and #5685 for helping me get this off the ground.

    Component: Python 
    opened by terencehonles 182
  • ARROW-14892: [Python][C++] GCS Bindings

    ARROW-14892: [Python][C++] GCS Bindings

    Incorporate GCS file system into python and other bug fixes.

    Bugs/Other changes:

    • Add GCS bindings mostly based on AWS bindings in Python and associated unit tests
    • Tell was incorrect, it double counted when the stream was constructed with an offset.
    • Missed setting the define in config.cmake which means FileSystemFromUri was never tested and didn't compile this is now fixed
    • Refine logic for GetFileInfo with a single path to recognize prefixes followed by a slash as a directory. This allows datasets to work as expected with a toy dataset generated on local-filesystem and copied to the cloud (I believe this is typical of how other systems write to GCS as well.
    • Switch convention for creating directories to always end in "/" and make use of this as another indicator. From testing with a sample iceberg table it appears this is the convention used for hive-partitioning, so I assume this is common practice for other Hive related writers (i.e. what we want to support).
    • Fix bug introduced in https://github.com/apache/arrow/commit/a5e45cecb24229433b825dac64e0ffd10d400e8c which caused failures when a deletion occurred on a bucket (not an object in the bucket).
    • Ensure output streams are closed on destruction (this is consistent with S3)
    Component: C++ Component: Python 
    opened by emkornfield 137
  • ARROW-16340: [C++][Python] Move all Python related code into PyArrow

    ARROW-16340: [C++][Python] Move all Python related code into PyArrow

    This PR moves src/arrow/python directory into pyarrow and arranges PyArrow to build it. The build on the Python side is made in two steps:

    1. _run_cmake_pyarrow_cpp() where the C++ part of the pyarrow is build first (the part that was moved in the refactoring)
    2. _run_cmake() where pyarrow is built as before

    No changes are needed in the build process from the user side to successfully build pyarrow after this refactoring. The test for PyArrow CPP will however be moved into Cython and can currently be run with:

    >>> pushd python/build/dist/temp 
    >>> ctest
    
    Component: C++ Component: Python Component: FlightRPC Component: Documentation 
    opened by AlenkaF 122
  • ARROW-17545: [C++][CI] Mandate C++17 instead of C++11

    ARROW-17545: [C++][CI] Mandate C++17 instead of C++11

    This PR switches our build system to require C++17 instead of C++11.

    Because the conda packaging jobs are out of sync with the conda-forge files, the Windows conda packaging jobs are broken with this change. The related task (sync conda packaging files with conda-forge) is tracked in ARROW-17635.

    Component: R Component: Java Component: Parquet Component: C++ Component: Python Component: Ruby Component: C++ - Gandiva Component: GLib Component: MATLAB Component: Documentation 
    opened by pitrou 120
  • ARROW-12626: [C++] Support toolchain xsimd, update toolchain version to version 8.1.0

    ARROW-12626: [C++] Support toolchain xsimd, update toolchain version to version 8.1.0

    This also updates pinned vcpkg to use xsimd 8.1.0.

    This also implements auto python-wheel-windows-vs2017 image update mechanism. We have a problem of "docker build" on Windows. "docker build" doesn't reuse pulled image as caches. "docker build" always rebuilds an image. This implements manual reuse mechanism like the following:

    if ! docker pull; then
       docker build # build only when built images don't exist
    fi
    docker run
    

    But this doesn't work when ci/docker/python-wheel-windows-vs2017.dockerfile is updated but pinned vcpkg revision isn't changed. In the case, "docker build" isn't run because "docker pull" is succeeded.

    To work this mechanism, this introduces "PYTHON_WHEEL_WINDOWS_IMAGE_REVISION". We must bump it manually when we update ci/docker/python-wheel-windows-vs2017.dockerfile. "PYTHON_WHEEL_WINDOWS_IMAGE_REVISION" is used in tag name. So "docker pull" is failed with new "PYTHON_WHEEL_WINDOWS_IMAGE_REVISION" and then "docker build" is used.

    Component: C++ 
    opened by wesm 101
  • ARROW-6920: [Packaging] Build python 3.8 wheels

    ARROW-6920: [Packaging] Build python 3.8 wheels

    adds python3.8 wheels

    as far as I can tell python3.8 isn't available for Conda yet (https://github.com/conda-forge/python-feedstock/pull/274), so that's will have to be added later

    opened by sjhewitt 95
  • ARROW-17635: [Python][CI] Sync conda recipe with the arrow-cpp feedstock

    ARROW-17635: [Python][CI] Sync conda recipe with the arrow-cpp feedstock

    Corresponds to status of feedstock as of https://github.com/conda-forge/arrow-cpp-feedstock/pull/848, minus obvious & intentional divergences in the setup here (with the exception of unpinning xsimd, which was pinned as of 9.0.0, but isn't anymore).

    opened by h-vetinari 73
  • ARROW-17692: [R] Add support for building with system AWS SDK C++

    ARROW-17692: [R] Add support for building with system AWS SDK C++

    This PR uses "pkg-config --static ... arrow" to collect build flags. "pkg-config --static ... arrow" reports suitable build flags that depend on build options and used libraries for Apache Arrow C++. This works with the system AWS SDK C++.

    Component: R Component: C++ 
    opened by thisisnic 71
  • ARROW-15639 [C++][Python] UDF Scalar Function Implementation

    ARROW-15639 [C++][Python] UDF Scalar Function Implementation

    PR for Scalar UDF integration

    This is the first phase of UDF integration to Arrow. This version only includes ScalarFunctions. In future of PRs, Vector UDF (using Arrow VectorFunction), UDTF (user-defined table function) and Aggregation UDFs will be integrated. This PR includes the following;

    • [x] UDF Python Scalar Function registration and usage
    • [x] UDF Python Scalar Function Examples
    • [x] UDF Python Scalar Function test cases
    • [x] UDF C++ Example extended from Compute Function Example
    • [x] Added aggregation example (optional to this PR: if required can remove and push in a different PR)
    Component: C++ Component: Python Component: GLib 
    opened by vibhatha 68
  • ARROW-16584: [Java] Java JNI with S3 support

    ARROW-16584: [Java] Java JNI with S3 support

    macOS development target is changed to 10.13 from 10.11. See also the discussion on mailing list: https://lists.apache.org/thread/pjgjrl716gvqzql586cnnoxb38nb0j5w

    opened by REASY 65
  • ARROW-14506: [C++] Conda support for google-cloud-cpp

    ARROW-14506: [C++] Conda support for google-cloud-cpp

    This PR adds support for google-cloud-cpp to the Conda files. Probably the most difficult change to grok is the change to compile with C++17 when using Conda:

    • Conda defaults all its builds to C++17, this bug goes into some detail as to why.
    • Arrow defaults to C++11 if no CMAKE_CXX_STANDARD argument is provided.
    • Abseil's ABI changes when used from C++11 vs. C++17, see https://github.com/abseil/abseil-cpp/issues/696
    • Therefore, one must compile with C++17 to use Abseil in Conda.
    • And because google-cloud-cpp has a direct dependency on Abseil, exposed through the headers, one must use C++17 to use google-cloud-cpp too.
    Component: C++ 
    opened by coryan 65
  • [C++][Parquet] The DictEncoder is always PLAIN_DICTIONARY even in parquet_v2 format

    [C++][Parquet] The DictEncoder is always PLAIN_DICTIONARY even in parquet_v2 format

    Describe the enhancement requested

    In DictEncoderImpl, the encoding of it is fixed, which is PLAIN_DICTIONARY. In our standard, it should be RLE_DICTIONARY or PLAIN_DICTIONARY, and should be decide by parquet version.

    Though the final format maybe right, the temporary encoding might be trickey here.

    I'd like to fix this if you'd like it

    Component(s)

    C++, Parquet

    Type: enhancement 
    opened by mapleFU 0
  • [C++] Add an option for the order by node to be stable

    [C++] Add an option for the order by node to be stable

    Describe the enhancement requested

    This will require support for ARROW-17762 first and only makes sense if there is some kind of existing (even the implicit) ordering. We can resequence as we accumulate and then sort the sequenced batches and we should get stable results if we use a stable sort.

    Component(s)

    C++

    Type: enhancement Component: C++ 
    opened by westonpace 0
  • GH-14951: [C++][Parquet] Support Benchmark for DELTA_BINARY_PACKED

    GH-14951: [C++][Parquet] Support Benchmark for DELTA_BINARY_PACKED

    This patch support benchmark for DELTA_BINARY_PACKED. Different from PLAIN, it should considering the cases that data can or cannot be well compressed

    • Closes: #14951
    Component: Parquet Component: C++ 
    opened by mapleFU 3
  • [C++] arrow.pc is missing dependencies with Windows static builds

    [C++] arrow.pc is missing dependencies with Windows static builds

    Describe the bug, including details regarding any error messages, version, and platform.

    I have been having to manually edit arrow.pc from a vcpkg installation of a static build of arrow (using triplet x64-windows-static-md) for a few reasons. #14869 was a start, but now for the tricky (to me) one - missing dependencies. It doesn't seem to be a vcpkg issue, as non-vcpkg builds also generate "unconsumable" pkg-config files, so I'm posting here. Or maybe it goes all the way up to cmake, I don't know (I'm a cmake avoider...).

    The following is using arrow master as of a0d16306229fc08f9dc64361f9459806e02b5932.

    For example, using VS2022 (17.4.3) (and cmake 3.24.202208181-MSVC_2 from there) configured as:

    cmake -G Ninja -DVCPKG_TARGET_TRIPLET=x64-windows-static-md -DARROW_WITH_BROTLI=ON -DARROW_WITH_BZ2=ON 
      -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON 
      -DARROW_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX=/path/to/somewhere -DARROW_DEPENDENCY_SOURCE=VCPKG 
      -DARROW_BUILD_STATIC=ON -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_TESTS=OFF 
      -DARROW_DEPENDENCY_USE_SHARED=OFF -DARROW_WITH_BACKTRACE=OFF 
      -DCMAKE_MSVC_RUNTIME_LIBRARY=MultiThreadedDebugDLL -DCMAKE_BUILD_TYPE=Debug ../..
    

    when built and installed generates arrow.pc:

    ...
    Requires:
    Requires.private:
    Libs: -L${libdir} -larrow /path/to/arrow.git/cpp/vcpkg_installed/x64-windows-static-md/lib/snappy.lib optimized;/path/to/arrow.git/cpp/vcpkg_installed/x64-windows-static-md/lib/bz2.lib;debug;/path/to/arrow.git/cpp/vcpkg_installed/x64-windows-static-md/debug/lib/bz2d.lib
    Libs.private:
    Cflags: -I${includedir} -DARROW_STATIC
    Cflags.private:
    

    This is essentially the same as an install via vcpkg. There are a few problems here:

    • Most of the compression deps are missing (libbrotlidec libbrotlienc zlib liblz4 libzstd).The cmake invocation generates the following:

      -- pkg-config package for libbrotlidec for static link isn't found
      -- pkg-config package for libbrotlienc for static link isn't found
      -- pkg-config package for zlib for static link isn't found
      ...
      

      so not surprising, I suppose! pc files are there in the vcpkg_installed/x64-windows-static-md/debug/lib/pkgconfig dir, however (and the libs themselves a level up).

    • Libs has bz2 and snappy without -l and with a file extension and path prefix. I don't think this is ideal, as without the -l the file may be treated as a plain object file rather than import library causing issues with certain toolchains. Ultimately this could be worked around in meson, maybe, (or libpkgconf I think it uses) but...

    • bzip2 should ideally be a Requires too - since there are separate release and debug pkgconfig dirs they will cope with the file naming difference.

    • And snappy too, but its .pc file is also broken (missing completely here; Libs: -L${libdir} -l, i.e. not -lsnappy, with a vcpkg install of arrow). I haven't investigated that at all yet, so keep in Libs, but without a path.

    I.e. a usable arrow.pc would look something like:

    ...
    Requires: bzip2 libzstd liblz4 zlib libbrotlidec libbrotlienc libbrotlicommon
    Libs: -L${libdir} -larrow -lsnappy
    Cflags: -I"${includedir}" -DARROW_STATIC
    

    Now I'm sure there are a gazillion of other cases to cope with where doing this would be wrong, but at least in the "get everything via vcpkg" case this arrow.pc works for me (and similarly editied arrow.pc from a vcpkg install of arrow).

    As an aside, the same set up of options (except no CMAKE_MSVC_RUNTIME_LIBRARY and x64-linux triplet) on Fedora 37 (cmake 3.25.1) generates:

    ...
    Requires: snappy libbrotlidec libbrotlienc zlib liblz4 libzstd
    Requires.private:
    Libs: -L${libdir} -larrow optimized;/path/to/arrow.git/cpp/vcpkg_installed/x64-linux/lib/libbz2.a;debug;/path/to/arrow.git/cpp/vcpkg_installed/x64-linux/debug/lib/libbz2d.a -larrow_bundled_dependencies
    Libs.private:
    Cflags: -I${includedir} -DARROW_STATIC
    Cflags.private:
    

    This is more reasonable, though I personally think bzip2 is handled wrong (for the same reasons as above), even if it "works". The debug/optimized stuff is still problematic but vcpkg are adding support for that there (https://github.com/microsoft/vcpkg/pull/23898), so will start to work at some point - but obviously potentially not if consuming the pc file via another route.

    I have spent some time hacking ThirdpartyToolchain.cmake to see if I could make the pc files more sane, but unsuccessfully. The best I could do was getting bzip2 "better" with:

    if(ARROW_WITH_BZ2)
      resolve_dependency(BZip2
        PC_PACKAGE_NAMES
        bzip2)
    
      # resolve_dependency(BZip2)
    
      # if(${BZip2_SOURCE} STREQUAL "SYSTEM")
      #   string(APPEND ARROW_PC_LIBS_PRIVATE " ${BZIP2_LIBRARIES}")
      # endif()
    
      # if(NOT TARGET BZip2::BZip2)
      #   add_library(BZip2::BZip2 UNKNOWN IMPORTED)
      #   set_target_properties(BZip2::BZip2
      #                         PROPERTIES IMPORTED_LOCATION "${BZIP2_LIBRARIES}"
      #                                    INTERFACE_INCLUDE_DIRECTORIES "${BZIP2_INCLUDE_DIR}")
      # endif()
    endif()
    

    but what that breaks I've no idea for other cases. As I say, not a cmake person and would like to keep it that way 😀 Happy to try any suggestions, though.

    Thanks!

    Component(s)

    C++

    Type: bug Component: C++ 
    opened by lukester1975 2
Owner
The Apache Software Foundation
The Apache Software Foundation
Digital Signal Processing Library and Audio Toolbox for the Modern Synthesist.

Digital Signal Processing Library and Audio Toolbox for the Modern Synthesist. Attention This library is still under development! Read the docs and ch

everdrone 81 Nov 25, 2022
A generic and robust calibration toolbox for multi-camera systems

MC-Calib Toolbox described in the paper "MultiCamCalib: A Generic Calibration Toolbox for Multi-Camera Systems". Installation Requirements: Ceres, Boo

null 204 Jan 5, 2023
Extension types for geospatial data for use with 'Arrow'

geoarrow The goal of geoarrow is to prototype Arrow representations of geometry. This is currently a first-draft specification and nothing here should

Dewey Dunnington 95 Jan 2, 2023
Isaac ROS image_pipeline package for hardware-accelerated image processing in ROS2.

isaac_ros_image_pipeline Overview This metapackage offers similar functionality as the standard, CPU-based image_pipeline metapackage, but does so by

NVIDIA AI IOT 32 Dec 15, 2022
Proof of Concept 'GeoPackage' to Arrow Converter

gpkg The goal of gpkg is to provide a proof-of-concept reader for SQLite queries into Arrow C Data interface structures. Installation You can install

Dewey Dunnington 8 May 20, 2022
Zero-Knowledge Proof Toolbox

Zkrypt是一个开源的C语言零知识证明算法库,旨在向用户提供简洁、高效的非交互式零知识证明协议接口,用户可以通过调用接口实现完整的零知识证明协议的流程,包括公共参数设置、证明生成和验证等步骤。 本项目由北京大学关志的密码学研究组开发维护。 特性 支持多种零知识证明协议(包括Groth16, Plo

Zhi Guan 16 Dec 14, 2022
A long-read analysis toolbox for cancer genomics

Lorax: A long-read analysis toolbox for cancer genomics In cancer genomics, long-read de novo assembly approaches may not be applicable because of tum

Tobias Rausch 11 Dec 15, 2022
Mod - MASTERS of DATA, a course about videogames data processing and optimization

MASTERS of DATA Welcome to MASTERS of DATA. A course oriented to Technical Designers, Technical Artists and any game developer that wants to understan

Ray 35 Dec 28, 2022
Mirror of Apache ODE

============== Apache ODE ============== Apache ODE is a WS-BPEL compliant web services orchestration engine. It organizes web services calls follo

The Apache Software Foundation 44 Jun 28, 2022
Mirror of Apache Portable Runtime

Apache Portable Runtime Library (APR) ===================================== The Apache Portable Runtime Library provides a predictable and cons

The Apache Software Foundation 379 Dec 9, 2022
C/C++ language server supporting multi-million line code base, powered by libclang. Emacs, Vim, VSCode, and others with language server protocol support. Cross references, completion, diagnostics, semantic highlighting and more

Archived cquery is no longer under development. clangd and ccls are both good replacements. cquery cquery is a highly-scalable, low-latency language s

Jacob Dufault 2.3k Jan 2, 2023
A CUDA-accelerated cloth simulation engine based on Extended Position Based Dynamics (XPBD).

Velvet Velvet is a CUDA-accelerated cloth simulation engine based on Extended Position Based Dynamics (XPBD). Why another cloth simulator? There are a

Vital Chen 39 Dec 21, 2022
"SaferCPlusPlus" is essentially a collection of safe data types intended to facilitate memory and data race safe C++ programming

A collection of safe data types that are compatible with, and can substitute for, common unsafe native c++ types.

null 329 Nov 24, 2022
The Synthesis ToolKit in C++ (STK) is a set of open source audio signal processing and algorithmic synthesis classes written in the C++ programming language.

The Synthesis ToolKit in C++ (STK) By Perry R. Cook and Gary P. Scavone, 1995--2021. This distribution of the Synthesis ToolKit in C++ (STK) contains

null 832 Jan 2, 2023
Multi-dimensional dynamically distorted staggered multi-bandpass LV2 plugin

B.Angr A multi-dimensional dynamicly distorted staggered multi-bandpass LV2 plugin, for extreme soundmangling. Based on Airwindows XRegion. Key featur

null 21 Nov 7, 2022
Unicorn is a lightweight, multi-platform, multi-architecture CPU emulator framework, based on QEMU.

Unicorn Engine Unicorn is a lightweight, multi-platform, multi-architecture CPU emulator framework, based on QEMU. Unicorn offers some unparalleled fe

lazymio 1 Nov 7, 2021
Exploring the Design Space of Page Management for Multi-Tiered Memory Systems (USENIX ATC'21)

AutoTiering This repo contains the kernel code in the following paper: Exploring the Design Space of Page Management for Multi-Tiered Memory Systems (

Computer Systems Laboratory @ Ajou University 23 Dec 20, 2022
A multi-bank MRAM based memory card for Roland instruments

Roland compatible multi-bank MRAM memory card (click to enlarge) This is a replacement memory card for old Roland instruments of the late 80s and earl

Joachim Fenkes 23 Nov 25, 2022