An R interface to the 'Apache Arrow' C API

Related tags

Miscellaneous carrow
Overview

carrow

Codecov test coverage R-CMD-check Lifecycle: experimental

The goal of carrow is to wrap the Arrow Data C API and Arrow Stream C API to provide lightweight Arrow support for R packages to consume and produce streams of data in Arrow format.

Installation

You can install the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("paleolimbot/carrow")

Creating arrays

You can create an Arrow array using as_carrow_array(). For many types (e.g., integers and doubles), this is done without any copying of memory: carrow just arranges the existing R vector memory and protects it for the lifetime of the underlying struct ArrowArray.

library(carrow)
(array <- as_carrow_array(1:5))
#> <carrow_array i[5]>
#> - schema:
#>   <carrow_schema 'i' at 0x130978070>
#>   - format: i
#>   - name: NULL
#>   - flags: 
#>   - metadata:  list()
#>   - dictionary: NULL
#>   - children[0]:
#> - array_data:
#>   <carrow_array_data at 0x1309794b0>
#>   - length: 5
#>   - null_count: 0
#>   - offset: 0
#>   - buffers[2]: List of 2
#>     $ : NULL
#>     $ : int [1:5] 1 2 3 4 5
#>   - dictionary: NULL
#>   - children[0]:

For Arrays and RecordBatches from the arrow package, this is almost always a zero-copy operation and is instantaneous even for very large Arrays.

library(arrow)
(array2 <- as_carrow_array(Array$create(1:5)))
#> <carrow_array i[5]>
#> - schema:
#>   <carrow_schema 'i' at 0x1308adee0>
#>   - format: i
#>   - name: 
#>   - flags: nullable
#>   - metadata:  list()
#>   - dictionary: NULL
#>   - children[0]:
#> - array_data:
#>   <carrow_array_data at 0x1308ab320>
#>   - length: 5
#>   - null_count: 0
#>   - offset: 0
#>   - buffers[2]: List of 2
#>     $ :<externalptr> 
#>     $ :<externalptr> 
#>   - dictionary: NULL
#>   - children[0]:

Exporting arrays

To convert an array object to some other type, use from_carrow_array():

str(from_carrow_array(array))
#>  int [1:5] 1 2 3 4 5

The carrow package has built-in defaults for converting arrays to R objects; you can also specify your own using the ptype argument:

str(from_carrow_array(array, ptype = double()))
#>  num [1:5] 1 2 3 4 5
from_carrow_array(array, ptype = arrow::Array)
#> Array
#> <int32>
#> [
#>   1,
#>   2,
#>   3,
#>   4,
#>   5
#> ]

Streams

The Arrow C API also specifies an experimental stream interface. In addition to handling streams created elsewhere, you can create streams based on a carrow_array():

stream1 <- as_carrow_array_stream(as_carrow_array(1:3))
carrow_array_stream_get_next(stream1)
#> <carrow_array i[3]>
#> - schema:
#>   <carrow_schema 'i' at 0x116664b50>
#>   - format: i
#>   - name: NULL
#>   - flags: 
#>   - metadata:  list()
#>   - dictionary: NULL
#>   - children[0]:
#> - array_data:
#>   <carrow_array_data at 0x116663b30>
#>   - length: 3
#>   - null_count: 0
#>   - offset: 0
#>   - buffers[2]: List of 2
#>     $ :<externalptr> 
#>     $ :<externalptr> 
#>   - dictionary: NULL
#>   - children[0]:
carrow_array_stream_get_next(stream1)
#> NULL

…or based on a function that returns one or more carrow_array()s:

counter <- -1
rows_per_chunk <- 5
csv_file <- readr::readr_example("mtcars.csv")
schema <- as_carrow_array(
  readr::read_csv(
    csv_file, 
    n_max = 0,
    col_types = readr::cols(.default = readr::col_double())
  )
)$schema

stream2 <- carrow_array_stream_function(schema, function() {
  counter <<- counter + 1L
  result <- readr::read_csv(
    csv_file, 
    skip = 1 + (counter * rows_per_chunk),
    n_max = rows_per_chunk,
    col_names = c(
      "mpg", "cyl", "disp", "hp", "drat",
      "wt", "qsec", "vs", "am", "gear", "carb"
    ),
    col_types = readr::cols(.default = readr::col_double())
  )
  
  if (nrow(result) > 0) result else NULL
})

You can pass these to Arrow as a RecordBatchReader using carrow_array_stream_to_arrow():

reader <- carrow_array_stream_to_arrow(stream2)
as.data.frame(reader$read_table())
#> # A tibble: 32 × 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # … with 22 more rows

Currently attemping to export an arrow RecordBatchReader() segfaults for an unknown reason, but in theory one could also go the other direction.

C data access

The C data interface is ABI stable (and a version of the stream interface will be ABI stable in the future) so you can access the underlying pointers in compiled code from any R package (or inline C or C++ code). A carrow_schema() is an external pointer to a struct ArrowSchema, a carrow_array_data() is an external pointer to a struct ArrowArray, and a carrow_array() is a list() of a carrow_schema() and a carrow_array_data().

#include <R.h>
#include <Rinternals.h>
#include "carrow.h"

SEXP extract_null_count(SEXP array_data_xptr) {
  struct ArrowArray* array_data = (struct ArrowArray*) R_ExternalPtrAddr(array_data_xptr);
  return Rf_ScalarInteger(array_data->null_count);
}
.Call("extract_null_count", as_carrow_array(c(NA, NA, 1:5))$array_data)
#> [1] 2

The lifecycles of objects pointed to by the external pointers are managed by R’s garbage collector: any object that gets garbage collected has its release() callback called (if it isn’t NULL) and the underlying memory for the struct Arrow... freed. You can call the release() callback yourself from compiled code but you probably don’t want to unless you’re explicitly limiting access to an object.

Comments
  • Minor compiler warnings

    Minor compiler warnings

    Thanks for putting this together -- I've been on-and-off following under the different names it had over the last few months.

    One thing I noticed, and which may be due to my use off gcc / g++ (now at -10 or -11) with you presumably using clang+= is that I get two nuisance warnings. They are easy to fix and I would be happy to send you a PR:

    • in src/cast-from-array.c you declare double_copy_result to hold the function return, but the compiler is unhappy that the result is never used -- simple fix is to not declare and assing
    • in src/narrow/buffer.cc we have simple signed/unsigned mismatch.

    That's all, and I don't mean to intrude or trample of your feet but you appear to be a rather careful coder so maybe you'd value two fewer warnings.

    opened by eddelbuettel 6
  • Could narrow_pointer_addr_dbl be exported as well?

    Could narrow_pointer_addr_dbl be exported as well?

    Your have

    # only needed for arrow package before 7.0.0
    <- function(ptr) {
      .Call(narrow_c_pointer_addr_dbl, ptr)
    }
    

    and I had been / am using schema and array pointers cast to double which maps reasonably well for non-R use. Is there a reason this could not be exported, whether or not Arrow 7.0 changed this or not. (Also, changed to ... what? I did not immediately find a replacement or alternate in the updated arrow package but I will admit that I do find navigating its code base a little difficult given the size and rate of change).

    Also, #5 really asked for a C-level API, as helpful as the exported R functions are. I may still play with a local variant to export the object code and see how it goes.

    Thanks again for the package. It is quite helpful.

    opened by eddelbuettel 4
  • Exporting record batch readers from arrow package segfaults

    Exporting record batch readers from arrow package segfaults

    ...probably a misunderstanding on my part about the object types that are expected:

    library(carrow)
    
      # some test data
      df <- data.frame(a = 1L, b = 2, c = "three")
      batch <- arrow::record_batch(df)
      tf <- tempfile()
    
      # write a valid file
      file_obj <- arrow::FileOutputStream$create(tf)
      writer <- arrow::RecordBatchFileWriter$create(file_obj, batch$schema)
      writer$write(batch)
      writer$close()
      file_obj$close()
    
      # create the reader
      read_file_obj <- arrow::ReadableFile$create(tf)
      reader <- arrow::RecordBatchFileReader$create(read_file_obj)
    
      # export it to carrow
      stream <- as_carrow_array_stream(reader)
    
      schema <- carrow_array_stream_get_schema(stream)
      identical(
        carrow_schema_info(schema, recursive = TRUE),
        carrow_schema_info(as_carrow_schema(reader$schema), recursive = TRUE)
      )
    #> [1] TRUE
    
      # skip("Attempt to read batch from exported RecordBatchReader segfaults")
      # batch <- carrow_array_stream_get_next(stream)
    
      read_file_obj$close()
      unlink(tf)
    

    Created on 2021-11-23 by the reprex package (v2.0.1)

    opened by paleolimbot 1
  • Fedora 36 gcc 12 problems

    Fedora 36 gcc 12 problems

    narrow-check.zip shows what happens when CMD check on Fedora 36 gcc 12.1.1 and the Fedora libarrow etc. binaries are used. Same for geoarrow: geoarrow-check.zip

    The GDAL 3.5.0 Arrow and Parquet drivers read the GDAL autotest/ogr/data examples OK - my reason for raising a diffuse issue is https://github.com/wcjochem/sfarrow/issues/14, a new failing revdep check on F36/gcc12 and Fedora's arrow RPMs (no external arrow libraries at all previously). I think, since you are obviously looking at checking the GDAL drivers, it might be worth seeing how to cross-check sf with GDAL 3.5.0 and the two experimental drivers, geoarrow and sfarrow (and others if they appear - maybe Python and reticulate). What do you think (@edzer for reference, this format feels promising)?

    opened by rsbivand 4
  • Could the C API be exported ?

    Could the C API be exported ?

    narrow already automates creating the needed instantiations via src/init.c. That registers the object code.

    How do you think about going one step further and provide a file, say, narrowAPI.h with a bunch of declarations such as (untested, adapting from prior work)

    SEXP attribute_hidden narrow_c_array_info(SEXP x) {
      static SEXP(*fun)(SEXP) = (SEXP(*)(SEXP)) R_GetCCallable("narrow","narrow_c_array_info");
      return fun(x);
    }
    

    and so on.

    A number of packages do this (e.g. nloptr, I have two small packages exporting some of R's own (unexported) API functions (via copies in the package), also added it in small scope other packages). The canonical example (also in WRE) is always lme4 and Matrix.

    Would you have any interest in supporting such use?

    opened by eddelbuettel 11
Owner
Dewey Dunnington
R devloper at @voltrondata, former @rstudio summer intern, ggplot2
Dewey Dunnington
Extension types for geospatial data for use with 'Arrow'

geoarrow The goal of geoarrow is to prototype Arrow representations of geometry. This is currently a first-draft specification and nothing here should

Dewey Dunnington 95 Jan 2, 2023
Proof of Concept 'GeoPackage' to Arrow Converter

gpkg The goal of gpkg is to provide a proof-of-concept reader for SQLite queries into Arrow C Data interface structures. Installation You can install

Dewey Dunnington 8 May 20, 2022
Mirror of Apache ODE

============== Apache ODE ============== Apache ODE is a WS-BPEL compliant web services orchestration engine. It organizes web services calls follo

The Apache Software Foundation 44 Jun 28, 2022
Mirror of Apache Portable Runtime

Apache Portable Runtime Library (APR) ===================================== The Apache Portable Runtime Library provides a predictable and cons

The Apache Software Foundation 379 Dec 9, 2022
A line follower simulation created in CoppeliaSim, with a C++ interface for CoppeliaSim's Remote API

Wall-E-Sim A line follower simulation created in CoppeliaSim, with a C++ interface for CoppeliaSim's Remote API This is a simuation of SRA's Wall-E bo

Anushree Sabnis 20 Dec 1, 2022
Projeto pessoal: Obter a temperatura ambiente e através de um termistor ligado a um arduino e disponibilizar esses dados em tempo real via API NodeJS. No front-end os dados são acessados por uma interface em React JS.

INTEGRAÇÃO DA API COM OS DADOS DO ARDUINO FORNECIDOS PELO TERMISTOR Código Desenvolvido por Lucas Muffato. MATERIAIS 1 Placa de Arduino; 1 Cabo de con

Lucas Muffato 35 Aug 16, 2022
The purpose of this project is to create a modern C++17 header-only interface to the FreeRTOS kernel API

FreeRTOS-Cpp The purpose of this project is to create a modern C++17 header-only interface to the FreeRTOS kernel API. Goals of this project include:

Jon Enz 17 Nov 12, 2022
experiments with the Gameboy Advance serial interface

(I'm hedging my bets with this repo name. I'd want to add more serial setup tutorials. For example, I bought a bunch of Gameboy Advance wireless adapt

Ties Stuij 32 Jan 6, 2023
An implementation of a weak handle interface to a packed vector in C++

Experimental handle container in C++ Overview Following on from c-handle-container, this library builds on the same ideas but supports a dynamic numbe

Tom Hulton-Harrop 13 Nov 26, 2022
A graphical interface to set options on devices with coreboot firmware

Corevantage A graphical interface to set options on devices with coreboot firmware. Introduction This is a utility that allows users to view and modif

null 31 Dec 20, 2022
This project shows how to interface Nokia 5110 LCD with Esp32 module to show current prices of any cryptocurrency like Bitcoin, Dogecoin, etc

ESP32 Cryptocurreny Ticker Introduction This project shows how to interface Nokia 5110 LCD with Esp32 module to show current prices of any cryptocurre

Aniket Katkar 20 Jun 16, 2022
A patched QEMU that exposes an interface for LibAFL-based fuzzers

QEMU LibAFL Bridge This is a patched QEMU that exposes an interface for LibAFL-based fuzzers. This raw interface is used in libafl_qemu that expose a

Advanced Fuzzing League ++ 29 Dec 14, 2022
Arduino Interface for cheap 2.4ghz RF enabled Solar Micro Inverters

NETSGPClient Arduino Interface for cheap 2.4ghz RF enabled Solar Micro Inverters using the so-called NETSGP protocol for communication. Here is a YouT

null 44 Dec 30, 2022
Simplified design of an analog keypad matrix interface and demo thereof

Analog Keypad Interface In pin-restricted microcontroller designs it is common to use analog pins and sets of resistors to encode button switch inputs

null 15 Dec 27, 2022
Arduino code to interface with quadrature-encoder mice, specifically the Depraz mouse

Depraz Mice on USB via Arduino This code lets you connect a Depraz mouse to a modern computer via USB. The Depraz mouse has a male DE-9 connector but

John Floren 6 Aug 12, 2022
Wolfram Language interface to the Gurobi numerical optimization library

GurobiLink for Wolfram Language GurobiLink is a package implemented in Wolfram Language and C++ using LibraryLink that provides an interface to the Gu

Wolfram Research, Inc. 16 Nov 3, 2022
Brain-Computer Interface, ADS1299 and STM32

Brain-Computer Interface, ADS1299 and STM32

Ildaron 403 Dec 19, 2022
AVR-based frequency counter module with I2C interface.

AVR-based Frequency Counter The AVR-based frequency counter is partly based on the project developed by Herbert Dingfelder with some extensions and mo

DoWiD 1 Feb 26, 2022
A simple example for 'Arduino' compatible boards to interface with I2C to the MPU-6050, a 6-axis micro-electromechanical IC

Arduino-MPU-6050 A simple example for 'Arduino' compatible boards to interface with I2C to the MPU-6050, a 6-axis micro-electromechanical IC ==About==

Ivan 1 Oct 19, 2021