Apache Arrow Cookbook

Overview

Apache Arrow Cookbooks

Cookbooks are a collection of recipes about common tasks that Arrow users might want to do. The cookbook is actually composed of multiple cookbooks, one for each supported platform, which contain the recipes for that specific platform.

All cookbooks are buildable to HTML and verifiable by running a set of tests that confirm that the recipes are still working as expected.

Each cookbook is implemented using platform specific tools. For this reason a Makefile is provided which abstracts platform specific concerns and makes it possible to build/test all cookbooks without any platform specific knowledge (as long as dependencies are available on the target system).

Building All Cookbooks

make all

Testing All Cookbooks

make test

Listing Available Commands

make help

Building Platform Specific Cookbook

Refer to make help to learn the commands that build or test the cookbook for the platform you are targeting.

Prerequisites

Both the R and Python cookbooks will try to install the dependencies they need (including latests pyarrow/arrow-R version). This means that as far as you have a working Python/R environment able to install dependencies through the respective package manager you shouldn't need to install anything manually.

Contributing to the Cookbook

Please refer to the CONTRIBUTING.md file for instructions about how to contribute to the Apache Arrow Cookbook.


All participation in the Apache Arrow project is governed by the Apache Software Foundation’s code of conduct.

Comments
  • [Java] Cookbook Java example does not work as expected with multiple batches

    [Java] Cookbook Java example does not work as expected with multiple batches

    I am trying out the cookbook java example here. The only change is that I am trying to write multiple batches. See "batch" comment in the code.

    Upon running this example I am seeing unexpected overlapping results!! This thing gets wierder with multi-threading. Please suggest what is the correct way of sending multiple batches!

    S1: Server (Location): Listening on port 33333
    C1: Client (Location): Connected to grpc+tcp://0.0.0.0:33333
    WARNING: An illegal reflective access operation has occurred
    WARNING: Illegal reflective access by org.apache.arrow.memory.util.MemoryUtil (file:/Users/rentsher/.m2/repository/org/apache/arrow/arrow-memory-core/8.0.0/arrow-memory-core-8.0.0.jar) to field java.nio.Buffer.address
    WARNING: Please consider reporting this to the maintainers of org.apache.arrow.memory.util.MemoryUtil
    WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
    WARNING: All illegal access operations will be denied in a future release
    C2: Client (Populate Data): Wrote 2 batches with 3 rows each
    C3: Client (Get Metadata): FlightInfo{schema=Schema<name: Int(64, true) not null>, descriptor=profiles, endpoints=[FlightEndpoint{locations=[Location{uri=grpc+tcp://0.0.0.0:33333}], ticket=org.apache.arrow.flight.Ticket@58871b0a}], bytes=-1, records=60}
    C4: Client (Get Stream):
    Client Received batch apache/arrow#1, Data:
    vector size: 10
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    Client Received batch apache/arrow#2, Data:
    vector size: 10
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    Client Received batch apache/arrow#3, Data:
    vector size: 10
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    Client Received batch apache/arrow#4, Data:
    vector size: 10
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    Client Received batch apache/arrow#5, Data:
    vector size: 10
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    Client Received batch apache/arrow#6, Data:
    vector size: 10
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    C5: Client (List Flights Info): FlightInfo{schema=Schema<name: Int(64, true) not null>, descriptor=profiles, endpoints=[FlightEndpoint{locations=[Location{uri=grpc+tcp://0.0.0.0:33333}], ticket=org.apache.arrow.flight.Ticket@58871b0a}], bytes=-1, records=60}
    C6: Client (Do Delete Action): Delete completed
    C7: Client (List Flights Info): After delete - No records
    C8: Server shut down successfully
    
    Process finished with exit code 0
    
    package com.iamsmkr.arrowflight;
    
    import org.apache.arrow.flight.Action;
    import org.apache.arrow.flight.AsyncPutListener;
    import org.apache.arrow.flight.Criteria;
    import org.apache.arrow.flight.FlightClient;
    import org.apache.arrow.flight.FlightDescriptor;
    import org.apache.arrow.flight.FlightInfo;
    import org.apache.arrow.flight.FlightServer;
    import org.apache.arrow.flight.FlightStream;
    import org.apache.arrow.flight.Location;
    import org.apache.arrow.flight.Result;
    import org.apache.arrow.flight.Ticket;
    import org.apache.arrow.memory.BufferAllocator;
    import org.apache.arrow.memory.RootAllocator;
    import org.apache.arrow.vector.BigIntVector;
    import org.apache.arrow.vector.VectorSchemaRoot;
    import org.apache.arrow.vector.holders.NullableVarCharHolder;
    import org.apache.arrow.vector.types.pojo.ArrowType;
    import org.apache.arrow.vector.types.pojo.Field;
    import org.apache.arrow.vector.types.pojo.FieldType;
    import org.apache.arrow.vector.types.pojo.Schema;
    
    import java.io.IOException;
    import java.nio.charset.StandardCharsets;
    import java.util.Arrays;
    import java.util.Iterator;
    
    public class CookbookApp {
    
        public static void main(String[] args) {
    
            Location location = Location.forGrpcInsecure("0.0.0.0", 33333);
            try (BufferAllocator allocator = new RootAllocator()) {
                // Server
                try (FlightServer flightServer = FlightServer.builder(allocator, location, new ArrowFlightProducer(allocator, location)).build()) {
                    try {
                        flightServer.start();
                        System.out.println("S1: Server (Location): Listening on port " + flightServer.getPort());
                    } catch (IOException e) {
                        System.exit(1);
                    }
    
                    // Client
                    try (FlightClient flightClient = FlightClient.builder(allocator, location).build()) {
                        System.out.println("C1: Client (Location): Connected to " + location.getUri());
    
                        // Populate data
                        Schema schema = new Schema(Arrays.asList(
                                new Field("name", new FieldType(false, new ArrowType.Int(64, true), null), null)));
    
                        try (
                                VectorSchemaRoot vectorSchemaRoot = VectorSchemaRoot.create(schema, allocator);
                                BigIntVector names = (BigIntVector) vectorSchemaRoot.getVector("name")
                        ) {
                            FlightClient.ClientStreamListener listener =
                                    flightClient.startPut(
                                            FlightDescriptor.path("profiles"),
                                            vectorSchemaRoot,
                                            new AsyncPutListener()
                                    );
    
                            // Batch 1
                            int j = 0;
                            for (long i = 0; i < 10; i++) {
                                names.setSafe(j, i);
                                j++;
                            }
                            vectorSchemaRoot.setRowCount(10);
    
                            while (!listener.isReady()) {
                                try {
                                    Thread.sleep(1);
                                } catch (InterruptedException e) {
                                    e.printStackTrace();
                                }
                            }
    
                            listener.putNext();
    
                            // Batch 2
                            j = 0;
                            for (long i = 10; i < 20; i++) {
                                names.setSafe(j, i);
                                j++;
                            }
                            vectorSchemaRoot.setRowCount(10);
    
                            while (!listener.isReady()) {
                                try {
                                    Thread.sleep(1);
                                } catch (InterruptedException e) {
                                    e.printStackTrace();
                                }
                            }
    
                            listener.putNext();
    
                            // Batch 3
                            j = 0;
                            for (long i = 20; i < 30; i++) {
                                names.setSafe(j, i);
                                j++;
                            }
                            vectorSchemaRoot.setRowCount(10);
    
                            while (!listener.isReady()) {
                                try {
                                    Thread.sleep(1);
                                } catch (InterruptedException e) {
                                    e.printStackTrace();
                                }
                            }
    
                            listener.putNext();
    
                            // Batch 4
                            j = 0;
                            for (long i = 30; i < 40; i++) {
                                names.setSafe(j, i);
                                j++;
                            }
                            vectorSchemaRoot.setRowCount(10);
    
                            while (!listener.isReady()) {
                                try {
                                    Thread.sleep(1);
                                } catch (InterruptedException e) {
                                    e.printStackTrace();
                                }
                            }
    
                            listener.putNext();
    
                            // Batch 5
                            j = 0;
                            for (long i = 40; i < 50; i++) {
                                names.setSafe(j, i);
                                j++;
                            }
                            vectorSchemaRoot.setRowCount(10);
    
                            while (!listener.isReady()) {
                                try {
                                    Thread.sleep(1);
                                } catch (InterruptedException e) {
                                    e.printStackTrace();
                                }
                            }
    
                            listener.putNext();
    
                            // Batch 6
                            j = 0;
                            for (long i = 50; i < 60; i++) {
                                names.setSafe(j, i);
                                j++;
                            }
                            vectorSchemaRoot.setRowCount(10);
    
                            while (!listener.isReady()) {
                                try {
                                    Thread.sleep(1);
                                } catch (InterruptedException e) {
                                    e.printStackTrace();
                                }
                            }
    
                            listener.putNext();
    
                            listener.completed();
                            listener.getResult();
                            
                            System.out.println("C2: Client (Populate Data): Wrote 2 batches with 3 rows each");
                        }
    
                        // Get metadata information
                        FlightInfo flightInfo = flightClient.getInfo(FlightDescriptor.path("profiles"));
                        System.out.println("C3: Client (Get Metadata): " + flightInfo);
    
                        // Get data information
                        try (FlightStream flightStream = flightClient.getStream(new Ticket(
                                FlightDescriptor.path("profiles").getPath().get(0).getBytes(StandardCharsets.UTF_8)))) {
                            int batch = 0;
                            try (
                                    VectorSchemaRoot vectorSchemaRootReceived = flightStream.getRoot();
                                    BigIntVector names = (BigIntVector) vectorSchemaRootReceived.getVector("name")
                            ) {
                                System.out.println("C4: Client (Get Stream):");
                                while (flightStream.next()) {
                                    batch++;
                                    System.out.println("Client Received batch #" + batch + ", Data:");
    //                                System.out.print(vectorSchemaRootReceived.contentToTSVString());
                                    int i = vectorSchemaRootReceived.getRowCount();
                                    System.out.println("vector size: " + i);
                                    int j = 0;
                                    while (j < i) {
                                        System.out.println(names.get(j));
    //                                    names.get(j);
    //                                    copy(vcHolder, tmpSB);
    //                                    System.out.println("name" + j + ": " + tmpSB);
                                        j++;
                                    }
                                }
                            }
                        } catch (Exception e) {
                            e.printStackTrace();
                        }
    
                        // Get all metadata information
                        Iterable<FlightInfo> flightInfosBefore = flightClient.listFlights(Criteria.ALL);
                        System.out.print("C5: Client (List Flights Info): ");
                        flightInfosBefore.forEach(t -> System.out.println(t));
    
                        // Do delete action
                        Iterator<Result> deleteActionResult = flightClient.doAction(new Action("DELETE",
                                FlightDescriptor.path("profiles").getPath().get(0).getBytes(StandardCharsets.UTF_8)));
                        while (deleteActionResult.hasNext()) {
                            Result result = deleteActionResult.next();
                            System.out.println("C6: Client (Do Delete Action): " +
                                    new String(result.getBody(), StandardCharsets.UTF_8));
                        }
    
                        // Get all metadata information (to validate detele action)
                        Iterable<FlightInfo> flightInfos = flightClient.listFlights(Criteria.ALL);
                        flightInfos.forEach(System.out::println);
                        System.out.println("C7: Client (List Flights Info): After delete - No records");
    
                        // Server shut down
                        flightServer.shutdown();
                        System.out.println("C8: Server shut down successfully");
                    }
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        }
    }
    
    opened by iamsmkr 13
  • gh-pages or Apache hosting?

    gh-pages or Apache hosting?

    It appears that in addition to gh-pages we can use Apache hosting. The only real difference would be the URLs.

    https://apache.github.io/arrow-cookbook https://arrow.apache.org/cookbook

    However, the latter approach may require some synchronization with the main Arrow repository (I'm not sure 100% sure if it is sufficient to just make sure the main arrow site doesn't have a cookbook directory). We might need to ask Infra but I'd rather test if there is interest before doing that.

    discussion 
    opened by westonpace 11
  • [C++] Build failure with clang 10.0 and clang-tidy 10.0

    [C++] Build failure with clang 10.0 and clang-tidy 10.0

    I'm not sure whether this is expected or not. I'm simply trying make cpp

    /home/antoine/arrow/cookbook/cpp/code/basic_arrow.cc:23:1: error: class 'BasicArrow_ReturnNotOkNoMacro_Test' defines a copy constructor and a copy assignment operator but does not define a destructor, a move constructor or a move assignment operator [hicpp-special-member-functions,-warnings-as-errors]
    TEST(BasicArrow, ReturnNotOkNoMacro) {
    ^
    /home/antoine/miniconda3/envs/pyarrow/include/gtest/gtest.h:2338:42: note: expanded from macro 'TEST'
    #define TEST(test_suite_name, test_name) GTEST_TEST(test_suite_name, test_name)
                                             ^
    /home/antoine/miniconda3/envs/pyarrow/include/gtest/gtest.h:2332:3: note: expanded from macro 'GTEST_TEST'
      GTEST_TEST_(test_suite_name, test_name, ::testing::Test, \
      ^
    /home/antoine/miniconda3/envs/pyarrow/include/gtest/internal/gtest-internal.h:1355:9: note: expanded from macro 'GTEST_TEST_'
      class GTEST_TEST_CLASS_NAME_(test_suite_name, test_name)                    \
            ^
    /home/antoine/miniconda3/envs/pyarrow/include/gtest/internal/gtest-internal.h:1347:3: note: expanded from macro 'GTEST_TEST_CLASS_NAME_'
      test_suite_name##_##test_name##_Test
      ^
    note: expanded from here
    /home/antoine/arrow/cookbook/cpp/code/basic_arrow.cc:46:1: error: class 'BasicArrow_ReturnNotOk_Test' defines a copy constructor and a copy assignment operator but does not define a destructor, a move constructor or a move assignment operator [hicpp-special-member-functions,-warnings-as-errors]
    TEST(BasicArrow, ReturnNotOk) {
    ^
    /home/antoine/miniconda3/envs/pyarrow/include/gtest/gtest.h:2338:42: note: expanded from macro 'TEST'
    #define TEST(test_suite_name, test_name) GTEST_TEST(test_suite_name, test_name)
                                             ^
    /home/antoine/miniconda3/envs/pyarrow/include/gtest/gtest.h:2332:3: note: expanded from macro 'GTEST_TEST'
      GTEST_TEST_(test_suite_name, test_name, ::testing::Test, \
      ^
    /home/antoine/miniconda3/envs/pyarrow/include/gtest/internal/gtest-internal.h:1355:9: note: expanded from macro 'GTEST_TEST_'
      class GTEST_TEST_CLASS_NAME_(test_suite_name, test_name)                    \
            ^
    /home/antoine/miniconda3/envs/pyarrow/include/gtest/internal/gtest-internal.h:1347:3: note: expanded from macro 'GTEST_TEST_CLASS_NAME_'
      test_suite_name##_##test_name##_Test
      ^
    note: expanded from here
    19395 warnings generated.
    Suppressed 19481 warnings (19393 in non-user code, 88 NOLINT).
    Use -header-filter=.* to display errors from all non-system headers. Use -system-headers to display errors from system headers as well.
    2 warnings treated as errors
    make[3]: *** [CMakeFiles/basic_arrow.dir/build.make:76 : CMakeFiles/basic_arrow.dir/basic_arrow.cc.o] Erreur 2
    make[3]: *** Attente des tâches non terminées....
    /home/antoine/arrow/cookbook/cpp/code/creating_arrow_objects.cc:23:1: error: class 'CreatingArrowObjects_CreateArrays_Test' defines a copy constructor and a copy assignment operator but does not define a destructor, a move constructor or a move assignment operator [hicpp-special-member-functions,-warnings-as-errors]
    TEST(CreatingArrowObjects, CreateArrays) {
    ^
    /home/antoine/miniconda3/envs/pyarrow/include/gtest/gtest.h:2338:42: note: expanded from macro 'TEST'
    #define TEST(test_suite_name, test_name) GTEST_TEST(test_suite_name, test_name)
                                             ^
    /home/antoine/miniconda3/envs/pyarrow/include/gtest/gtest.h:2332:3: note: expanded from macro 'GTEST_TEST'
      GTEST_TEST_(test_suite_name, test_name, ::testing::Test, \
      ^
    /home/antoine/miniconda3/envs/pyarrow/include/gtest/internal/gtest-internal.h:1355:9: note: expanded from macro 'GTEST_TEST_'
      class GTEST_TEST_CLASS_NAME_(test_suite_name, test_name)                    \
            ^
    /home/antoine/miniconda3/envs/pyarrow/include/gtest/internal/gtest-internal.h:1347:3: note: expanded from macro 'GTEST_TEST_CLASS_NAME_'
      test_suite_name##_##test_name##_Test
      ^
    note: expanded from here
    19622 warnings generated.
    Suppressed 19709 warnings (19621 in non-user code, 88 NOLINT).
    Use -header-filter=.* to display errors from all non-system headers. Use -system-headers to display errors from system headers as well.
    1 warning treated as error
    make[3]: *** [CMakeFiles/creating_arrow_objects.dir/build.make:76 : CMakeFiles/creating_arrow_objects.dir/creating_arrow_objects.cc.o] Erreur 1
    make[3]: *** Attente des tâches non terminées....
    /home/antoine/arrow/cookbook/cpp/code/datasets.cc:78:1: error: class 'DatasetReadingTest_DatasetRead_Test' defines a copy constructor and a copy assignment operator but does not define a destructor, a move constructor or a move assignment operator [hicpp-special-member-functions,-warnings-as-errors]
    TEST_F(DatasetReadingTest, DatasetRead) {
    ^
    /home/antoine/miniconda3/envs/pyarrow/include/gtest/gtest.h:2369:3: note: expanded from macro 'TEST_F'
      GTEST_TEST_(test_fixture, test_name, test_fixture, \
      ^
    /home/antoine/miniconda3/envs/pyarrow/include/gtest/internal/gtest-internal.h:1355:9: note: expanded from macro 'GTEST_TEST_'
      class GTEST_TEST_CLASS_NAME_(test_suite_name, test_name)                    \
            ^
    /home/antoine/miniconda3/envs/pyarrow/include/gtest/internal/gtest-internal.h:1347:3: note: expanded from macro 'GTEST_TEST_CLASS_NAME_'
      test_suite_name##_##test_name##_Test
      ^
    note: expanded from here
    /home/antoine/arrow/cookbook/cpp/code/datasets.cc:148:29: error: no member named 'UseAsync' in 'arrow::dataset::ScannerBuilder' [clang-diagnostic-error]
      ASSERT_OK(scanner_builder.UseAsync(true));
                                ^
    /home/antoine/arrow/cookbook/cpp/code/common.h:33:73: note: expanded from macro 'ASSERT_OK'
      for (const ::arrow::Status& _st = ::arrow::internal::GenericToStatus((expr)); \
                                                                            ^
    31717 warnings and 1 error generated.
    Error while processing /home/antoine/arrow/cookbook/cpp/code/datasets.cc.
    Suppressed 31851 warnings (31716 in non-user code, 135 NOLINT).
    Use -header-filter=.* to display errors from all non-system headers. Use -system-headers to display errors from system headers as well.
    1 warning treated as error
    
    cpp 
    opened by pitrou 10
  • [Java]: Java cookbook for create arrow jni dataset

    [Java]: Java cookbook for create arrow jni dataset

    Java cookbook for create arrow jni

    We are seeing lots of improvements on arrow java dataset jni library latest version 7.0.0.

    We are able to run jni dataset 7.0.0 version on latest github ci images without problems and not needed to install or upgrade custom dependencies.

    We are adding github ci to support java cookbook testing for: Linux & MacOS


    History messages:

    Proposal: Use macos-latest for jni cookbooks that offer more support for libarrow_dataset_jni library

    Consider:

    • Current limitation for test this libarrow_dataset_jni using github ci:

    Linux: Is needed to use Ubuntu 21 for libre2.so.9 and libc.so.6, latest version at github is Ubuntu 20

    OSX: Is needed to install libprotobuf.28.dylib and it is not available at brew formula, needed to create custom protobuf formula

    • Library dependencies:

    (base) ➜ /tmp objdump -p libarrow_dataset_jni.so | grep NEEDED NEEDED liblz4.so.1 NEEDED libsnappy.so.1 NEEDED libz.so.1 NEEDED libzstd.so.1 NEEDED libutf8proc.so.2 NEEDED libre2.so.9 NEEDED libthrift-0.13.0.so NEEDED libstdc++.so.6 NEEDED libm.so.6 NEEDED libgcc_s.so.1 NEEDED libpthread.so.0 NEEDED libc.so.6 NEEDED ld-linux-x86-64.so.2

    (base) ➜ /tmp otool -L libarrow_dataset_jni.dylib libarrow_dataset_jni.dylib: @rpath/libarrow_dataset_jni.600.dylib (compatibility version 600.0.0, current version 600.0.0) /usr/local/opt/lz4/lib/liblz4.1.dylib (compatibility version 1.0.0, current version 1.9.3) /usr/local/opt/snappy/lib/libsnappy.1.dylib (compatibility version 1.0.0, current version 1.1.9) /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.11) /usr/local/opt/zstd/lib/libzstd.1.dylib (compatibility version 1.0.0, current version 1.5.0) /usr/local/opt/protobuf/lib/libprotobuf.28.dylib (compatibility version 29.0.0, current version 29.3.0) /usr/local/opt/utf8proc/lib/libutf8proc.2.dylib (compatibility version 2.0.0, current version 2.4.1) /usr/local/opt/re2/lib/libre2.9.dylib (compatibility version 9.0.0, current version 9.0.0) /usr/local/opt/thrift/lib/libthrift-0.15.0.dylib (compatibility version 0.0.0, current version 0.0.0) /usr/local/opt/llvm/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1.0.0) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.100.5)

    opened by davisusanibar 6
  • Explicit arr object creation

    Explicit arr object creation

    Hi, I have started playing with the cookbook examples and there is a little doc section that I did not find clear hence here this PR.

    Issue

    The testsetup:: blocs in the rst files are not written out in the resulting HTML. In the "Given an array with 100 numbers, from 0 to 99" sections the arr object pops up ex nihilo in the documentation which might lose the reader a bit even though most people will guess the np.arange here.

    Proposed solution

    Add the array creation code in the testcode:: blocs for those sections section to make it more explicit.

    That is going from

    Given an array with 100 numbers, from 0 to 99
    
    print(f"{arr[0]} .. {arr[-1]}")
    
    0 .. 99
    

    to

    Given an array with 100 numbers, from 0 to 99
    
    import numpy as np
    import pyarrow as pa
    
    arr = pa.array(np.arange(100))
    
    print(f"{arr[0]} .. {arr[-1]}")
    
    0 .. 99
    

    in the resulting HTML.

    Thanks, Nathanaël

    opened by Nlte 6
  • [R] Broken links in section 2

    [R] Broken links in section 2

    It seems all the links in Section 2 are broken.

    Below are some examples:

    • https://arrow.apache.org/cookbook/r/reading-and-writing-data.html
    • https://arrow.apache.org/cookbook/r/reading-and-writing-data.html#write-a-parquet-file
    • https://arrow.apache.org/cookbook/r/reading-and-writing-data.html#read-a-feather-file

    Screenshot 2022-11-08 at 15 55 16

    opened by andreranza 5
  • [CI] Run cookbook recipes against arrow nightlies only manually and scheduled

    [CI] Run cookbook recipes against arrow nightlies only manually and scheduled

    Currently we are running both tests against the stable and development versions of Arrow on PRs this might cause some issues if working against the current stable version. The idea is to add a stable branch in the future and deploy that one but on the meantime we should only run on CI against arrow dev on demand or as a scheduled task.

    opened by raulcd 4
  • [Java] Link to install instructions (was: Cannot resolve io.netty:netty-transport-native-unix-common:4.1.72.Final)

    [Java] Link to install instructions (was: Cannot resolve io.netty:netty-transport-native-unix-common:4.1.72.Final)

    I thought of trying the example from the cookbook java (https://arrow.apache.org/cookbook/java/flight.html) but I keep getting the following exception when building the project

    Cannot resolve io.netty:netty-transport-native-unix-common:4.1.72.Final 
    

    Since the cookbook doesn't list out the dependencies explictly in the doc I copied from https://github.com/apache/arrow/blob/master/java/flight/flight-integration-tests/pom.xml. Although it would be really nice if the docs specify the dependencies clearly.

    My pom.xml look is like this:

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>com.iamsmkr</groupId>
        <artifactId>arrow-flight-java</artifactId>
        <version>1.0-SNAPSHOT</version>
    
        <dependencies>
    
            <dependency>
                <groupId>org.apache.arrow</groupId>
                <artifactId>arrow-vector</artifactId>
                <version>8.0.0</version>
            </dependency>
    
            <dependency>
                <groupId>org.apache.arrow</groupId>
                <artifactId>arrow-memory-core</artifactId>
                <version>8.0.0</version>
            </dependency>
    
            <dependency>
                <groupId>org.apache.arrow</groupId>
                <artifactId>flight-core</artifactId>
                <version>8.0.0</version>
            </dependency>
    
            <dependency>
                <groupId>org.apache.arrow</groupId>
                <artifactId>flight-sql</artifactId>
                <version>8.0.0</version>
            </dependency>
    
            <dependency>
                <groupId>com.google.protobuf</groupId>
                <artifactId>protobuf-java</artifactId>
                <version>3.20.1</version>
            </dependency>
    
            <dependency>
                <groupId>commons-cli</groupId>
                <artifactId>commons-cli</artifactId>
                <version>1.5.0</version>
            </dependency>
    
            <dependency>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-api</artifactId>
                <version>1.7.36</version>
            </dependency>
    
        </dependencies>
    
        <build>
            <plugins>
                <plugin>
                    <artifactId>maven-assembly-plugin</artifactId>
                    <version>3.0.0</version>
                    <configuration>
                        <descriptorRefs>
                            <descriptorRef>jar-with-dependencies</descriptorRef>
                        </descriptorRefs>
                    </configuration>
                    <executions>
                        <execution>
                            <id>make-assembly</id>
                            <phase>package</phase>
                            <goals>
                                <goal>single</goal>
                            </goals>
                        </execution>
                    </executions>
                </plugin>
            </plugins>
        </build>
    
    </project>
    

    Please suggest how could I resolve this problem?

    opened by iamsmkr 4
  • [Python][Flight] How to log the GeneratorStream duration

    [Python][Flight] How to log the GeneratorStream duration

    I made a do_get function that returned the GeneratorStream object.

    def time_it(func):
        """This decorator prints the execution time for the decorated function."""
    
        @wraps(func)
        def wrapper(*args, **kwargs):
            start = time.time()
            result = func(*args, **kwargs)
            end = time.time()
            logging.debug("{} executed in {}s".format(func.__name__, round(end - start, 2)))
            return result
    
        return wrapper
    
    class FlightServer(pa.flight.FlightServerBase):
        def __init__(self, host="localhost", location=None,filesystem="s3", **kwargs):
            super(FlightServer, self).__init__(location, **kwargs)
    
        @time_it
        def do_get(self, context, ticket):
            ....
            scanner = dataset.scanner(
                batch_size=1000*1000*10
                )
            return pa.flight.GeneratorStream(
                scanner.projected_schema, scanner.to_batches()
                )
    

    The logged time is not for the complete data transfer between the server and the client but for the initial connection. How can I log the complete duration of the data streaming (from the server-side...)?

    python 
    opened by motybz 4
  • [Java]: Java cookbook recipes

    [Java]: Java cookbook recipes

    1. Initial java cookboom recipes for:
    • Reading and Writing Data
    • Creating Arrow Objects
    • Working with Schema
    • Data Manipulation
    1. Pending task:
    • Define a way how to validate java recipe documentation. Planning to use java-sphinx but it is out off scope. Probably implement as a java unit test and test sourcecode before the documentation creation but this validate the source not the code in the documentation

    • Other pending task is to review github workflow and align java recipe on that

    opened by davisusanibar 4
  • [Python] Makefile: pytest target fails

    [Python] Makefile: pytest target fails

    I'm getting 1 fail test when running the make pytest target.

    Document: io
    ------------
    **********************************************************************
    File "io.rst", line 799, in default
    Failed example:
        dataset = ds.dataset("s3://ursa-labs-taxi-data/2011",
                             partitioning=["month"])
        for f in dataset.files[:10]:
            print(f)
        print("...")
    Exception raised:
        Traceback (most recent call last):
          File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/doctest.py", line 1336, in __run
            exec(compile(example.source, filename, "single",
          File "<doctest default[0]>", line 1, in <module>
            dataset = ds.dataset("s3://ursa-labs-taxi-data/2011",
          File "/Users/nathanael.leaute/Documents/github/arrow-cookbook/venv/lib/python3.9/site-packages/pyarrow/dataset.py", line 655, in dataset
            return _filesystem_dataset(source, **kwargs)
          File "/Users/nathanael.leaute/Documents/github/arrow-cookbook/venv/lib/python3.9/site-packages/pyarrow/dataset.py", line 410, in _filesystem_dataset
            return factory.finish(schema)
          File "pyarrow/_dataset.pyx", line 2402, in pyarrow._dataset.DatasetFactory.finish
          File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
          File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
        OSError: Error creating dataset. Could not read schema from 'ursa-labs-taxi-data/2011/01/data.parquet': Could not open Parquet input source 'ursa-labs-taxi-data/2011/01/data.parquet': AWS Error [code 15]: Access Denied. Is this a 'parquet' file?
    **********************************************************************
    1 items had failures:
       1 of  27 in default
    27 tests in 1 items.
    26 passed and 1 failed.
    ***Test Failed*** 1 failures.
    
    

    It seems like the ACL on the ursa-labs-taxi-data bucket doesn't allow public access. I don't know if you want to open up the bucket / prefix to the public and incur that aws bandwidth costs though. Those are definitely a thing.

    bug python 
    opened by Nlte 4
  • [R] Revert installing more dependencies for using nightly packages

    [R] Revert installing more dependencies for using nightly packages

    https://github.com/apache/arrow-cookbook/pull/284 installs more packages such as libssh-dev to use nightly R packages.

    It's caused by https://github.com/apache/arrow/pull/14235 but it's not expected.

    I've fixed it by https://github.com/apache/arrow/commit/43b95e6e01d2bb411a1cc04b6f3f6c07c615b34f . So we can revert the workaround.

    opened by kou 0
  • [R] Fix matomo integration for R bookdown

    [R] Fix matomo integration for R bookdown

    This PR https://github.com/apache/arrow-cookbook/pull/283 added the matomo integration to the cookbooks. After deploy we can see the matomo script on the different cookbooks and the index page but it does not seem to work with the R bookdown: https://arrow.apache.org/cookbook/r/index.html This issue purpose is to track fixing that.

    opened by raulcd 1
  • [Python] Add recipe for appending/replacing data set partitions

    [Python] Add recipe for appending/replacing data set partitions

    The pyarrow docs for the exisiting_data_behavior param contains this hint:

    This behavior, in combination with a unique basename_template for each write, will allow for an append workflow.

    ‘delete_matching’ is useful when you are writing a partitioned dataset. The first time each partition directory is encountered the entire directory will be deleted. This allows you to overwrite old partitions completely.

    A fully formed recipe for this would be nice as this hint is a bit hidden.

    opened by assignUser 0
Owner
The Apache Software Foundation
The Apache Software Foundation
OpenGL 4 Shading Language Cookbook - Third Edition, published by Packt

Example code from the OpenGL 4 Shading Language Cookbook, 3rd Edition The example code from the OpenGL 4 Shading Language Cookbook, 3rd Edition, by Da

Packt 257 Dec 26, 2022
Apache Arrow Cookbook

Cookbooks are a collection of recipes about common tasks that Arrow users might want to do. The cookbook is actually composed of multiple cookbooks, one for each supported platform, which contain the recipes for that specific platform.

The Apache Software Foundation 65 Dec 28, 2022
An R interface to the 'Apache Arrow' C API

carrow The goal of carrow is to wrap the Arrow Data C API and Arrow Stream C API to provide lightweight Arrow support for R packages to consume and pr

Dewey Dunnington 30 Aug 5, 2022
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Apache Arrow Powering In-Memory Analytics Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enabl

The Apache Software Foundation 10.8k Dec 29, 2022
JNI Cookbook Redux

Inside this book you can find JNI related codes that I have written over years of working with JNI. I was typically preparing small samples to test some particular concepts.

Michal 88 Dec 29, 2022
CMake Cookbook recipes.

CMake Cookbook This repository collects sources for the recipes contained in the CMake Cookbook published by Packt and authored by Radovan Bast and Ro

null 2k Dec 30, 2022
OpenGL 4 Shading Language Cookbook - Third Edition, published by Packt

Example code from the OpenGL 4 Shading Language Cookbook, 3rd Edition The example code from the OpenGL 4 Shading Language Cookbook, 3rd Edition, by Da

Packt 257 Dec 26, 2022
Source code for Game Physics Cookbook

Game Physics Cookbook Website Facebook Twitter This book is a comprehensive guide to the linear algebra and collision detection games commonly use, bu

Gabor Szauer 647 Dec 30, 2022
Extension types for geospatial data for use with 'Arrow'

geoarrow The goal of geoarrow is to prototype Arrow representations of geometry. This is currently a first-draft specification and nothing here should

Dewey Dunnington 95 Jan 2, 2023
Proof of Concept 'GeoPackage' to Arrow Converter

gpkg The goal of gpkg is to provide a proof-of-concept reader for SQLite queries into Arrow C Data interface structures. Installation You can install

Dewey Dunnington 8 May 20, 2022
The Apache Kafka C/C++ library

librdkafka - the Apache Kafka C/C++ client library Copyright (c) 2012-2020, Magnus Edenhill. https://github.com/edenhill/librdkafka librdkafka is a C

Magnus Edenhill 6.4k Dec 31, 2022
Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB

Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB. Scylla embraces a shared-nothing approach that increases throughput and storage capacity to realize order-of-magnitude performance improvements and reduce hardware costs.

ScyllaDB 8.9k Jan 4, 2023
Making Type Info Library (TIL) file for Apache modules

Creating TIL files for IDA Intro Creating a Type Information Library makes it easier to reverse engineer binaries by providing IDA with detailed and a

Raphaël Rigo 42 Nov 22, 2022
Bringing rustls into the Apache server.

mod_tls - memory safety for TLS in Apache This repository contains mod_tls, a module for Apache httpd that uses rustls to provide a memory safe TLS im

Internet Security Research Group 25 Feb 8, 2022
Open 3D Engine (O3DE) is an Apache 2.0-licensed multi-platform AAA Open 3D Engine

Open 3D Engine (O3DE) is an Apache 2.0-licensed multi-platform 3D engine that enables developers and content creators to build AAA games, cinema-quality 3D worlds, and high-fidelity simulations without any fees or commercial obligations.

O3DE 5.8k Jan 7, 2023
Mirror of Apache C++ Standard Library

$Id$ Apache C++ Standard Library (STDCXX) 5.0.0 ------------------------------------------ 0 Index -------- Inde

The Apache Software Foundation 56 Oct 6, 2022
Mirror of Apache ODE

============== Apache ODE ============== Apache ODE is a WS-BPEL compliant web services orchestration engine. It organizes web services calls follo

The Apache Software Foundation 44 Jun 28, 2022
Mirror of Apache Portable Runtime

Apache Portable Runtime Library (APR) ===================================== The Apache Portable Runtime Library provides a predictable and cons

The Apache Software Foundation 379 Dec 9, 2022
Apache Thrift is a lightweight, language-independent software stack for point-to-point RPC implementation

Apache Thrift Introduction Thrift is a lightweight, language-independent software stack for point-to-point RPC implementation. Thrift provides clean a

The Apache Software Foundation 9.5k Jan 7, 2023