An Aspiring Drop-In Replacement for Pandas at Scale

Overview

Legate Pandas

Legate Pandas is a distributed and accelerated drop-in replacement of Pandas. Legate Pandas enables high-performance, scalable execution of dataframe programs on multi-GPU systems by combining the Legion runtime with GPU accelerated dataframe kernels in cuDF. Legate Pandas targets dataframe programs with data processing requirements that cannot be fulfilled by a single GPU.

Here we show some preliminary performance results with Legate Pandas.

The following shows weak scaling performance of join micro-benchmark measured on an NVIDIA DGX SuperPOD (the lower the line is, the better):

drawing

All implementations (Legate Pandas, Dask+cuDF, and MPI+cuDF) use the same GPU-accelerated dataframe kernels in cuDF and differ only in the programming system and communication API. (the "explicit comm" version used a hand-optimized all-to-all shuffle code, instead of Dask's shuffle implementation.) The Legate Pandas version achieved almost the same level of performance as the hand-written MPI+cuDF version, despite the latter requiring significantly more efforts to write. Both of the Dask versions struggled to keep up with Legate Pandas.

Legate Pandas also showed better performance than Dask+cuDF on a realistic example. The following shows weak scaling performance of the mortgage data example measured on an NVIDIA DGX SuperPOD (the lower the line is, the better); Legate Pandas was on average ~2.4X faster than Dask:

drawing

Legate Pandas is still a work-in-progress that is missing some features that Dask+cuDF supports, but we believe that the performance and composability of Legate Pandas is demonstrative of the value of our approach.

If you have questions, please contact us at legate(at)nvidia.com.


  1. Dependencies
  2. Installation
  3. Execution
  4. Supported Features
  5. Differences between Pandas and Legate Pandas
  6. Limitations and Known Issues

Dependencies

Legate Pandas requires Python >= 3.6 and the following packages:

  • arrow-cpp=1.0.1
  • arrow-cpp-proc=3.0.0
  • pyarrow=1.0.1
  • numpy
  • nccl>=2.7
  • cudf=0.19
  • librmm=0.19
  • rmm=0.19

All except the last three packages can be found in the conda-forge channel and the rest can be downloaded from the rapidsai channel.

We provide a conda environment file that installs all these dependencies in one step. Use the following command to create a conda environment with it:

conda env create -n legate -f conda/legate_pandas_dev_nccl2.8.yml

Users must also install Legate Core to build Legate Pandas from source.

Installation

Legate Pandas can be built and installed from source using two install scripts, setup.py and install.py. Both can be used interchangeably, but the latter provides additional flags to configure the build, which are intended for advanced users. To see the list of all build flags, run ./install.py --help.

Both scripts take five essential flags, one for a path to each dependency:

  • --with-core (path to the legate.core installation)
  • --with-arrow (path to arrow-cpp)
  • --with-cudf (path to cudf)
  • --with-rmm (path to rmm)
  • --with-nccl (path to nccl)

These paths need to be provided only once and the paths will be cached for later rebuilds. An example command to install Legate Pandas under a conda environment per our instruction is:

./install.py --with-core (path-to-legate.core) \
  --with-arrow "$CONDA_PREFIX" \
  --with-cudf "$CONDA_PREFIX" \
  --with-rmm "$CONDA_PREFIX" \
  --with-nccl "$CONDA_PREFIX"

Execution

The very first step to start using Legate Pandas is to replace this import statement in your program:

import pandas as pd

with this:

import legate.pandas as pd

To execute Legate Pandas programs, you need to use the custom Python launcher legate included in the Legate Core installation. The launcher uses only CPUs by default, and you can command it to use GPUs with --gpus (gpu count). The launcher takes other machine flags (such as the maximum framebuffer size to use) to configure execution; see the "How Do I Use Legate" section for further details.

Supported Features

Pandas has an incredibly large API that any attempt to build a drop-in replacement for it takes enormous effort. Legate Pandas is still work in progress and currently covers the following list of features that we think are essential for data processing:

  • Joins (merge and join)
  • Groupby reductions (groupby.sum, groupby.mean, groupby.var, etc.)
  • Sorting (sort_values and sort_index)
  • Arithmetic and comparison operators
  • Cumulative operators (cumsum, cummax, etc.)
  • Reduction operators (sum, mean, var, etc.)
  • IO with CSV and Parquet file formats

See the API reference for a complete list of supported features.

Legate Pandas currently supports the following data types:

  • All primitive data types
  • String
  • Categorical types
  • datetime64[ns]

Differences between Pandas and Legate Pandas

Since Pandas has not been designed and implemented for parallel execution in mind, many of its semantic guarantees are not necessarily suitable to match under a distributed execution setting. Though we strive to match Pandas' semantics as much as possible in Legate Pandas so that users can port their code to Legate Pandas effortlessly, there are some occasions that we had to diverge for performance reasons:

  • Joins do not sort keys of the outputs.

  • Groupby reductions do not sort keys of the outputs by default. Users can still request sorted outputs with sort=True, which merely sorts the outputs before returning them.

  • Concatenation and append operation in Legate Pandas on axis=0 perform a union of operand dataframes/series and not back-to-back concatenation.

Limitations and Known issues

The following is the list of limitations and known issues in the current release of Legate Pandas. We are actively working on addressing these issues. Any comments, suggestions, and feature requests are welcome.

  • The current partitioning strategy simply distributes a dataframe or series across all the GPUs (or CPUs, if GPUs are not available) as evenly as possible. A more intelligent partitioning heuristic taking the data size into account is still in progress.
  • All operations that take more than one dataframe or series, such as binary operators, expect all the operands to be aligned on indexes. For examples, Legate Pandas throws an unimplemented exception on the following example:
    x = pd.Series([1, 2])
    y = pd.Series([1, 2], index=[3, 4])
    x + y
    
  • Legate Pandas raises a hard error for cases where Pandas would have raised exceptions, such as failing to encode a value into a given set of categories.
  • Categories must be strings for now.
  • Legate Pandas is currently piggybacking on Pandas for indexes and categorical types, and provides no native abstractions for them. In particular, accessing the index property of a dataframe or a series will materialize the index distributed across multiple memories into a single Python object, which can be problematic when the index does not fit to a system memory. Users should avoid using it and instead convert the index to a column with reset_index.
  • Some options in the operations found in the API references may not be supported. Legate Pandas will raise an exception when the user passed such options. In particular, the loc and iloc locators do not accept arbitrary lists of index values as indexers for now.
  • Locators, head, and tail currently make a copy of the selected part of operand dataframe/series, for which Pandas would return a view to it. For example, the following in-place update internally copies 999 untouched elements:
    x = pd.Series(range(1000))
    x.iloc[0] = 0
    
  • In general, the CPU implementation in Legate Pandas is only functionally correct and not optimized. We recommend you run Legate Pandas on GPUs for any performance evaluation.
  • There is a known issue in NCCL 2.8 that is causing hang on multi-node systems using Volta or older generations of GPUs. On such systems, please use NCCL 2.7 to avoid the issue. We provide a conda environment file that installs NCCL 2.7 instead of 2.8.
You might also like...
Windows 11 Drag & Drop to the Taskbar (Fix)

Windows 11 Drag & Drop to the Taskbar (Fix) This program fixes the missing "Drag & Drop to the Taskbar" support in Windows 11. In the best case, such

D2R mod generator. Provide quick tool to generate .txt files to change game balance: increase drop, monster density or even randomize items.
D2R mod generator. Provide quick tool to generate .txt files to change game balance: increase drop, monster density or even randomize items.

Diablo 2 mod generator Generator is inspired by d2modmaker. It provides fast and easy way to create mod without any modding knowledge. Features includ

A drop-in entity editor for EnTT with Dear ImGui
A drop-in entity editor for EnTT with Dear ImGui

imgui_entt_entity_editor A drop-in, single-file entity editor for EnTT, with ImGui as graphical backend. demo-code (live) Editor Editor with Entiy-Lis

Alien Swarm: Reactive Drop

Alien Swarm: Reactive Drop Alien Swarm: Reactive Drop is a standalone modification for Valve's Alien Swarm game. This repository contains the source c

This the contains the test examples and validator tool for the  ISPD2021 Wafer-Scale Physics Modeling contest.
This the contains the test examples and validator tool for the ISPD2021 Wafer-Scale Physics Modeling contest.

This readme documents information regarding the validator/scorer which will be used for the 2021 ISPD Contest problem: Wafer-Scale Physics Modelling

You Only Look Twice: Rapid Multi-Scale Object Detection In Satellite Imagery
You Only Look Twice: Rapid Multi-Scale Object Detection In Satellite Imagery

YOLT You Only Look Twice: Rapid Multi-Scale Object Detection In Satellite Imagery As of 24 October 2018 YOLT has been superceded by SIMRDWN YOLT is an

Large scale embeddings on a single machine.

Marius Marius is a system under active development for training embeddings for large-scale graphs on a single machine. Training on large scale graphs

This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.
This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Fast Face Classification (F²C) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicit

FG-Net: Fast Large-Scale LiDAR Point Clouds Understanding Network Leveraging Correlated Feature Mining and Geometric-Aware Modelling
FG-Net: Fast Large-Scale LiDAR Point Clouds Understanding Network Leveraging Correlated Feature Mining and Geometric-Aware Modelling

FG-Net: Fast Large-Scale LiDAR Point Clouds Understanding Network Leveraging Correlated Feature Mining and Geometric-Aware Modelling Comparisons of Running Time of Our Method with SOTA methods RandLA and KPConv:

Square Root Bundle Adjustment for Large-Scale Reconstruction
Square Root Bundle Adjustment for Large-Scale Reconstruction

Square Root Bundle Adjustment for Large-Scale Reconstruction

Multi-Scale Geometric Consistency Guided and Planar Prior Assisted Multi-View Stereo

ACMMP [News] The code for ACMH is released!!! [News] The code for ACMM is released!!! [News] The code for ACMP is released!!! About This repository co

A header-only C++ library for large scale eigenvalue problems
A header-only C++ library for large scale eigenvalue problems

NOTE: Spectra 1.0.0 is released, with a lot of API-breaking changes. Please see the migration guide for a smooth transition to the new version. NOTE:

Postmodern immutable and persistent data structures for C++ — value semantics at scale
Postmodern immutable and persistent data structures for C++ — value semantics at scale

immer is a library of persistent and immutable data structures written in C++. These enable whole new kinds of architectures for interactive and concu

Yet another ratio espresso scale

SofronioEspressoRatioScale Yet another ratio espresso scale

GalaxyEngine is a MySQL branch originated from Alibaba Group, especially supports large-scale distributed database system.

GalaxyEngine is a MySQL branch originated from Alibaba Group, especially supports large-scale distributed database system.

[NeurIPS 2021 Spotlight] Learning to Delegate for Large-scale Vehicle Routing

Learning to Delegate for Large-scale Vehicle Routing This directory contains the code, data, and model for our NeurIPS 2021 Spotlight paper Learning t

Codebase for "SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems"

Codebase for "SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems"

AWS Ambit Scenario Designer for Unreal Engine 4 (Ambit) is a suite of tools to streamline content creation at scale for autonomous vehicle and robotics simulation applications.
AWS Ambit Scenario Designer for Unreal Engine 4 (Ambit) is a suite of tools to streamline content creation at scale for autonomous vehicle and robotics simulation applications.

AWS Ambit Scenario Designer for Unreal Engine 4 Welcome to AWS Ambit Scenario Designer for Unreal Engine 4 (Ambit), a suite of tools to streamline 3D

Comments
  • Move only columns

    Move only columns

    This PR is to make classes expressing columns and region fields move-only types so that we can detect any misuses of them at compile time. This also makes the scalar class stop leaking memory.

    opened by magnatelee 0
  • Fixes for multi-architecture compilation

    Fixes for multi-architecture compilation

    The current code depends on one of FERMI_ARCH, KEPLER_ARCH, ... being defined

    https://github.com/nv-legate/legate.pandas/blob/7a97b455999e49c328c1873e49fb65d2eade7f2a/src/pandas.cu#L27

    but if the core is build with support for multiple architectures then none of these will be defined:

    https://github.com/nv-legate/legate.core/blob/branch-22.01/src/legate.mk#L90

    opened by manopapad 0
  • Several tests seem to be failing

    Several tests seem to be failing

    The tests that are failing.

    The tests that fail on CPU:

    [FAIL] (CPU) tests/pandas/df_drop_duplicates.py
    [FAIL] (CPU) tests/pandas/sr_astype_invalid.py
    [FAIL] (CPU) tests/pandas/sr_at.py
    [FAIL] (CPU) tests/pandas/sr_drop_duplicates.py
    [FAIL] (CPU) tests/pandas/sr_iat.py
    

    All of these fail on one CPU, but on 2 CPUs only at and iat fail.

    Tests that fail on one GPU:

    [FAIL] (GPU) tests/pandas/df_join.py
    

    All tests seem to fail on 2 GPUs. Must be something obvious.

    opened by marcinz 1
Owner
Legate
High Productivity High Performance Computing
Legate
A drop-in replacement for std::list with 293% faster insertion, 57% faster erasure, 17% faster iteration and 77% faster sorting on average. 20-24% speed increase in use-case testing.

plf_list A drop-in replacement for std::list with (on average): 293% faster insertion 57% faster erasure 17% faster iteration 77% faster sorting 70% f

Matt Bentley 117 Nov 8, 2022
SIDKick -- the first complete SID 6581/8580-drop-in-replacement that you can build yourself

.- the first complete SID-drop-in-replacement that you can build yourself -. SIDKick is a drop-in replacement for the SID sound chips used in C64s and

null 103 Jan 4, 2023
A faster drop-in replacement for giflib. It uses more RAM, but you get more speed.

GIFLIB-Turbo What is it? A faster drop-in replacement for GIFLIB Why did you write it? Starting in the late 80's, I was fascinated with computer graph

Larry Bank 27 Jun 9, 2022
A simple "no frills" drop-in replacement PCB for the KBDfans 67mkII / 67lite

67mk_E A simple "no frills" drop-in replacement PCB for the KBDfans 67mkII / 67lite KiCAD PCB files Gerbers for PCB production JLCPCB BOM JLCPCB CPL V

null 24 Dec 18, 2022
A drop-in replacement for std::list with 293% faster insertion, 57% faster erasure, 17% faster iteration and 77% faster sorting on average. 20-24% speed increase in use-case testing.

plf::list A drop-in replacement for std::list with (on average): 293% faster insertion 57% faster erasure 17% faster iteration 77% faster sorting 70%

Matt Bentley 117 Nov 8, 2022
Amiga 1200 keyboard MPU drop-in replacement pcb

A1200_keyb_MPU Amiga 1200 keyboard MPU drop-in replacement pcb As the 68HC05 (p/n 391508-01) used in the Amiga 1200 is getting to be very expensive, I

Oleg Mishin 19 Nov 26, 2022
mold is a faster drop-in replacement for existing Unix linkers

mold: A Modern Linker mold is a faster drop-in replacement for existing Unix linkers. It is several times faster than LLVM lld linker, the second-fast

Rui Ueyama 9.8k Jan 2, 2023
Improved and configurable drop-in replacement to std::function that supports move only types, multiple overloads and more

fu2::function an improved drop-in replacement to std::function Provides improved implementations of std::function: copyable fu2::function move-only fu

Denis Blank 429 Dec 12, 2022
A single file drop-in memory leak tracking solution for C++ on Windows

MemLeakTracker A single file drop-in memory leak tracking solution for C++ on Windows This small piece of code allows for global memory leak tracking

null 22 Jul 18, 2022
Windows 11 Drag & Drop to the Taskbar (Partial Fix)

Windows 11 Drag & Drop to the Taskbar (Partial Fix) This program partially fixes the missing "Drag & Drop to the Taskbar" support in Windows 11. In th

null 1.3k Dec 29, 2022