PGSpider: High-Performance SQL Cluster Engine for distributed big data.

Related tags

Database pgspider
Overview

PGSpider

PGSpider is High-Performance SQL Cluster Engine for distributed big data.
PGSpider can access a number of data sources using Foreign Data Wrapper(FDW) and retrieves the distributed data source vertically.
Usage of PGSpider is the same as PostgreSQL except its program name is pgspider and default port number is 4813. You can use any client applications such as libpq and psql.

Features

  • Multi-Tenant User can get records in multi tables by one SQL easily.
    If there are tables with similar schema in each data source, PGSpider can view them as a single virtual table: We call it as Multi-Tenant table.

  • Parallel processing
    PGSpider executes queries and fetches results from child nodes in parallel.
    PGSpider expands Multi-Tenant table to child tables, creates new threads for each child table to access corresponding data source.

  • Pushdown
    WHERE clause and aggregation functions are pushed down to child nodes.
    Pushdown to Multi-tenant tables occur error when AVG, STDDEV and VARIANCE are used.
    PGSPider improves this error, PGSpider can execute them.

How to build PGSpider

Clone PGSpider source code.

git clone https://github.com/pgspider/pgspider.git

Build and install PGSpider and extensions.

cd pgspider
./configure
make
sudo make install
cd contrib/pgspider_core_fdw
make
sudo make install
cd ../pgspider_fdw
make
sudo make install

Default install directory is /usr/local/pgspider.

Usage

For example, we will create 2 different child nodes, SQLite and PostgreSQL. They are accessed by PGSpider as root node. Please install SQLite and PostgreSQL for child nodes.

After that, we install PostgreSQL FDW and SQLite FDW into PGSpider.

Install SQLite FDW

cd ../
git clone https://github.com/pgspider/sqlite_fdw.git
cd sqlite_fdw
make
sudo make install

Install PostgreSQL FDW

cd ../postgres_fdw
make
sudo make install

Start PGSpider

PGSpider binary name is same as PostgreSQL.
Default install directory is changed.

/usr/local/pgspider

Create database cluster and start server.

cd /usr/local/pgspider/bin
./initdb -D ~/pgspider_db
./pg_ctl -D ~/pgspider_db start
./createdb pgspider

Connect to PGSpider.

./psql pgspider

Load extension

PGSpider (Parent node)

CREATE EXTENSION pgspider_core_fdw;

PostgreSQL, SQLite (Child node)

CREATE EXTENSION postgres_fdw;
CREATE EXTENSION sqlite_fdw;

Create server

PGSpider (Parent node)

CREATE SERVER parent FOREIGN DATA WRAPPER pgspider_core_fdw OPTIONS (host '127.0.0.1', port '4813');

PostgreSQL, SQLite (Child node)
In this example, child PostgreSQL node is localhost and port is 5432.
SQLite node's database is /tmp/temp.db.

CREATE SERVER postgres_svr FOREIGN DATA WRAPPER postgres_fdw OPTIONS(host '127.0.0.1', port '5432', dbname 'postgres');
CREATE SERVER sqlite_svr FOREIGN DATA WRAPPER sqlite_fdw OPTIONS(database '/tmp/temp.db');

Create user mapping

PGSpider (Parent node)

Create user mapping for PGSpider. User and password are for current psql user.

CREATE USER MAPPING FOR CURRENT_USER SERVER parent OPTIONS(user 'user', password 'pass');

PostgreSQL (Child node)

CREATE USER MAPPING FOR CURRENT_USER SERVER postgres_svr OPTIONS(user 'user', password 'pass');

SQLite (Child node)
No need to create user mapping.

Create Multi-Tenant table

PGSpider (Parent node)
You need to declare a column named "__spd_url" on parent table.
This column is node location in PGSpider. It allows you to know where the data is comming from node.
In this example, we define 't1' table to get data from PostgreSQL node and SQLite node.

CREATE FOREIGN TABLE t1(i int, t text, __spd_url text) SERVER parent;

When expanding Multi-Tenant table to data source tables, PGSpider searches child node tables by name having [Multi-Tenant table name]__[data source name]__0.

PostgreSQL, SQLite (Child node)

CREATE FOREIGN TABLE t1__postgres_svr__0(i int, t text) SERVER postgres_svr OPTIONS (table_name 't1');
CREATE FOREIGN TABLE t1__sqlite_svr__0(i int, t text) SERVER sqlite_svr OPTIONS (table 't1');

Access Multi-Tenant table

SELECT * FROM t1;

  i |  t  | __spd_url 
----+-----+----------------
  1 | aaa | /sqlite_svr/
  2 | bbb | /sqlite_svr/
 10 | a   | /postgres_svr/
 11 | b   | /postgres_svr/
(4 rows)

Access Multi-Tenant table using node filter

You can choose getting node with 'IN' clause after FROM items (Table name).

SELECT * FROM t1 IN ('/postgres_svr/');

  i | t | __spd_url 
----+---+----------------
 10 | a | /postgres_svr/
 11 | b | /postgres_svr/
(2 rows)

Tree Structure

PGSpider can get data from child PGSpider, it means PGSpider can create tree structure.
For example, we will create a new PGSpider as root node which connects to PGSpider of previous example.
The new root node is parent of previous PGSpider node.

Start new root PGSpider

Create new database cluster with initdb and change port number.
After that, start and connect to new root node.

Load extension

PGSpider (new root node)
If child node is PGSpider, PGSpider use pgspider_fdw.

CREATE EXTENSION pgspider_core_fdw;
CREATE EXTENSION pgspider_fdw;

Create server

PGSpider (new root node)

CREATE SERVER new_root FOREIGN DATA WRAPPER pgspider_core_fdw OPTIONS (host '127.0.0.1', port '54813') ;

PGSpider (Parent node)

CREATE SERVER parent FOREIGN DATA WRAPPER pgspider_svr OPTIONS
(host '127.0.0.1', port '4813') ;

Create user mapping

PGSpider (new root node)

CREATE USER MAPPING FOR CURRENT_USER SERVER new_root OPTIONS(user 'user', password 'pass');

PGSpider (Parent node)

CREATE USER MAPPING FOR CURRENT_USER SERVER parent OPTIONS(user 'user', password 'pass');

Create Multi-Tenant table

PGSpider (new root node)

CREATE FOREIGN TABLE t1(i int, t text, __spd_url text) SERVER new_root;

PGSpider (Parent node)

CREATE FOREIGN TABLE t1__parent__0(i int, t text, __spd_url text) SERVER parent;

Access Multi-Tenant table

SELECT * FROM t1;

  i |  t  |      __spd_url 
----+-----+-----------------------
  1 | aaa | /parent/sqlite_svr/
  2 | bbb | /parent/sqlite_svr/
 10 | a   | /parent/postgres_svr/
 11 | b   | /parent/postgres_svr/
(4 rows)

Note

When a query to foreign tables fails, you can find why it fails by seeing a query executed in PGSpider with EXPLAIN (VERBOSE).

Contributing

Opening issues and pull requests are welcome.

License

Portions Copyright (c) 2018-2021, TOSHIBA CORPORATION

Permission to use, copy, modify, and distribute this software and its documentation for any purpose, without fee, and without a written agreement is hereby granted, provided that the above copyright notice and this paragraph and the following two paragraphs appear in all copies.

See the LICENSE file for full details.

Issues
  • No rule to make target 'pgspider_core_util.o', needed by 'pgspider_core_fdw.so'

    No rule to make target 'pgspider_core_util.o', needed by 'pgspider_core_fdw.so'

    i use docker build , but got this error dockerfile

    FROM debian:stretch-slim
    RUN apt-get update && apt-get install -y build-essential wget libreadline-dev  zlib1g-dev
    WORKDIR /app
    RUN wget https://ftp.postgresql.org/pub/source/v11.6/postgresql-11.6.tar.gz
    RUN tar xvf postgresql-11.6.tar.gz
    COPY pgspider/pgspider.patch /app/pgspider.patch
    RUN patch -p1 -d postgresql-11.6 < /app/pgspider.patch
    RUN cd postgresql-11.6 \
       && ./configure \
       && make && make install \
       && cd contrib/pgspider_core_fdw \
       && make && make install \ 
       && cd ../pgspider_fdw \
       && make && make install
    RUN  rm -rf /var/lib/apt/lists/*
    
    opened by rongfengliang 2
  • Error when running sample code

    Error when running sample code

    @khieuvm Thank you for sharing but it comes Error when running sample code Can you help me to see why?

    2022-06-06 15:14:30.219 CST [9643] ERROR: SQL error during prepare: no such table: main.t1 SELECT i, t FROM main."t1" 2022-06-06 15:14:30.219 CST [9643] ERROR: PGSpider fail to iterate tuple from child thread 2022-06-06 15:14:30.219 CST [9643] STATEMENT: SELECT * FROM t1; ERROR: SQL error during prepare: no such table: main.t1 SELECT i, t FROM main."t1" ERROR: PGSpider fail to iterate tuple from child thread

    opened by roycyz 1
  • Make this an extension instead of a patch

    Make this an extension instead of a patch

    If there are places where PostgreSQL needs hooks or other extensibility mechanisms to enable this, it'd be great to point those out and send them to the developer community.

    opened by davidfetter 5
Releases(v2.0.0)
Owner
PGSpider
Toshiba Corporation Corporate Software Engineering & Technology Center. Distributed Data Search Framework for utilization of big-data and IoT-data
PGSpider
OceanBase is an enterprise distributed relational database with high availability, high performance, horizontal scalability, and compatibility with SQL standards.

What is OceanBase database OceanBase Database is a native distributed relational database. It is developed entirely by Alibaba and Ant Group. OceanBas

OceanBase 4.5k Jul 30, 2022
YugabyteDB is a high-performance, cloud-native distributed SQL database that aims to support all PostgreSQL features

YugabyteDB is a high-performance, cloud-native distributed SQL database that aims to support all PostgreSQL features. It is best to fit for cloud-native OLTP (i.e. real-time, business-critical) applications that need absolute data correctness and require at least one of the following: scalability, high tolerance to failures, or globally-distributed deployments.

yugabyte 6.7k Aug 2, 2022
dqlite is a C library that implements an embeddable and replicated SQL database engine with high-availability and automatic failover

dqlite dqlite is a C library that implements an embeddable and replicated SQL database engine with high-availability and automatic failover. The acron

Canonical 3.1k Jul 29, 2022
MySQL Server, the world's most popular open source database, and MySQL Cluster, a real-time, open source transactional database.

Copyright (c) 2000, 2021, Oracle and/or its affiliates. This is a release of MySQL, an SQL database server. License information can be found in the

MySQL 8.1k Aug 6, 2022
Kunlun distributed DBMS is a NewSQL OLTP relational distributed database management system

Kunlun distributed DBMS is a NewSQL OLTP relational distributed database management system. Application developers can use Kunlun to build IT systems that handles terabytes of data, without any effort on their part to implement data sharding, distributed transaction processing, distributed query processing, crash safety, high availability, strong consistency, horizontal scalability. All these powerful features are provided by Kunlun.

zettadb 102 Jul 29, 2022
GridDB is a next-generation open source database that makes time series IoT and big data fast,and easy.

Overview GridDB is Database for IoT with both NoSQL interface and SQL Interface. Please refer to GridDB Features Reference for functionality. This rep

GridDB 1.8k Jul 28, 2022
An open-source big data platform designed and optimized for the Internet of Things (IoT).

An open-source big data platform designed and optimized for the Internet of Things (IoT).

null 18.7k Aug 4, 2022
TimescaleDB is an open-source database designed to make SQL scalable for time-series data.

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.

Timescale 13.4k Aug 7, 2022
Nebula Graph is a distributed, fast open-source graph database featuring horizontal scalability and high availability

Nebula Graph is an open-source graph database capable of hosting super large scale graphs with dozens of billions of vertices (nodes) and trillions of edges, with milliseconds of latency.

vesoft inc. 817 Aug 4, 2022
A type safe SQL template library for C++

sqlpp11 A type safe embedded domain specific language for SQL queries and results in C++ Documentation is found in the wiki So what is this about? SQL

Roland Bock 2k Aug 2, 2022
DuckDB is an in-process SQL OLAP Database Management System

DuckDB is an in-process SQL OLAP Database Management System

DuckDB 5.7k Aug 7, 2022
A bare-bone SQL implementation

MiniSQL A bare-bone SQL implementation. Project Structure include folder contains header files of all modules. These header files are meant to be shar

null 16 Apr 23, 2022
A PostgreSQL extension providing an async networking interface accessible via SQL using a background worker and curl.

pg_net is a PostgreSQL extension exposing a SQL interface for async networking with a focus on scalability and UX.

Supabase 41 Jun 19, 2022
A lightweight header-only C++11 library for quick and easy SQL querying with QtSql classes.

EasyQtSql EasyQtSql is a lightweight header-only C++11 library for quick and easy SQL querying with QtSql classes. Features: Header only C++11 library

null 44 Jun 28, 2022
BerylDB is a data structure data manager that can be used to store data as key-value entries.

BerylDB is a data structure data manager that can be used to store data as key-value entries. The server allows channel subscription and is optimized to be used as a cache repository. Supported structures include lists, sets, and keys.

BerylDB 194 Jul 27, 2022
pgagroal is a high-performance protocol-native connection pool for PostgreSQL.

pgagroal is a high-performance protocol-native connection pool for PostgreSQL.

Agroal 524 Jul 27, 2022
High-performance time-series aggregation for PostgreSQL

PipelineDB has joined Confluent, read the blog post here. PipelineDB will not have new releases beyond 1.0.0, although critical bugs will still be fix

PipelineDB 2.5k Aug 3, 2022
A high performance fiber RPC network framework. 高性能协程RPC网络框架

ACID: 高性能协程RPC框架 学习本项目需要有一定的C++,网络,RPC知识 项目依赖 1.项目用到了大量C++17/20新特性,如constexpr if的编译期代码生成,基于c++20 coroutine的无栈协程状态机解析 URI 和 HTTP 协议等。注意,必须安装g++-11,否则不支

null 193 Jul 29, 2022
BaikalDB, A Distributed HTAP Database.

BaikalDB supports sequential and randomised realtime read/write of structural data in petabytes-scale. BaikalDB is compatible with MySQL protocol and it supports MySQL style SQL dialect, by which users can migrate their data storage from MySQL to BaikalDB seamlessly.

Baidu 966 Aug 3, 2022