High-performance time-series aggregation for PostgreSQL

Overview

PipelineDB has joined Confluent, read the blog post here.

PipelineDB will not have new releases beyond 1.0.0, although critical bugs will still be fixed.

PipelineDB

Gitter chat Twitter

Overview

PipelineDB is a PostgreSQL extension for high-performance time-series aggregation, designed to power realtime reporting and analytics applications.

PipelineDB allows you to define continuous SQL queries that perpetually aggregate time-series data and store only the aggregate output in regular, queryable tables. You can think of this concept as extremely high-throughput, incrementally updated materialized views that never need to be manually refreshed.

Raw time-series data is never written to disk, making PipelineDB extremely efficient for aggregation workloads.

Continuous queries produce their own output streams, and thus can be chained together into arbitrary networks of continuous SQL.

PostgreSQL compatibility

PipelineDB runs on 64-bit architectures and currently supports the following PostgreSQL versions:

  • PostgreSQL 10: 10.1, 10.2, 10.3, 10.4, 10.5
  • PostgreSQL 11: 11.0

Getting started

If you just want to start using PipelineDB right away, head over to the installation docs to get going.

If you'd like to build PipelineDB from source, keep reading!

Building from source

Since PipelineDB is a PostgreSQL extension, you'll need to have the PostgreSQL development packages installed to build PipelineDB.

Next you'll have to install ZeroMQ which PipelineDB uses for inter-process communication. Here's a gist with instructions to build and install ZeroMQ from source. You'll also need to install some Python dependencies if you'd like to run PipelineDB's Python test suite:

pip install -r src/test/py/requirements.txt

Build PipelineDB:

Once PostgreSQL is installed, you can build PipelineDB against it:

make USE_PGXS=1
make install

Test PipelineDB (optional)

Run the following command:

make test

Bootstrap the PipelineDB environment

Create PipelineDB's physical data directories, configuration files, etc:

make bootstrap

make bootstrap only needs to be run the first time you install PipelineDB. The resources that make bootstrap creates may continue to be used as you change and rebuild PipeineDB.

Run PipelineDB

Run all of the daemons necessary for PipelineDB to operate:

make run

Enter Ctrl+C to shut down PipelineDB.

make run uses the binaries in the PipelineDB source root compiled by make, so you don't need to make install before running make run after code changes--only make needs to be run.

The basic development flow is:

make
make run
^C

# Make some code changes...
make
make run

Send PipelineDB some data

Now let's generate some test data and stream it into a simple continuous view. First, create the stream and the continuous view that reads from it:

$ psql
=# CREATE FOREIGN TABLE test_stream (key integer, value integer) SERVER pipelinedb;
CREATE FOREIGN TABLE
=# CREATE VIEW test_view WITH (action=materialize) AS SELECT key, COUNT(*) FROM test_stream GROUP BY key;
CREATE VIEW

Events can be emitted to PipelineDB streams using regular SQL INSERTS. Any INSERT target that isn't a table is considered a stream by PipelineDB, meaning streams don't need to have a schema created in advance. Let's emit a single event into the test_stream stream since our continuous view is reading from it:

$ psql
=# INSERT INTO test_stream (key, value) VALUES (0, 42);
INSERT 0 1

The 1 in the INSERT 0 1 response means that 1 event was emitted into a stream that is actually being read by a continuous query. Now let's insert some random data:

=# INSERT INTO test_stream (key, value) SELECT random() * 10, random() * 10 FROM generate_series(1, 100000);
INSERT 0 100000

Query the continuous view to verify that the continuous view was properly updated. Were there actually 100,001 events counted?

$ psql -c "SELECT sum(count) FROM test_view"
  sum
-------
100001
(1 row)

What were the 10 most common randomly generated keys?

$ psql -c "SELECT * FROM test_view ORDER BY count DESC limit 10"
key  | count 
-----+-------
 2   | 10124
 8   | 10100
 1   | 10042
 7   |  9996
 4   |  9991
 5   |  9977
 3   |  9963
 6   |  9927
 9   |  9915
10   |  4997
 0   |  4969

(11 rows)
Comments
  • Memory leak

    Memory leak

    May 25 07:22:33 vapipeline01 kernel: [733833.174735] [ 8820]  1001  8820  3149706  1516475    2992        0             0 pipeline-server
    May 25 07:22:33 vapipeline01 kernel: [733833.174868] Out of memory: Kill process 8820 (pipeline-server) score 369 or sacrifice child
    May 25 07:22:33 vapipeline01 kernel: [733833.179622] Killed process 8820 (pipeline-server) total-vm:12598824kB, anon-rss:6065816kB, file-rss:84kB
    May 25 07:22:33 vapipeline01 pipeline[8814]: [83-1] LOG:  autovacuum launcher process (PID 8820) was terminated by signal 9: Killed
    May 25 12:45:51 vapipeline01 pipeline[20481]: [8820-1] LOG:  out of file descriptors: Too many open files; release and retry
    May 25 12:45:51 vapipeline01 pipeline[20481]: [8820-2] STATEMENT:  planedemo_altitude
    May 25 13:36:14 vapipeline01 pipeline[20482]: [8820-1] LOG:  out of file descriptors: Too many open files; release and retry
    May 25 13:36:14 vapipeline01 pipeline[20482]: [8820-2] STATEMENT:  planedemo_altitude
    
    bug 
    opened by usmanm 39
  • PipelineDB <-> PostgreSQL interface

    PipelineDB <-> PostgreSQL interface

    All of the internal refactoring work for #1596 is fairly straightforward and well defined. The most important thing in my mind that needs to be figured out is what exactly we want the PipelineDB <-> PostgreSQL interface to look and feel like at a high level. That is, how do users interact with PipelineDB's abstractions now that we're going to only be using PostgreSQL's interface/syntax?

    I see a few different approaches we can take here. In no particular order,

    0) Keep PipelineDB syntax

    Continuous views, transforms, and streams are fairly unique and simply do not map very directly to any existing PG objects. So it's nice to have a small bit of syntax to represent these objects, and helps new users become familiar with PipelineDB's abstractions. It also makes documentation more distinctive and precise.

    However, PG does not support syntax extension so this approach would require working around that. Fundamentally we'd probably need a very thin shim process in front of PostgreSQL that detects and handles PipelineDB DDL statements.

    If it's a CREATE CONTINUOUS VIEW, CREATE CONTINUOUS TRANSFORM, or CREATE STREAM, statement, we'd transpile the query into a valid PG query using function calls. Otherwise, the statement would just be quickly passed off to PG.

    Obviously this isn't ideal, but I do see quite a bit of value in keeping our syntax for our primitives. This approach would also incur minimal migration friction.

    1) Map PipelineDB objects to PostgreSQL objects

    Streams are already foreign tables internally, so if we simply removed CREATE STREAM syntax and required that streams be created with the foreign table interface, we'd be able to to detect when queries are reading from a stream. That would allow us to leverage existing syntax to create continuous views and transforms.

    For example, you could create a stream like so:

    CREATE FOREIGN TABLE stream (x integer) SERVER pipelinedb_stream;
    

    And perhaps use CREATE MATERIALIZED VIEW syntax to create a continuous view:

    CREATE MATERIALIZED VIEW continuous_view AS SELECT count(*) FROM stream;
    

    So we'd just detect the stream in the FROM clause, and create a continuous view accordingly. Similarly, for transforms we could use a VIEW over a stream:

    CREATE VIEW continuous_transform AS SELECT transform(x) FROM stream;
    

    These objects map relatively intuitively to PipelineDB objects. The downside is that their behavior is quite a bit different: The "materialized view" in this case would be automatically updating, whereas actual materialized views are updated via REFRESH. And the "view" would be writing rows out to another stream (its output stream), which is kind of weird for a view.

    2) Use functions for everything

    We could just encapsulate all PipelineDB DDL functionality in function calls. This is what approach 0) would do internally. For example,

    -- CREATE STREAM
    SELECT pipelinedb.create_stream('stream_name', '(x integer, y integer)');
    
    -- CREATE CONTINUOUS VIEW
    SELECT pipelinedb.create_continuous_view('cv_name',
      'SELECT x, count(*) FROM stream GROUP BY x');
    
    -- CREATE CONTINUOUS TRANSFORM
    SELECT pipelinedb.create_continuous_transform('xform_name',
      'SELECT transform(x) FROM stream');
    

    This is probably the most straightforward approach, but does feel a bit lacking. It feels weird to be writing SQL within strings, rather than being able to leverage things like syntax highlighting.


    There may be fundamentally better approaches, or perhaps the best one is some combination of the above. We'd love to get some user feedback here and have a discussion about what the best design looks like. Getting this interface just right as we become an extension is the most important part of the remaining work.

    question 
    opened by derekjn 36
  • High-fanout TRANSFORM stall

    High-fanout TRANSFORM stall

    With the following I can consistently crash Pipeline:

    DROP STREAM IF EXISTS s CASCADE;
    CREATE STREAM s (json jsonb);
    
    CREATE CONTINUOUS TRANSFORM t AS
      SELECT json, interval, resource
      FROM s
        CROSS JOIN generate_series(1, 4) AS x (interval)
        CROSS JOIN jsonb_array_elements(json->'foo') AS y (resource);
    
    CREATE CONTINUOUS VIEW v AS
      SELECT 1 FROM output_of('t');
    
    -- generate some json and insert it into the stream
    -- INSERT INTO s (json) SELECT json
    --   FROM generate_series(1, 100) AS row_count
    --     CROSS JOIN LATERAL (
    --       SELECT row_count, jsonb_build_object('foo', jsonb_agg(el)) AS json
    --       FROM generate_series(1, 100) AS n
    --        CROSS JOIN LATERAL (
    --         SELECT n, jsonb_object_agg(key, value) AS el
    --         FROM generate_series(1, 3) AS y
    --         CROSS JOIN LATERAL (
    --           SELECT md5((random() + y)::text) AS key, string_agg(md5(random()::text), '') AS value
    --           FROM generate_series(1, 400) AS n
    --         ) x
    --       ) x
    --     ) x;
    
    COPY (SELECT json
      FROM generate_series(1, 100) AS row_count
        CROSS JOIN LATERAL (
          SELECT row_count, jsonb_build_object('foo', jsonb_agg(el)) AS json
          FROM generate_series(1, 100) AS n
           CROSS JOIN LATERAL (
            SELECT n, jsonb_object_agg(key, value) AS el
            FROM generate_series(1, 3) AS y
            CROSS JOIN LATERAL (
              SELECT md5((random() + y)::text) AS key, string_agg(md5(random()::text), '') AS value
              FROM generate_series(1, 400) AS n
            ) x
          ) x
        ) x) TO '/home/pipeline/json';
    
    COPY s (json) FROM '/home/pipeline/json';
    

    Note that the commented out INSERT will also crash the database.

    The following which is the equivalent of the above CV without the TRANSFORM does not crash:

    DROP STREAM IF EXISTS s CASCADE;
    CREATE STREAM s (json jsonb);
    
    CREATE CONTINUOUS VIEW v AS
      SELECT 1
      FROM s
        CROSS JOIN generate_series(1, 4) AS x (interval)
        CROSS JOIN jsonb_array_elements(json->'foo') AS y (resource);
    
    -- generate some json and insert it into the stream
    -- INSERT INTO s (json) SELECT json
    --   FROM generate_series(1, 100) AS row_count
    --     CROSS JOIN LATERAL (
    --       SELECT row_count, jsonb_build_object('foo', jsonb_agg(el)) AS json
    --       FROM generate_series(1, 100) AS n
    --        CROSS JOIN LATERAL (
    --         SELECT n, jsonb_object_agg(key, value) AS el
    --         FROM generate_series(1, 3) AS y
    --         CROSS JOIN LATERAL (
    --           SELECT md5((random() + y)::text) AS key, string_agg(md5(random()::text), '') AS value
    --           FROM generate_series(1, 400) AS n
    --         ) x
    --       ) x
    --     ) x;
        
    COPY (SELECT json
      FROM generate_series(1, 100) AS row_count
        CROSS JOIN LATERAL (
          SELECT row_count, jsonb_build_object('foo', jsonb_agg(el)) AS json
          FROM generate_series(1, 100) AS n
           CROSS JOIN LATERAL (
            SELECT n, jsonb_object_agg(key, value) AS el
            FROM generate_series(1, 3) AS y
            CROSS JOIN LATERAL (
              SELECT md5((random() + y)::text) AS key, string_agg(md5(random()::text), '') AS value
              FROM generate_series(1, 400) AS n
            ) x
          ) x
        ) x) TO '/home/pipeline/json';
    
    COPY s (json) FROM '/home/pipeline/json';
    

    There seems to be a second bug, where in the second code snippet if you use the INSERT it causes a deadlock. I can create a second issue for that if you like.

    If it makes any difference to repro, I am running in a docker container with 8 combiners and 8 workers, 768MB of shmem. Let me know if you need any other details.

    bug deadlock 
    opened by EliSnow 32
  • Background invalidation of sliding windows

    Background invalidation of sliding windows

    pgload01g/test M # select pipeline_version();
                                                                              pipeline_version
    --------------------------------------------------------------------------------------------------------------------------------------------------------------------
     PipelineDB 0.8.0 at revision bc8c4e3decbcc219689330987ea46ddb6b3edd34 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-11), 64-bit
    (1 row)
    
    Time: 0.298 ms
    pgload01g/test M # \dS+ recent_sessions
                           View "public.recent_sessions"
       Column   |           Type           | Modifiers | Storage  | Description
    ------------+--------------------------+-----------+----------+-------------
     uid        | bigint                   |           | plain    |
     session_id | text                     |           | extended |
     last_ts    | timestamp with time zone |           | plain    |
    View definition:
     SELECT uid,
        session_id,
        max(ts) AS last_ts
       FROM ONLY stream1
      WHERE arrival_timestamp > (clock_timestamp() - '00:15:00'::interval) AND session_id IS NOT NULL
      GROUP BY uid, session_id;
    
    pgload01g/test M # \dS+ recent_sessions_mrel0
                               Table "public.recent_sessions_mrel0"
       Column   |           Type           | Modifiers | Storage  | Stats target | Description
    ------------+--------------------------+-----------+----------+--------------+-------------
     _0         | timestamp with time zone |           | plain    |              |
     uid        | bigint                   |           | plain    |              |
     session_id | text                     |           | extended |              |
     last_ts    | timestamp with time zone |           | plain    |              |
    Indexes:
        "recent_sessions_mrel0_expr_idx" btree (ls_hash_group(_0, uid, session_id))
        "recent_sessions_mrel0_uid_idx" btree (uid)
    Options: fillfactor=50
    
    pgload01g/test M # select count(*) from recent_sessions;
     count
    -------
         0
    (1 row)
    
    Time: 1228.627 ms
    pgload01g/test M # select count(*) from recent_sessions_mrel0 ;
      count
    ---------
     5000008
    (1 row)
    
    Time: 719.131 ms
    pgload01g/test M # select clock_timestamp(), min(last_ts), clock_timestamp() - min(last_ts) as diff from recent_sessions_mrel0 ;
            clock_timestamp        |             min              |      diff
    -------------------------------+------------------------------+-----------------
     2015-08-24 11:44:31.624587+03 | 2015-08-24 10:29:28.20038+03 | 01:15:03.424211
    (1 row)
    
    Time: 914.546 ms
    pgload01g/test M #
    

    Am I right that these tuples should have been removed an hour ago? Is there a way to trigger such process manually?

    bug vacuum sprint 
    opened by man-brain 30
  • 0.8.4 results into start-up error

    0.8.4 results into start-up error

    We try to build a new docker image for the latest pipelinedb 0.8.4 version but we experience the following error:

    LOG: starting background worker process "worker0 [pipeline]" LOG: continuous query process "worker0 [pipeline]" running with pid 63 LOG: continuous query scheduler process (PID 62) was terminated by signal 7: Bus error LOG: terminating any other active server processes LOG: unregistering background worker "worker0 [pipeline]" WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. FATAL: the database system is in recovery mode .FATAL: the database system is in recovery mode .FATAL: the database system is in recovery mode .FATAL: the database system is in recovery mode .FATAL: the database system is in recovery mode .LOG: all server processes terminated; reinitializing

    We installed it with ubuntu14.04 debootstrap as source, added libxml2 library and took the .deb file from pipelinedb.com.

    bug packaging docker 
    opened by Degola 18
  • Factor out into a standard PostgreSQL extension

    Factor out into a standard PostgreSQL extension

    Our interfaces and core functionality are stable and well-defined enough at this point that we can cleanly formulate our API in terms of the extension API. The main areas of work here are:

    Syntax - Grammar is not extensible so we lose CREATE CONTINUOUS VIEW et all. They become SELECT pipelinedb.create_cv('...') etc.

    Syscache lookups - We'll need to write fast lookup support around our catalog tables since they won't be native catalogs anymore.

    Planner/executor - CQ node types will need to be handled via hooks instead of natively.

    Dump/restore - We'll need to make dump/restore work with vanilla pg_dump/pg_restore

    This refactor should probably happen in a few phases:

    • Phase 0: get the diff against upstream Postgres down to an absolute minimum
    • Phase 1: factor PipelineDB out as an extension of itself
    • Phase 2: stabilize the extension against upstream Postgres, possibly landing some changes upstream where necessary
    opened by derekjn 17
  • first_values ordered set aggregate

    first_values ordered set aggregate

    I would like the ability to get the top n ranked rows in a CV. In vanilla Postgresql it is accomplished with:

    CREATE TEMPORARY TABLE s (name text, part text, value int);
    CREATE OR REPLACE VIEW v AS
    SELECT
      name,
      part,
      value,
      rank
      FROM (
        SELECT
          name,
          part,
          value,
          dense_rank() OVER (PARTITION BY name, part ORDER BY value) AS rank
        FROM s) AS x
      WHERE rank <= 3;
    
    INSERT INTO s VALUES ('foo', 'stuff', 7), ('foo', 'blah', 18), ('foo', 'jar', 8),
      ('foo', 'stuff', 123), ('foo', 'blah', 6), ('foo', 'jar', 199),
      ('foo', 'stuff', 63), ('foo', 'blah', 63), ('foo', 'jar', 9),
      ('foo', 'stuff', 44), ('foo', 'blah', 26), ('foo', 'jar', 909),
      ('foo', 'stuff', 9), ('foo', 'blah', 61), ('foo', 'jar', 2),
      ('foo', 'stuff', 1), ('foo', 'blah', 996), ('foo', 'jar', 0);
    
    SELECT * FROM v;
    

    PipelineDB does not allow window functions in a subquery and reports:

    subqueries in continuous views cannot contain window functions

    Perhaps PipelineDB can allow certain window functions in a subquery.

    opened by EliSnow 17
  • Segfault joining stream with partitioned table

    Segfault joining stream with partitioned table

    background worker segfaults when a continuous view tries to join a stream value to a record of a partition

    The users table is partitioned with pg 10 declarative list partitioning

    CREATE FOREIGN TABLE ping_stream(
      user_id UUID,
    	session_id TEXT,
    	session_start_time TIMESTAMPTZ,
    	organization_id UUID
    ) SERVER pipelinedb;
    
    CREATE VIEW online_users
    
    AS
      SELECT 
    	  user_id, arrival_timestamp, name
    FROM ping_stream
    inner join users on users.id = ping_stream.user_id
    
                          postgres | Segmentation fault (PID 1613)
                          postgres | PostgreSQL version: 10.5 (Debian 10.5-2.pgdg90+1)
                          postgres | PipelineDB version: 1.0.0 at revision 12a01a3107dfac74b33e7ae3086a44f6d05edfab
                          postgres | query: online_users
                          postgres | backtrace:
                          postgres | /usr/lib/postgresql/10/lib/pipelinedb.so(debug_segfault+0x33)[0x7f8b25a2d193]
                          postgres | /lib/x86_64-linux-gnu/libpthread.so.0(+0x110c0)[0x7f8b2de6b0c0]
                          postgres | postgres: bgworker: worker0 [help]   (ExecLockNonLeafAppendTables+0x3a)[0x559e7d56213a]
                          postgres | postgres: bgworker: worker0 [help]   (ExecInitAppend+0x44)[0x559e7d565084]
                          postgres | postgres: bgworker: worker0 [help]   (ExecInitNode+0x3cd)[0x559e7d55defd]
                          postgres | postgres: bgworker: worker0 [help]   (ExecInitHashJoin+0xab)[0x559e7d56eceb]
                          postgres | postgres: bgworker: worker0 [help]   (ExecInitNode+0x1d3)[0x559e7d55dd03]
                          postgres | /usr/lib/postgresql/10/lib/pipelinedb.so(+0x5282f)[0x7f8b259f682f]
                          postgres | /usr/lib/postgresql/10/lib/pipelinedb.so(ContinuousQueryWorkerMain+0x235)[0x7f8b259f6d25]
                          postgres | /usr/lib/postgresql/10/lib/pipelinedb.so(cont_bgworker_main+0x220)[0x7f8b25a2de70]
                          postgres | postgres: bgworker: worker0 [help]   (StartBackgroundWorker+0x2cc)[0x559e7d61570c]
                          postgres | postgres: bgworker: worker0 [help]   (+0x30d445)[0x559e7d622445]
                          postgres | postgres: bgworker: worker0 [help]   (+0x30e005)[0x559e7d623005]
                          postgres | /lib/x86_64-linux-gnu/libpthread.so.0(+0x110c0)[0x7f8b2de6b0c0]
                          postgres | /lib/x86_64-linux-gnu/libc.so.6(__select+0x13)[0x7f8b2bb003a3]
                          postgres | postgres: bgworker: worker0 [help]   (+0xb2f75)[0x559e7d3c7f75]
                          postgres | postgres: bgworker: worker0 [help]   (PostmasterMain+0xfea)[0x559e7d62434a]
                          postgres | postgres: bgworker: worker0 [help]   (main+0x854)[0x559e7d3c9f74]
                          postgres | /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f8b2ba3f2e1]
                          postgres | postgres: bgworker: worker0 [help]   (_start+0x2a)[0x559e7d3ca02a]
                          postgres | 2018-12-15 05:17:48.475 UTC [1] LOG:  worker process: worker0 [help] (PID 1613) exited with exit code 1
    
    bug 
    opened by esatterwhite 16
  • Continuous SQL trigger does not fire

    Continuous SQL trigger does not fire

    I am trying to use a trigger on continuous view to update rows in another table. It does not seem to fire. Here is the code (following pattern of #1552)

    CREATE CONTINUOUS VIEW cv_demo AS SELECT x::int, count(DISTINCT y::int) FROM stream GROUP BY x;

    CREATE TABLE t_demo (id int, c int);

    INSERT INTO t_demo VALUES (1,0), (2, 0), (3, 0);

    CREATE OR REPLACE FUNCTION p_demo( ) RETURNS TRIGGER AS $$ BEGIN UPDATE t_demo set c = c + 1 where id = 1; RETURN NULL; END; $$ LANGUAGE plpgsql;

    CREATE TRIGGER tg_demo AFTER INSERT ON cv_demo FOR EACH ROW EXECUTE PROCEDURE p_demo();

    --============== INSERT INTO stream (x, y) SELECT x % 100 AS x, random() * 1000000 AS y FROM generate_series(1, 10000000) x; --==============

    SELECT id,c FROM t_demo;

    id | c 1 | 0 2 | 0 3 | 0

    output streams 
    opened by vryzhov 16
  • Improved image for Docker Hub

    Improved image for Docker Hub

    Should resolve #1578. Image size: 237.1 MB

    It behaves very similar to the official Postgres image: https://hub.docker.com/_/postgres/ https://github.com/docker-library/postgres/tree/e4942cb0f79b61024963dc0ac196375b26fa60dd/9.6

    I added an easier way to bind a custom pipelinedb.conf into the image. You can test the image with the following commands:

    docker-compose -f pkg/docker/hub/tests/.yml build docker-compose -f pkg/docker/hub/tests/.yml up docker-compose -f pkg/docker/hub/tests/.yml rm -vf

    E.g.: docker-compose -f pkg/docker/hub/tests/custom_init.yml build docker-compose -f pkg/docker/hub/tests/custom_init.yml up docker-compose -f pkg/docker/hub/tests/custom_init.yml rm -vf

    Please note that you need a newer version of docker-compose (for version 2 syntax).

    I'd love to see an alpine linux based image in the future as they are much smaller. (Postgres: 264MB [official] vs 31MB [alpine based]) As noted in #1578 one needs to patch a tiny glibc dependency. (Alpine ships with musl)

    opened by ankoh 15
  • Memory corruption with Pipeline 0.9.3

    Memory corruption with Pipeline 0.9.3

    Based on log messages, there seems to be a memory corruption issue with Pipeline 0.9.3 and pipeline_kafka 0.9.3

    I am running Pipeline within a docker container running Debian 8 on a CentOS 6 host. I have had no issues with Pipeline 0.9.1 and I couldn't really test it with 0.9.2 due to the file descriptor leak.

    Just before the corruption, there were a series of disconnect messages. I've seen them before and these disconnects haven't been an issue with Pipeline 0.9.1

    After the memory corruption message, PipelineDB tries to restart the Kafka consumer and fails to do so

    We have 100 partitions for that particular Kafka topic.

    Here is the relevant log message:

    2016-06-03 23:05:37 EDTLOG:  [pipeline_kafka consumer (208)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:15:30 EDTLOG:  [pipeline_kafka consumer (210)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:15:30 EDTLOG:  [pipeline_kafka consumer (204)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:15:31 EDTLOG:  [pipeline_kafka consumer (207)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:15:32 EDTLOG:  [pipeline_kafka consumer (214)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:15:32 EDTLOG:  [pipeline_kafka consumer (216)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:15:33 EDTLOG:  [pipeline_kafka consumer (227)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:15:34 EDTLOG:  [pipeline_kafka consumer (215)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:15:34 EDTLOG:  [pipeline_kafka consumer (206)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:15:36 EDTLOG:  [pipeline_kafka consumer (205)]: chrlnc01-ci-bgd0007d1.conops.timewarnercable.com:9092/3: Receive failed: Disconnected
    2016-06-03 23:25:29 EDTLOG:  [pipeline_kafka consumer (231)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:25:31 EDTLOG:  [pipeline_kafka consumer (209)]: chrlnc01-ci-bgd0007d1.conops.timewarnercable.com:9092/3: Receive failed: Disconnected
    2016-06-03 23:25:32 EDTLOG:  [pipeline_kafka consumer (212)]: chrlnc01-ci-bgd0007d1.conops.timewarnercable.com:9092/3: Receive failed: Disconnected
    2016-06-03 23:25:32 EDTLOG:  [pipeline_kafka consumer (213)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:25:33 EDTLOG:  [pipeline_kafka consumer (211)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:25:33 EDTLOG:  [pipeline_kafka consumer (219)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:25:34 EDTLOG:  [pipeline_kafka consumer (225)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:25:39 EDTLOG:  [pipeline_kafka consumer (221)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:25:41 EDTLOG:  [pipeline_kafka consumer (220)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:25:57 EDTLOG:  [pipeline_kafka consumer (217)]: chrlnc01-ci-bgd0007d1.conops.timewarnercable.com:9092/3: Receive failed: Disconnected
    2016-06-03 23:35:30 EDTLOG:  [pipeline_kafka consumer (218)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:35:32 EDTLOG:  [pipeline_kafka consumer (229)]: chrlnc01-ci-bgd0007d1.conops.timewarnercable.com:9092/3: Receive failed: Disconnected
    2016-06-03 23:35:38 EDTLOG:  [pipeline_kafka consumer (232)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-03 23:45:30 EDTLOG:  [pipeline_kafka consumer (230)]: chrlnc01-ci-bgd0007d1.conops.timewarnercable.com:9092/3: Receive failed: Disconnected
    2016-06-03 23:45:30 EDTLOG:  [pipeline_kafka consumer (233)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-04 00:45:31 EDTLOG:  [pipeline_kafka consumer (208)]: chrlnc01-ci-bgd0007d1.conops.timewarnercable.com:9092/3: Receive failed: Disconnected
    2016-06-04 00:45:35 EDTLOG:  [pipeline_kafka consumer (210)]: chrlnc01-ci-bgd0007d1.conops.timewarnercable.com:9092/3: Receive failed: Disconnected
    2016-06-04 00:45:37 EDTLOG:  [pipeline_kafka consumer (215)]: chrlnc01-ci-bgd0007d1.conops.timewarnercable.com:9092/3: Receive failed: Disconnected
    2016-06-04 01:05:31 EDTLOG:  [pipeline_kafka consumer (219)]: chrlnc01-ci-bgd0007d1.conops.timewarnercable.com:9092/3: Receive failed: Disconnected
    2016-06-04 01:05:39 EDTLOG:  [pipeline_kafka consumer (218)]: chrlnc01-ci-bgd0007d1.conops.timewarnercable.com:9092/3: Receive failed: Disconnected
    2016-06-04 01:15:35 EDTLOG:  [pipeline_kafka consumer (233)]: chrlnc01-ci-bgd0007d1.conops.timewarnercable.com:9092/3: Receive failed: Disconnected
    2016-06-04 01:35:37 EDTLOG:  [pipeline_kafka consumer (204)]: chrlnc01-ci-bgd0007d1.conops.timewarnercable.com:9092/3: Receive failed: Disconnected
    2016-06-04 01:45:32 EDTLOG:  [pipeline_kafka consumer (209)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-04 01:45:45 EDTLOG:  [pipeline_kafka consumer (212)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-04 01:55:32 EDTLOG:  [pipeline_kafka consumer (217)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-04 01:55:39 EDTLOG:  [pipeline_kafka consumer (229)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    2016-06-04 01:55:49 EDTLOG:  [pipeline_kafka consumer (230)]: 69.134.66.17:9092/bootstrap: Receive failed: Disconnected
    *** Error in `pipelinedb: bgworker: combiner0 [pipeline]   ': malloc(): memory corruption: 0x00000000016d1ea0 ***
    2016-06-04 01:58:25 EDTLOG:  worker process: combiner0 [pipeline] (PID 176) was terminated by signal 6: Aborted
    2016-06-04 01:58:25 EDTDETAIL:  Failed process was running: combiner0 [pipeline]
    2016-06-04 01:58:25 EDTLOG:  terminating any other active server processes
    2016-06-04 01:58:26 EDTWARNING:  terminating connection because of crash of another server process
    2016-06-04 01:58:26 EDTDETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
    2016-06-04 01:58:26 EDTHINT:  In a moment you should be able to reconnect to the database and repeat your command.
    
    bug unbreak now! 
    opened by sutram 14
  • Incredible bug

    Incredible bug

    Hello, I have been using the piplinedb plugin for a while, but I recently encountered a very strange phenomenon. When I implemented a process, I needed to repeatedly delete the SQL that created the pipelinedb. I found that the foregin table could not be read from kafka or Insert data, and even if you delete the plug-in and create it again, it will not get better. Only by recreating a new postgresql instance can it be restored. Looking forward to your answer, thank you.

    opened by GreenplumDB 0
  • Question/Maintenence

    Question/Maintenence

    I what used this several years ago, maybe 2017 or so, before it was an extension iirc. I have a desire for continuous queries on PostgreSQL again and so I have been looking around.

    I have experience working on the OpenTSDB project. Would there be any interest in someone taking over maintenance and potentially improving or continuing this project? Or is there a better alternative for PostgreSQL that would render this unnecessary?

    opened by johann8384 2
  • Can we please add a default column called time just like the one called arrival_timestamp so it can be visualized using Grafana as Timeseries? Thanks

    Can we please add a default column called time just like the one called arrival_timestamp so it can be visualized using Grafana as Timeseries? Thanks

    Can we please add a default column called time just like the one called arrival_timestamp so it can be visualized using Grafana as time series? Thanks

    Figure10 6

    If it is time column beside arrival_timestamp, it will be visualized as time series in Grafana instead of table view.

    Thanks

    opened by mostafaahamidmanon 0
  • makefile: SHLIB_LINK`s doubt

    makefile: SHLIB_LINK`s doubt

    @usmanm Hi, I have a problem. In the Makefile, the zmq static library uses an absolute path, and there are some doubts about user compilation. yum installation zmq will be automatically installed under / usr / local /. So why doesn't SHLIB_LINK use a relative dynamic library for automatic search. SHLIB_LINK + = -lzmq -lstdc ++

    opened by vagabond1132 0
Releases(1.0.0-13)
Owner
PipelineDB
High-performance time-series aggregation for PostgreSQL
PipelineDB
YugabyteDB is a high-performance, cloud-native distributed SQL database that aims to support all PostgreSQL features

YugabyteDB is a high-performance, cloud-native distributed SQL database that aims to support all PostgreSQL features. It is best to fit for cloud-native OLTP (i.e. real-time, business-critical) applications that need absolute data correctness and require at least one of the following: scalability, high tolerance to failures, or globally-distributed deployments.

yugabyte 7.4k Jan 7, 2023
pgagroal is a high-performance protocol-native connection pool for PostgreSQL.

pgagroal is a high-performance protocol-native connection pool for PostgreSQL.

Agroal 555 Dec 27, 2022
A framework to monitor and improve the performance of PostgreSQL using Machine Learning methods.

pg_plan_inspector pg_plan_inspector is being developed as a framework to monitor and improve the performance of PostgreSQL using Machine Learning meth

suzuki hironobu 198 Dec 27, 2022
OceanBase is an enterprise distributed relational database with high availability, high performance, horizontal scalability, and compatibility with SQL standards.

What is OceanBase database OceanBase Database is a native distributed relational database. It is developed entirely by Alibaba and Ant Group. OceanBas

OceanBase 5.1k Jan 4, 2023
TimescaleDB is an open-source database designed to make SQL scalable for time-series data.

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.

Timescale 14.3k Jan 2, 2023
GridDB is a next-generation open source database that makes time series IoT and big data fast,and easy.

Overview GridDB is Database for IoT with both NoSQL interface and SQL Interface. Please refer to GridDB Features Reference for functionality. This rep

GridDB 2k Jan 8, 2023
以简单、易用、高性能为目标、开源的时序数据库,支持Linux和Windows, Time Series Database

松果时序数据库(pinusdb) 松果时序数据库是一款针对中小规模(设备数少于10万台,每天产生的数据量少于10亿条)场景设计的时序数据库。以简单、易用、高性能为设计目标。使用SQL语句进行交互,拥有极低的学习、使用成本, 提供了丰富的功能、较高的性能。 我们的目标是成为最简单、易用、健壮的单机时序

null 99 Nov 19, 2022
SiriDB is a highly-scalable, robust and super fast time series database

SiriDB is a highly-scalable, robust and super fast time series database. Build from the ground up SiriDB uses a unique mechanism to operate without a global index and allows server resources to be added on the fly. SiriDB's unique query language includes dynamic grouping of time series for easy analysis over large amounts of time series.

SiriDB 471 Jan 9, 2023
The official C++ client API for PostgreSQL.

libpqxx Welcome to libpqxx, the C++ API to the PostgreSQL database management system. Home page: http://pqxx.org/development/libpqxx/ Find libpqxx on

Jeroen Vermeulen 718 Jan 3, 2023
A PostgreSQL extension providing an async networking interface accessible via SQL using a background worker and curl.

pg_net is a PostgreSQL extension exposing a SQL interface for async networking with a focus on scalability and UX.

Supabase 49 Dec 14, 2022
Prometheus exporter for PostgreSQL

pgexporter pgexporter is a Prometheus exporter for PostgreSQL. pgexporter will connect to one or more PostgreSQL instances and let you monitor their o

null 19 Dec 22, 2022
PostgreSQL extension for pgexporter

pgexporter_ext pgexporter_ext is an extension for PostgreSQL to provide additional Prometheus metrics for pgexporter. Features Disk space metrics See

null 4 Apr 13, 2022
The PostgreSQL client API in modern C++

C++ client API to PostgreSQL {#mainpage} Dmitigr Pgfe (PostGres FrontEnd, hereinafter referred to as Pgfe) - is a C++ client API to PostgreSQL servers

Dmitry Igrishin 137 Dec 14, 2022
A friendly and lightweight C++ database library for MySQL, PostgreSQL, SQLite and ODBC.

QTL QTL is a C ++ library for accessing SQL databases and currently supports MySQL, SQLite, PostgreSQL and ODBC. QTL is a lightweight library that con

null 173 Dec 12, 2022
C++ client library for PostgreSQL

Welcome to taoPQ taoPQ is a lightweight C++ client library for accessing a PostgreSQL➚ database. It has no dependencies beyond libpq➚, the C applicati

The Art of C++ 232 Dec 22, 2022
Backup / restore solution for PostgreSQL

pgmoneta pgmoneta is a backup / restore solution for PostgreSQL. pgmoneta is named after the Roman Goddess of Memory. Features Full backup Restore Sym

null 41 Dec 22, 2022
recovery postgresql table data by update/delete/rollback/dropcolumn command

recovery postgresql table data by update/delete/rollback/dropcolumn command

RadonDB 6 Aug 4, 2022
xxhash functions for PostgreSQL

pg_xxhash PostgreSQL ❤️ xxhash Tested with xxhash 0.8.1 and PostgreSQL 14.1 on Linux and macOS. Think twice before even considering to use it in any s

Igor Hatarist 6 Oct 27, 2022
Distributed PostgreSQL as an extension

What is Citus? Citus is a PostgreSQL extension that transforms Postgres into a distributed database—so you can achieve high performance at any scale.

Citus Data 7.7k Dec 30, 2022