Parallel, indexed xz compressor

Overview

pixz

Build Status

Pixz (pronounced pixie) is a parallel, indexing version of xz.

Repository: https://github.com/vasi/pixz

Downloads: https://github.com/vasi/pixz/releases

pixz vs xz

The existing XZ Utils provide great compression in the .xz file format, but they produce just one big block of compressed data. Pixz instead produces a collection of smaller blocks which makes random access to the original data possible. This is especially useful for large tarballs.

Differences to xz

  • pixz automatically indexes tarballs during compression
  • pixz supports parallel decompression, which xz does not
  • pixz defaults to using all available CPU cores, while xz defaults to using only one core
  • pixz provides -i and -o command line options to specify input and output file
  • pixz does not support the command line option -z or --compress
  • pixz does not support the command line option -c or --stdout
  • -f command line option is incompatible
  • -l command line option output differs
  • -q command line option is incompatible
  • -t command line option is incompatible

Building pixz

General help about the building process's configuration step can be acquired via:

./configure --help

Dependencies

  • pthreads
  • liblzma 4.999.9-beta-212 or later (from the xz distribution)
  • libarchive 2.8 or later
  • AsciiDoc to generate the man page

Build from Release Tarball

./configure
make
make install

You many need sudo permissions to run make install.

Build from GitHub

git clone https://github.com/vasi/pixz.git
cd pixz
./autogen.sh
./configure
make
make install

You many need sudo permissions to run make install.

Usage

Single Files

Compress a single file (no tarball, just compression), multi-core:

pixz bar bar.xz

Decompress it, multi-core:

pixz -d bar.xz bar

Tarballs

Compress and index a tarball, multi-core:

pixz foo.tar foo.tpxz

Very quickly list the contents of the compressed tarball:

pixz -l foo.tpxz

Decompress the tarball, multi-core:

pixz -d foo.tpxz foo.tar

Very quickly extract a single file, multi-core, also verifies that contents match index:

pixz -x dir/file < foo.tpxz | tar x

Create a tarball using pixz for multi-core compression:

tar -Ipixz -cf foo.tpxz foo/

Specifying Input and Output

These are the same (also work for -x, -d and -l as well):

pixz foo.tar foo.tpxz
pixz < foo.tar > foo.tpxz
pixz -i foo.tar -o foo.tpxz

Extract the files from foo.tpxz into foo.tar:

pixz -x -i foo.tpxz -o foo.tar file1 file2 ...

Compress to foo.tpxz, removing the original:

pixz foo.tar

Extract to foo.tar, removing the original:

pixz -d foo.tpxz

Other Flags

Faster, worse compression:

pixz -1 foo.tar

Better, slower compression:

pixz -9 foo.tar

Use exactly 2 threads:

pixz -p 2 foo.tar

Compress, but do not treat it as a tarball, i.e. do not index it:

pixz -t foo.tar

Decompress, but do not check that contents match index:

pixz -d -t foo.tpxz

List the xz blocks instead of files:

pixz -l -t foo.tpxz

For even more tuning flags, check the manual page:

man pixz

Comparison to other Tools

plzip

  • about equally complex and efficient
  • lzip format seems less-used
  • version 1 is theoretically indexable, I think

ChopZip

  • written in Python, much simpler
  • more flexible, supports arbitrary compression programs
  • uses streams instead of blocks, not indexable
  • splits input and then combines output, much higher disk usage

pxz

  • simpler code
  • uses OpenMP instead of pthreads
  • uses streams instead of blocks, not indexable
  • uses temporary files and does not combine them until the whole file is compressed, high disk and memory usage

pbzip2

  • not indexable
  • appears slow
  • bzip2 algorithm is non-ideal

pigz

  • not indexable

dictzip, idzip

  • not parallel
Comments
  • Crash when using -x option

    Crash when using -x option

    Sometimes pixz will segfault when using the -x option to extract individual files. The conditions which trigger the crash are such that there is a small chance of a crash for each file extracted using the -x option. The more files one extracts (using -x), the greater the chances of encountering the crash.

    A more detailed analysis of the bug and a fix can be found in pull request https://github.com/vasi/pixz/pull/105

    opened by usefulcat 0
  • Bug fix for segfault

    Bug fix for segfault

    In read_thread(), in the "Do we need this block?" block, it failed to advance to the next wanted_t when w->end is exactly equal to uend.

    In the particular case I looked at, this resulted in read_thread() erroneously putting two blocks into the queue (pipeline_split(), read.c:570) instead of one.

    This resulted in a subsequent crash in tar_read() at read.c:660, where gArWanted was null (when clearly the code does not expect it to be null at that point).

    My use case:

    I have several hundred large .tar.xz files (created by pixz). Each archive contains over 200k files. I frequently need to extract a lot of files from each archive, but not all files, only a specific subset. So I am making heavy use of the -x option to pixz.

    opened by usefulcat 0
  • Error decoding stream footer when trying to decompress a 3.1 TiB .tpxz file

    Error decoding stream footer when trying to decompress a 3.1 TiB .tpxz file

    I am trying to decompress a 3.1 TiB .tpxz file that was compressed using 1.0.6 and the error message that I am getting is:

    "Error decoding stream footer"

    What can/should I do as a workaround for this issue so that I can try and get my data back?

    Your help is greatly appreciated.

    Thank you.

    opened by alpha754293 8
  • configure: error: AsciiDoc not found, not able to generate the man page.

    configure: error: AsciiDoc not found, not able to generate the man page.

    [root@localhost pixz-master]# ./autogen.sh configure.ac:47: installing './config.guess' configure.ac:47: installing './config.sub' configure.ac:11: installing './install-sh' configure.ac:11: installing './missing' src/Makefile.am: installing './depcomp' parallel-tests: installing './test-driver' [root@localhost pixz-master]# ./config config.guess config.sub configure
    [root@localhost pixz-master]# ./configure checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for a thread-safe mkdir -p... /usr/bin/mkdir -p checking for gawk... gawk checking whether make sets $(MAKE)... yes checking whether make supports nested variables... yes checking for style of include used by make... GNU checking for gcc... gcc checking whether the C compiler works... yes checking for C compiler default output file name... a.out checking for suffix of executables... checking whether we are cross compiling... no checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ISO C89... none needed checking dependency style of gcc... gcc3 checking for gcc option to accept ISO C99... -std=gnu99 checking for gcc -std=gnu99 option to accept ISO Standard C... (cached) -std=gnu99 checking for a2x... no configure: error: AsciiDoc not found, not able to generate the man page. [root@localhost pixz-master]#

    Can I help you?

    opened by chenliangfei 3
  • Syntax for converting existing tar.xz archive to indexed pixz file?

    Syntax for converting existing tar.xz archive to indexed pixz file?

    just running pixz on an existing tar file creates a tpxz file, but when I then do an extract with a piped file, I get a warning about seeking, which I presume implied that an index was not created.

    E.g.

    cd /tmp
    curl -OLf https://gitlab.com/api/v4/projects/26210301/packages/generic/glibc/2.27_x86_64/glibc-2.27-chromeos-x86_64.tar.xz
    xz -d glibc-2.27-chromeos-x86_64.tar.xz
    pixz -9 glibc-2.27-chromeos-x86_64.tar 
    curl -Ls file:///tmp/glibc-2.27-chromeos-x86_64.tpxz  |  pixz -x usr/local/lib64/Scrt1.o usr/local/lib64/crti.o usr/local/lib64/crtn.o | tar x
    can not seek in input: Illegal seek
    

    (Here I get everything extracted, instead of just the files I want. The goal is to later just extract the files directly piped from a curl download.)

    opened by satmandu 1
Releases(v1.0.7)
Owner
Dave Vasilevsky
Dave Vasilevsky
Parallel, indexed xz compressor

pixz Pixz (pronounced pixie) is a parallel, indexing version of xz. Repository: https://github.com/vasi/pixz Downloads: https://github.com/vasi/pixz/r

Dave Vasilevsky 619 Dec 22, 2022
Parallel-util - Simple header-only implementation of "parallel for" and "parallel map" for C++11

parallel-util A single-header implementation of parallel_for, parallel_map, and parallel_exec using C++11. This library is based on multi-threading on

Yuki Koyama 27 Jun 24, 2022
ESE is an embedded / ISAM-based database engine, that provides rudimentary table and indexed access.

Extensible-Storage-Engine A Non-SQL Database Engine The Extensible Storage Engine (ESE) is one of those rare codebases having proven to have a more th

Microsoft 792 Dec 22, 2022
A fast compressor/decompressor

Snappy, a fast compressor/decompressor. Introduction Snappy is a compression/decompression library. It does not aim for maximum compression, or compat

Google 5.5k Dec 30, 2022
Przemyslaw Skibinski 579 Jan 8, 2023
A free, open-source compressor for the ZX0 format

salvador -- a fast, near-optimal compressor for the ZX0 format salvador is a command-line tool and a library that compresses bitstreams in the ZX0 for

Emmanuel Marty 35 Dec 26, 2022
Experimental data compressor for 8-bit computers and low-end platforms

ZX5 (experimental) ZX5 is an experimental data compressor derived from ZX0, similarly targeted for low-end platforms, including 8-bit computers like t

Einar Saukas 9 Apr 14, 2022
A General-purpose Parallel and Heterogeneous Task Programming System

Taskflow Taskflow helps you quickly write parallel and heterogeneous tasks programs in modern C++ Why Taskflow? Taskflow is faster, more expressive, a

Taskflow 7.6k Dec 31, 2022
Kokkos C++ Performance Portability Programming EcoSystem: The Programming Model - Parallel Execution and Memory Abstraction

Kokkos: Core Libraries Kokkos Core implements a programming model in C++ for writing performance portable applications targeting all major HPC platfor

Kokkos 1.2k Jan 5, 2023
Powerful multi-threaded coroutine dispatcher and parallel execution engine

Quantum Library : A scalable C++ coroutine framework Quantum is a full-featured and powerful C++ framework build on top of the Boost coroutine library

Bloomberg 491 Dec 30, 2022
An optimized C library for math, parallel processing and data movement

PAL: The Parallel Architectures Library The Parallel Architectures Library (PAL) is a compact C library with optimized routines for math, synchronizat

Parallella 296 Dec 11, 2022
C++ Parallel Computing and Asynchronous Networking Engine

中文版入口 Sogou C++ Workflow As Sogou`s C++ server engine, Sogou C++ Workflow supports almost all back-end C++ online services of Sogou, including all sea

Sogou-inc 9.7k Dec 29, 2022
ParaMonte: Plain Powerful Parallel Monte Carlo and MCMC Library for Python, MATLAB, Fortran, C++, C.

Overview | Installation | Dependencies | Parallelism | Examples | Acknowledgments | License | Authors ParaMonte: Plain Powerful Parallel Monte Carlo L

Computational Data Science Lab 182 Dec 31, 2022
Fast parallel CTC.

In Chinese 中文版 warp-ctc A fast parallel implementation of CTC, on both CPU and GPU. Introduction Connectionist Temporal Classification is a loss funct

Baidu Research 4k Dec 26, 2022
Material for the UIBK Parallel Programming Lab (2021)

UIBK PS Parallel Systems (703078, 2021) This repository contains material required to complete exercises for the Parallel Programming lab in the 2021

null 12 May 6, 2022
monolish: MONOlithic Liner equation Solvers for Highly-parallel architecture

monolish: MONOlithic LIner equation Solvers for Highly-parallel architecture monolish is a linear equation solver library that monolithically fuses va

RICOS Co. Ltd. 179 Dec 21, 2022
C++ Parallel Computing and Asynchronous Networking Engine

As Sogou`s C++ server engine, Sogou C++ Workflow supports almost all back-end C++ online services of Sogou, including all search services, cloud input method,online advertisements, etc., handling more than 10 billion requests every day. This is an enterprise-level programming engine in light and elegant design which can satisfy most C++ back-end development requirements.

Sogou-inc 9.7k Dec 26, 2022
a language for fast, portable data-parallel computation

Halide Halide is a programming language designed to make it easier to write high-performance image and array processing code on modern machines. Halid

Halide 5.2k Jan 5, 2023
Shared-Memory Parallel Graph Partitioning for Large K

KaMinPar The graph partitioning software KaMinPar -- Karlsruhe Minimal Graph Partitioning. KaMinPar is a shared-memory parallel tool to heuristically

Karlsruhe High Quality Graph Partitioning 17 Nov 10, 2022
Supplemental source code for "A Halfedge Refinement Rule for Parallel Catmull-Clark Subdivision".

This repository provides source code to reproduce some of the results of my paper "A Halfedge Refinement Rule for Parallel Catmull-Clark Subdivision". The key contribution of this paper is to provide super simple algorithms to compute Catmull-Clark subdivision in parallel with support for semi-sharp creases. The algorithms are compiled in the C header-only library CatmullClark.h. In addition you will find a direct GLSL port of these algorithms in the glsl/ folder. For various usage examples, see the examples/ folder.

Jonathan Dupuy 22 Dec 7, 2022