Demo of a fast PNG encoder.

Overview

Fast PNG Encoder

This is a proof-of-concept fast PNG encoder that uses AVX2 and a special Huffman table to encode images faster. Speed on a single core is anywhere from 180 to 800 MP/s on a Threadripper 3970x, depending on compile time settings and content.

At the moment, only RGB(A) input is supported.

Comments
  • Add heuristics to speed up predictor search.

    Add heuristics to speed up predictor search.

    @hohMiyazawa could you test on your image set how well this change does?

    In particular, it'd be useful to compare with yours, and with just always using Paeth (i.e. compiling with -DFPNGE_FIXED_PREDICTOR=4).

    terminal:
    
    base:
       728.911 MP/s
         1.178 bits/pixel
    
    new:
       935.213 MP/s
         1.224 bits/pixel
    
    FP4:
      1157.451 MP/s
         1.250 bits/pixel
    
    FP2:
      1736.122 MP/s
         1.554 bits/pixel
    
    flowchart:
    
    base:
       960.315 MP/s
         0.807 bits/pixel
    
    new:
      1476.482 MP/s
         0.903 bits/pixel
    
    F4:
      1631.139 MP/s
         0.859 bits/pixel
    
    F2:
      2528.175 MP/s
         1.214 bits/pixel
    
    flower:
    
    base:
       446.234 MP/s
        11.051 bits/pixel
    
    new:
       622.776 MP/s
        11.052 bits/pixel
    
    F4:
       627.288 MP/s
        11.052 bits/pixel
    
    F2:
       689.025 MP/s
        12.908 bits/pixel
    
    
    opened by veluca93 24
  • Filter search short circuiting

    Filter search short circuiting

    In some images (mainly photographic content), considering other filters than paeth is futile.

    I therefore tested a simple heuristic: Group scanlines into groups of 8. If the filter search selects paeth for all of the first 3, the remaining five get paeth applied as well without other filters considered.

    This reaps most of the benefits of optimal filter selection, at the cost of half of fpnge4's speed advantage over fpnge.

    Sorted file sizes for a corpus of images, both photographic and synthetic images:

    heuristics_2

    While this is just a quick attempt, I think it shows that trying to achieve near optimal filter selection without doing a full filter search is both doable and worthwhile.

    The opposite case, getting rid of paeth calculations in images where paeth doesn't perform well may also yield some gains. (flat areas?)

    opened by hohMiyazawa 12
  • Add targeting for SSE4.1

    Add targeting for SSE4.1

    Not sure if this interests you at all (if not, feel free to close this request), but I added support for targeting SSE4.1 in addition to AVX2, so that it can run on more CPUs.
    There's no dynamic dispatch - it's a compile time switch only for now, so the build script uses -march=native instead of specifically targeting AVX2.

    Currently this will compile to work on CPUs with SSE4.1+POPCNT+PCLMUL minimum. The POPCNT requirement should be easy to remove, and I might include an alternative CRC32 implementation to remove the PCLMUL requirement.

    Most of this was a quick find & replace style modification, so hopefully I got everything right.
    Feel free to let me know what you think, and thanks for this awesome project!

    opened by animetosho 11
  • Re-use Paeth predicted data from TryPredictor

    Re-use Paeth predicted data from TryPredictor

    Most of PNG's predictors are fairly trivial to compute, however Paeth is relatively slow. It also happens to be the most often chosen predictor. This results in Paeth being computed twice for many rows.

    The idea of this change is to save the Paeth predicted data during TryPredictor, then re-use this during actual encoding.
    It's likely not worth doing this for the other predictors, due to being fast to compute and being less likely chosen.

    Comparison on a 12700K:

    Old code - image 1
       378.738 MP/s
        10.787 bits/pixel
    Old code - image 2
       414.767 MP/s
        16.240 bits/pixel
    
    New code - image 1
       447.864 MP/s
        10.787 bits/pixel
    New code - image 2
       457.003 MP/s
        16.240 bits/pixel
    

    CLA response: I release these changes to the public domain subject to the CC0 license (https://creativecommons.org/publicdomain/zero/1.0/).

    opened by animetosho 7
  • Add speed/compression runtime option

    Add speed/compression runtime option

    As discussed, this removes the defines for speed control, moving it into a runtime option.

    This is done via a new parameter, which accepts an options struct. The struct can be filled with a function, given a compression level, allowing for both a (relatively) simple 'level' setting, and being able to configure behaviour more precisely. NULL can be passed as the new parameter to indicate that defaults should be used.
    The console application now accepts a -1..-5 flag for the compression level.

    CLA response: I release these changes to the public domain subject to the CC0 license (https://creativecommons.org/publicdomain/zero/1.0/).

    opened by animetosho 5
  • Appears to not handle 16-bit RGB images

    Appears to not handle 16-bit RGB images

    Output image is 8bit from 16bit image input according to file and ImageMagick compare.

    Steps to reproduce:

    git clone <this repo>
    ./build.sh
    # Check inputs
    file 8b.png 
    # 8b.png: PNG image data, 1920 x 1200, 8-bit/color RGB, non-interlaced
    file 12b.png
    # 12b.png: PNG image data, 1920 x 1200, 16-bit/color RGB, non-interlaced
    
    # Convert images
    ./build/fpnge 8b.png 8b-out.png 100
    ./build/fpnge 12b.png 12b-out.png 100
    
    # Compare:
    compare -metric AE 8b.png 8b-out.png compare.png  # gives 0
    compare -metric AE 12b.png 12b-out.png compare.png  # Gives a very very large number (2.304e+06)
    
    file 12b-out.png 
    # 12b-out.png: PNG image data, 1920 x 1200, 8-bit/color RGB, non-interlaced
    

    Note: the 12b-out.png claims to be 8-bit/color. Interestingly opencv reads this still as 16-bit, but wrongly.

    Extra Info:

    CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz OS: Debian GNU/Linux 11 (bullseye) (in docker) Compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110 g++ (Debian 10.2.1-6) 10.2.1 20210110

    Extra extra

    I stumbled on this via the python binding here. Below is an example out showing the errors:

    blank_image = np.random.randint(0, high=np.iinfo(np.uint16).max, size=(1,2,3), dtype=np.uint16)
    image_bytes = fpnge.binding.encode_view(blank_image.data, blank_image.shape[1], blank_image.shape[0], blank_image.shape[2], blank_image.dtype.itemsize * 8)
    image_check: cv2.Mat = cv2.imdecode(np.frombuffer(image_bytes, np.uint8), cv2.IMREAD_ANYCOLOR | cv2.IMREAD_ANYDEPTH)
    assert image_check.dtype == blank_image.dtype
    assert image_check.shape == blank_image.shape
    assert (image_check == blank_image).all()
    # If you print these:
    pprint.pprint(image_check)
    array([[[51929, 35260,  2880],
            [ 6872, 23392, 15276]]], dtype=uint16)
    pprint.pprint(blank_image)
    array([[[16395, 48265, 55754],
            [44091, 24667, 55322]]], dtype=uint16)
    

    Images: 8b 12b

    opened by nznobody 4
  • Assertion failed in HuffmanTable::FillBits with certain images

    Assertion failed in HuffmanTable::FillBits with certain images

    I've discovered some images which fail the assert in HuffmanTable::FillBits. I've been able to reproduce the issue on GCC and MSVC, samples loaded with lodepng or stb_image. Failure happens when trying to encode images in RGB format.

    Sample images, including raw pixel data at 3 channel RGB: sample_images.zip

    Image 1:

    • width: 256
    • height: 256
    • num_channels: 3
    • bytes_per_channel: 1
    • FPNGE_COMPRESS_LEVEL_BEST
    • FPNGE_CICP_NONE img1

    Image 2:

    • width: 512
    • height: 512
    • num_channels: 3
    • bytes_per_channel: 1
    • FPNGE_COMPRESS_LEVEL_BEST/FPNGE_COMPRESS_LEVEL_DEFAULT/3/2/1 (failing all compression levels, libpng encodes this image without issue)
    • FPNGE_CICP_NONE sample0
    opened by mwilsnd 3
  • Add fast approximation for predictor search

    Add fast approximation for predictor search

    From prior discussions: adds FPNGE_APPROX_PREDICTOR switch to enable a fast approximation for predictor search. Benchmarks and effect on compression copied from previous thread:

    --- speedup% --- size%
    aggregate	+35.79	+0.12
    photographic	+36.62	+0.02
    synthetic	+35.97	+0.22
    

    This doesn't really eliminate the other possible heuristics for predictor search that were discussed, as it doesn't attempt to skip it.

    CLA response: I release these changes to the public domain subject to the CC0 license (https://creativecommons.org/publicdomain/zero/1.0/).

    opened by animetosho 3
  • Remove PCLMUL requirement for CRC computation

    Remove PCLMUL requirement for CRC computation

    This reduces the base ISA requirement to just SSE4.1 (reducing this to SSSE3 shouldn't be too hard, but not something I need).
    CRC32 is done via slice-by-8 algorithm if PCLMUL isn't available.

    This does also change the CLMUL implementation so that it doesn't do final reduction until the hash is needed.

    CLA response: I release these changes to the public domain subject to the CC0 license (https://creativecommons.org/publicdomain/zero/1.0/).

    opened by animetosho 3
  • Encode speed optimizations (2)

    Encode speed optimizations (2)

    Second batch of optimizations. These shouldn't affect the output in any way.

    Most of this is an implementation of WriteBits using the PEXT instruction, which is disabled on CPUs with a slow implementation of it (AMD CPUs before Zen3).

    Comparison on a 12700K:

    Old code - image 1
       295.585 MP/s
        10.787 bits/pixel
    Old code - image 2
       384.460 MP/s
        16.240 bits/pixel
    
    New code - image 1
       302.302 MP/s
        10.787 bits/pixel
    New code - image 2
       397.900 MP/s
        16.240 bits/pixel
    

    CLA response: I release these changes to the public domain subject to the CC0 license (https://creativecommons.org/publicdomain/zero/1.0/).

    opened by animetosho 3
  • Remove mask/adler buffers

    Remove mask/adler buffers

    Not sure if these buffers originally had different intended purposes, but removing them reduces cache usage, and simplifies a bunch of code.

    This change also removes the POPCNT requirement, and may slightly change output as RLE now works on the last vector of the line.

    This also seems to yield a minor performance benefit - on a 12700K:

    Old code - image 1
       295.585 MP/s
        10.787 bits/pixel
    Old code - image 2
       384.460 MP/s
        16.240 bits/pixel
    
    New code - image 1
       308.533 MP/s
        10.787 bits/pixel
    New code - image 2
       385.016 MP/s
        16.240 bits/pixel
    

    CLA response: I release these changes to the public domain subject to the CC0 license (https://creativecommons.org/publicdomain/zero/1.0/).

    opened by animetosho 3
  • Tweak heuristics for fast predictor selection

    Tweak heuristics for fast predictor selection

    Have you tried subtracting 1 iff nbits[0] <= (1/2)?

    Even better I think could be to double all the bit costs (as that changes nothing), but remap bitcost 1 into 1 instead of 2 for 0, and perhaps bitcost 2 into 3 (or 4), or something like that.

    Rationale being that RLE only kicks in for many consecutive 0s, so you'd need the 0 probability to be quite high.

    Originally posted by @veluca93 in https://github.com/veluca93/fpnge/pull/19#discussion_r933998684

    opened by veluca93 0
  • Ideas for improving compression ratio?

    Ideas for improving compression ratio?

    fpnge is pretty fast, so I was wondering what, in theory, could be done to improve compression slightly without sacrificing too much speed. I was hoping you might have some ideas on the topic (regardless of if they'd ever be implemented).
    Thoughts I had:

    RLE

    Currently only full vectors of zeroes are candidates for RLE.

    • Detecting sequences starting/ending partway through a vector might help density very slightly, with relatively little performance cost
    • Detecting sequences completely within a vector may be possible without too much cost, but I suspect would have little impact on compression, due to it being relatively short sequences
    • Detecting sequences of bytes other than 0 - not sure of the effectiveness
      • Perhaps use the distance of one pixel to the left instead of one byte?

    Unfortunately due to how the Huffman table is currently constructed, the LZ77 symbols always get assigned rather long sequences, making RLE unattractive (particularly with 0s) unless it's a rather long sequence. Which leads to the next point.

    Huffman encoding

    The current implementation doesn't consider symbol counts for many symbols, as the Huffman table has to be specially crafted for the algorithm used. Trying to relax some length restrictions often leads to ComputeCodeLengths becoming rather slow - from what I can tell, it appears to be finding the optimal tree via a brute-force approach. Would you happen to know of a faster algorithm for this?

    The other idea is to remove almost all length restrictions, which allows normal optimal Huffman tree algorithms to be used. I can't think of a quick way to allow the middle section (symbols 16-239) to have differing bit lengths, so you'd need to manipulate with symbol counts there a bit to ensure they all end up getting the same bit length. Otherwise, the unrestricted Huffman code could be handled with some slowdown, as, in addition to shuffle-lookup for the low bits, you'd need to do it for the high bits as well.

    Another thing to consider might be having multiple deflate blocks and sampling the image at various points, to be able to adapt to changes which occur vertically down the image.

    I suspect a more optimal Huffman code might generally give a low single-digit percentage gain most of the time, but at some speed cost.

    LZ77

    It seems that there's a fair bit of scope here to improve compression through better match finding, but this is also often the slowest part of deflate implementations. I can't think of a way to implement this without dropping to scalar code either.

    Furthermore, I'm not sure that knowing the data is an image, provides much benefit over generic LZ77 implementations - except that perhaps you can only look for matches on pixel boundaries instead of for every byte.

    Any ideas here?

    opened by animetosho 10
Owner
Luca Versari
Luca Versari
Very fast C++ .PNG writer for 24/32bpp images.

fpng Very fast C++ .PNG writer for 24/32bpp images. fpng.cpp was written to see just how fast you can write .PNG's without sacrificing too much compre

Rich Geldreich 639 Dec 25, 2022
Fast streaming PNG<->QOI converter with some compression-improving extensions

QOIG Fast streaming PNG<->QOI converter with some compression-improving extensions. Can achieve 1%-10% better compression than QOI without sacrificing

David Rutter 3 Oct 3, 2022
CGIF, A fast and lightweight GIF encoder that can create GIF animations and images

CGIF, a GIF encoder written in C A fast and lightweight GIF encoder that can create GIF animations and images. Summary of the main features: user-defi

Daniel Löbl 75 Dec 28, 2022
Arduino PNG image decoder library

An 'embedded-friendly' (aka Arduino) PNG image decoding library

Larry Bank 102 Jan 6, 2023
pngtostl is a program that converts a PNG image into STL 3D models

pngtostl is a program that converts a PNG image into a litophane, in STL format, suitable to be printed by entry level 3D printers.

Salvatore Sanfilippo 157 Dec 17, 2022
libspng is a C library for reading and writing PNG format files with a focus on security and ease of use.

libspng (simple png) is a C library for reading and writing Portable Network Graphics (PNG) format files with a focus on security and ease of use.

Randy 570 Dec 29, 2022
Simple, generally spec-compliant, hacky PNG Decoder written in C99.

Simple, generally spec-compliant, hacky PNG Decoder written in C99.

cristei 2 Nov 2, 2021
An image and texture viewer for tga, png, apng, exr, dds, gif, hdr, jpg, tif, ico, webp, and bmp files

An image and texture viewer for tga, png, apng, exr, dds, gif, hdr, jpg, tif, ico, webp, and bmp files. Uses Dear ImGui, OpenGL, and Tacent. Useful for game devs as it displays information like the presence of an alpha channel and querying specific pixels for their colour.

Tristan Grimmer 159 Dec 31, 2022
Rate-Distortion Optimized Lossy PNG Encoding Tool

rdopng is a command line tool which uses LZ match optimization, Lagrangian multiplier rate distortion optimization (RDO), a simple perceptual error tolerance model, and Oklab-based colorspace error metrics to encode 24/32bpp PNG files which are 30-80% smaller relative to lodepng/libpng.

Rich Geldreich 40 Dec 24, 2022
Guetzli is a JPEG encoder that aims for excellent compression density at high visual quality

Guetzli is a JPEG encoder that aims for excellent compression density at high visual quality. Guetzli-generated images are typically 20-30% smaller than images of equivalent quality generated by libjpeg. Guetzli generates only sequential (nonprogressive) JPEGs due to faster decompression speeds they offer.

Google 12.8k Jan 7, 2023
A fast image processing library with low memory needs.

libvips : an image processing library Introduction libvips is a demand-driven, horizontally threaded image processing library. Compared to similar lib

libvips 26 Nov 10, 2022
The “Quite OK Image” format for fast, lossless image compression

The “Quite OK Image” format for fast, lossless image compression

Dominic Szablewski 6k Dec 30, 2022
QOY - The "Quite OK YCbCr420A" format for fast, lossless image compression

QOY - The "Quite OK YCbCr420A" format for fast, lossless* image compression ( * colorspace conversion to/from RGBA is lossy, if used ) Single-file MIT

Chainfire 19 Oct 1, 2022
A slim, fast and header-only GIF loader written in C

gif_load This is an ANSI C compatible animated GIF loader in a single header file of less than 300 lines of code (less than 200 without empty lines an

Andrey Guskov 68 Dec 10, 2022
PNG encoder and decoder in C and C++.

PNG encoder and decoder in C and C++, without dependencies

Lode Vandevenne 1.7k Dec 25, 2022
conversion from absolute encoder and incremental encoder, control two robotis dynamixel motors, testing qserialport library in qt

Q_dxl This example is created for testing: Serial connection Testing two dynamixel motors (eg. MX-28AT) Doing the conversion from absolute encoder (of

ibov 1 Oct 30, 2021
Second life for famous JPEGView - fast and tiny viewer/editor for JPEG, BMP, PNG, WEBP, TGA, GIF and TIFF images with a minimalist GUI and base image processing.

JPEGView-Image-Viewer-and-Editor Updated Dec 07 2021. Version 1.1.1.0 has been released. Download link1, link2 added. Second life for famous JPEGView

Ann Hatt 40 Dec 27, 2022
Very fast C++ .PNG writer for 24/32bpp images.

fpng Very fast C++ .PNG writer for 24/32bpp images. fpng.cpp was written to see just how fast you can write .PNG's without sacrificing too much compre

Rich Geldreich 639 Dec 25, 2022
Fast streaming PNG<->QOI converter with some compression-improving extensions

QOIG Fast streaming PNG<->QOI converter with some compression-improving extensions. Can achieve 1%-10% better compression than QOI without sacrificing

David Rutter 3 Oct 3, 2022
This is a openGL cube demo program. It was made as a tech demo using PVR_PSP2 Driver layer GPU libraries.

OpenGL Cube Demo using PVR_PSP2 Driver layer GPU libraries This is a openGL cube demo program. It was made as a tech demo using PVR_PSP2 Driver layer

David Cantu 5 Oct 31, 2021